CN104679838A - Efficient information collection method - Google Patents

Efficient information collection method Download PDF

Info

Publication number
CN104679838A
CN104679838A CN201510067379.2A CN201510067379A CN104679838A CN 104679838 A CN104679838 A CN 104679838A CN 201510067379 A CN201510067379 A CN 201510067379A CN 104679838 A CN104679838 A CN 104679838A
Authority
CN
China
Prior art keywords
entrance
information
result
downloader
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510067379.2A
Other languages
Chinese (zh)
Inventor
赵金杰
许欢庆
郭永福
陈沛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
In Beijing Yun Yue Network Technology Co. Ltd.
Original Assignee
Beijing Zhongsou Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongsou Network Technology Co ltd filed Critical Beijing Zhongsou Network Technology Co ltd
Priority to CN201510067379.2A priority Critical patent/CN104679838A/en
Publication of CN104679838A publication Critical patent/CN104679838A/en
Pending legal-status Critical Current

Links

Abstract

The invention provides an efficient information collection method which comprises access scanning and news downloading, wherein after being screened, the accesses to be scheduled are placed into a queue to be downloaded; a news downloader downloads data in the queue to be downloaded, and transmits the data to a database; the result collected by the method is accurate; a large amount of noisy data in the collected information are reduced; the construction is simple; and important data are collected timely.

Description

A kind of method that efficient information gathers
Technical field
The present invention relates to a kind of information acquisition technique, be specifically related to a kind of method that efficient information gathers.
Background technology
Internet develop and universal, bring the undergoes rapid expansion of information, the form of information is also thereupon varied, and information is exactly one of them.
Internet search engine technology is convenient for user provides, passage fast, help the faster more accurate information that more fully obtains information of user, by search engine technique, information search is arisen at the historic moment, during certain keyword of user search, all relevant informations all can be retrieved out, and user can by checking to the factor such as trust and preference of website the information oneself wanting to see.In information search, the collection of information is crucial, and the accuracy, promptness etc. of collection directly affect quality and the Consumer's Experience of information search.
The collection of information is the key of information integration always, traditional information collection is the groundwork of web editor personnel, is mostly to search in artificial mode, and this mode is not only a kind of repeated labor, the one waste of labour especially, work efficiency is very low.And occur along with the collection framework of system, realize the process such as collection, arrangement to information in the mode of system, greatly improve work efficiency, and save labour.
The shortcoming of current information collection is that source is many and assorted, and noise data is many, and emphasis data are not given prominence to, and gathers not in time etc.
Summary of the invention
In order to overcome above-mentioned the deficiencies in the prior art, the invention provides a kind of method that efficient information gathers.The method obtains the related data of data source entrance from database, the entrance of the excellent updating decision of the priority scheduling information quality of data, obtains the details page information data of entrance, carries out downloading, transcoding and extraction, complete information data is sent.
In order to realize foregoing invention object, the present invention takes following technical scheme:
The method that efficient information gathers, the method comprises entrance scanning and news downloads two parts.
In optimal technical scheme provided by the invention, described entrance scanning comprises the steps:
A, access information to be dispatched according to priority policy, dispatch scanning entrance is put into entrance downloader;
B, entrance downloader obtain download result, will download successful entrance and correlate template information pushing to withdrawal device;
C, obtain extraction result from withdrawal device;
C-1, to extraction successful entrance result carry out re-scheduling and analysis, obtain corresponding information details page connect and other relevant informations, put it into task queue to be downloaded, wait for news download subsequent treatment;
C-2, then wait for the dispatch scanning in next cycle according to scheduling strategy to extracting failed entrance.
In second optimal technical scheme provided by the invention, described steps A medium priority strategy comprises the renewal amount of entrance and the weight of website of entrance.
In 3rd optimal technical scheme provided by the invention, the time expand that in described step C-2, dispatch scanning cycle stretch-out extremely arranges by scheduling strategy.
In 4th optimal technical scheme provided by the invention, described news is downloaded and is comprised the steps:
A, by the data-pushing of task queue to be downloaded to news downloader;
B, obtain download result from news downloader, successful for download details page and relevant information are pushed to withdrawal device and carry out extracted data;
The extraction result of C, the page that to obtain detailed information from withdrawal device, the extraction result of page turning needs to merge;
D, to extraction result analyze;
If it is complete information data that D-1 extracts result, be sent to database by transmitter;
If D-2 extracts containing page turning link in result, page turning link is put in task queue to be downloaded, forwards steps A to and process, this page turning queue priority processing;
If D-3 extracts in result containing image link, then carry out picture processing operation, after processing, check this information data, if complete information data are then sent to database by transmitter, if have page turning link then to forward steps A to process, picture processing operator precedence is in page turning queue task.
In 5th optimal technical scheme provided by the invention, described picture processing comprises the steps:
(1) image link is pushed to picture downloader;
(2), after picture downloader download pictures, analyze picture and compression upload process is carried out to picture.
In 6th optimal technical scheme provided by the invention, described manually given access information comprises following content:
(1) entrance storehouse, comprise linking inlet ports, template that entrance extracts, affiliated web site and respective labels;
(2) storehouse, website, comprises web site url, website PR rank and respective labels;
(3) template data.
Compared with prior art, beneficial effect of the present invention is:
The invention provides efficient information and gather framework method, the result of collection is accurate, decreases a large amount of noise datas gathered in information, builds simple, gathers significant data in time.
Accompanying drawing explanation
Fig. 1 is a kind of process flow diagram of information acquisition method
Fig. 2 is the process flow diagram of entrance scanning
Fig. 3 is the process flow diagram that news is downloaded
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in further detail.
As shown in Figure 1, efficient information acquisition method, the method comprises entrance scanning and news downloads two parts.
As shown in Figure 2, the concrete steps of entrance scanning are as follows:
A, dispatch scanning entrance is put into entrance downloader;
B, entrance downloader obtain download result, will download successful entrance and correlate template information pushing to withdrawal device;
C, withdrawal device obtain extraction result, and analyze result;
C-1, to extraction successful entrance result carry out re-scheduling, obtain corresponding information details page connect and other relevant informations, put it into task queue to be downloaded, wait for news download subsequent treatment;
C-2, then wait for the dispatch scanning in next cycle according to scheduling strategy to extracting failed entrance.
As shown in Figure 3, the concrete steps of news download are as follows:
A, from news downloader, obtain queue to be downloaded;
B, obtain download result from news downloader, successful for download details page and relevant information are pushed to withdrawal device and carry out extracted data;
The extraction result of C, the page that to obtain detailed information from withdrawal device, the extraction result of page turning needs to merge;
D, to extraction result carry out transcoding and analyze;
If it is complete information data that D-1 extracts result, be sent to database by transmitter;
If D-2 extracts containing page turning link in result, page turning link is put in task queue 2 to be downloaded, forwards steps A to and process, this page turning queue priority processing;
If D-3 extracts containing image link in result, then image link is pushed to picture downloader, after picture downloader download pictures, analysis picture also carries out compression upload process to picture.After processing, check this information data, if complete information data are then sent to database by transmitter, process if there is page turning link then to forward steps A to, picture processing operator precedence is in page turning queue task.
Finally should be noted that: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit, although with reference to above-described embodiment to invention has been detailed description, those of ordinary skill in the field are to be understood that: still can modify to the specific embodiment of the present invention or equivalent replacement, and not departing from any amendment of spirit and scope of the invention or equivalent replacement, it all should be encompassed in the middle of right of the present invention.

Claims (7)

1. a method for efficient information collection, is characterized in that, the method comprises entrance scanning and news downloads two parts.
2. the method for information collection according to claim 1, is characterized in that, described entrance scanning comprises the steps:
A, access information to be dispatched according to priority policy, dispatch scanning entrance is put into entrance downloader;
B, entrance downloader obtain download result, will download successful entrance and correlate template information pushing to withdrawal device;
C, withdrawal device obtain extraction result, and analyze result;
C-1, to extraction successful entrance result carry out re-scheduling, obtain corresponding information details page link and other relevant informations, put it into task queue to be downloaded, wait for news download subsequent treatment;
C-2, then wait for the dispatch scanning in next cycle according to scheduling strategy to extracting failed entrance.
3. the method for information collection according to claim 2, it is characterized in that, described steps A medium priority strategy comprises the renewal amount of entrance and the weight of website of entrance.
4. the method for information collection according to claim 2, is characterized in that, the time expand that in described step C-2, dispatch scanning cycle stretch-out extremely arranges by scheduling strategy.
5. the method for information collection according to claim 1, is characterized in that, described news is downloaded and comprised the steps:
A, by the data-pushing of task queue to be downloaded to news downloader;
B, obtain download result from news downloader, successful for download details page and relevant information are pushed to withdrawal device and carry out extracted data;
The extraction result of C, the page that to obtain detailed information from withdrawal device, the extraction result of page turning needs to merge;
D, to extraction result analyze;
If it is complete information data that D-1 extracts result, be sent to database by transmitter;
If D-2 extracts containing page turning link in result, page turning link is put in task queue to be downloaded, forwards steps A to and process, this page turning queue priority processing;
If D-3 extracts in result containing image link, then carry out picture processing operation, after processing, check this information data, if complete information data are then sent to database by transmitter, if have page turning link then to forward steps A to process, picture processing operator precedence is in page turning queue task.
6. the method for information collection according to claim 5, it is characterized in that, described picture processing comprises the steps:
(1) image link is pushed to picture downloader;
(2), after picture downloader download pictures, analyze picture and compression upload process is carried out to picture.
7. the method for information collection according to claim 2, it is characterized in that, described manually given access information comprises following content:
(1) entrance storehouse, comprise linking inlet ports, template that entrance extracts, affiliated web site and respective labels;
(2) storehouse, website, comprises web site url, website PR rank and respective labels;
(3) template data.
CN201510067379.2A 2015-02-09 2015-02-09 Efficient information collection method Pending CN104679838A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510067379.2A CN104679838A (en) 2015-02-09 2015-02-09 Efficient information collection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510067379.2A CN104679838A (en) 2015-02-09 2015-02-09 Efficient information collection method

Publications (1)

Publication Number Publication Date
CN104679838A true CN104679838A (en) 2015-06-03

Family

ID=53314880

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510067379.2A Pending CN104679838A (en) 2015-02-09 2015-02-09 Efficient information collection method

Country Status (1)

Country Link
CN (1) CN104679838A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106534341A (en) * 2016-12-02 2017-03-22 天脉聚源(北京)传媒科技有限公司 Method and device for pushing updated news
CN108664606A (en) * 2018-05-10 2018-10-16 北京鼎泰智源科技有限公司 A kind of big data coverage rate capturing analysis method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6665658B1 (en) * 2000-01-13 2003-12-16 International Business Machines Corporation System and method for automatically gathering dynamic content and resources on the world wide web by stimulating user interaction and managing session information
CN101114285A (en) * 2006-07-25 2008-01-30 腾讯科技(深圳)有限公司 Internet topics file searching method, reptile system and search engine
CN102117275A (en) * 2009-12-31 2011-07-06 北大方正集团有限公司 Method and device for collecting webpage data of direction site based on internet
CN103177076A (en) * 2012-12-28 2013-06-26 中联竞成(北京)科技有限公司 Public sentiment monitoring system and method based on fixed point websites

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6665658B1 (en) * 2000-01-13 2003-12-16 International Business Machines Corporation System and method for automatically gathering dynamic content and resources on the world wide web by stimulating user interaction and managing session information
CN101114285A (en) * 2006-07-25 2008-01-30 腾讯科技(深圳)有限公司 Internet topics file searching method, reptile system and search engine
CN102117275A (en) * 2009-12-31 2011-07-06 北大方正集团有限公司 Method and device for collecting webpage data of direction site based on internet
CN103177076A (en) * 2012-12-28 2013-06-26 中联竞成(北京)科技有限公司 Public sentiment monitoring system and method based on fixed point websites

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
许志凯: "网络舆情分析关键技术的研究与实现", 《中国优秀硕士学位论文全文数据库》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106534341A (en) * 2016-12-02 2017-03-22 天脉聚源(北京)传媒科技有限公司 Method and device for pushing updated news
CN108664606A (en) * 2018-05-10 2018-10-16 北京鼎泰智源科技有限公司 A kind of big data coverage rate capturing analysis method

Similar Documents

Publication Publication Date Title
CN102054028B (en) Method for implementing web-rendering function by using web crawler system
CN113486833B (en) Multi-modal feature extraction model training method and device and electronic equipment
CN101441629A (en) Automatic acquiring method of non-structured web page information
CN103970801B (en) Microblogging advertisement blog article recognition methods and device
CN103823907B (en) A kind of method, apparatus and engine for integrating online video resource address
CN105577528B (en) A kind of wechat public platform collecting method and device based on virtual machine
CN106357635A (en) Vulnerability comparison analysis method based on homologous framework
CN102542061A (en) Intelligent product classification method
CN103440243A (en) Teaching resource recommendation method and device thereof
CN105262812A (en) Log data processing method based on cloud computing platform, log data processing device and log data processing system
CN104991904A (en) Page data acquisition method of dynamic webpage
CN104281619A (en) System and method for ordering search results
CN104765823A (en) Method and device for collecting website data
CN112818201A (en) Network data acquisition method and device, computer equipment and storage medium
CN106713859A (en) Image visual monitoring search system and search method thereof
CN103761257A (en) Webpage handling method and system based on mobile browser
CN104679838A (en) Efficient information collection method
CN103164438B (en) The acquisition method of a kind of network comment and system
CN104281680B (en) Data processing system, method and device for obtaining site resource
CN105426407A (en) Web data acquisition method based on content analysis
CN105956069A (en) Network information collection and analysis method and network information collection and analysis system
CN103605670B (en) A kind of method and apparatus for determining the crawl frequency of network resource point
CN104750812A (en) Automatic data collecting method based on webpage label analysis
CN104572996A (en) Processing method and device for video webpage
CN109389972B (en) Quality testing method and device for semantic cloud function, storage medium and equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20170426

Address after: 100086 Beijing, Haidian District, North Third Ring Road West, No. 43, building 5, floor 08-09, No. 2

Applicant after: BEIJING ZHONGSOU CLOUD BUSINESS NETWORK TECHNOLOGY CO., LTD.

Address before: Shou Heng Technology Building No. 51 Beijing 100191 Haidian District Xueyuan Road room 0902

Applicant before: Beijing Zhongsou Network Technology Co,Ltd

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20180124

Address after: 102400 Beijing city Fangshan District Chenguang Road No. 16 Building No. 16 hospital 6 layer 612

Applicant after: In Beijing Yun Yue Network Technology Co. Ltd.

Address before: 100086 Beijing, Haidian District, North Third Ring Road West, No. 43, building 5, floor 08-09, No. 2

Applicant before: BEIJING ZHONGSOU CLOUD BUSINESS NETWORK TECHNOLOGY CO., LTD.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20150603

RJ01 Rejection of invention patent application after publication