中国科学院机构知识库网格系统: 基于Context_Graphs的主题爬虫系统的设计与实现

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

基于Context_Graphs的主题爬虫系统的设计与实现

文献类型：学位论文


作者	陈星
学位类别	硕士
答辩日期	2010-06-01
授予单位	中国科学院研究生院
授予地点	北京
导师	秦晓
关键词	主题爬虫 Context Graphs模型层次建模链接分析内容分析
学位专业	计算机应用技术
中文摘要	为了利用有限的硬件资源和存储空间，即时获取网络上用户最关心的数据，研究者们提出了主题爬虫。以往的通用爬虫既不考虑页面内容与主题的相关度，也不做任何预测，相比之下，主题爬虫以一定策略去评价网页的优先度，选择最为可能获取主题网页的路径进行搜索，因此有能力更快更准地获取目标网页。主题爬虫爬行策略有很多方法，其中，Context Graphs方法是一种结合了网页的文字内容信息与Web超链结构信息的综合方法。Context Graphs方法将网络中的页面视为分层的结构，链接到达主题网页的页面根据其特征会被分到一定层次中。基于各个层次的特征，可以指导爬行器更快地去发掘可能存在的主题页面。然而，以往的Context Graphs方法在建立层次模型的时候没有区分网页不同部分文本信息的重要程度，但是在很多情况下，网页标题、超链锚文字等信息在区分网页主题时比网页正文内容更重要。此外，Context Graphs方法在指导抓取的时候不能根据新得到的主题网页更新模型，事实上，如果能利用这些新得到的网页来增量更新模型，将有可能获得更为准确的结果。基于这两点，作者提出了一种采用混合打分法以及引入模型反馈更新机制的M-Context Graphs方法，并进一步设计并实现了一个主题爬虫原型系统。本文首先对目前已有的和正在探索中的各种主题爬行策略进行了综述和分析，并对国内外主题爬虫系统的开发现状做了简要介绍；接下来，本文详细介绍了M-Context Graphs方法中的混合打分法和模型反馈更新策略，并进一步给出了一个主题爬虫原型系统的详细设计和实现方案。最后，利用该系统通过实验比较了M-Context Graphs算法与以往算法，结果表明，M-Context Graphs方法确实获得了更好的效果。
英文摘要	In order to crawl users’ most concerned web pages within limited hardware resources and storage space, the researchers proposed focused web crawler. General crawlers don’t consider the relevance of page content and the topic, or make any prediction. In contract, focused web crawlers use certain strategies to evaluate the page priorities, and search the most preferential path firstly. Therefore, this kind of crawlers is able to acquire on-topic web pages more quickly. There are some main kinds of focused crawling strategies, and Context Graphs is one of them. Context Graphs method combines both the web page content and the web hyperlink structure, therefore it’s a composite strategies. It is significant for us to make in-depth analysis and improvement on it. Context Graphs method treats the network pages as a hierarchical structure, pages which link to on-topic pages could always be assigned to some level. The crawler could achieve better results based on the characteristics of all these levels. However, Context Graphs method treats different sections of web pages text equally when building the hierarchical structure model, but in many cases, the web page title and the anchor text of hyperlink are more important than web page content. Besides, in the crawling stage, Context Graphs method can’t update model dynamically. In fact, if the newly acquired on-topic pages are used to update model, it is possible to get more accurate results. Based on these two points, the author proposes M-Context Graphs using mixed rating method and dynamic updating mechanism. Furthermore, the author designs and realizes a prototype system. In this paper, we firstly discuss and summarize the domestic and international research on focused crawling strategies, and then make a brief introduction on several crawler applications. Next, we give some details on M-Context Graphs method, including mixed rating method and dynamic updating mechanism. According to the improved method, a design and implementation solution of the focused crawler system is also given. Finally, using this prototype system, a comparation of the M-Context Graphs method and the former methods is made. Experiment results show that the method proposed behaves better.
学科主题	计算机应用 ; 计算机应用其他学科
语种	中文
公开日期	2010-06-07
源URL	[http://124.16.136.157/handle/311060/2326]
专题	软件研究所_人机交互技术与智能信息处理实验室_学位论文
推荐引用方式 GB/T 7714	陈星. 基于Context_Graphs的主题爬虫系统的设计与实现[D]. 北京. 中国科学院研究生院. 2010.

入库方式： OAI收割

来源：软件研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。