主题搜索系统关键技术研究
文献类型:学位论文
作者 | 白鹤 |
学位类别 | 博士 |
答辩日期 | 2009-05-26 |
授予单位 | 中国科学院声学研究所 |
授予地点 | 声学研究所 |
关键词 | 主题搜索 Deep Web 主体语义块 爬行策略 数据密集型 |
其他题名 | Research on Key Technologies for Topic-specific Web Search Engine System |
学位专业 | 信号与信息处理 |
中文摘要 | 搜索引擎是当前互联网的基础应用,它帮助用户在海量数据中进行查询。但是传统的全网搜索模式存在索引更新困难和查询准确度低等局限性,主题搜索的出现和发展可以有效弥补上述不足,其关键技术成为了当前互联网研究的热点。 本课题深入探讨了主题搜索引擎的现状,从业务、框架、工程和算法几方面归纳了系统的功能需求,在此基础上开展研究。研究内容主要包括:容纳多业务点的主题搜索系统架构,Web页面主体语义块的提取算法,Deep Web接口页面的自动查询方案,目录式页面的主题爬行策略和“正文式”数据密集型页面的数据抽取算法。课题针对上述内容提出相应的解决或改进方案,主要贡献如下: 1. 提出了一种改进的基于数据抽取器的搜索系统架构。该架构提前训练数据抽取模式,以分类标注的策略支持多个主题业务,改善了以往系统只能提供单一主题搜索服务的情况;架构中针对分布式爬虫系统,实现了加权最小连接调度的任务分割算法,改进了之前基于哈希的平均分配策略,提高了资源的利用率和分布式爬虫的扩展性。 2. 提出了一种使用SVM分类模型区别Web页面主体语义块的方法,经过对正结果集进行后续的校验,最终定位最佳的主体块节点的准确率达到92.3%。本方案成功地把文本分类模型引入页面信息提取领域;相对于其它页面分块方法,实现了领域和平台无关,准确率也比有最佳记录的Data-Rover系统提高了大概两个百分点。 3. 提出了基于领域实例库自动查询Deep Web接口的方案。方案中实现了Deep Web领域实例库的建模方法,首次完备地描述了领域实例的要素、属性和相互之间的约束关系。测试结果显示,在积累一定样本数量的基础上,算法可以达到91%以上的模式匹配准确率,从而保证了Deep Web接口页面正确的自动查询。 4. 提出了一种针对目录式页面的主题爬行策略。不同于一般主题爬行算法对于页面内容或链接的分析,本算法从页面结构特征出发,归纳一系列先验性规则指导中心链接和翻页信息的提取算法。实验显示:本算法有效链接提取的F1指标能够达到85.6%,相比具有代表性的Fish-Search算法提高60%以上。 5. 提出两种对“正文式”数据密集型页面的知识发现算法。1)综合统计学和信号处理的理论,实现了提取“正文式”网页正文文本的算法,首次把FFT应用到信息提取领域,并取得了91.9%的提取准确度;2)基于元搜索技术,结合一定DOM结构的先验知识,实现了从新闻页面自动提取结构化信息的算法。无人工干预情况下Precision指标达到很优异的88.2%。 |
英文摘要 | Search engines are one of the basic applications in current Internet use; they assist users to achieve useful information from an enormous amount of available date. Traditionally, these systems afford users the entire internet search, which is limited by the frequency of updated index as well as low search precision; however, the emergence and development of topic-specific search mode eradicate insufficiencies effectively, where critical technologies turn to be hot issues of Internet research. In this study, the system of the topic-specific search engines is further studied. The function requirements of the system are summarized in several aspects including business, framework, engineer and algorithm. Based on these requirements, research mainly focuses on: the architecture of topic-specific search system, the method to retrieve the main semantic segmentation of web page, automatic querying schema on deep web interfaces, focus crawling policy on “Catalog-style” web page and data-extraction algorithm from “Content-Dominated” web page. Solutions and improved schemes are proposed for the above research contents to solve the problems for the topic-specific search systems. Contributions and innovative work of this dissertation are described as follows: 1. To propose an improved architecture of focus search system based on data extractor. Given the trained data extractor model, it could accommodate muliti-business with category-label policy, which reforms the situation that one system only provides single topic search sevice; for distributed crawler, a weighted least-connection scheduling algorithm is designed to realize target-guided URL assign based on agents’ load while providing better dynamic scalability; furthermore, the hierarchy of website object is analyzed for guiding focus crawling policy. 2. An approach to discovering main semantic segmentation of web page with applying SVM model is proposed. Through processing the verification on positive result units, the precision of locating best main block node reaches 92.3%. This scheme successfully implied text classification model in information retrieval of web page; and compared with othe page segmentation methods, it realizes platform-independent and domain-independent, while the precision increases 2 points than the best record (Data-Rover) ever reported. 3. To propose an approach to automatically querying deep web interface pages based on domain instances. It implies a modeling method of deep web domain instance, which, at the first time, describes the elements, attibutes and constaint condition. The experiment’s result shows that the interface schema matching precision is achieved more than 91%, which ensures the correct automatic query on deep web interface. 4. To propose a topic-specific crawling policy for catalog page. Differing with the common crawling schemes that analyze features of contents and links, the structure characteristic is mainly considered by this method, and a series of heuristic rules is concluded to find the central URLs and page number information. Our approach is evaluated on catalog pages and F1 measure of efficient URLs retrieval reaches 85.6%, increases 60% than representational Fish-Search algorithm. 5. To propose two algorithms for discovering knowledge of “Text-dominated” web pages. 1) An FFT-based algorithm of main body extraction is presented. By applying window-segmentation, statistics theory and FFT, this method calculates the weight of every possible range; and thereby selects the best one as solution. It achieves an encouraging precision of 91.9%. 2) A meta-search based algorithm is presented to extract information structure of news. Without artificial participation, its retrival precision figure reaches 88.2%. |
语种 | 中文 |
公开日期 | 2011-05-07 |
页码 | 143 |
源URL | [http://159.226.59.140/handle/311008/500] ![]() |
专题 | 声学研究所_声学所博硕士学位论文_1981-2009博硕士学位论文 |
推荐引用方式 GB/T 7714 | 白鹤. 主题搜索系统关键技术研究[D]. 声学研究所. 中国科学院声学研究所. 2009. |
入库方式: OAI收割
来源:声学研究所
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。