中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
实体搜索理论与应用研究

文献类型:学位论文

作者杨柳
学位类别工学博士
答辩日期2010-06-05
授予单位中国科学院研究生院
授予地点中国科学院自动化研究所
导师张文生
关键词专家搜索 关系证据 查询词邻近度 语义关系 互联网实体摘要 Expert Search Relationship Evidences Query Proximity Semantic Relation Web Entity Summarization
其他题名Research on Entity Search and Its Application
学位专业模式识别与智能系统
中文摘要万维网(World Wide Web)是互联网(Internet)上最受欢迎的服务之一,为世界上各种组织机构、科研机关、大学、公司厂商及个人共享信息资源提供了巨大的方便,时至今日,万维网已经发展成为世界上最大的信息资源库。搜索引擎是在万维网上检索信息资源的主要工具,传统搜索引擎研究以文档信息检索的相关理论为基础,以提供用户最为相关的网页为目标,在改进搜索结果相关性的问题上做了许多有价值的探索。然而随着万维网规模的爆炸性增长以及用户需求的不断专业化,越来越多的用户开始关心网页内部所蕴含的具体信息,而非网页本身,传统搜索引擎的局限性也开始体现得越来越明显。在万维网如此巨大规模的数据空间中,人物、产品、机构、地点等各类实体描述信息散布其间,随着万维网发展的逐步社会化和语义化,如何有效而准确的找到用户所需要的实体信息成为近年来得到广泛关注的热点。实体搜索研究隶属于信息检索的范畴,主要存在实体排序和实体描述两大问题,实体排序旨在从一个给定的实体集合中,找出与查询相关性最高的实体子集,实体描述旨在生成对实体的属性与特征的合理描述。本文主要的工作和贡献如下: 1. 对实体排序理论问题研究最为深入的是专家搜索领域,以往的研究证实了单独对候选人和查询词的距离或顺序关系证据建模在提高算法准确率上的有效性,但忽视了对它们进行比较或结合。本文提出顺序核函数来建模顺序关系证据,并进一步提出两种对不同关系证据进行统一建模的概率框架,通过在TREC标准数据集上进行实验,比较了不同关系证据的实际效果并证实了结合它们进行专家搜索的有效性。 2. 查询主题中多个查询词间的距离关系,即查询词邻近度对传统文档检索结果的准确性有很大的影响,但目前还没有人考虑将其应用到专家搜索领域。本文提出一种基于查询词邻近度的专家搜索算法,通过TREC标准数据集上的实验说明了其有效性,并进一步证实了将其与候选人和查询词间关系证据有效结合可以获得更好的效果。 3. 目前大部分实体搜索相关研究都基于网页中的结构化与半结构化文本信息,忽视了在网页中大量存在的自由文本。本文对利用网页自由文本信息进行实体搜索的可行性进行了探索,针对实体排序问题,以专家搜索的相关算法为基础,提出了一种结合实体间语义关系和统计关系的实体排序算法,针对实体描述问题首次给出了万维网实体摘要的定义,提出了四种有效提高摘要准确率的重要特征,并以多文档摘要的MMR算法为基础,提出了万维网实体摘要MMR-WE算法,构建了大规模人物搜索引擎“人立方”作为实验平台,对上述算法进行了实验验证。
英文摘要The World Wide Web is the most popular service ever built on Internet, and it has facilitated the sharing of information resources to a very great extent among organizations, research institutes, universities, enterprises and individuals. So far, the Web has evolved to be the largest repository of information on planet earth. Search engine is the main tool for the finding and retrieving of information on Web. Traditional search engine research targets on providing the most relevant web pages to the user, and has explored a lot to improve the relevance of its returned results. However, as the scale of the Web growing rapidly and the user needs becoming professional, more and more web users switch their interests from the web page itself to the actual information within. In a data space vast as the Web, the information about real world entities such as person, products, organizations and locations is scattered all around. As the Web is being developed to be more semantic and socialized, there has been an increasing interest of research in the field of finding information about real world entities efficiently and effectively. Entity search lies under the scope of information retrieval, and it contains two main problems which are the entity ranking and the entity description. The former aims at retrieving the most relevant subset of entities from a given set and the latter aims at generating reasonable representations for the attributes and characteristics of entities. Overall the work and the contributions are the following: 1. The most in-depth research ever conducted on entity search is expert finding. Existing approaches have shown the benefits of utilizing the relationship evidences for this task by modeling the distance or sequential dependencies between the candidates and query terms respectively. However, the comparison or combination of these evidences remains undone. In this thesis we propose an order kernel function to model the sequential relationship evidence, and further construct two unified probabilistic frameworks to combine difference kinds of evidences. Our experiment results on the standard TREC dataset show that the distance and sequential evidences achieve comparable performance gains over the baseline and a combination of both can achieve better performance than using any of them alone. 2. The distance between the query terms, namely the query proximity has a great influence on the precision of ad-hoc retrieval methods. However, to th...
语种中文
其他标识符200618014628056
源URL[http://ir.ia.ac.cn/handle/173211/6297]  
专题毕业生_博士学位论文
推荐引用方式
GB/T 7714
杨柳. 实体搜索理论与应用研究[D]. 中国科学院自动化研究所. 中国科学院研究生院. 2010.

入库方式: OAI收割

来源:自动化研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。