中国科学院机构知识库网格系统: 开放域命名实体抽取关键技术研究

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

开放域命名实体抽取关键技术研究

文献类型：学位论文


作者	齐振宇
学位类别	工学博士
答辩日期	2013-05-27
授予单位	中国科学院大学
授予地点	中国科学院自动化研究所
导师	赵军
关键词	开放域命名实体抽取种子实体质量评估新种子生成置信度计算 Open Entity Extraction Seed Quality Evaluation New Seeds Generating Confidence Computing
其他题名	Research on Methods of Open Domain Named Entity Extraction
学位专业	模式识别与智能系统
中文摘要	开放域命名实体抽取是近年来信息抽取领域的研究热点，其主要任务是从多源异构数据中抽取并构建开放类别命名实体列表。这一任务涉及自然语言处理、机器学习、模式分类、信息抽取等多个领域的关键技术，因此具有重要的学术研究意义。另外，该技术也是查询分析、广告匹配等应用中的关键技术，因此具有重要的应用价值。开放域命名实体抽取任务包括两个核心问题：第一，如何得到高质量的种子实体。第二，如何准确计算候选实体的置信度。本论文针对上述两个核心问题展开研究，论文的主要工作和创新点归纳如下： 1、提出了一组种子实体质量评估指标与相应度量方法，部分解决了种子质量评估问题种子实体质量好坏对开放域命名实体抽取系统的结果有非常大的影响（不同种子的差别可以达到40%[Vyas, 2009]），因此研究如何度量种子实体的质量非常重要。本文提出了一组融合实体语义知识的种子实体质量评估指标：语义相关度、歧义度以及流行度，并为每个指标设计了相应的计算方法。本项研究成果部分解决了种子质量评估问题。实验结果表明，与使用随机种子相比，该方法取得了9.2%的性能提升。 2、提出了融合语义知识的高质量新种子生成方法，可以有效获得高质量种子实体人工输入的初始种子实体通常质量比较差[Vyas, 2009]，因此需要研究如何生成高质量新种子的方法。结合上述种子质量评估指标，本文提出了一种高质量新种子生成方法。该方法能够从初始种子出发，自动生成高质量的新种子。本项研究成果可以有效地获得高质量种子实体。实验结果表明，与使用随机种子相比，该方法取得了7.3%的性能提升。 3、提出了基于图随机游走的候选实体置信度计算方法，部分解决了候选实体置信度计算问题利用模板进行实体抽取时，为更准确地计算候选实体置信度，本文提出了基于图随机游走的候选实体置信度计算方法。本文认为，模板的质量对于评估候选实体的置信度有重要影响，而候选实体的置信度对于评估模板的质量也有重要作用。因此本文使用候选实体和模板之间的抽取/被抽取关系构建二分图，在图上使用随机游走方法综合度量候选实体的置信度与模板的质量。实验表明，相比于基于模板向量空间的候选实体置信度计算方法，该方法取得了4.4%的性能提升。利用上下文统计信息进行实体抽取时，为更准确地计算候选实体置信度，本文提出了基于实体空间和基于文档空间的候选实体置信度计算方法。实验表明，与基于上下文统计信息的置信度计算方法相比，该方法可以分别获得0.8%和4.9%的性能提升。 4、提出了融合模板与网络标签扩展的开放域命名实体抽取方法，部分解决了如何准确描述候选实体语义问题为更准确地刻画候选实体的语义信息，本文首先提出了一种基于网络标签扩展的开放域命名实体抽取方法。与传统基于模板的方法相比，该方法可以更精确地抽取候选实体从而达到较高的准确率。同时为了弥补网络标签扩展方法在覆盖率上的不足，本文将基于模板的抽取方法与基于网络标签扩展的抽取方法相融合，提出了一种融合模板与网络标签扩展的开放域命名实体抽取方法。实验表明，相比于传统基于模板的抽取方法，该方法最高可以取...
英文摘要	Open Entity Extraction is a new task of natural language processing. The aim of this task is to extract entities from multi-source heterogeneous spatial data and build lists of entities belong to the same semantic category, which involves key technologies of pattern recognition, machine learning, information extraction, etc. Thus, the research on open entity extraction has significant academic value. Furthermore, open entity extraction will be beneficial for query analysis, advertisement-matching and so on. Therefore, it is also very useful for real applications. There are two key problems in open entity extraction task: 1) How to measure the quality of seed entities accurately. 2) How to measure the confidence of candidate entities precisely. This dissertation focuses on the two key problems above. The main contributions are summarized as follows: 1. Evaluation Criteria and Methods for Measuring the Quality of Seed Entities The quality of seed entities can greatly influence the result of open entity extraction systems (as much as 40%), therefore it is very important to research how to measure the quality of seed entities. This dissertation proposes an evaluation criteria and relative methods for measuring the quality of seeds based on the semantic knowledge of seed entities. The criterion includes three indicators: semantic relatedness, ambiguity and popularity. The experimental results show that our method can significantly outperform the traditional random chosen seeds methods by 9.2% 2. A Semantic-Knowledge-based Method for High-quality Seed Generating Human editors generally input bad seeds; therefore we need to research how to generate high-quality seeds. This dissertation proposes a novel method which can generate new, high-quality seeds and replace original, poor-quality ones. In our method, we leverage Wikipedia as a semantic knowledge to measure semantic relatedness and ambiguity of each seed. Moreover, to avoid the sparseness of the seed, we use web resources to measure its population. Then new seeds are generated to replace original, poor-quality seeds. Experimental results show that new seed sets generated by our method can improve entity expansion performance by up to average 7.3% over original seed sets. 3. A Graph-based Random Walk Method for Calculating Confidence of Candidates This dissertation proposes a graph-based random walk method to calculate confidence of candidates when use pattern-based strategy to extract entities. We con...
语种	中文
其他标识符	200918014628043
源URL	[http://ir.ia.ac.cn/handle/173211/6518]
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	齐振宇. 开放域命名实体抽取关键技术研究[D]. 中国科学院自动化研究所. 中国科学院大学. 2013.

入库方式： OAI收割

来源：自动化研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。