中国科学院机构知识库网格系统: 中文地名与时间的识别和标注

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

中文地名与时间的识别和标注

文献类型：学位论文


作者	李诺
学位类别	博士
答辩日期	2009-05-27
授予单位	中国科学院声学研究所
授予地点	声学研究所
关键词	地名两级识别最大熵模型特征函数 HNC理论时间表达式标注
其他题名	The Recognition and Tagging of Chinese Place Name and Time Expression
学位专业	信号与信息处理
中文摘要	地名与时间信息是描述事件背景内容的两个关键信息。正确地识别地名与时间表达式，将有助于中文分词、未登录词识别、命名实体识别等处理性能的提高。同时，这一工作也是信息检索、内容抽取、问答系统等工作的基础，研究意义重大。中文地名与时间表达式在实际语料中出现的形式灵活多样，使其成为处理的难点。本文设计并实现中文地名与时间表达式的识别与标注系统。文章在充分挖掘中文地名与时间表达式各自组成结构和上下文信息的基础上，首先通过统计与规则结合的方法进行初次识别，之后再对初次识别结果应用最大熵模型进行二次分析得到最终结果。在最大熵模型方面，引入了语义概念知识，提高模型整体识别效果。最后，本文研究了中文地名与时间表达式的标注工作。具体而言，本文的主要研究内容和进展包括： 1、实现了中文地名识别系统。通过对大量中文地名有针对性地进行训练并分析地名的组成特点，应用N元文法的方法实现地名的初次识别，得到召回率大于97%的地名初次识别结果。再通过应用最大熵模型，结合不同方面的多种特征进行处理。经实际语料测试，对中文地名的最终结果F值达到88%（封闭），84%（开放）。 2、在最大熵模型特征选择方面，引入HNC概念属性。实验数据表明，HNC概念属性特征加入后，识别效果提高了1%。同时本文还使用了变长的特征窗口，给出了在小规模测试集上对中文地名的识别结果并进行了分析。 3、实现了中文时间表达式识别及标注系统。与中文地名的识别类似，本文先分析了时间表达式的组成结构，在TIMEX2等国际通用的时间标注规范的基础上，完善了中文时间表达式的定义。通过正则表达式及最大熵统计模型的方法进行识别，识别结果F值约为81%（封闭）。对于识别正确的时间表达进行标注，应用并实现TIMEX2标准的标注方法，在实际语料标注中，标注F值达到86%。最后本文还研究了时间表达式与事件发生时间的关系。 4、在中文地名与时间表达式识别的基础上，研究了中文地名的标注。设计并制作了地域信息知识库。包含中国地名的人口、面积、经纬、邮编、区划等方面。并以地域信息知识库指导地名的标注。综上所述，本文分析地名及时间表达式各自的组成结构特点，之后对地名和时间表达式都采用两级识别的模式进行识别。在此基础上，又分析了对地名和时间表达式各自的标注工作。本文的研究结果可以作为独立系统完成地名与时间表达式的识别和抽取，也可以作为中文分词、文本检索以及机器翻译等语言信息处理系统的一部分或一个模块。
英文摘要	ABSTRACT Place names and time expressions are two kinds of key information which describe the background of a concrete event. Accurate recognition of place names and time expressions would help to improve the performance of word segmentation, recognition of out of vocabulary words and recognition of named entities. Meanwhile, this work is also the foundation for information retrieval, content extraction, and question and answer system, therefore it is very important. However, recognition and tagging of place names and time expressions are very difficult as their numerous different forms. This dissertation focuses on designing and implementing a place names and time expressions recognition and tagging system. More attention is paid to digging the context information of place names and time expressions. Firstly, we recognize place names and time expressions by statistic methods and rules. Secondly, we recognize place names and time expressions again by maximum entropy model. While analyzing the feature functions of maximum entropy model, we also make use of the semantic information to improve the result. Finally, we research tagging task of place names and time expression. The main research work in this dissertation is listed following: 1. To implement Chinese place name recognition system. We analyze a lot of Chinese place names to get the features. Then, we gain the initial results of recognition by using the statistic data and N-gram method. The recall of initial recognition achieves to 97%. Afterwards, we utilize the mature maximum entropy model to combine different context features. The F value of system on real corpus comes to 88%(closed), and 84%(open). 2. As to the feature functions of maximum entropy model, we introduce HNC concept features. The result of experiment proved that semantic features contribute to 1% improvement. Meanwhile, we also try to change the length of maximum entropy model windows and analyzes the result. 3. We implement time expressions recognition and tagging system. Same as place names, we analyze the features of time expressions at the first step. Based on international time expression tagging standard, we improve the rules in Chinese time expressions tagging parts. The F value of recognition of time expression by maximum entropy model reach to 81%(closed). For the correctly recognized result, we implement the tagging system. The F value of tagging reaches to 86%. Finally, we research the relationship of time expressions and the time of an event. 4. Based on recognition of place names and time expressions, we research the tagging of place names. We design and achieve area name information database. This database includes population, area, longitude and latitude, post number, domination and so on. At last, we use the database to supervise the tagging of place names. To summarize, we analyze the features of place names and time expressions. Then, we adopt a two-step recognition method to recognize place names and time expressions. Based on recognition, we also analyzed the tagging task. The result of this dissertation could be used at the recognition and extraction of place names and time expressions, or working as a module of word segmenting, text retrieval, machine translation or other language information processing systems.
语种	中文
公开日期	2011-05-07
页码	64
源URL	[http://159.226.59.140/handle/311008/552]
专题	声学研究所_声学所博硕士学位论文_1981-2009博硕士学位论文
推荐引用方式 GB/T 7714	李诺. 中文地名与时间的识别和标注[D]. 声学研究所. 中国科学院声学研究所. 2009.

入库方式： OAI收割

来源：声学研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。