中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
基于语义的文本关联性分析

文献类型:学位论文

作者刘贤达
学位类别工学硕士
答辩日期2011-05-27
授予单位中国科学院研究生院
授予地点中国科学院自动化研究所
导师杨一平
关键词关联性 知网 概念知识树知识表示体系 关键词网络 相似度 Correlation HowNet Conceptual Knowledge Tree Keyword Network Similarity
其他题名Correlation between Texts Based on Semantic Analysis
学位专业计算机应用技术
中文摘要随着网络信息的迅速增长,如何提高信息检索系统对自然语言的处理能力,成为了研究热点。文本关联性计算作为信息检索处理中一项基础性技术,直接影响着检索结果的好坏。而传统的基于词语字符串匹配的方法已经不适用于解决今天复杂的语言关联问题。因此,本文提出一种基于语义的文本关联性分析方法,以语义为核心,构建文本间的关键词网络,分析文本间的语义关联性。论文的主要内容包括: 1、建立关键词网络 分析论文要素及结构,介绍关键词特征,详细说明了首位置特征、首次出现位置特征POS、词频、TF×IDF、词性、文档长度等特征的基本思想和计算方法。讨论了四种常用的关键词抽取方法,并结合已有资源,决定采用基于统计的关键词抽取方法。最后定义关键词网络,并定义关键词网络中的“核心词汇”节点、“枝叶词汇”节点及“潜在词汇”节点。 2、研究并阐释两种知识表示体系:知网和概念知识树知识表达体系。 在知网中,义原是基本表达单元,而义项是由义原所组成的。知网通过一种知识描述语言来对每个概念进行描述;在概念知识树中,概念是基本表达单元,而我们用属性、关系和行为三方面对概念进行描述。我们结合两种知识表达体系,对自动化学科词汇进行语义分析。 3、分析文本关联性 首先提出基于知网的词汇间相似度改进算法。在义原间相似度计算的改进算法中,我们考虑了概念层次树的深度和概念层次树的区域密度对义原间相似度计算的影响。在义项间相似度计算的改进算法中,我们采用分类讨论的方式解决义原加权的问题。然后分析自动化学科词汇的结构,提出自动化学科词汇的语义确定的算法以及计算自动化学科词汇间相似度的算法。最后,结合关键词网络,提出文本关联性的语义分析算法。
英文摘要With the rapidly increasing information on the Internet, a research has been a focus on improving the performance of an information retrieval (IR) system by Natural Language Processing (NLP). As a fundamental technique in IR system, correlation computation between texts has affected the retrieval results directly. However, the traditional method to compute correlation is to use keyword string match, which is helpless when it comes to solve complex problems about text correlation. Therefore, this paper will solve the problem about correlation between automation discipline papers based on semantic analysis. The main content of this paper is: 1.Build Keyword Networks First, I analyze the elements and structures of papers. Then I introduce the characteristics of the keywords and explicitly explained 5 characteristics, such as the first position, the term sequence, the value of TFIDF, the part of speech and length of documents. Besides, I also discusse 4 common ways of extracting keywords and decided to use the method based on statistics. At last, I define the keyword networks and put forward 3 kinds of nodes in keyword networks, such as “core-word” node, “leaf-word” node and “potential-word” node. 2.Explain the Structure of Knowledge Representation: HowNet and Conceptual Knowledge Tree In HowNet, the sememe is the unit of semantic meaning and the concept is made up of sememes. Each concept is expressed as Knowledge Representive Language; in Conceptual Knowledge Tree, we use the attributes, relations and behaviors to describe a concept. I use these two knowledge representation system mention above to analyze automation discipline words on the basis of semantics. 3.Analyze correlation between texts First, I use HowNet as semantic representation fundament to compute similarity between common words. I improve the algorithm of computing similarity between sememes and between concepts. When computing similarity between sememes, I take the height and density of sememe tree into consideration. When computing similarity between concepts, I solve the problem about the weight of sememes by classifying each condition. Second, I analyze the structure of automation discipline words and put forward an algorithm to determine automation discipline words’ semantic meaning with the help of Conceptual Knowledge Tree and then computed the similarity between automation discipline words. Finally, I presente the algorithm of computing correlation between papers: map papers...
语种中文
公开日期2015-09-08
其他标识符200828009029013
源URL[http://ir.ia.ac.cn/handle/173211/7560]  
专题毕业生_硕士学位论文
推荐引用方式
GB/T 7714
刘贤达. 基于语义的文本关联性分析[D]. 中国科学院自动化研究所. 中国科学院研究生院. 2011.

入库方式: OAI收割

来源:自动化研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。