中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
面向内容的信息检索模型研究

文献类型:学位论文

作者吴晨
学位类别博士
答辩日期2007-06-06
授予单位中国科学院声学研究所
授予地点声学研究所
关键词信息检索 HNC理论 统计自然语言处理 语义 语言模型
其他题名Content-oriented Information Retrieval Model
学位专业信号与信息处理
中文摘要本文针对目前自然语言处理领域的研究热点――信息检索模型存在的问题,结合统计自然语言处理以及HNC自然语言理解技术的研究现状,提出了将语义方法与统计方法结合进而构建检索模型的新思路,并给出了面向内容的信息检索模型的逐步构建方案及各阶段具体模型。 通过内容研究,本文所表现出的主要创新点在于: 1) 提出了语义与统计相结合的构建检索模型的新思路。给出了基于内容的信息检索系统的逐步构建方案,通过对两个重要模型的研究,证明了这一方案的可行性。 2) 在对当前基于统计方法的检索系统的研究过程中,发现了其不足点。结合HNC语义表示的特点,本文提出了有针对性的改进意见,并在DGMSys模型中予以了体现,在最后测试中表现出了较好效果,在准确率—召回率指标上达到了较高水平。 3) 在探索“基于句群语义的信息检索模型”的过程中,制定了能够满足计算机处理需要的句群划分及判定的规则。这些规则基于已有的HNC语言概念空间表示方法,充分考虑了句群本身的构成特点。 4) 所提出的以概念作为中介的、基于词汇概念知识的信息检索模型初步解决了数据稀疏的问题。实验表明,采用了概念作为检索中介以后,系统索引文件的大小大大减少。有效提升了基于概念的检索系统的检索速度。
英文摘要Most of the Information Retrieval models are based on the pure Statistical Language Processing (SLP) methods which mainly focus on the term frequency within a document and do not try to analyze and differentiate the meanings of the terms; therefore, they have difficulties in realizing an ideal system which was wished to be filled with intelligence. To address this issue, on the ground of the current research achievements of both Nature Language Understanding (NLU) and SLP, this paper proposed a brand-new schema which is intended to expedite the way to the future ideal IR model through combining the NLU methods with the SLP methods. In the schema, the approaches of semantic extraction and semantic expression are embedded into the approaches of term weighting and similarity measurement. According to the differentiation in the proportion of the SLP methods in the whole IR model, two important IR models, based on the schema, have also been proposed. There are “Concept-based Glossary Model (CGM)” and “Concept-based Sentence group Model (CSM)”. In both the models, concepts, namely the formalized expressions of the meaning, are introduced. Based on the research work, the main contribution and creative points of this dissertation are listed as the following: 1) Proposed a brand-new schema which tried to expedite the research way to the future ideal IR model. The features of the schema lie in that the SLP methods and the NLU methods have been integrated into each other. The experiments arming at the schema indicate that the IR models with the semantics have the upper hand to the ones without it (improved by 2% to 8% respectively) 2) Some shortages of the current statistical IR models have been detected during studying the models of this dissertation. Through considering these two points, the models proposed in the dissertation put up a well performance. 3) Proposed a method for marking off the Chinese sentence groups based on the semantic relationship between the sentences. Taking advantages of the symbolic system of Language concept space defined by HNC, some formalized rules for detecting the Chinese sentence groups are also presented. 4) The proposed model based on the concepts solved the problem of the data sparseness in IR models elementary. The IR system based on the concepts cuts down the dimension of the token by half. Under the test collection containing 381375 docs, the total number of the tokens in the concept-based IR systems is about 251206, while the counterpart in the word-based ones is about 120821 which accounts for nearly 1/2 of the former one. Due to this, the retrieval time cost in the concept-based system is fewer than the word-based one. In summary, this dissertation proposed a brand-new schema which takes advantages of both the NLU methods and the SLP methods. The experiments on the IR Systems advocated by the schema indicate the feasibility and effective of the schema as well as the models.
语种中文
公开日期2011-05-07
页码136
源URL[http://159.226.59.140/handle/311008/212]  
专题声学研究所_声学所博硕士学位论文_1981-2009博硕士学位论文
推荐引用方式
GB/T 7714
吴晨. 面向内容的信息检索模型研究[D]. 声学研究所. 中国科学院声学研究所. 2007.

入库方式: OAI收割

来源:声学研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。