中国科学院机构知识库网格系统: 文本自动摘要方法研究

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

文本自动摘要方法研究

文献类型：学位论文


作者	吴晓锋
学位类别	工学博士
答辩日期	2010-11-24
授予单位	中国科学院研究生院
授予地点	中国科学院自动化研究所
导师	宗成庆
关键词	自动文摘半条件随机场排序学习潜层狄利赫雷分配复述句识别 aotumatic summarization semiCRF learning to rank Latent Dirchlet Allocation paraphrase
其他题名	research on automatic document summarization
学位专业	模式识别与智能系统
中文摘要	自动文摘（Automatic Document Summarization, ADS）是自然语言处理领域的一个子领域。它是利用计算机系统自动生成文本摘要的应用技术，或者说是按读者（或用户）的要求以简洁的形式表达原文主要内容的技术。研究自动文摘技术的理论价值在于, 一个完善的自动文摘系统几乎可以涵盖自然语言处理领域的方方面面，所以，该领域的研究对于整个自然语言处理的发展定能起到一定的推动作用。并且，这项研究也有着广泛的应用前景：在互联网技术高度发达的今天，自动文摘技术能够有效地帮助人们从检索到的文章中寻找自己感兴趣的内容，提高阅读速度和质量。本论文主要工作和贡献归纳如下： (1)在模型创建方面，本论文提出了一种基于序列分段模型（Sequence Seg-mentation Models，SSM）的有监督摘录型摘要提取方法。在这种方法里，摘要问题被看作“段标注”问题。与前人的工作相比较，SSM方法的不同之处在于提取特征的单位不单来自句子，也可以来自于段。我们的SSM使用了可以对“段”建模并标注的半马尔可夫条件随机场（Semi-Markov Conditional Random Fields，SemiCRF）。实验表明，这种方法与单纯以句子为单位提取特征的摘要方法相比，有较明显的改善效果。 (2)在建模方面我们提出的另一种方法是采用排序学习方法（Learning to Rank，LTK）对通用型（generic）摘要问题建模。摘录型摘要的核心问题是给句子打分，打分的目的是为了后面的排序，并输出排名靠前的句子。而排序学习本质上就是为了解决排序问题，所以和摘录型摘要有很强的内在切合点。而且，采用排序学习建模更强调同一文本内的句子之间的相互比较，这和以往的建模方法有很大不同。我们将当前流行的几种排序学习算法在摘要问题上进行了比较，并第一次使用了逐列的排序学习方法。我们的实验证明，采用排序学习对通用型摘要建模是行之有效的，当采用SVMMAP这种逐列排序学习方法时，其总体效果还要优于以往建模方法。 (3)在特征提取方面，本论文提出了采用潜层狄利赫雷分配（Latent Dirichlet Allocation，LDA）来提取特征的方法。这种方法近年来被广泛应用于文本聚类、分类、段落切分等等，并且也有人将其应用于基于查询的无监督的多文档自动摘要。该方法被认为能较好地对文本进行潜层语义建模。本论文在前人工作基础上，研究了LDA在有监督的自动文摘中的作用，提出了将LDA提取的主题（Topic）作为特征加入有监督模型中进行训练的方法，并分析研究了在不同Topic下LDA对摘要结果的影响。实验结果表明，加入LDA特征后，能够有效地提高以传统特征为输入的文摘系统的质量。 (4)在多文档摘要中，冗余句的识别和剔除是一个至关重要的问题。无论是采用摘录型摘要方法还是理解式摘要方法，这都是一个不可回避的问题。针对这个问题本论文着重研究了复述（Paraphrase）句的识别问题。传统的解决复述句识别方法是通过词频或句法上的相似度来判断的。可是哪怕用相同的文字书写的句子其含义也可能差别很大，而相同句法结构也不能保证意义一致。本文根据新闻语料的特点，提出了一种通过引入深层的语义角色标注来帮助识别新闻领域复述句的方法。该方...
英文摘要	Automatic Document Summarization (ADS) is one of the subfield of Natural Language Processing (NLP). It can be defined as a technology which is to summarize documents with the help of computer, or to represent the original documents with short but compre-hensive texts according to the demands of customers. The research of ADS is of both theoretical and applicational values: A complete ADS system can almost cover all of the NLP subfields, so, the research upon it must be a boost to the development of NLP; the applicational value is that with the advent of the information era and the development of the internet, ADS can efficiently speed up people”s reading. In this thesis, therefore, we make an intensive study on the ADS” modeling and feature extraction. The main work and contributions are summarized as follows: (1) We propose and implemented a new model which is based on Semi-Markov Conditional Random Fields (SemiCRF) and does ADS using Sequence Segmentation Models (SSM). Compared with existing approaches, SemiCRF can utilize features ex-tracted from segments as well as sentences. According to our experiments, this new ap-proach outperforms all exisiting approaches which only extract features from sentences. (2) The other new approach we propose is based on Learnint to Rank (LTK). Be-cause the core idea of extractive summarization methods is to rank sentences according some criteria, so it is quite natural to consider LTK in automatic summarization. We tested some of the most famous LTK approaches both pair-wised and list-wised. Our ex-periment results show that LTK could work well in generic extractive summarization, and one of the list-wised approaches SVMMAP could even outperform the best known result using the same feature space. (3)We also extract a new feature for automatic summarization which based upon Latent Dirichlet Allocation (LDA). LDA has been a very hot point in recent years due to its sound theoretical foundation and good flexibility. Some of researches have shaped it for query-based summarization. Yet there has been no study of the influence of the num-ber of topics to the summarization results and no one has used the original LDA to be a feature for doing automatic summarization. Our experiments are aiming to solve these two problems. Our results show that LDA is quite capable as an automatic summarization feature. (4)We have also studied the problem of Paraphrase Recognition as an approach to solve the redundant sentence recognition...
语种	中文
其他标识符	200518014628097
源URL	[http://ir.ia.ac.cn/handle/173211/6306]
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	吴晓锋. 文本自动摘要方法研究[D]. 中国科学院自动化研究所. 中国科学院研究生院. 2010.

入库方式： OAI收割

来源：自动化研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。