基于词汇链与文本切分的更新型多文档摘要技术研究
文献类型:学位论文
作者 | 李靖 |
学位类别 | 硕士 |
答辩日期 | 2008-06-04 |
授予单位 | 中国科学院研究生院 |
授予地点 | 中国科学院软件研究所 |
导师 | 孙乐 |
关键词 | 自动文本摘要 查询型摘要 更新型摘要 词汇链 |
学位专业 | 计算机软件与理论 |
中文摘要 | 本文从文本摘要的背景和概念入手,着重介绍了新近出现的两种自动文本摘要类型:查询型摘要和更新型摘要。接着从基于外部特征、基于简单语义分析和基于深度语义分析三个方面详细给出了自动文本摘要技术的主要方法,以及目前国际上普遍采用的三种自动摘要评价方法:ROUGE,Pyramid,BE。本文以基于词汇链的摘要方法为研究重点,在描述词汇链构建算法、词汇链摘要算法及其优化算法的基础上,主要从以下四个方面展开了研究工作: 1) 将用户查询关键词词义序列引入词汇链评分算法,利用词汇链与查询关键词词义序列的语义相似度为词汇链评分,并结合经典的Strongest Chain评分方法以获得能同时满足正确体现原文含义并且符合用户查询的词汇链。 2) 利用词汇链是原始文档含义的中间表示这一特性,通过对来自历史文档和待摘要文档这两方面的词汇链进行相似度计算,实现了历史信息与新信息的分离,并依此生成了更新型摘要。 3) 对旧词汇链结构进行扩展,使之同时包含链成员来源句段,并且通过引入文本切分研究领域的TextTiling算法,对同一条链的链成员来源句段所组成的临时短文进行切分,以切分结果作为摘要候选,实现了句子抽取的摘要生成方法和段落抽取的摘要生成方法的结合,提高了生成摘要的良构性。 4) 依据上述算法改良了原有的自动文本摘要系统,并与跨语言检索系统相结合实现了一个中英跨语言新闻摘要系统。 |
索取号 | 暂无 |
英文摘要 | In this thesis we introduced the background and main concept of automatic text summarization at first and mainly described two summarization task: query-focused (or topic-based) summarization and update-style summarization. We discussed and analyzed existing summarization methods in three categories, based on abstraction, based on simple semantic analysis and based on deep semantic analysis. And we also introduced three main automatic methods for summary evaluation, ROUGE, Pyramid and BE. Then, we focused on the discussions of Lexical Chain approach in summarization and optimized approach. We deeply analyzed the virtues and faults of these approaches and proposed improvement strategies in three aspects as follows. Firstly, we scored lexical chains by calculating semantic similarities between chains and query term sequences. By combining this scoring strategy and the classic scoring method we can find the strongest chain, which can represent raw text and suit query term better. Secondly, we divided historical information (or outdated information) and new information by calculating similarities between lexical chains from each side’s documents. Then we can extract candidate sentences for update-style summary. Thirdly, we introduced TextTiling algorithm into Lexical Chain approach and developed the chain structure into rich chain, which contains sentences where chain elements come from. By segmenting sentence set contained by rich chain using TextTiling algorithm we extracted segmentations as candidates for summary. Through this we can benefit from both virtues from sentence-based extraction and paragraph-based extraction. At the last part of this thesis we implemented improvement strategies in our summarization system and utilized our system in Chinese summarization as well. By combining our summarizer and cross-language IR technology we developed our News Browser System. |
公开日期 | 2011-03-17 |
分类号 | 暂无 |
源URL | [http://124.16.136.157/handle/311060/6276] ![]() |
专题 | 软件研究所_基础软件国家工程研究中心_学位论文 |
推荐引用方式 GB/T 7714 | 李靖. 基于词汇链与文本切分的更新型多文档摘要技术研究[D]. 中国科学院软件研究所. 中国科学院研究生院. 2008. |
入库方式: OAI收割
来源:软件研究所
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。