面向HNC的语料库软件设计与实现
文献类型:学位论文
作者 | 谢法奎 |
学位类别 | 博士 |
答辩日期 | 2008-05-29 |
授予单位 | 中国科学院声学研究所 |
授予地点 | 声学研究所 |
关键词 | HNC理论 语料库 语言空间 语言概念空间 标注 检索 统计 XML XQuery 最大熵模型 |
其他题名 | The Design and Implementation for HNC Corpus Software System |
学位专业 | 信号与信息处理 |
中文摘要 | 语料库是指为语言研究收集的、用电子形式保存的语言材料,语料库是语言学研究和自然语言处理等相关领域研究的重要工具。HNC理论作为一个中文信息处理的流派,它的发展要求HNC语料库的同步发展。本文的工作是要设计和实现一个体现HNC自然语言处理理论特色的、服务于HNC研究的语料库。 论文的主要进展和贡献如下: (1)建立了完整的HNC语料库系统,包含生语料库和熟语料库,能够进行语料管理、加工、标注、检索、统计等。在系统设计上采用3层架构:应用层、接口层、实现层。接口层抽象出一套通用的语料库访问接口,能够有效隔离上层应用与底层语料库存储实现,简化了开发过程。 (2)构建了多用户语料库管理平台。在服务器上构建语料库管理平台,集中管理用户语料和公共语料。系统采用C/S模型,允许多用户并发访问,为多用户协同工作提供了一个便捷的公共平台。 (3)完善了语料库系统的功能。① 在标注方面,构造了一种新的基于XML的语料标注方式,利用XML结合语言空间和语言概念空间标注信息,简化了标注过程。另外,提供了句群切分和语境单元标注工具,将HNC标注推向了语境单元层次。② 在检索方面,采用Lucene全文索引技术实现了全文检索,并提供了3种HNC特征检索方式:基本检索、高级检索、XQuery检索。③ 在统计方面,除了常规统计外,还提供了HNC特征统计功能,设计和实现了4种HNC特征统计统计模式:数量统计、比值统计、自定义分布统计、属性分布统计,用户可以自由的定义统计内容,极大的满足了用户的统计需求。 (4)研究机器辅助标注。利用已有的标注语料,采用最大熵模型来解决语义块切分问题,采用基于实例的方法解决句类判断问题。 (5)建设了句类重组语料库。依托于基本语料库,按照句类重组熟语料,并提供了错误反馈和难点标注功能。 |
英文摘要 | Corpus is a collection of linguistic materials in electronic form. And it is also a very important tool for linguistic studies, NLP and related fields. The HNC theory as a new NLP theory needs the corresponding corpus. Our goal is to design and implement a corpus software system which embodies the HNC characteristics, and helps HNC study. The main contribution of this dissertation is as follow: (1)An integrative HNC corpus system is established, which contains raw and tagged corpus. Some functions, such as management, processing, tagging, searching, and statistical processing, are presented. The corpus system is designed with three-layer structure: the application, interface, and implementation layers. The interface layer consists of a set of universal interfaces for corpus access, then it can effectively isolate the top and bottom layers, and simplifies the development process. (2)A multi-user corpus management platform is constructed. All users’ corpus and public corpus is managed on the server. The platform adopts C/S model which allows many users simultaneously access the server. (3)Some functions of the corpus system are improved. ① In the aspect of tagging, a novel XML-based tagging mode is realized, which greatly simplifies the process of tagging. The information of the linguistic space and the linguistic concept space is transformed into XML. ② In the aspect of searching, we have achieved full-text search based on Lucene, and HNC features search including basic search, advanced search, XQuery search. ③ In the aspect of statistical processing, apart from conventional statistical processing, we have designed and implemented four basic modes of HNC feature: amount, ratio, user-defined distribution, attribute distribution. Users can freely define the content for statistical processing. (4)Some computer-aided tagging models are explored. Concretely, a maximum entropy model is adopted to deal with the problem of semantic chunks segmentation. And an example-based model is adopted to deal with the problem of sentence category parsing. (5)Sentence category reorganization corpus is constructed. Relying on the basic corpus, tagged corpus is reorganized by sentence category. Some basic functions, such as feedbacking tagging mistakes and marking parsing difficulties, are also provided. |
语种 | 中文 |
公开日期 | 2011-05-07 |
页码 | 66 |
源URL | [http://159.226.59.140/handle/311008/400] ![]() |
专题 | 声学研究所_声学所博硕士学位论文_1981-2009博硕士学位论文 |
推荐引用方式 GB/T 7714 | 谢法奎. 面向HNC的语料库软件设计与实现[D]. 声学研究所. 中国科学院声学研究所. 2008. |
入库方式: OAI收割
来源:声学研究所
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。