汉语句法分析的理论、方法的研究及其应用
文献类型:学位论文
作者 | 张艳 |
学位类别 | 工学博士 |
答辩日期 | 2003-04-07 |
授予单位 | 中国科学院研究生院 |
授予地点 | 中国科学院自动化研究所 |
导师 | 徐波 ; 宗成庆 |
关键词 | 自然语言处理 句法分析 汉语分词和词性标注 术语定义 文本校对 Natural Language Processing Chinese Syntactic Parsing Segmentation and Part-of-Speech Tagging Term Definition and Text Proofread |
其他题名 | Research on Theory and Methods of Chinese Syntactic Parsing and Application |
学位专业 | 模式识别与智能系统 |
中文摘要 | 自然语言句法分析是自然语言处理的一个关键问题,本文对汉语自然语言句 法分析的理论和方法,以及其在术语定义和文本校对方面的应用进行了系统的研 究,取得了以下成果: (一)实现了面向汉语大规模真实文本的分词和词性标注系统。为了更好地 解决汉语大规模真实文本中的词性兼类问题,本文综合考虑了基于统计和规则相 结合的方法。首先,本文提出了基于tri-gram的汉语分词和词性标注一体化的方 法,并且结合语言模型提高了分词的正确率。在此基础上,本文又进一步采用了 Brill的基于模板的错误驱动的自动标注的规则方法,通过自动获取规则模板,利 用语法规则特点提高了词性标注的正确率。实验结果表明,这种混合方法特别适 用于大规模真实文本的词性标注。 (二)研究并改进了'romita的GLR分析算法,实现了汉语基本名词短语的 识别和完全句法分析算法。在识别汉语的基本名词短语时,本文中主要采用三种 方法解决移进一规约方法中的冲突问题,分别是句法规则的最大概率优先、最长 规则匹配以及最为重要的通过LR分析表,简化GLR的过程解决移进和规约动作 之间的冲突。在完全句法分析系统中,本文采用了概率上下文无关文法(PCFG) 来改进GLR分析算法,用规则的概率值解决规约之间的冲突,并且在算法的实 现过程中,进行了一系列的简化处理,从而提高了句法分析的效率。这部分是本 文的主要部分,也是对分析算法的一种改进。 (三)提出了汉语术语定义的模板结构和自动发现的算法,这是在汉语句法 分析基础上的一个应用,是自然语言处理的一个新的研究领域。通过句法分析器, 自动获取术语定义的模板,总结出术语定义的一般规则和方法,提出了定义自动 发现的算法。这样通过给术语下定义的方式,可以解释和说明一些新的未知的术 语。 (四)实现了用规则方法对拼音到汉字转换系统的文本纠错处理。作为音字 转换系统的后处理部分,本文在统计方法的基础上,利用语法和语义信息以及上 下文约束关系校正了转换错误的词语,该方法是对统计方法的补充和辅助处理。 |
英文摘要 | Natural language parsing is one of key problems in natural language processing. This paper firstly discusses main theories and approaches of natural language parsing, and presents some work on Chinese syntactic parsing. In this paper the parser is applied in two fields: definition of terms and Chinese text proofreading. The main contributions are summarized as follows: (1) Implement Chinese segmentation and part-of-speech (pos) tagging system oriented to large scale real Chinese texts. Chinese segmentation and pos tagging are the basis of a parsing system. In order to handle the question of multi-pos candidates of a Chinese word in large scale real texts, this paper synthetically utilizes statistical approach and rule-based approach. First, this paper proposes tri-gram model as the statistical model to integrate Chinese segmentation and pos tagging. This statistical approach simultaneously calculates language model and pos model to improve the precision of segmentation. Based on it, this paper further utilizes Brill's transformation-based error driven algorithm as the rule-based approach. The mixed approach uses the results of statistical model as initial corpus and grammatical rules to improve the precision of pos tagging by obtaining the transformation rules automatically. The experiment proves that the mixed approach is a better approach to tag large scale real texts. (2) Learn and extend Tomita's GLR parsing algorithm and implement identification of Chinese base noun phrases and Chinese full parsing system. When the parser identifies base noun phrases, three approaches are used to handle conflicts between shift-actions and reduce-actions. First principle is that the rule with maximum probability is preferentially executed. The second is that the longest rule is firstly matched. The third is the most important to overcome conflicts through LR parsing table and simplifying GLR parsing process. To Chinese full parsing, this paper utilizes probabilistic context-free grammar to extend GLR parsing algorithm, and resolves conflicts between shift-action and reduce-action by rule probability. This approach improves the efficiency of parsing algorithm. This part is the main content of this paper and the important of work of my research. (3) Propose the pattern structures of Chinese terms and an algorithm of automatically discovering a term. This part is a new application of Chinese syntactic parser and in natural language processing. The patterns of terms are automatically obtained through pre-processing corpus and paring the sentences of term definitions. We have summarized some rules and approaches to define term names and proposed the discovery algorithm. This research can explain more unknown terms by the definition style we have obtained. (4) Implement-action of proofreading system of text by using a rule-based approach to pinyin-to-Chinese character conversion system. This pa |
语种 | 中文 |
其他标识符 | 738 |
源URL | [http://ir.ia.ac.cn/handle/173211/5744] ![]() |
专题 | 毕业生_博士学位论文 |
推荐引用方式 GB/T 7714 | 张艳. 汉语句法分析的理论、方法的研究及其应用[D]. 中国科学院自动化研究所. 中国科学院研究生院. 2003. |
入库方式: OAI收割
来源:自动化研究所
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。