中国科学院机构知识库网格系统: 基于双语口语语料库的机器翻译知识获取和系统实现研究

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

基于双语口语语料库的机器翻译知识获取和系统实现研究

文献类型：学位论文


作者	陈博兴
学位类别	博士
答辩日期	2003
授予单位	中国科学院声学研究所
授予地点	中国科学院声学研究所
关键词	自然语言处理机器翻译基于实例的机器翻译口语翻译知识获取翻译词典翻译模板词语相似度
其他题名	Research on Knowledge Acquisition and Machine Translation System Based on Bilingual Spoken Language Corpus
中文摘要	随着语料库语言学的兴起和机器学习技术的发展，通过机器学习从语料库中自动或半自动获取语言知识和翻译规则，从而实现机器翻译，成为机器翻译的新的突破点。本文在前人研究的基础上，继承了他人研究的优点和长处，改进了一些缺点和不足，创造性地提出了一些新的方法，构建了一个完整的雄于1语语料库的机器翻译系统。本文以建立一个完整的基于实例的机器翻译（examPlebasedmachinetranslation，EBMT）系统为目标，对下列问题进行了深入研究和探讨，其中包括：翻译词典的自动构建、翻译模板库的建立、应用于EBMT的词语相似度的计算、模板驱动的EBMT系统的实现和应用EBMT的口语翻译的初步研究。上述工作达到了预期目的，取得了一些重要的研究成果，总结如下：（l）设计了针对口语的翻译词典抽取模型并首次提出了词典分级的思想。本文首先分析了中文和口语的特点，然后依次完成了一词对一词、一词对多词和多词刘·多词的翻译词典的建立。同时通过应用多个关联度参数及交换源语言和目标语言的相互关系得到多个翻译词表，进行词典分级，有效地提高了高级别翻译词典的正确率。（2）完成了一词对多词的翻译词典的自动抽取。本文首次提出了一个自动抽取由源语言单个词汇和目标语言多词单元构成的翻译词典的算法。该算法利用了平均关联值和关联值的归一化差值作为关联度的衡量标准，采用了局部最优算法和多个启发式方法如长词优先过滤、禁用词过滤等。这项研究大大提高了翻译词典的实用性，因为英语和汉语中存在很多单个源语言词汇和多词目标语言短语对应的情况，比如单个英文人名、地名单词对应到汉语的翻译为多词单元，而该算法能有效地解决这类问题，因此对于机器翻译尤其是汉英翻译有很大的帮助。（3）开发了一个能有效抽取低频双语多词单元的算法。其他相关研究未能解决有效地抽取低频双语多词单元的问题，本文则通过运用局部最优算法，在识别多词Jyj．元的同时对齐多词单元，很好地解决了这个问题。为了提高正确率，算法同时还采用了长度比值过滤和禁用词过滤等方法。这个算法的另一优点是避免了重复统计Bi一gram，因为关联度是衡量两个对象之间的关联程度的度量标准，所以现有同类技术中抽取多词单元大多是通过重复统计Bi一gram来完成的，而每次统计Bi一graln的错误率会在重复统计的过程中累计起来，造成错误率成指数级的增一长，严重地影响多词单元抽取的正确率。本文的算法只需计算一次关联度，大大降低了关联度本身所带来的不精确度。（4）开发了一个综合语法特性和统计特性的计算词语相似度的算法。为了使得词语的相似度更适合于EBMT，本文抛弃了传统的基于某个现成的词典来计算同语相似度的方法，而是通过统计词语对之间的上下文差异并且利用词语之间阴神卜仁的异同米计算词语相似度。这种方法最大的优点是所计算的词语相似度不单纯是语法特性或者语义特性上的相似，而是一种相互替换的能力，所以这种痛似度更适合于EBMT系统。现有基于统计的相似度计算方法一般都只考虑两末半物所具有的共同特性的种类的多少，而没有考虑在一定范围内每个具体特性爪现次数的多少。事实上，由于语言的灵活性，尤其是口语具有更大的灵活性，示以经常会出现一些词类活用或者词语借用的情况，所以即使两个词语有过莫个未白同的体存结构或者相同的上下，也并不代表它们在该特性上具有完全的相低性。针对这种情况，本文提出了基于上下文个体（wordtokell）的相似度概念．即：如果两个需要比较相似性的词语在同一位置上拥有相同的上下文词语，则训－算该上下文词语分别与两个词语之间的共狈．骊率所贡献的相似磨。椒就具老由爷共同特性出现次数的多少，这是我们首次提出的。（5）提出了一个利用相似和差异准则以及一一对应对齐方法和动态规划程序抽取翻译模板的算法。在该算法中利用了多个双语句对之间的相似部分互为翻译，同时差异部分在每个句对中也互为翻译的思想。为衡量双语片段之间的互译程度，提出了一个利用翻译词典计算的双语词串互译程度度量值函数，通过一一对应对齐算法和动态规划算法进行对齐，提高了对齐的准确性和减少了训一算量。同时设计了双语词串互译程度度量值过滤、词性相似过滤等两个过滤器对所提取的翻译模板进行了准确性校验。·（6）设计和实现了一个模板驱动的EBMT原型系统。通过查找翻译模板、确定变量及其翻译、重组模板和变量来完成翻译。在为输入的待翻译句子查找最合适的翻译模板的过程中，通过多重机制，确保找到的模板是最适合翻译输入句子的模板，并拓展了其中关于语言片断的相似度的计算方法。系统避免依赖准确的句法分析、语义分析等等现在还无法达到实用要求的技术，使之具有更强的实用能力。·（7）针对话语的特点设计了应用EBMT的口语翻译系统。首先分析了话语（utterance）的特点，针对话语中语块（chunk）的内部稳定性和语块位置的灵活性，设计了基于语块的口语翻译系统，通过三步来完成翻译：语块切分、语块番羽译和语块重组。同时分析了口语翻译可能遇到的若干问题以及对这些问题的一些解决办法。一I：．述研究成果，对于从双语语料库中获取机器翻译知识、结构化机器翻译知识库的建设、实用机器翻译系统的搭建和精确度的提高，以及将该方法扩展到其他语言和口语对话领域都具有重要的意义和直接参考价值。
英文摘要	Along with the rise of the Corpus Linguistics and the development of the Machine Learning technology, there is a new breakthrough point of Machine Translation, that is, toa automatically or semi-automatically acquire linguistics knowledge and translation rule from the corpus by Machine Learning, and then to accomplish the machine translation. On the bases of other researches, we have followed their merits, improved on some shortcomings and limitations, originally proposed some new processing technologies, and integrated a Machine translation system based on corpus. It mainly includes the work as follows:Automatic Extraction of Translation Lexicon. It includes: extraction of thelexicon whose entries composed by single source word and single target word, thelexicon whose entries composed by single source word and target multi-word unit, thelexicon whose entries composed by bilingual multi-word units, and the lexicon whoseextraction based on Similarity and Difference principle.Establishment of Translation Template Database. It includes: automaticextraction of translation template from the bilingual corpus, design of the storageformat for the translation template database, and creation of the word-based index filefor the translation template database.Measurement of the Word Similarity. It includes: analyses of the similarity'scharacters, and construction of the computing model of the word similarity, which isused in EBMT system.Realization of EBMT system for Template-Driven. It includes: the retrievalof translation template, the identification and translation of variable, and therecombination of template and variable.Preliminary Research on Spoken Language Translation Based on EBMT. Itincludes the analyses of the utterance's characters and the remodel of the EBMTsystem for the application of spoken language translation.Aiming to establish an integrated EBMT system, the authors studied the above issues, and have got some achievements, which include:(1) We have analyzed the characters of Chinese and spoken language, designed an extraction model of translation lexicon aiming for English-Chinese spoken language, and compiled the translation lexicons of word-to-word, word-to-multiword and multiword-to-multiword. Furthermore, by classifying the lexicons acquired by using several parameters and exchanging the relation of the source language and the target language, we have improved the precision of the high-level lexicon effectively. The idea of lexicon classification is proposed for the first time.We have creatively researched on the automatic extraction of the translationlexicon for single-source-word to target-multiword-unit, and proposed a newalgorithm for such extraction for the first time. This algorithm has the followingcharacteristics: first, utilize average association score and normalized associationscore difference as the measurement standard of association; second, use Local BestsAlgorithm with the help of several heuristic strategies, such as Stopword Filtrationand Long-Length Units Preference etc. Since there are many cases of single Englishwords corresponding to Chinese multiword units, especially the English personalname and place name, through solving this problem our research will be great help formachine translation, especially Chinese-English translation, thus to make thetranslation lexicon more practical,We have developed an algorithm for automatic extraction of bilingualMulti-Word Units, which has solved the problem of extracting the low-frequencybilingual multiword unit. The algorithm constructs a model to identify the Multi-WordUnits in the process of alignment. The methods of Length Ratio Filtration andStop-Word Filtration are introduced to improve the system's performance and getbetter results. In many other similar researches the process of extracting Multi-WordUnits is the iteration of Bi-gram calculating, and the retrieval results mostly dependon the identification of suitable Bi-grams for the initiation of the iterative process. Theerror could have accumulated during the iteration process. But our algorithm requirescalculating the association only once, so the effect of inaccuracy caused byassociation itself is reduced.In order to fulfill the need of measuring the similarity between word-pair'inEBMT system, we have proposed an algorithm for measuring word-similarity byusing the Part of Speech (POS) of the word and its context. But this method considersthe syntax characteristics and the statistical characteristics of the word pairsimultaneously, which makes the similarity fitter for EBMT system. The main idea ofthis algorithm is to first compare POS of the word pairs, then stat the numbers of theword types and the frequencies of the word tokens that co-occur with the word pairsin the window, and then measure the similarity of the word pairs by using the DiceCoefficient. This algorithm needs only a dictionary with POS and corpus, avoidingthe parsing and POS tagging. The current methods for calculating the similarityusually only consider the number of types of commonalities between two things,without concern about the appearance times of every specific characteristic in acertain range. In fact, because of the flexibility of language, especially the spokenlanguage, there are often some situations of word borrowing. Even if two words'havesome identical dependence structures or contexts, it doesn't mean that they have entiresimilarity in this specialty. To solve this problem, the algorithm developed thesimilarity based on word token, that is: If two words have the same context in a same place, then calculate the similarity provided by the co-occurrence frequency of the context and the word tokens. This idea was proposed for the first time.In the algorithm for the extraction of translation template and translationunits based on "Similarity and Difference" principle, we proposed a distance functionbased on translation lexicon, and aligned the similarity parts and difference parts byutilizing the Dynamic Programming approach. Compared with the recursion methodsin the same technology, this method can greatly reduce the computational complexity.At the same time, we have also designed three filters to verify the accuracy of thetranslation units.We have designed and implemented a template-driven EBMT prototypesystem. In the process of finding the most suitable template for the input sourcelanguage sentence, multi-mechanisms was used to insure that the retrieved template isthe most suitable template for the input sentence. The system avoids relying onprecise syntax analysis, semantic analysis and other unavailable linguistics resource,which makes it with more powerful practicality.Having analyzed the characters of the utterance, we designed a SpokenLanguage Translation system based on EBMT according to the internal stability of theutterance chunks and the flexibility of the chunks position. We also introduced thepossible problems of the Spoken Language Translation and proposed some methodsto resolve these problems.The above achievements are of great significance and as valuable reference in the following aspects: acquisition of machine translation knowledge, construction of structured machine translation knowledge database, establishment of the commercial machine translation system and its precision improvement, and expansion of the system to other language and the spoken language field.
语种	中文
公开日期	2011-05-07
页码	132
源URL	[http://159.226.59.140/handle/311008/1012]
专题	声学研究所_声学所博硕士学位论文_1981-2009博硕士学位论文
推荐引用方式 GB/T 7714	陈博兴. 基于双语口语语料库的机器翻译知识获取和系统实现研究[D]. 中国科学院声学研究所. 中国科学院声学研究所. 2003.

入库方式： OAI收割

来源：声学研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。