中国科学院机构知识库网格系统: 基于数据增强的汉语词表和语言模型自适应技术

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

基于数据增强的汉语词表和语言模型自适应技术

文献类型：学位论文


作者	宁振江
学位类别	博士
答辩日期	2005
授予单位	中国科学院声学研究所
授予地点	中国科学院声学研究所
关键词	语言模型自适应数据稀疏汉语连续语音识别数据增强
其他题名	Adaptation on Chinese Lexicon and Language Model Based on Data Augmentation
中文摘要	基于词的统计语言模型作为当前连续语音识别技术的基石之一，已经成功地应用于语音识别、信息获取与数据挖掘以及自然语言理解等领域中。在开发实际的主题相关语音识别应用时，经常无法获得足够的同识别任务相匹配的训练语料，而当识别系统的应用环境与已有语言模型的训练条件不一致时一，往往会导致识别系统性能的恶化，此时语言模型自适应成为提高语音识别系统性能的有效手段。传统的基于MAP的语言模型自适应技术虽然在解决训练语料的数据稀疏问题上取得了很大的成功，然而该方法的前提是要求有一定量的主题相关自适应语料，然而在某些语音识别应用中，即使是少量的主题相关语料都很难得到，此时MAP方法也无法给出好的结果。在此背景下，本论文主要研究在主题相关自适应语料极少的情况下，利用潜在语义分析技术对该语料文本进行数据增强的方法，其优点在于可以使自适应后的语言模型有能力预测并校正在训练语料中甚至不存在的词汇的出现概率。本论文主要讨论了基于词的统计语言模型。在汉语语言中，字与字、词与词是连写的，之间没有空格分隔，词在句中也没有显式的标记，而且词语的定义比较灵活，导致无法直接将已有的词表优化、模型自适应等技术直接应用到汉语连续语音识别当中。因而本文同时也着重研究了汉语文本语料的预切分、主题领域相关的汉语词表增强和优化以及汉语连续语音识别词表的优化等问题。本论文的主要工作如下：l、研究了基于特定主题领域的汉字词表增强和优化技术。提出了基于Bi一Grom的新词抽选度量准则，并在此基础上实现了主题相关关键词的迭代抽选算法，跟传统的非迭代算法相比，首先，由于本文的算法只采用了Bi-Gram语言模型，因而避免了非迭代算法中由于使用高阶语言模型而带来的数据稀疏现象；其次，通过迭代过程巧妙阻止了一些语义上不完整词的选出，而这种情况在非迭代算法中往往是难以避免的：再者，该迭代算法同基于RATTTREE的关键词抽取算法相比，具有很高的运算时间和存储空间效率。实验结果表明，本论文提出的迭代抽词算法的存储器开销大约只有后者的13.4％，时间开销约为后者的5.5％。2、汉语语言的一个独特之处所有汉字单词和语言句子都是由六千多个基本汉字单字组成的。根据这个特点，本文分别分析了汉语和英语语音识别中OOV对连续语音识别系统的影响。并在此基础上进行了汉语词表的优化实验，试验结果表明，对于大词汇量的主题相关的汉语连续语音识别来讲，16k的词汇量可以在识别系统所需的存储空间、运算效率和字识别正确率之间取得较好的平衡。3、本文提出了一种利用上下文潜在语义分析技术对少量主题相关自适应语料进行数据增强的方法。其基本思想是认为潜在语义功能越相似的词上下文，其后出现同一词的可能性就越大。本文采用了词一上下文矩阵的奇异值分解（SVD）方法来计算不同上下文之间的潜在语义相似程度，并在此基础之上进行自适应语料的数据增强。由于在Bi-Gram中，上下文刚好对应词表中的每个单词，因此本文在对上下文的潜在语义分类中，提出了以Bi-Gram平均互信息为目标函数的统计词类划分方法，实验结果表明，与传统的K-近邻分类方法相比，该方法能给出更精确的潜在语义类别划分结果。在本文的实验中，基于SVD数据增强的方法得出的MAP自适应模型，其性能要明显优于单独采用MAP方法所得出的。以《射雕英雄传》主题为例，相对于传统的MAP方法，语言模型混淆度降低了11.6％，由此带来的语音识别字错误率相对下降了3.1％。在此基础上采用了本文提出的以Bi-Gram平均互信息为目标函数的统计词类划分方法后，语言模型的性能有了进一步的提高，语言模型混淆度进一步降低了7.7％，语音识别字错误率相对下降了2.2％。4、虽然借助于奇异值分解方法可以很好地定义词一上下文矩阵中不同上下文之间的潜在语义相似程度，但是奇异值分解却无法给出明确的类别划分方法，通常在SVD空间中进行类别划分往往要借助于传统的K-平均分类等类别划分技术，计算量和分类误差较大；此外，SVD的低秩近似方法只有在所分析数据样本服从正态分布时，才能取得较好的效果，而自然语言文本中的词频计数一般并不服从于正态分布，通常更接近于泊松分布等非负分布。为此本文提出了基于非负矩阵分解（NMF）的潜在语义类别划分方法，同时定义了NMF空间之上的上下文潜在语义相似度，并在此基础上对主题相关的自适应语料进行了数据增强。实验结果证明，基于NMF数据增强的MAP自适应语言模型的性能无论是在语言模型的混淆度上，还是在主题相关的连续语音识别应用中，都明显优于由SVD数据增强方法得到的MAP自适应语言模型，而且同时具有运算复杂度低的优点。
英文摘要	As one of the important components of today's continuous speech recognition technologies, the word-based stochastic language model has been successfully applied in the areas such as speech recognition, information retrieval, data mining, natural language understanding, etc. However, adequate quantities of training corpus required by recognition task are always unavailable. In addition, a lack of consistency in conditions between application of recognition system and training of the present language model tends to deteriorate the performance of the speech recognition system. Therefore stochastic language model adaptation, which acts as a powerful means of improving the performance of speech recognition system, is introduced to tackle this problem. Though MAP-based language model adaptation contributed a lot to solving data sparse problem of training corpus in some special topic fields, it is conditioned to some quantities of topic-related adaptation training corpus. Furthermore, in some cases of speech recgonition even a small quantity of topic-related corpus couldn't be obtained easily and accordingly this model couldn't work well as expected when the corpus are in small amount. Therefore, focusing on less topic-related adaptation corpus, this paper explored data augmentation of corpus context by making latent semantic analysis. In this way, adapted language model could predict vocabulary probabilities which are even non-existent in training corpus. The approach proposed in this paper starts with the prerequisite that it is word-based stochastic language model. As is all known that characters and words in Chinese are written-round with no space division, nor sign for each word in one sentence. So, the present method of language model adaptation couldn't be simply adopted in Chinese. Considering this, this paper also studied segmentation of Chinese context corpus, topic-related key words selection, and lexical optimization in Chinese continuous speech recognition system. The main contributions of this paper are: 1. By studying augmentation and optimization for Chinese vocabulary in given topics and forwarding the principles of Bi-Gram based new words choosing and measuring, this paper achieves the iterative algorithm of topic-related key words selection. This method adopted Bi-Gram language model only, which on the one hand can avoid data, sparse from traditional non-iterative algorithm caused by high-order language model; on the other hand in iterative process it can prevent choosing words with semantically imperfect meaning, which is inevitable in non-iterative algorithm. When compared with method based on PAT TREE, the iterative algorithm is more efficient in both memory requirement and time consumption. The experiment results show that the expenses on memory requirement for the iterative algorithm are approximately 13.4% of those of PAT TREE, and time consumption is about 5.5% of that of PAT TREE. 2. The unique feature of Chinese language lies in the fact that all the characters and sentences are composed of more than 6000 elementary Chinese characters. According to this, this paper analyzes influences brought by OOV in continuous speech recognition in both Chinese and English speech recognition and performs vocabulary optimization experiments, which indicates that for topic-related Chinese continuous speech recognition large vocabulary, vocabulary with 16k could build a better balance among storage space, computation effort and word accuracy rate in recognition system. This paper proposed a data augmentation method on given small quantity adaptation training corpus through context latent semantic analysis. The basic idea of this approach is that if a word has been observed in a given context, then latent semantically similar words are more likely to appear in the same context. Singular value decomposition of word-context matrix is employed to calculate latent semantic similarity degree in different contexts and further adaptation corpus can be augmented. In terms of Bi-Gram, the context is exactly corresponded with each word in lexical lists. So stochastic lexical classification method based on objective function of Bi-Gram average mutual information is proposed. And the experiment result also demonstrates this method can provide more accurate results of latent semantic classification, which is prior to the present K-nearest classification method. In the experiment, the performance of MAP adaptation model based on SVD data argumentation is obviously prior to that obtained by MAP only. Take the topic of The story of the Hero in shooting Vulture as an example: compared with the present MAP, its language model perplexity decreases 11.6%, accordingly speech recognition error ratio comparatively declines 3.1%. When stochastic lexical classification method based on objective function of Bi-Gram average mutual information is employed, its language model performance has a further improvement in which perplexity has a further decrease with 11.6%, while speech recognition error ratio has a comparative decline with 2.2%. Though context latent semantic similarity degree can be well defined by singular value decomposition of word-context matrix, there is no definite latent semantic classification from SVD-based method. Usually the traditional approximate K-nearest classification is employed in SVD-based space, with high computation effort and low precision. In addition, the best approximation of SVD for a given rank is related to the assumption of normality of the data samples. But the normality assumption is not valid for word counts, Poission or other non-negative distributions seemed to be more appropriate. Therefore, the latent semantic classification method based on NMF is proposed and the context latent semantic similarity degree in NMF space is defined. Experiment results show that the performance of MAP adaptive language model based on NMF data augmentation is superior to the one which obtained by SVD both in perplexity and topic related continuous speech recognition WER, while hold a lower computation complexity.
语种	中文
公开日期	2011-05-07
页码	101
源URL	[http://159.226.59.140/handle/311008/914]
专题	声学研究所_声学所博硕士学位论文_1981-2009博硕士学位论文
推荐引用方式 GB/T 7714	宁振江. 基于数据增强的汉语词表和语言模型自适应技术[D]. 中国科学院声学研究所. 中国科学院声学研究所. 2005.

入库方式： OAI收割

来源：声学研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。