中国科学院机构知识库网格系统: 声学模型自适应算法及稳健汉语连续数字识别应用研究

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

声学模型自适应算法及稳健汉语连续数字识别应用研究

文献类型：学位论文


作者	赵兵
学位类别	工学硕士
答辩日期	2001-03-01
授予单位	中国科学院自动化研究所
授予地点	中国科学院自动化研究所
导师	徐波
关键词	MAP MLLR MDL. Viterbi Pronunciation Modeling VTLN Duration Modeling Pitch A-star搜索算法 MAP MLLR MDL. Viterbi Pronunciation Modeling VTLN Duration Modeling Pitch A-star search algorithm
其他题名	A Study on Acoustic Model Adaptation and Robust Mandarin Connected Digits Recognition
学位专业	模式识别与智能系统
中文摘要	本文的内容总体分成两个部分，第一部分是针对声学模型自适应算法的分析和实现；第二部分是针对稳健汉语连续数字识别系统的设计。在过去的几十年来中，基于HMM的语音识别技术取得了较大的发展。大多数的识别系统是非特定人的系统，旨在给尽可能大的范围内达到较小的识别率。当识别环境和训练环境失配的时候，识别率会大大降低。本文在详细推导MAP和 MLLR算法的基础上，通过分析不同层次的声学模型参数之间的相关性，如HMM 状态，和状态输出概率密度的高斯分量，在分成声韵母两类变换的基础上进一步讨论了建立静态块变换的MLLR算法。并将此算法延拓成基于自回归树的按照特定的准则根据自适应数据的多少动态控制自适应程度的结构化的MLLR算法。同时本文分析了静音模型的自适应策略，并给出了详细的实验结果。实验表明上述的算法使平均相对误识率下降超过30％，并且无论是在自适应数据量少还是多的情况下，优于标准的MAP和MLLR算法。第一部分还分析了在自适应数据量很少的情况下，如何避免Over-training的问题。本文利用发音字典的技术，计算两两状态之间的相关性，并在MLLR算法的统计计数的过程中采用特定的松弛算法，使得MLLR变换变得局部化。这里还同时分析了MLLR与前端的VTLN结合的情况。本文的第二部分主要是针对提高汉语连续数字识别系统的识别效果。汉语是一种有调语言。基频，作为一种特殊的韵律特征，使得系统的识别效果大大提高。实验表明利用调整后的决策树问题集，相对无调的系统其总体平均误识率下降超过40％。本文种描述了利用AMDF和NCCF结合的基频提取算法，DP算法做后处理，在有限的计算量下得到较准确的基频的值。算法的实时性很好，在一个 PIII-500的机器上只需要0.2个Realtime。这对于小词汇量识别系统如数字拨号具有重要的意义。本文的第二部分形式化的详细的描述了建立一个高性能的汉语连续数字识别系统的几项重要的技术如高斯时建模，以及如何利用时长信息在时间帧同步的 Viterbi-beam搜索中。本文测试了几种针对汉语数字识别的自适应方法和策略，本文还测试、分析了对短时瞬变噪声有效的SCMN(Segmental Cepstral Mean Normalization)的算法。
英文摘要	As the paper's title suggests, this paper is naturally divided into two parts. One part describes the systematic way of adaptation approach for our tri-phone state tying acoustic model, and the other part discusses the special designed pitch tractor for the Mandarin connected digits recognition system. HMM-based speech recognition systems have recently demonstrated impressive recognition performance. Most of them are speaker independent system, aiming at providing low error rates for a large range of speakers. The first part of this paper starts from the detail mathematics framework of MAP and MLLR. By analyzing the correlation of our acoustic model parameters on several levels of HMMs, states, and even Gaussian Mixtures, the paper shows how to effectively design a static structured MLLR transformation derived from the two classifications: initials and finials. This is then followed by how to dynamically build a regression tree based MLLR using different control strategies including MDL for a mandarin speech recognition system. Also the paper discussed the background model adaptation in detail. The algorithm in our system showed average error rate reduction more than 30% over baseline and this algorithm is superior to both standard MAP and MLLR approaches. The detail experiments are reported too. When the training data is sparse, the over-training problem has to be deal with. On the basis of the structured MLLR, the first part also showed an algorithm using the HMM state confusion characteristics as prior knowledge and using relaxation during the accumulation stage of MLLR to reduce the risks of the over-training. The idea is to use the HMM state confusion table to calculate the pair-wise correlation of states. The state confusion can reflect information of the pronunciation facts, which is useful for speaker adaptation, The first part of the paper also checked the combination of VTLN with MLLR and showed the result in the Mandarin acoustic model. The second part of the paper aims at improving the performance of the mandarin connected digit recognition. Pitch, as a special feature for the Chinese tonal language, showed great effectiveness in our system. By incorporating new questions of tone in the decision tree question set, the overall average error rate reduction is more than 40% compared to toneless system. And in our paper, we showed this special pitch tracker which combines the AMDF and NCCF to secure the accuracy of the pitch values and at the same time to reduce the computation of the front-end pitch extract operations. The post-process of dynamic programming (DP) greatly improved the effectiveness, The overall pitch extraction takes only 0.2 real time on a PIII-500. This is acceptable for the small vocabulary recognition systems, such as digit dialing, which generally require less computation and less memory. The second part of the paper also showed the three effective detail technologie
语种	中文
其他标识符	589
源URL	[http://ir.ia.ac.cn/handle/173211/7323]
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	赵兵. 声学模型自适应算法及稳健汉语连续数字识别应用研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所. 2001.

入库方式： OAI收割

来源：自动化研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。