基于数据重建的语音识别鲁棒性技术研究
文献类型:学位论文
作者 | 罗宇 |
学位类别 | 博士 |
答辩日期 | 2003 |
授予单位 | 中国科学院声学研究所 |
授予地点 | 中国科学院声学研究所 |
关键词 | 缺失特征方法 数据重建 缺失分量估计 语音识别 鲁棒性 |
其他题名 | Missing data imputation method-based robust speech recognition |
中文摘要 | 现代语音识别系统在安静环境下可以达到良好的性能,但是,当语音输入受噪声破坏时,系统性能急剧下降。噪声鲁棒性问题成为语音识别技术目前面临的主要挑战之一。在复杂任务条件(高困惑度非特定人汉语连续语音识别)下,本论文研究了数据重建方法对于提高语音识别系统噪声鲁棒性的作用。数据重建方法认为噪声和语音在时间一频率域上不同区域具有不同局部信噪比,并进行缺失分量估计,即把局部信噪比较低的区域标记为"缺失",而局部信噪比较高的区域标记为"可靠",然后重建"缺失矢量",得到完整矢量后进行语音识别。数据重建方法没有对噪声特性进行假设和限制,因此,当噪声为不稳定信号时,该方法具有潜在的优越性。先前的数据重建方法研究主要集中在连接数字串识别这样相对简单的任务。实验研究发现,复杂任务的语音识别系统对噪声敏感性更大,即使在信噪比较高的情况下,语音识别系统的识别性能也有明显的下降。为此,本论文研究的问题主要定位在复杂任务条件下,研究数据重建方法对于提高语音识别系统噪声鲁棒性的作用。论文的主要贡献如下:1)基于高斯模型集的数据重建算法研究假设纯净语音的美(Mel)子带特征矢量可以用N个高斯模型的码书进行单码字量化,本文研究了基于最大边缘化概率的数据重建算法(SMRGDI),提出了基于高斯模型期望的数据重建算法(SMNDI)和基于最大后验概率的数据重建算法(SMAPDI),对三种基于高斯模型集的数据重建算法进行了比较和分析。:实验表明,SMRGDI算法性能稳定,重建误差较小,对计算能力的要求较低。因此,SMRGDI算法更适合数据重建。经过理想缺失分量估计和SMRGDI数据重建后,语音识别系统对加性噪声的鲁棒性能有了极大的提高;对受Babbfe噪声破坏的语音,在SNR=20dB时,音节准确率从45.97%提高到68.75%;在冬SNR=5dB时,音节准确率从-5.81%提高到32.03%。对受高斯白噪声破坏的语音,扛在SNR=20dB时,音节准确率从28.00%提高到60.17%;在SNR=5dB时,音节准确率从2.34%提高到20.24%。2)基于概率加权平均的数据重建算法研究蒸引起基于高斯模型集的数据重建算法重建误差的主要原因是归类错误和语多音特征矢量的随机分布不符合高斯分布。本文提出了基于概率加权平均的缺失特征数据重建(PWADI)算法,该算法假设纯净语音的美子带特征矢量可以用N个高斯模型的码书进行多码字量化,即把特征S所属高斯模型候选范围从1个扩大到K个,并根据候选高斯模型产生"可靠矢量"So的概率,对K个重建特征进行加权平均,得到"缺失矢量"Sm的估计。PWADI算法减轻了归类错误和模型分布不符合高斯分布的影响,经过理想缺失分量估计和PwADI数据重建后,语音识别系统性能比采用SMRGDI算法有了进一步提高:对受Babble噪声破坏的语音,在SN卜20dB时,音节准确率从45.97%提高到69.10%;在SN卜sdB时,音节准确率从一81%提高到35.08%。对受高斯白噪声破坏的语音,在sNR=加dB时,音节准确率从28.00%提高到63.08%;在SNR=sdB时,音节准确率从2,34%提高到25.27%。3)基于HMM模型的数据重建算法研究考虑到语音特征的时间相关性·,本文利用隐马尔可夫模型转移概率矩阵来描述语音特征矢量在时域上的动态特性,利用全协方差矩阵来描述"可靠矢量"和"缺失矢量,,间的相关特性,提出了基于局部最优状态路径的数据重建(LOPDI)算法和基于边缘化Viterbi解码过程的数据重建(VITDI)算法。LOPDI算法基于局部最优状态路径估计产生语音特征矢量的状态序列,并按最大后验概率准则(MAP)重建出"缺失"数据。局部最优状态路径估计可能陷入局部最优却错误的状态,从而使数据重建误差大大增加。在信噪比较高的情况下,LOPDI算法性能较好;在信噪比较低的情况下,采用LOPDI算法的性能明显低于其他数据重建算法。经过理想缺失分量估计和LoPDI数据重建后,对受Babble噪声破坏的语音,在SNR=ZOdB时,音节准确率从45.97%提高到67.05%;在SNR=sdB时,音节准确率从一5.81%提高到14.51%。对受高斯白噪声破坏的语音,在SN卜ZOdB时,音节准确率从28·00%提高到61·57%;在SNR=sdB时,音节准确率丛2.34%提高到12,96%。viTDI算法基于边缘化Viterb:i解码过程估计产生语音特征矢量的最优状态序列,并MAP重建出"缺失"数据。实验结果表明,在不同噪声类型、不同信噪比下,vITDI算法的性能均优于.LOPDI算法。经过理想缺失分量估计和VITDI数据重建后,对受Babble噪声破坏的语音,在SNR;之odB时,音节准确率从45.97%提高到70.07%;在SN卜sdB时,音节准确率从一5.81%提高到北.58%。对受高斯白噪声破坏的语音,在SN卜20dB时(音节准确率从28.00%提高到63.20%;在SNR=犯仑时,音节准确率从2。34/0提高到26.邓%。在理想缺失分量估计的情况下,Vl:TI算法的性能最好,PWADI算法性能-略低犷二者均忧子L钟1一算法和助琅Gl算法,另一方面,vit时bi解码过程需止要在语音输入结束后通过状态回溯以得到最优状态序歹日,一不能实现实时数据重建,而PWADI算法、LOPDI算法和SMRGDI算法均可实时重建出当前祯的特征矢量。4)缺失分量估计方法研究数据重建算法己经具有良好性能,因此,缺失分量估计性能对语音识别系统噪声鲁棒性影响十分明显。基于谱减法的缺失分量估计带来了噪声平稳性限制:对非平稳babble噪声破坏的语音,基于谱减法的缺失分量估计将导致严重的估计误差,语音识别系统性能发生灾难性下降;对平稳高斯白噪声破坏的语音,基于谱减法的缺失分量估计在信噪比较高时取得较好结果,语音识别系统性能有明显的提高。经过基于谱减法的缺失分量估计和PWADI数据重建后,对受Babble噪声破坏的语.,在SNR=20dB时,音节准确率从45.97%下降到36.62%;在SNR=sdB时,音节准确率从一5.81%下降到一9.41%。对受高斯白噪声破坏的语音,在SNR=20dB时,音节准确率从28.00%提高到51,05%;在SNR=sdB时,音节准确率从2.34%提高到4.09%。本文提出了非线性谱减缺失分量估计算法。该算法根据信号的信噪比估计动态调整噪声更新系数a,从而达到如下目的:在信噪比较高时,噪声估计更新缓慢;在信噪比较低时,噪声估计更新较快。实验结果表明,无论噪声是平稳高斯白l噪声还是非平稳babble噪声,在信噪比较高情况下,非线性谱减缺失分量估计都能取得较好的效果,语音识别系统噪声鲁棒性得到明显提高。经过基于非线性谱减缺失分量估计和PWADI数据重建后,对受Babble噪声破坏的语音,在SNR=20dB时,音节准确率从45.97%提高到51.43%;在SNR=sdB时,音节准确率从一5.81%提高到一5.41%。对受高斯白噪声破坏的语音,在SNR=20dB时,音节准确率从28.00%提高到47.57%;在SNR=sdB时,音节准确率从2.34%提高到2.52%。 |
英文摘要 | Modern Automatic Speech Recognition (ASR) systems work well in quiet environment. But when speech was distorted by additive noise, the performance of ASR system degrades rapidly. So robustness against additive noise arises to be one of the most challenging problems. In this thesis, we research the missing data imputation method in a complex task, high perplexity speaker independent continuous speech recognition task. Missing data imputation methods assume that additive noise will distort speech differently in different time-spectrum region. The noise-distorted speech regions, where local Signal Noise Ratio (SNR) is low, are marked as "missing" and regions, where local SNR is high, are marked as "reliable". After identified by mask estimation method, "missing" data recovered and used as the input of ASR system. Missing data imputation method make no assumption about the characteristic of additive noise, so it shows the potential in improving the robustness of ASR system against non-stationary noise. Most of previous research on missing data imputation methods focused on connected digit recognition task. Our experimental result shows that the performance of a speech recognition system with a complex acoustic model is sensitive to the additive noise. The performance of ASR System degrades greatly in noisy environment even though SNR is high. In this thesis, we try to study data imputation methods in a complex task, high perplexity speaker independent continuous mandarin speech recognition task. The main contributions of this thesis are: 1) research on Gaussian model set-based data imputation On the assumption that clean speech character vectors can be represented by N Gaussian codebooks, Gaussian model set-based data imputation methods are studied to recover "missing" data. We studied maximum marginal probability-based data imputation (SMRGDI) and developed Gaussian model mean-based data imputation (SMNDI) and maximum a posterior probability-based data imputation (SMAPDI). Experiments are carried out to compare these Gaussian model set-based data imputation methods. Experimental results show that the performance of SMRGDI is more stable than that of SMAPDI and SMNDI. SMRGDI and SMNDI cost less time than SMAPDI does. So we select SMRGDI to do data imputation. With ideal mask estimation and SMRGDI, ASR system's robustness increases significantly. For babble noise distorted speech, syllable accuracy can be improved from 45.97% to 68.75% when SNR=20dB and from -5.81% to 32.03% when SNR=5dB. For gauss white noise distorted speech, syllable accuracy can be improved from 28.00% to 60.17% when SNR=20dB and from 2.34% to 20.24% when SNR=5dB. 2) Development of Probability weighted average algorithm for data imputation The estimation error of Gaussian model set-based data imputation method comes from the codebook identification error and the difference between the distribution of practical speech character vectors and that of Gaussian model. In this paper, we proposed Probability Weighted Average algorithm for Data Imputation (PWADI). PWADI method makes use of K-best Gaussian models rather than just one best hypothesis to do data imputation. Each of K-best Gaussian models is used to do data imputation separately, and the probability-weighted average of K-best recovered vectors is taken as the estimation of "missing" data. PWADI can decrease the codebook identification error and reduce the effect of the difference between the distribution of practical speech character vectors and that of Gaussian model. With ideal mask estimation and PWADI, ASR system's robustness increases significantly. For babble noise distorted speech, syllable accuracy can be improved from 45.97% to 69.10% when SNR=20dB and from -5.81% to 35.08% when SNR=5dB. For gauss white noise distorted speech, syllable accuracy can be improved from 28.00% to 63.08% when SNR=20dB and from 2.34% to 25.27% when SNR=5dB. |
语种 | 中文 |
公开日期 | 2011-05-07 |
页码 | 130 |
源URL | [http://159.226.59.140/handle/311008/1016] ![]() |
专题 | 声学研究所_声学所博硕士学位论文_1981-2009博硕士学位论文 |
推荐引用方式 GB/T 7714 | 罗宇. 基于数据重建的语音识别鲁棒性技术研究[D]. 中国科学院声学研究所. 中国科学院声学研究所. 2003. |
入库方式: OAI收割
来源:声学研究所
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。