基于混合HMM-ANN模型的汉语连续语音识别技术
文献类型:学位论文
作者 | 贾颖 |
学位类别 | 博士 |
答辩日期 | 1999 |
授予单位 | 中国科学院中科院声学所 |
授予地点 | 中科院声学所 |
关键词 | 语音识别 人工神经网络(ANN) 隐马尔可夫模型(HMM) 声学模型 混合HMM-ANN模型 |
中文摘要 | 随着新世纪的临近,人类社会步入一个高度自动化和信息化的数字化时代,计算机和网络在国民经济生活领域中的普遍应用为人机交互技术提供了一个巨大的市场。语言是人与人之间进行信息交流最方便、有效和快捷的交互方式,以语音为主的新一代人机通信媒体的研究和应用也因此成为许多研究机构和公司制定未来5-10年间信息技术投资战略的焦点。作为人机语音交互最关键的一项技术-语音识别技术,历经40年的研究积累和发展,特别是最近1O年间所取得的一系列突破性进展,在微处理器技术和计算机软件最新进展的驱动下,迎来了大规模商业应用的前夜。用隐马尔可夫模型(HMM)进行语音声学建模是大词汇连续语音识别取得突破性进展最主要的原因,遗憾的是,过去十年间语音识别技术所取得的辉煌成就在很大程度上得益于计算机技术的发展和一系列的系统优化措施,而不是HMM理论和算法的实质性进步。HMM本身存在的许多不合理建模假设和不具有区分性的训练算法(最大似然准则ML)正在成为制约语音识别系统未来发展的瓶颈。着眼于语音识别技术未来的发展,混合HMM-ANN(人工神经网络)声学模型在HMM的诸多建模假设和训练算法等诸多方面进行了突破和革新。用神经网络非参数概率模型代替高斯混合器(GM)计算HMM状态所需要的观测概率,这就使得状态观测概率具有更符合实际分布的函数形式,并且在帧层面上是可区分的,在网络的输入端还可以考虑相邻帧之间的时间相关性。研究工作表明经过精心训练的神经网络可以较准确地估计HMM状态的后验概率,但是大网络的设计和训练是十分困难和耗时的,这是混合模型研究最突出的问题。本文以混合HMM-ANN声学模型研究中存在的基本问题和系统技术为研究内容,主要的创新工作包括:1.大规模神经网络的优化设计和训练算法:神经网络的后验概率估计性能直接影响了混合声学模型的建模精度,并且向所有HMM状态提供观测概率的网络规模很大,用现有标准的误差反向传播(BP)算法很难对它进行有效的训练,网络结构的确定目前还被认为是一件艺术工作。鉴于此,我们提出了一套对网络设计和训练各个方面进行优化的算法。在结构优化方面,提出了一种猜测初始隐节点数目和对训练结束后的网络进行隐节点剪枝的算法;在标准BP算法的连接权初试化、目标函数形式、学习率更新、参数泛化性能等方面也进行了一系列优化,提出了一种新的目标函数的构造方法。这些优化措施使得在汉语音节语音数据库上训练一个具有114,800个连接权的大规模神经网络只需用25次迭代就能完成,达到的语音帧识别率为86.38%。2.后验分布共享的混合声学模型:首先我们强调了语音声学建模是一个静态映射问题,提出了不同层面的语音学单元区分性的层次结构,并从理论上分 析了HMM单元在语音特征向量空间中存在大量重叠分布的原因。为了区分重叠分布的声学向量帧,我们提出了在HMM状态之间进行后验分布共享的解决策略,并提出了基于音位结构和声学特征空间分析的共享关系确定算法。在实验中,后验分布共享的混合声学模型在汉语连续语音测试数据上实现的基元识别率为90.69%,音节识别率为69.22%。这部分内容目前还未见到任何类似报道和介绍,所以独创性很强。3.语境相关的混合声学模型:我们首先将关于语境相关混合HMM-ANN建模的研究工作总结为4种形式,指出了后验概率分解和网络隐参数共享是语境相关混合模型的要点。遵循这个思路,我们提出了语境相关后验概率的一种 新的分解形式、共享输入到隐节点之间参数的实现结构和参数平滑训练算法。实验表明采用语境相关的混合模型明显提高了声母的识别率(92.75%)。4.融合基元混淆结构的混合HMM-HNN模型:直觉告诉我们基元之间区分的难易程度是不同的,反映在识别系统的声学识别结果中就是某些基元之间误识率特别顽固,居高不下,而有些基元之间几乎不会出现误识。基于这些现象,我们提出了基元之间的区分具有层次结构的观点,并提出了一种从声学识别混淆矩阵提取基元混淆结构的算法。从我们得到的汉语声韵母基元混淆结构树来看,基元之间的混淆程度和声韵母的声学特征(清浊、送气和不送气、韵尾)有着密切的联系。这种联系使得我们可以用不同语音特征解析不同的基元混淆情形。为了在语音帧层面上进行基元的区分,即计算后验概率p(s_i|X_t),我们提出了一种和树形混淆结构相适应的层次神经网络(HNN)计算结构,以及层次结构的期望最大(EM)学习算法。最后我们根据汉语声韵母的声学识别混淆树设计了一个估计HMM状态后验概率的HNN结构,应用于汉语连续语音识别,与混合HMM-MLP相比,训练时间缩短了一半,且测试集上的字识别率提高了7个百分点,对连续语音测试集中的语句进行识别得到的混淆矩阵进行混淆结构分析,结果表明融合不同输入窗长的HNN结构能够明显减少声母之间的混淆度。本文最重要的研究成果就是提出并实现了根据基元之间的混淆结构进行有专门区分(也就是基元识别)的想法,并提出了声学识别混淆结构的提取算法和观测概率计算的层次网络结构,这些内容为语音识别的研究开创了一个崭新和富有发展潜力的思路。 |
英文摘要 | As computers permeate every comer of our daily lives, our ability to communicate with machines and computers, through keyboards, mice and other devices, is an order of magnitude slower and more cumbersome. It is becoming clear that an easier, faster, and more intuitive method of communicating with computers is needed. Speech is the primary mode of communication among human beings. The demand to speak to a computer device that requires human interaction has emerged to a substantial market for products incorporating ASR technology. So speech recognition technologies will be the focus of the next generation of computer advances. In recent years, speech recognition technology has made major strides, and recognition algorithms have been developed and refined. However, there are significant limitations with these systems, which continue to utilize the same type of statistical technology, called Hidden Markov Model (HMM) - a statistical framework that supports both acoustic and temporal modeling. Much of the improvement in ASR systems in the last decade can be accounted by the increase in speed and memory of personal computers, while the HMM technology has not changed dramatically since the late 1980's. A number of suboptimal modeling assumptions made by HMMs limit their potential effectiveness. The ML criterion is often mentioned as lack of discrimination. These problems we experienced with HMMs also confirmed that this technology, although useful, still has a way to go before it's ready for widespreading. The hybrid HMM-ANN technology investigated in this thesis is a major departure from current industry methods and is designed to provide superior speech recognition for both current and future applications. In hybrid models, ANNs are used to estimate the scaled likelihood for each HMM state. Neural networks are able to avoid many of HMM unrealistic assumptions, furthermore they can learn complex functions, generalize effectively, tolerate noise, and support parallelism. So a hybrid HMM-ANN model has several theoretical advantages over a pure HMM system, such as better acoustic modeling accuracy, better context sensitivity, more natural discrimination, and a more economical use of parameters. These advantages have been confirmed experimentally by pioneering work in ICSI with slight reduction in word error rate, great potential in discriminative training, speeding decoding, and robustness etc. In this thesis, we examine how Chinese continuous speech recognition can benefit from the deployment of hybrid HMM-ANN models. The main contributions made by author are as follows: 1. To make the estimation of posterior probability with neural networks more accurate, we develop a complete set of efficient design and training algorithms for Big-Dumb neural networks deployed in hybrid HMM-ANN models. For optimization of the network structure, we propose an algorithm to remove hidden nodes in a trained neural network, and the initial number of hidden nodes is determined by cluster analysis of training data. For the training of large-scale neural network, we propose a new set of objective functions, which can eliminate the famous false saturation with mean square error function and overspecialization with cross entropy function. 2. To improve the discrimination and reduce the distribution overlap between HMM states at frame level, we propose two strategies for state-dependent posterior distribution sharing, one is based on the phonetics, another on clustering. 3. Required by continuous speech recognition, a context-dependent hybrid model different from Bourlards' is proposed, which has a new form of context-dependent posterior factoring, input-to-hidden connection sharing structure, and smoothing training method. 4. The most important new idea proposed in this thesis is to discriminate HMM states according to the confusion structure. We developed an algorithm that extract the confusion structure from confusion matrix given by acoustic recognition, and construct a hierarchical neural network to resolve those confusion cases. |
语种 | 中文 |
公开日期 | 2011-05-07 |
页码 | 124 |
源URL | [http://159.226.59.140/handle/311008/620] ![]() |
专题 | 声学研究所_声学所博硕士学位论文_1981-2009博硕士学位论文 |
推荐引用方式 GB/T 7714 | 贾颖. 基于混合HMM-ANN模型的汉语连续语音识别技术[D]. 中科院声学所. 中国科学院中科院声学所. 1999. |
入库方式: OAI收割
来源:声学研究所
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。