中国科学院机构知识库网格系统: 电话语音声学特征的补偿算法研究

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

电话语音声学特征的补偿算法研究

文献类型：学位论文


作者	周伯文
学位类别	博士
答辩日期	1999
授予单位	中国科学院声学研究所
授予地点	中国科学院声学研究所
关键词	电话语音声宇补偿算法
中文摘要	当前人类社会正处于一个前所未有的高度信息化、自动化的数字环境中，人与机器之间的信息交互无处不在。人们渴望人与机器能够更和谐地沟通。语言识别技术－－研究如何利用计算机从人的声学语音信号中提取有用信息并从中确定其语言含义的技术，正是帮助实现人们这个梦想想的重要手段。电话是当今世界最通用和最有效的信息传播手段之一，因此电话语音识别是语音识别的一个极其重要的应用方向。同时，电话环境又给语音识别带来了诸多鲁棒性问题，使得电话连续语音识别在学术上成为一个极具挑战性的研究课题。正因为这两方面的原因，电话语音识别已成为当前语音识别研究领域的一个热点。本文论述的便是作者在汉语连续电话语音识别研究中所做的工作。本文的研究目的是为了构造一个具有较好鲁棒性的汉语连续电话语音识别系统，主要研究工作包括：1. 全自动电话语音采集系统的设计与汉语电话语音数据库的建立：作为语音识别研究必不可少的一项资源，汉语电话语音语料在国内还比较缺乏。为此，我们设计并创建了一个全自动的电话语音采集系统来大量采集电话语音。该系统24小时运行，能够自动响应呼叫用户的呼叫，由语音提示与用户对话并自动采集用户语音数据，分类存储于本地硬盘。当用户挂机时，系统自动使本地电话复位并等待下一次呼叫。该系统基于Windows平台，用户易于改变其提供的多种采集参数及监控采集进程。基于该系统，我们初步建立了一个汉语电话语音数据库。该数据库内容丰富，包括孤立字、连接词及大量连续语音句子，当前容量约为2G字节。而且，基于该采集系统，本实验室正在进一步采集大量汉语情景式对话（Spontaneous Speech）电话语音数据。这些工作为顺利开展电话语音识别的研究及应用工作奠定了坚实的基础。2. 电话信道的声学特征分析及补偿算法的研究：电话语音识别的最大困难在于电话环境给语音识别带来的鲁棒性挑战。电话环境的影响包括电话使用的背景环境（背景噪声），电话话筒类型，电话信道的传输特性等多种因素给电话语音带来的多样性。这些多样性将使得基于统计的HMM模型难以获得足够的训练，进而导致识别数据与训练模型的高度不匹配，降低识别率。所以，能减小这些多样性的鲁棒性算法在电话语音识别中便显得尤为重要。本文分析了电话信道的一些显著的声学特性如话筒特性，带宽限制，卷积性噪声，加性噪声，交调性干扰，低频干扰，非线性等以及它们对语音识别的影响。在鲁棒性算法，研究了倒谱均值归一化（CMS）算法，最大似然（ML）信道估计、最大后验（MAP）信道估计算法的实现及其优缺点。特别地，本文详细研究了相对谱（RASTA）方法的思想、实现及其物理特性；分析了RASTA的上文依赖性对不同类型的后端HMM声学模型的影响；讨论了将RASTA方法用于上下文无关模型的HMM系统时，如何对其进行相位补偿；实现了具有较好鲁棒性又能实时实现的相对mel倒谱特征（RASTA－MFCC）的提取算法。3. 构造面向电话语音的声学模型的训练策略及快速训练算法：当前构造适用于电话语音的声学模型一般有两种训练途径，及用纯净语音特征训练模型或用补偿后的电话语音特征训练模型。本文提出了一种新的构造电话语音声学模型的训练策略。首先，由纯净语音窄带特征训练一个窄带（300－3400Hz）纯净语音声学模型；在这一步的训练过程中，为了加快训练速度，我们采用了单遍重训练法（Single-Pass Retraining），由已有的纯净语音宽带声学模型快速训练得到窄带模型。然后，再由补偿后的鲁棒性电话语音窄带特征（RASTA-MFCC）对此模型进行适应性训练得到面向电话语音的声学模型HMMTEL。我们的大量实验表明，HMMTEL具有较好的鲁棒性，其对包含近45，000个音节的汉语连续电话语音测试集的音节正识率达到了70.27%。
英文摘要	We are currently in the era of digital, in which we need to interact with computers anywhere and anytime. It's quite obvious that people expect a more natural way, rather than keyboard or mice, to communicate with machines. Automatic speech recognition (ASR), which is a technology to make computers can extract linguistic information from the speech signal of human beings, is just an important method to help people to realize their dreams. Telephone is one of the most common and effective ways to spread information in the world, which makes telephone speech recognition (TSR) to be an important application example of ASR. Meanwhile, the robustness issues for ASR systems introduced by telephone environments make TSR to be more challenging. Therefore, TSR is currently becoming one of the most attractive projects in ASR research field. In this thesis, we focus on the issues of Chinese continuous telephone speech recognition. The main contributions made by the author include: 1. An automatic telephone speech collecting system and the Corpus of Chinese telephone speech. As an indispensable resource for TSR research, there are few Chinese telephone speech database available. To initialize our research, we design an automatic telephone speech collecting system. The system works for 24 hours a day and responses automatically if there is a call on the line. Callers are directed by system's speech hints and the speech data of the callers are collected and saved on local hard disk. When the talk is over and the caller hang up the phone, the system reset the local phone and waits for another call. Based on the system, we had created a Chinese telephone speech database, which has an amount of 2G bytes and includes isolated words, connected words and continuous speech. Moreover, we are currently collect more spontaneous Chinese speech through this system for future work. All these works have set a solid foundation for our TSR research and future application. 2. The research of acoustical characters of telephone channel and robust feature extraction for TSR. The biggest difficulty of TSR lies in the robustness issues introduced by telephone environments, which includes the background noise when user makes a call, the variability of handsets and the transfer character of telephone channel, etc. The variability brings mismatch between the models and testing speech and highly degrades system's performance. We analyze some typical acoustical characters of telephone channel such as handset variability, band limitation, convolutive noise, additive noise, interactive distortion, low-frequency tone and nonlinearity, as well as their impacts on ASR systems. And we study the theories and realizations of some algorithms to extract robust feature for TSR, including cepstral mean normalization, Maximum likelihood channel vector estimation, MAP channel vector estimation etc. Specially, we discuss the RASTA algorithm in more details from the perspective of an ASR system. RASTA processing can make feature more robust to channel distortion, and we found that its left-context dependency makes more suitable for the ASRP systems based on context-dependent HMM models. We also discuss how to compensates the phase distortion of RASTA by a pole-zero phase correction filter to make RASTA also suitable for context-independent models. We propose the RASTA-MFCC as our robust feature in TSR. 3. The training strategy and fast-retraining algorithms for constructing robust acoustic models for telephone speech recognition: Usually, there are two ways to construct acoustic models for telephone speech recognition, trained from high-quality speech or telephone speech. We propose our training strategy. First, we trained a narrow band (300-3400Hz) acoustic model with high-quality speech (in this step, we mapped the new models from the existing models for high-quality speech recognition through a fast retraining method named as single-pass retraining, which has the least consumption of computation). Second, we adapted the narrow-band models through compensated telephone speech features. With a few rounds of adaptation, we get the acoustic models for telephone speech, HMMTEL. Our large amounts of experiments show that HMMTEL is robust for telephone speech recognition. For a testing corpus including near 4,5000 continuous syllables, the percentage correct of the recognition result is 70.27%.
语种	中文
公开日期	2011-05-07
页码	52
源URL	[http://159.226.59.140/handle/311008/632]
专题	声学研究所_声学所博硕士学位论文_1981-2009博硕士学位论文
推荐引用方式 GB/T 7714	周伯文. 电话语音声学特征的补偿算法研究[D]. 中国科学院声学研究所. 中国科学院声学研究所. 1999.

入库方式： OAI收割

来源：声学研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。