中国科学院机构知识库网格系统: 听觉视觉双模态汉语语音识别关键技术的研究

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

听觉视觉双模态汉语语音识别关键技术的研究

文献类型：学位论文


作者	徐彦君
学位类别	博士
答辩日期	1998
授予单位	中国科学院声学研究所
授予地点	中国科学院声学研究所
关键词	语音识别双模态数据库汉语AVSR系统
中文摘要	自动双模态语音识别（AVSR）通过用语音的视觉信息补偿声学信息，获得了更高的识别性能。本文在对AVSR的研究现状进行了深入的分析以后，集中介绍了对AVSR的一些关键技术的研究：汉语双模态数据库CAVSR1.0的建设、二维视觉特征提取技术研究平台以及三维立体视觉匹配算法的研究。鉴于汉语双模态数据库的建设方面尚属空白，而双模态数据库的建设对于汉语AVSR的研究势在必行，在分析国外同类数据库结构的基础上，结合汉语语音的特点，建立了汉语语音的第一个双模态数据库CAVSR1.0。该数据库具有如下的特点：符合汉语的一些特点；具有较大的规模，包括语料、说话者、发音次数，适合于中短期的研究；建立的不仅是一个数据库，而且是一个框架，该框架应具有很好的扩展性，可扩展的内容包括说话、语料、图象的空间分辨率、图象序列的时间分辨率。这使得该汉语双模态数据库不仅仅是填补汉语处理上的一个空白，而且是对通用的双模态数据库建设方法与体系的贡献。目前视觉特征提取已经逐渐成为计算机AVSR实现的瓶颈问题。在八十年代后期与九十年代初期建立了不少的AVSR实验系统，但由于视觉特征提取技术至今仍无大的突破，计算机AVSR的研究目前缺乏后劲。因此，计算机AVSR的研究目前应着重解决视觉特征提取这一关键技术。二维视觉特征是目前计算机AVSR中主要的研究对象。在已经建立的方法中，基于模型的方法逐渐显示出其超越基于象素方法的优势，因此，本文在分析主动形态模型方法的基础上，建立了一个基于模型的视觉特征提取研究平台。在该研究平台中，对主动形状模型的三个主要环节（形状的建模、形状图象证据的建模与能量最小化过程）进行了深入的研究和实现。在形状建模中，在对点分布模型的基本表示形式的研究基础上，提出了点分布模型的复数形式，对两者的比较表明，点分布模型的复数形式是更为0觉特征提取技术的成熟，将成为AVSR系统的有效组成部分。本文对立体视觉的一些基础问题进行了研究，实现了两种立体匹配方法，一种是采用神经网络实现的立体匹配方法，一种是尺度自适应的基于相位的立体匹配方法。在基于相位的尺度自适应的方法中，借鉴了M.W.Maimone提出的尺度自适应的概念。针对多尺度滤波器的构造问题提出了一种基于频率响应积分面积相关的选择规则，并采用质数序列作为Gabor滤波器组的波长。然后，进一步将由粗及精策略与尺度自适应策略有机地融合在一起，提出了一种分组尺度自适应的算法，较好地保留了两种方案的优点，并克服了其各自的缺点。实验表明，该算法计算效率高、鲁棒性好、恢复视差精度高。本文的研究工作致力于提供的构筑汉语AVSR系统的关键技术与基础，由于时间紧、工作量大，已经完成的研究内容基本上是独立并行进行的，但实验上它们是构造中的一个汉语AVSR系统的有机组成部分。未来研究包括系统整合和单元技术的进一步优化。
英文摘要	Automatic audiovisual speech recognition (AVSR) systems, through their use of visual information to supplement acoustic information, have been shown to yield better recognition performance. After reviewing the research literature of AVSR, this thesis presents some key technologies research for Chinese AVSR: the construction of Chinese audiovisual bimodal database CAVSR1.0, 2D visual features extraction research platform, and stereo matching algorithms. As there are no such database in Chinese, it's necessary to build a Chinese bimodal database as soon as possible. In this thesis, the first Chinese bimodal database CAVSR1.0 is introduced. It has following advantages. Its corpus includes all Chinese phonetic units (initials and finals), and its size is very large. Its corpus selection conform to the distribution probability of initials and finals, conclusions from it could stand for Chinese language. There are algorithms of automatic segmenting and automatic main features labeling bound with it, so it has good extensibility. The extendibility includes talkers, corpus, temporal resolution and spatial resolution of the images. So this database is not only constructing a bimodal database for Chinese speech processing but also a contribution to the general bimodal database construction methods and frameworks. Visual feature extraction is one of the most important techniques in AVSR, and also remains a very challenging area in image understanding. There are some AVSR systems developed in 1980's and 1990's, but this problem has not yet got satisfactory solutions. Now visual feature extraction is becoming the main difficulty in AVSR research area. 2-D visual feature is most useful for AVSR systems. Model based method is attracting more and more interests. In this thesis a model-based visual feature extraction research platform is built. In this platform, three main process (shape modeling, image evidence modeling, and energy minimization procedure) have been deeply researched and realized. In shape modeling, a complex-style point distribution model is put forward, which s more accurate than the normal real-style version. As to image evidence modeling, it is realized with gray-level profile vector, Gabor filters profile vector, and wavelet transform profile vector respectively. An analysis on the variations of energy function versus translation, rotation, scale and shape variations is also performed. In energy minimization procedure, a segmental simplex downhill method is put forward, which is faster than the normal method. Experimental results of lip localization for Tulips1 database showed that this research platform has high accuracy and good robustness. 3-D visual feature has been used in only a few AVSR system now. As a very important feature in perception for some phonemes, it is becoming one of the key components of the AVSR system. In this thesis, some basic issues in stereo vision are deeply studied. An ANN stereo matching algorithm and a scale-adaptive multiscale phase-based stereo matching algorithm are realized. In the scale-adaptive phase-based stereo matching algorithm, M.V. Maimone's scale-adaptive strategy is adopted. A novel constructing rule for filter bank frequencies selective is put forward, that is based on correlation of filters' frequency response integration. A prime sequence is taken as Gabor filters bank's wavelengths. A grouped scale-adaptive strategy is put forward, which integrates the advantages of the coarse-to-fine and scale-adaptive strategy into a mixture version. Experiments show that this method is effective and robust. The research work in this thesis aims at providing key technologies and basis for constructing a Chinese AVSR system. As the covered issues have been researched independently at the meantime, they are actually important components of a developing Chinese AVSR system. Future research including system integration and optimization of related technologies.
语种	中文
公开日期	2011-05-07
页码	101
源URL	[http://159.226.59.140/handle/311008/618]
专题	声学研究所_声学所博硕士学位论文_1981-2009博硕士学位论文
推荐引用方式 GB/T 7714	徐彦君. 听觉视觉双模态汉语语音识别关键技术的研究[D]. 中国科学院声学研究所. 中国科学院声学研究所. 1998.

入库方式： OAI收割

来源：声学研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。