中国科学院机构知识库网格系统: 汉语视觉语音合成关键技术的研究

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

汉语视觉语音合成关键技术的研究

文献类型：学位论文


作者	张欣
学位类别	博士
答辩日期	2005
授予单位	中国科学院声学研究所
授予地点	中国科学院声学研究所
关键词	视觉语音双模态数据库人脸特征的定位与跟踪唇形匹配视素提取
其他题名	Key Technologies Research for Chinese visual speech synthesis
中文摘要	通过融合来自不同感觉通道的多模态信息来增强对事件的检测、识别和理解的能力是人类进行信息交互的一个基本特征。人类的语言感知在本质上也是一个多模态的过程，它不仅依赖于听觉信息，也依赖于视觉信息。语言的视觉信息与听觉信息产生于同一生理机制，两者高度相关，存在非常明显的互补关系。语音的视觉信息能极大地提高人和机器在噪声环境下对语音的识别率，增强人机交互的自然性，是目前国际上的热点研究领域之一。本文的研究工作致力于提供构筑汉语视觉语音合成系统的关键技术与基础，主要取得了以下几点研究成果：（1）根据汉语语音的特点，建立了国内第一个较为完备的面向多用户的汉语视觉语音合成数据库CVSS1.0，包含了136个单音节和262个独白语句，语料覆盖了汉语语音所有的发音方式和大部分的韵律结构及音节间的音段音联关系，能很好地反映汉语视觉语音发音规律，记录了部分MPEG4中定义的脸部特征点发音动作的三维运动信息，便于参数化人脸发音动画的研究和脸部MPEG4特征点的跟踪，适用于进行视觉语音合成的专业研究；（2）提出了基于灰度投影的人脸主要特征的定位与跟踪算法，首先对经过预处理的彩色人脸图像的不同区域进行水平和垂直灰度投影，然后利用人脸结构知识，对投影曲线进行有效分析，结合模板匹配方法达到对瞳孔、鼻、嘴、下颗进行准确快速的位置标定。该算法不仅速度快，而且对不同的人、不同的头部姿态、不同的脸部状态及光照的变化具有很高的鲁棒性，还能很好的消除眼镜的影响，在对汉语听觉视觉语音识别双模态数据库CAVSR1.0中12男，8女，共2000多幅发音图像的测试中，其平均精度达90％，对采用图像跟踪技术的图像序列的标定精度可达99％；（3）提出了一个基于PCA的主动形状模型算法，成功实现了唇形精确定位，采用点分布模型和灰度剪影模型来描述唇形的形状特征和灰度特征，使用PCA主元分析技术从训练集中提取出唇形变化的主元，最后采用分段单纯形下山法对能量函数进行最小化实现最佳唇形匹配，在不同发音状态下达到对唇形的精确标定，并且不受嘴唇的变形、旋转和缩放的影响；（4）提出了一个通过对连续语音口型序列的聚类来提取汉语视素的算法，并在视觉语音合成数据库CVSS1.0的平台上实现了该算法，给出了8个重要的汉语视素。
英文摘要	Enhancing the ability of the. detection, identification and perception of the event by integrating information coming from different sensory channels is one of the fundamental characteristics of human information interaction. The perception of human languages is inherently a multi-modal process and depends not only on acoustic cues, but also on visual cues. The visual and audio information of language are strongly correlated and can complement with each other since they are generated through the same physiological mechanism. Providing the visual information of speech can greatly improve the intelligibility of speech under noise conditions. Meanwhile, the technology of visual speech synthesis can be used to provide a more natural human-machine interactive system and has been one of the hottest issues in the world. In the thesis we try to provide key technologies and basis for constructing a Chinese visual speech synthesis system. The major achievements of this research are as follows: (1) According to the characteristics of Chinese speech, we have built a Chinese visual speech synthesis database (CVSS1.0) which is the first relatively exhaustive multi-user system in China. Its utterance material includes 136 Chinese characters and 262 phonetically balanced sentences which cover all the ways of pronunciation, most of prosody structure and syllabary order of Chinese speech. So the corpus can reflect the rules of Chinese visual speech. The corpus records 3D facial feature points' motion of pronunciation defined by MPEG4. It is helpful to research on parameterized facial motion of pronunciation and tracking MPEG4 points of facial features, which can be applied for the professional research on visual speech synthesis. An algorithm based on gray projection to locate and track the primary facial features is put forward. Firstly, the vertical and horizontal gray projection of pre-processing image is got, then biometric structure of face will be used to analyze projection curve, and finally the result of analyzing can be used to locate the pupil, nose, mouth and chin together with template matching. The algorithm is not only fast, but also robust to different people, head postures, facial states and illumination variability, and can avoid the influence of glasses. More than 2000 images coming from CAVSRl.O (consisting of the video data of 12 men and 8 women) are used to test the algorithm. The average accuracy is 90%. If the technology of tracking is used in image sequences, the average accuracy is 99%. (3) An Active Shape Model algorithm based on Principal Component Analysis (PCA) is brought forward to locate the lip shape accurately. Point Distribution Model (PDM) is used to describe shape feature and gray-level profile vector is used to describe gray feature. The principal components of lip shape variation have been picked up from training corpus by using PCA. In the procedure of minimizing the energy, a segmental simplex downhill method is used to implement the matching between model and lip. The algorithm accuracy is high and can't be affected by the distortion, rotation and zoom of lip. (4) We suggest an algorithm to extract Chinese viseme from continuous pronunciation image sequences. The algorithm has been realized on CVSSl.O and outputs eight Chinese viseme.
语种	中文
公开日期	2011-05-07
页码	91
源URL	[http://159.226.59.140/handle/311008/1048]
专题	声学研究所_声学所博硕士学位论文_1981-2009博硕士学位论文
推荐引用方式 GB/T 7714	张欣. 汉语视觉语音合成关键技术的研究[D]. 中国科学院声学研究所. 中国科学院声学研究所. 2005.

入库方式： OAI收割

来源：声学研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。