汉语听觉视觉双模态信息的互补特性和人脸特征的结构化建模
文献类型:学位论文
作者 | 周治 |
学位类别 | 博士 |
答辩日期 | 2000 |
授予单位 | 中国科学院声学研究所 |
授予地点 | 中国科学院声学研究所 |
关键词 | 汉语听觉视觉双模态数据库 视听感知实验 视觉特征建模 信息处理 脸部结构化模型 |
中文摘要 | 视觉和听觉是人的两个重要的但有本质差异的信息通道,它们在进行语音交互这种特殊的信息处理过程中有什么特性和联系?人在语音交互时的双模态特性对人-机界面技术的发展趋势有哪些信息和启发?如果存在视听互补的作用,如何在人机界面技术中对体现听觉信息的视觉特征进行模型化以便有效地加以利用?本文以汉语语音交互为背景,采用实验和分析的方法进行研究,以求获得对这些问题的进一步的认识。作为这项研究的重要基础,我们首先建立了第一个汉语听觉视觉双模态数据库。在借鉴国外其他语种听觉视觉双模态数据库经验的基础上,分析了汉语的特殊性,研究确定了符合汉语声韵规律的语料,数据库的主要指标超过国外的同类数据库。我们还设计了维护、管理和使用的工具,使该数据库具有灵活的扩展性和实用性。论文的重点研究内容之一是以汉语听觉视觉双模态数据库为基础进行视听感知实验,获得对人的视听互补特性的定量化认识。实验研究发现,人根据单视觉信息对汉语语音有较强的识别能力;声母的发音方法、发音部位和韵母造成了在视觉上感受语音信息的不同差异;在噪声环境下,视觉信息对听觉信息有非常明显的补偿作用,可以使人对语音信息感知的正确识别率大幅度提高。视觉特征建模一直是听觉视觉双模态信息处理中的关键技术与论文的另一个重点内容是研究语音信息密集的唇形特征的建模与提取的问题。我们在主动形状模型的基础上建立了脸部结构化模型,并以该模型为基础建立了视觉特征提取的研究平台。脸部的结构化建模包括全局层和局部层两部分,而局部层又包括形状的建模、图像证据的建模以及能量最小化过程三个部分。该模型可用来在图像序列中定位和跟踪与音节关系最为密切的唇形变化等视觉信息,并作为重要的视觉特征。我们还针对汉语中的发音规律,对该模型做了一定的优化。汉语语音双模态数据库在该视觉特征提取平台上的测试获得了非常好的结果,这一关键技术的研究为汉语听觉视觉双模态语音识别系统的研究打下了良好基础。 |
英文摘要 | Vision and hearing are two important channels of human, but there is essential difference between them. Which kind of characteristic and relation do they have in the speech interactive information processing? What does the bimodal characteristic of speech interaction suggest to effect the development of the human-machine interface? Based on the background of speech interaction of Chinese, we performed some experiments and analyzed the results to obtain more understanding of these problems. As the important foundation of the research, we established the first Chinese Audiovisual Bimodal Database. Its corpus selection conform to the distribution probability of initials and finals, conclusions from it could stand for Chinese language. Compared with the databases of other languages, it has more advantages. We also designed some tools to maintain, manage and utilize it. Therefore it has convenient expansibility and practicability. It is one of the important points of the research that we performed an experiment of human perception of the Chinese audiovisual information to get quantitative information of audiovisual mutual compensation. The experimental result was studied and we obtained the following conclusions for Chinese speech. Human beings can recognize visual-only stimuli rather well. The method and place of the initial articulation and the different vowels determine the visual distinction. In noisy environment, audio information can remarkably be compensated by visual information and improve the recognition performance. The modeling of visual features is the key technique of audiovisual information processing. Therefore the other important point of the research is the face modeling and the extraction of lip features, which have more visual information. Based on the active shape model, we established the hierarchy face model and the platform of visual features extraction. The hierarchy face model includes global and local levels, and the local level involves shape modeling, image evidence modeling, and energy minimization procedure. This model can be used to locate and track the visual features of lips. The model was optimized according to the rules of Chinese syllables. Experimental results of lip localization for Chinese Audiovisual Bimodal Database showed that this research platform has high accuracy and good robustness. It established the good foundation of the Chinese Audiovisual Bimodal Speech Recognition System. |
语种 | 中文 |
公开日期 | 2011-05-07 |
页码 | 75 |
源URL | [http://159.226.59.140/handle/311008/678] ![]() |
专题 | 声学研究所_声学所博硕士学位论文_1981-2009博硕士学位论文 |
推荐引用方式 GB/T 7714 | 周治. 汉语听觉视觉双模态信息的互补特性和人脸特征的结构化建模[D]. 中国科学院声学研究所. 中国科学院声学研究所. 2000. |
入库方式: OAI收割
来源:声学研究所
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。