中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
语音驱动发音器官运动可视化及差异分析研究

文献类型:学位论文

作者张大伟1,2
学位类别工学博士
答辩日期2017-03
授予单位中国科学院大学
授予地点北京
导师陶建华
关键词可视语音 医学图像处理 发音器官轮廓提取 组合深度神经网络 弹性转换模型
中文摘要  语音驱动发音器官运动可视化及差异分析研究是可视语音合成研究和病理语音分析的重要课题之一。人类语音产生与发音器官运动有着紧密联系,然而由于多数发音器官隐藏在口腔内部且可供观测的医学数据十分有限,目前在发音机理和病理分析等领域尚缺乏一种客观有效的评估依据,同时也给该研究课题带来了很大的挑战。本文围绕语音驱动发音器官运动可视化及差异分析研究目标,首先通过对医学图像中发音器官轮廓进行提取,并建立与语音声学参数之间的映射关系,实现发音器官运动的可视化;进而利用发音器官轮廓转换技术,分析不同发音人之间发音器官运动差异。该研究工作为探索人类发音机理,病理语音诊断和康复训练,以及标准发音教学等研究领域奠定了基础。论文主要包括以下具体内容:
  针对X光发音视频提出一种发音器官轮廓自动提取方法。X光影像能够反映出嘴唇、牙齿、舌、上下颌及喉部等多个发音部位与声音同步的运动特征,方便对发音器官、声道形状与声音之间协同关系做进一步深入研究,现存的一些珍贵X光视频资料有着极其重要的研究价值。然而其成像模糊,噪点较多,各发音器官轮廓之间遮挡比较严重,给轮廓自动提取工作带来很大困难和挑战。针对口腔内运动最为灵活的舌位轮廓,采用区域灰度对比的边缘检测算子和邻接点簇的点到点距离比错误点排除法获得边缘关键点,并通过过控制点的三次样条曲线拟合技术获取舌位轮廓;针对嘴唇、牙齿、上下颌等部位,分别采用最大类间方差法、局部区域灰度直方图和样条曲线拟合等技术进行轮廓提取与跟踪。该方法能够方便准确地获得发音器官轮廓,为发音器官运动可视化及发音机理研究提供了大量与声音同步的发音器官运动数据。
  基于核磁、超声等医学发音视频提出一种舌位轮廓精确提取方法。核磁共振成像作为现常用医学观测手段之一,研究人员选取发音人每一帧的上呼吸道正中矢状面重建了与发音同步的核磁共振图像序列。因其只对正中矢状面进行了重建,于是相对X光影像其舌位轮廓不会受到其他发音器官的遮挡,然而当舌位与上颌或咽喉后部等部位接触时,其轮廓变得极其模糊甚至缺失,给舌位轮廓自动提取增加了很大难度。本文针对这些特点,结合多方向梯度算子和前后帧舌位运动关系建立非均匀区间下的舌位边缘梯度矩阵,通过最优边缘点序列搜索获取舌位轮廓。本文所提方法仅需前期简单人工标定便能够自动获得较完整的舌位轮廓,且在准确率和鲁棒性方面较基线方法有明显提升,同样也适用于超声发音视频中舌位轮廓的提取,为后续舌位轮廓合成与对齐及差异分析提供了精确的数据支持。
  基于医学发音视频中自动提取到的发音器官轮廓,提出一种文本无关的语音驱动发音器官运动合成方法。根据本文研究目标,发音器官合成数据需准确反映出不同人在发不同声音过程中发音器官运动差异,对准确率和鲁棒性方面有着较高要求。然而由于医学发音视频资料的多样性和稀缺性,可用于标准训练数据量有限,一般只有几分钟至几十分钟的样本数据,训练过程中容易出现过拟合或欠拟合现象;且发音器官运动参数依赖于医学图像中发音器官轮廓自动提取的结果,而在提取过程中产生的误差也会影响到语音驱动发音器官运动合成效果。本文对多种音视频特征及高斯混合模型、神经网络模型等映射模型性能进行对比分析,提出一种基于组合深度神经网络的音视频映射模型,在中小规模训练集上获得了较优的综合性能。该工作可用于给定语音下标准发音人舌位轮廓生成和发音器官运动差异分析。
  提出一种不同发音人舌位轮廓弹性转换模型和发音器官运动差异分析方法。由于每个人都有着不同的发音生理构造,所以人们在发音过程中发音器官运动的差异除了受到病理或口音等因素影响外,还来自于不同发音人之间生理结构的差异,本文旨在发音器官运动差异分析过程中尽可能消除生理结构差异带来的影响。本文首先通过动态时间规划及关键点标注实现小样本训练集的时空对齐,再构建舌位轮廓转换模型及弹性约束条件,并利用交替迭代解法对其进行优化。该方法基于有限样本集和少量人工标注即可实现舌位轮廓实时转换,相较于常见传统方法准确率有较大提升,且可用于异源图像间发音器官轮廓转换,为实现不同人之间发音器官运动差异分析和多源图像融合奠定了理论基础。最后,基于舌位运动合成与转换技术,提出一种语音驱动舌位运动差异分析方法,为病理语音研究和言语障碍诊断等领域提供了理论基础和有效分析手段。
英文摘要Speech-driven articulatory motion visualization and difference analysis is an important task in research on visual speech synthesis and pathological speech analysis. Speech production depends on the articulatory motion. However, as most of the articulators are hidden inside mouth and the data available for observation are limited, there is still a lack of effective and objective evaluation basis, and the research has great challenges. This thesis firstly proposes extraction methods of articulatory contours in medical images, and builds the mapping between acoustic features and articulatory motion data, which contributes to the visualization of articulatory motion; and then proposes an elastic conversion model which can be used for articulatory motion difference analysis between different speakers. This work is the basic research of exploring vocal mechanism, pathological speech diagnosis and rehabilitation training, and standard pronunciation teaching. This thesis contains the following four researches:
    An automatic extraction method of articulatory contours in X-ray pronunciation videos is proposed. The X-ray videos showed the speech-sync motion characteristics of articulators such as lips, teeth, tongue, jaw and throat, and the existing videos are valuable for researches on synergic relationship between speech and articulators or vocal track shapes. However, the articulatory contours in X-ray images are usually difficult to extract because of the image noise and low resolution. To extract the tongue contour, two region gradient-based edge detectors and a cluster-based point-to-point distance ratio filter are used to obtain the boundary points which can be used for fitting using cubic spline approximation. For other articulatory parts like mouth, teeth and jaw, the method of Otsu, local area gray histogram and spline approximation technology are used to extract and track the contours. The proposed method could automatically obtain the accurate articulatory contours which provide large amount of data for articulatory motion visualization and research on mechanism of human pronunciation.
    An accurate tongue contour extraction method is proposed based on the MRI and ultrasound pronunciation videos. Subjects’ upper airways were imaged in the midsagittal plane of MRI and the image sequences were reconstructed as pronunciation videos. As tongue contour in MRI videos is clearer than X-ray, the entire contour can be extracted from the glottis to the glossodesmus. However, the tongue contour might be blurred or incomplete when the tongue touches other articulatory boundaries such as lips and upper mandible. This thesis firstly builds a boundary gradient mapping matrix with inhomogeneous gridlines by using multi-directional gradient operators and the relationship between two adjacent frames, and then obtain the tongue contour by searching the optimal boundary route in the mapping matrix. With simply manual marking, the tongue contour can be extracted automatically with higher accuracy and robustness. The proposed method also performs well in ultrasound videos, and it can provide more accurate data for tongue contour synthesis and difference analysis.
    A text-independent speech-driven articulatory motion synthesis method is proposed based on the articulatory contours extracted from medical images. The synthetic articulators need to accurately show the difference between different speakers or pronunciations. However, due to the diversity and lack of medical data, the available data are limited for training, which may cause over-fitting or under-fitting. Moreover, the articulatory motion data come from the automatically extracted contours which may have extraction errors. By comparing different kinds of audio-visual features and mapping models such as GMM and DNN, this thesis proposes a combined deep neural network model which obtains better comprehensive performance on the limited training data set. This work can be used for standard articulatory motion synthesis with given speech and difference analysis.
    A tongue contours conversion method with elastic constraints is proposed. As each speaker has different vocal physiological structure, the articulatory motion difference comes from physiological and pathological factors. In order to reduce the influence of physiological factor, this thesis proposes an elastic conversion model for aligned tongue contours. By training with the alternate iteration method, this model can be used for tongue contours conversion from different speakers or images. This work provides the theoretical support for articulatory motion difference analysis and different medical images fusion. Finally, a method of speech-driven tongue motion difference analysis is proposed based on the reseaches above, which provides a theoretical basis and an effective analysis means for pathological speech research and dysphonia diagnosis.
源URL[http://ir.ia.ac.cn/handle/173211/14627]  
专题毕业生_博士学位论文
作者单位1.中国科学院自动化研究所模式识别国家重点实验室
2.中国科学院大学计算机与控制学院
推荐引用方式
GB/T 7714
张大伟. 语音驱动发音器官运动可视化及差异分析研究[D]. 北京. 中国科学院大学. 2017.

入库方式: OAI收割

来源:自动化研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。