中国科学院机构知识库网格系统: 说话人语种识别技术在特种语音中的应用

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

说话人语种识别技术在特种语音中的应用

文献类型：学位论文


作者	李明
学位类别	博士
答辩日期	2008-05-24
授予单位	中国科学院声学研究所
授予地点	声学研究所
关键词	语种识别说话人识别语音分离
其他题名	Speaker Verification and Language Identification Techniques in Special Speech Application
学位专业	信号与信息处理
中文摘要	现实生活中存在大量的电话语音，无论是民用还是国家安全应用, 都迫切需要分析这些电话语音. 手工分析整理这些电话则面临着成本高、劳动强度大、标准难以统一、可信度受到局限等缺陷.目前语音信号的处理和理解基本还是依赖于人听, 这种失衡已经成了语音信息利用的瓶颈。目前主要的课题集中于音频信息的检测与识别等方面，需要研究语音关键词检测、说话人识别、语种识别、固定音频检测等特种语音技术。本文首先介绍了说话人识别语种识别技术等特种语音技术的背景, 介绍了声学层建模的主流方法。其次本文结合特种语音技术的应用, 在以下几个方面提出了自己的创新性算法和改进: 语种识别, 说话人识别, 单通道混叠语音分离。本文的研究工作主要有： 1. 为了补偿同一个语种中不同说话人之间的差异性对语种识别训练的影响, 以及解决大数据量语种识别训练中的内存需求过大的问题, 每一个语种被划分为一些基于说话人聚类方法得到的说话人组, 然后以每个组为单位参与语种识别训练. 这些基于语种内部说话人组的鉴别性分类器被用来把输入的倒谱特征映射到鉴别性语种特征得分向量DLCSV(discriminative language characterization score vectors)中, 然后利用后端的二级分类器去在这个得分向量空间内利用各个语种在得分向量空间的分布的不同建模各个语种,最后对后端分类器输出的得分进行后验概率估计得到每个语种的后验概率。在NIST2003语种识别30秒测试集上取得了等错率30%的相对下降. 2. 提出了一种基于长时韵律特征（基频,时域能量,共振峰,谐波频域能量等特征的phone级别的长时轨迹拟合参数）的话者确认系统PRO-GSV，对所提取的基本韵律特征进行前端预处理后，通过能量进行分段, 在每一个小段内部把这些韵律特征的轨迹用多项式拟合的方法提取出拟合参数, 再利用HLDA的技术进行特征降维, 用高斯混合模型的均值超向量表示每句话音韵律特征的统计信息，利用SVM支持向量机进行建模. 在NIST2006说话人1side-1side男说话人测试集中, 取得了18.7%的等错率, 与基于MFCC的GSV系统进行融合, 等错率从4.9%下降到了4.6%, 获得了6%的相对等错率下降. 3. 在原有的单通道混合语音分离系统的框架上, 提出了基于区分性说话人模型的顺序组合方法, 把顺序组合的应用范围从非混叠部分扩展到了非混叠部分语音和严重混叠的语音都可以处理.我们利用混叠说话人的先验，训练了说话人区分模型，把传统的基于多基频提取与时频连续性线索的分离方法和说话人区分模型鉴别相结合，提高了系统分离出语音的说话人纯度，提高了基频提取的准确性以及分离后语音的信噪比。在Challenage数据库0dB条件下多基频提取准确率从70.60%提高到76.23%, 信噪比增益从3.11dB提高到5.61dB.
英文摘要	In real life, there are a lot of speech files in telephone channel, and analysis of these speech files is necessary and essential in both domestic and military usage. However, manual deal with telephone channel speech is time consuming and difficult. Researches on Special speech applications focus on audio information retrieview contain speech keyword spotting, speaker verification, language identification, audio watermarking and etc. In this paper, firstly the background of speaker verification and language identification in real world application is introduced, then we introduce the state-of-art systems and method built on acoustics level. Secondly, three innovations regarding score vector modeling in language identification, long time prosodic feature based speaker verification, model based sequential grouping in cochannel speech separation are proposed: 1. In this paper, in order to compensate the distortions due to inter-speaker variability within the same language and solve the practical limitation of computer memory requested by large database training, multiple speaker group based discriminative classifiers are employed to map the cepstral features of speech utterances into discriminative language characterization score vectors (DLCSV), followed by backend SVM classifiers to model the probability distribution of each target language in the DLCSV space. Besides, the output scores of backend SVM classifiers are calibrated as the final language recognition scores by a pair-wise posterior probability estimation algorithm. The proposed SVM framework is evaluated on 2003 NIST Language Recognition Evaluation (LRE) databases, achieving an equal error rate (EER) of 4.0% in 30-second tasks, which outperformed the state-of-art SVM-SDC system by more than 30% relative error reduction. 2. A novel speaker verification system based on long span prosodic features is proposed. These prosodic features are contours of pitch, time domain energy, formant and each harmonic energy contours spanning a syllable-like unit and the unit segment duration. We extract log pitch, log time domain energy values and each harmonic energy log values, which is grounded on the sub harmonic summation (SHS) method and a spectral cancellation framework. Besides, for each consecutive non-zero pitch segments, we use energy contours to segment speech into syllable-like units. In each syllable-like unit, after time normalization, we carried out an approximation of the pitch, formant, time domain energy and top 10 harmonic energy contours by taking the 6 leading terms in a Legendre polynomial expansion. All the Legendre term coefficients plus the duration of the syllable-like unit produced the feature vector for each unit. HLDA is used to reduce the feature dimension and each individual Gaussian is considered as a class. We then use SVM to model each mapped GMM mean supervector followed the GSV framework. Experiments on NIST06 show that the proposed method can reduce the EER from 4.9% to 4.6% when fusing with the GSV-MFCC system. 3. In this paper, a new cochannel speech separation algorithm using multi-pitch extraction and speaker model based sequential grouping is proposed. After auditory segmentation based on onset and offset analysis, robust multi-pitch estimation algorithm is performed on each segment and the corresponding voiced portions are segregated. Then speaker pair model based on support vector machine (SVM) is employed to determine the optimal sequential grouping alignments and group the speaker homogeneous segments into pure speaker streams.Systematic evaluation on the speech separation challenge database shows significant improvement over the baseline performance.
语种	中文
公开日期	2011-05-07
页码	94
源URL	[http://159.226.59.140/handle/311008/420]
专题	声学研究所_声学所博硕士学位论文_1981-2009博硕士学位论文
推荐引用方式 GB/T 7714	李明. 说话人语种识别技术在特种语音中的应用[D]. 声学研究所. 中国科学院声学研究所. 2008.

入库方式： OAI收割

来源：声学研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。