中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
电话信道下说话人分离及识别研究

文献类型:学位论文

作者张策
学位类别工学博士
答辩日期2013-05-29
授予单位中国科学院大学
授予地点中国科学院自动化研究所
导师徐波
关键词说话人识别 说话人分离 因子分析 高斯混合模型 贝叶斯分析 Speaker recognition speaker diarization factor analysis gaussian mixture models bayesian analysis
其他题名Speaker Diarization and Recognition of Telephone Conversations
学位专业模式识别与智能系统
中文摘要在电话信道环境下,说话人身份认证/识别面临的核心问题是由合路语音 所带来的通道差异及通话双方信号的相互干扰,这种干扰对说话人的训练和测 试都是严峻的考验。本文主要研究两人对话语音条件下说话人识别的鲁棒性问 题。论文工作的主要内容和创新点如下: 1. 在联合因子分析框架下,研究和对比了多种置信度计算方法,在一阶近 似的泰勒展开基础上提出对称形式的评分方式。该置信度计算方法克服 了传统计算方法中训练和测试语音不对等的缺点,使得任意给定的两条 语音在说话人层面的相似度能够保持统一,而与顺序无关。 2. 在此基础上深入分析了内积形式的分数归一化方法的意义,并将其推广 到支持向量机的核函数中,直接在核函数形式上引入隐式的归一化准则, 从而避免了系统后端的分数归一化后处理。 3. 由于目前主流的说话人算法均是基于通用背景的高斯混合模型,而高斯 混合模型的充分统计量提取一直是影响系统速度的瓶颈所在。对此提出 了一种数据驱动的高斯选择方法,利用数据对声学空间进行划分,然后 结合后验概率提前绑定高斯列表,实现快速、高效的统计量提取。实验 表明在性能几乎无损的情况下,统计量提取模块速度提升10倍左右。 4. 对于说话人分离,利用说话人识别中已趋成熟的iVector技术,提出将变 分贝叶斯方法与iVector相结合,使得在聚类过程中每个片段以一定的概 率属于某个说话人(软决策),并利用EM算法不断优化这个后验概率, 最终在NIST-SRE2008合路测试数据上将分离错误率从13.8%降到6.88%, 重分割之后进一步降低至5.34%。 5. 在涉及多条合路语音的训练阶段,提出用PLDA模型进行公共说话人 的提取,针对不同组合方式的选择策略给出了多种目标函数的形式化 描述。在NIST-SRE2008评测中的3summed-summed任务上,将等错误率 从NIST官方公布的最好结果(约8%)降低至4.05%。
英文摘要The most challenging part in speaker recognition of telephone conversations is the intra-session variability in the summed channel. We mainly focus on the robust speaker diarization and recognition for two speaker scenarios in this thesis and the contribution is shown as follows: 1. We compare several confidence measures in the framework of joint factor analysis and obtain symmetric scoring method based on the first order approximation of Taylor series for fully likelihood calculation, which com-pletely symmetrizes the problem so that it does not matter anymore which utterance in a trial is for enrollment and which is for test. 2. Based on the symmetric scoring we investigate various normalization meth-ods and extend the implicit normalization formula to any confidence mea-sures defined in the form of inner product. According to the general form of symmetric normalization we also modify the KL kernel to incorporate some kinds normalization in the kernel space. 3. Because of the dominance of GMMs in speaker related fields and the bottle-neck of sufficient statistics extraction especially when the number of com-ponents grows to thousands, we propose a data driven Gaussian componen-t selection algorithm based on multi-layer acoustic space partition which achieves a 10 times faster Baum-Welch statistic extraction without any performance loss. 4. Applying the variational Bayesian in the context of iVector representation for fuzzy clustering in speaker diarization which is proved to be more effec-tive than the traditional hierarchical agglomerative clustering. We decrease the diarization error rate from 13.8% to 6.88% and further improve it to 5.34% after Viterbi re-segmentation. 5. Finaly, we introduce the PLDA model into the target speaker selection for multiple summed-channel excerpts enrollment. We also propose and evalu-ate several kinds of objective function to measure the purity of selected seg-ments, which obtains a much better equal error rate(4.05%) than the best system of NIST-SRE 2008 on the 3summed-summed test condition(∼8%).
语种中文
其他标识符201018014628071
源URL[http://ir.ia.ac.cn/handle/173211/6536]  
专题毕业生_博士学位论文
推荐引用方式
GB/T 7714
张策. 电话信道下说话人分离及识别研究[D]. 中国科学院自动化研究所. 中国科学院大学. 2013.

入库方式: OAI收割

来源:自动化研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。