中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
说话人识别SIS95话者自动识别系统

文献类型:学位论文

作者王宏
学位类别博士
答辩日期1999
授予单位中国科学院声学研究所
授予地点中国科学院声学研究所
关键词说话人识别 文本有关 文本无关 长时平均频谱 三维功率谱图
中文摘要本文实现了一个完整的基于IPC和DSP的SIS95话者自动识别系统。该系统运行在PC-WINDOWS 3.2(中文版)平台上,采用功能型菜单驱动,人机界面友好,操作简洁方便。主要完成如下功能:1)获取语音信号,2)回放语音信号,3)时域预处理,3)LPC分析和FFT分析,4)提取基频和共振峰,5)生成三维功率谱图,6)比较识别。整个系统由硬件和软件组成,其中硬件部分:DSP部件完成FFT运算,并控制AD/DA部件完成语音信号的采集和回放。PREPROC部件完成输入信号的放大和防混叠滤波。POSPROC部件完成恢复滤波和功率放大。软件部分:用Visual Basic 4.0(中文企业版)编写,辅以DLL函数(即动态库函数,它负责完成CPU与DSP之间的数据交换),控制完成系统的全部功能。其中为了提高运算速度,FFT分析程序用TMS320C50汇编语言完成。软件部分又分为“文本有关”、“文本无关”和“三维功率谱图”三部分。在这三个部分中,所有要分析的语音样本都是分帧处理的,并且经过了时域Hamming窗加权和频域加权(或称为预加重)等处理。下面就简单介绍一下这三部分的基本设置和完成的功能。“文本有关”部分:采样频率固定为12.5KHz,由截止频率为10KHz的模拟低通滤波器和截止频率为5KHz的数字低通滤波器拼成防混叠滤波器;分析帧长40ms;帧间距10ms;由人工选择丢弃语音样本中的无声段和过渡段;采用改进的SIFT方法获取基频;采用共振峰增强的LPC分析法获取前4个共振峰;将基频和共振峰分别时间归-化后作为说话人的特征。“文本无关”部分:采样频率可选择25KHz,50KHz两种,相应的防混叠滤波分别由截止频率为10KHz和20KHz的模拟低通滤波器完成:FFT分析点数可选择128点、256点、512点和1024点四种;语音样本中的无声段或噪声段利用帧能量门限自动去除;通过长时间平均的FFT分析获取平均频谱;长时平均频谱经过能量归-化和频率归-化后作为说话人的特征。“三维功率谱图”部分:采样频率固定为25KHz,防混叠滤波由截止频率为10KHz的模拟低通滤波器完成;分析频率范围固定为10KHz;FFT分析点数固定为1024点;单次分析可提供时长10秒的三维功率谱图,并且增益可调,分析带宽可选择50Hz、100Hz、300Hz三种。其中50Hz对应于窄带三维功率谱图,300Hz对应于宽带三维功率谱图;系统采用伪彩色编码给出彩色的三维功率谱图,采用“蒙特卡罗”方法给出黑白的三维功率谱图;它们在这里作为说话人对比识别的一个重要辅助手段,清晰地给出了语音的时变频率特征,增强了系统的整体性能。我们用20名说话人的语音样本作为参考样本和待识别样本进行实验。20名说话人(男10人,女10人)均说流利的汉语普通话,年龄分布在20岁到60岁之间。说话人在自制的录音棚内发音,全向话筒在距说话人唇15厘米处拾取语音。将所有说话人分别说一次“文本有关”语料和一次“文本无关”语料定义为一次样本采集,我们分别进行了两次样本采集。识别时20名说话人的语音样本按性别分为两组分别进行。实验结果表明:“文本有关”的说话人识别正识率达到100%,“文本无关”的说话人识别正识率达到100%。
英文摘要In this paper we present a versatile speaker identification (acoustic signatures identification) system based on IPC and DSP. The structure of the system has been developed according to the common sequence of i.) Acquiring the speech signal, ii.) Pre-processing and feature extraction, and iii.) Speaker identification. System is integrated with hardware and software on the platform of PC-WINDOWS 3.2(Chinese version). For hardware, Signal pre-amplifying, anti-aliasing filtering and sampling are carried out in the same PCB, named as PREPROC-board, while smooth filtering and power amplifying are fulfiled by POSPROC-board. at the same time, a DSP chip TMS320C50 is used to control signal sampling or signal playback, and also to realize the FFT algorithm. For software, it's written by Visual Basic 4.0 (Enterprise Chinese Version) to complete the whole identification functions. According to the different methods adapted, the whole system has been branched into three subsystems: text-dependent speaker identification, text-independent speaker identification, and 3D-power spectrum. Here, 3D-power spectrum of speech acts as an auxiliary method to improve the performance of the system. In this system, all speech frames are weighted by Hamming function and pre-emphasized, then these frames are separately analysed to estimate the pitch, the formants and the long-term average amplitude spectrum of the speaker. For the text-dependent speaker identification, sampling frequency is fixed (12.5KHz), the frame width is 40ms, and the overlap between successive frames is 10ms, silence segments are discard manually. For the text-independent speaker identification, both sampling frequency and FFT point numbers can be adjusted, silence segments are detected and discarded by computer automatically based on the frame energy threshold. For the 3D-power spectrum, both narrow-band and broad-band 3D-power spectrum can be obtained by the system. For the text-dependent speaker identification, Pitch frequency and 4 predominant formant frequencies are estimated by LPC analysis (by the Durbin method), then these parameters are normalized and stored in the system as the key characteristics of the speaker. For the text-independent speaker identification, after being normalized, the long-term average amplitude spectrums are stored in the system as the key characteristics of the speaker. Experiments are carried out under the same conditions. 20 speakers included ten males and ten females are divided into 2 groups by gender. They are asked to read the text-dependent and text-independent materials for 2 times, respectively. Experiments results show that the correct identification rate reaches 100%.
语种中文
公开日期2011-05-07
页码164
源URL[http://159.226.59.140/handle/311008/652]  
专题声学研究所_声学所博硕士学位论文_1981-2009博硕士学位论文
推荐引用方式
GB/T 7714
王宏. 说话人识别SIS95话者自动识别系统[D]. 中国科学院声学研究所. 中国科学院声学研究所. 1999.

入库方式: OAI收割

来源:声学研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。