中国科学院机构知识库网格系统: 音视频融合的情感识别技术研究

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

音视频融合的情感识别技术研究

文献类型：学位论文


作者	巢林林
学位类别	工学博士
答辩日期	2016-05
授予单位	中国科学院研究生院
授予地点	北京
导师	陶建华
关键词	音视频数据融合
学位专业	模式识别与智能系统
中文摘要	情感识别技术是一项通过分析处理语音信号、视觉信号和生理信号来识别人的情感状态的技术。作为人工智能领域的一个重要分支，情感识别技术在自然人机交互、疾病诊断和监控、公共安全等领域有着广泛的应用。近年来，随着心理学、生理学、神经科学及计算机技术的发展，无论是基于语音还是基于视觉信号的情感识别技术都取得了显著的进步。但是由于情感识别的复杂性和应用场景的多样性，单一模态的情感识别技术很难满足现实的应用需求。因而，将音视频信号相融合的情感识别技术逐渐受到国内外研究人员的广泛关注。本文以音视频融合的情感识别技术为研究目标，分别针对维度情感识别和基本情识别技术中的若干关键问题进行了研究。论文的主要研究内容可分为以下四个方面：针对维度情感识别的时序建模问题，本文提出了一种基于特征层建模和决策层建模相结合的多尺度时序建模方法。在特征层，本文采用具有时序池化层的深度置信网络（DBN-TP）来学习序列中连续多帧数据的特征表示，以实现短跨度的时序建模。DBN-TP与维度情感识别领域领先的识别算法——具有长短时记忆能力的循环神经网络（LSTM-RNN）相比，在情感评测数据集上取得了明显的领先结果。在决策层，本文将时序建模与多模态决策层融合相结合，提出了一种多模态时序融合方法。该方法通过同时融合来自多组特征以及各组特征不同时刻的预测结果，实现了更长跨度的时序建模，并与特征层的时序建模实现相互补充。本文所提出的多尺度时序建模方法在2014年举办的音视频融合的情感识别公开评测（AVEC2014）中取得了评测第二名的成绩。针对维度情感识别的标签噪声问题及标签数据采样率过高的情况，本文以LSTM-RNN模型为基础，分别从优化目标和识别模型的角度出发，提出了针对性的改进措施。对于标签噪声问题，通过调研回归问题中常用的多种损失函数，本文发现ε不敏感损失函数对离群点的线性惩罚提高了识别模型对带有噪声的标签数据的鲁棒性。同时，其对较小误差的选择性“忽略”有助于识别模型获得与标签数据相关性更强的预测值。针对维度情感识别中标签数据采样率过高的问题，本文在以LSTM-RNN为基础的识别模型中引入了时序池化层。该解决方案通过同时缩短标签数据和待预测数据的序列长度，解决了由于标签数据信息冗余所造成的模型建模时间跨度过小的问题，同时提升了模型的收敛速度。基于以上两个关键点，本文所提算法在AVEC2015公开评测中取得评测第二名的成绩，并且在AVEC2014数据集上取得了具有竞争力的实验结果。针对基本情感识别中特征序列的数据编码问题，本文提出了基于LSTM-RNN的序列编码方式，并同时调研了两种基于LSTM-RNN的编码方式——均值编码和最后时刻编码。在与传统的池化编码方式和时序池化编码方式的比较中，基于LSTM-RNN的均值编码方式凭借其对特征序列动态信息的有效利用，获得了所有编码方式中的最优实验结果。同时，本文还比较了来自于卷积神经网络（CNN）模型不同深度的卷积层特征在情感识别任务上的识别结果。结果显示，来自于不同深度的卷积特征存在着一定的信息重复性。基于上述编码方式，本文实现了音视频数据特征层融合的基本情感识别方法。针对基本情感识别中音视频数据的时序耦合信息难以建模应用和特征序列的数据编码问题，本文在LSTM-RNN模型的基础上，利用软注意力机制分别提出了有针对性的解决方案。在音视频数据时序耦合信息建模方面，本文在软注意力机制下，根据音频数据帧和视频数据帧之间的相关性确定时序对齐分数，实现了音视频数据的自动时序对齐，进而将时序耦合信息应用到识别模型中。在特征序列的数据编码方面，本文从人类对情感数据的感知过程得到启发，提出了利用软注意力机制，根据模型中所添加的情感嵌入向量去定位序列中的情感显著性片段，并根据数据片段的情感显著性程度进行加权融合的编码方式。最后，本文将上述两个关键点实现在统一的模型框架下。本文还通过相应的定性及定量实验验证了上述两个关键点的有效性。
英文摘要	Emotion recognition is a technology to identify human’s emotions by analyzing and processing vocal, visual and physiological signals. As an important branch of artificial intelligence, emotion recognition can be widely utilized in human-computer interaction, diagnosing and monitoring disease, public security and other fields. In recent years, with the development of psychology, physiology, neurology and computer science, emotion recognitions based on either speech or vision have made remarkable progress. However, due to the complexity of emotion recognition and the diversity of application scenarios, emotion recognition of single modality is difficult to meet the demand of reality. Meanwhile, audio visual based emotion recognition has received numerous attentions from researchers of home and abroad. Therefore, this thesis focuses on audio visual based emotion recognition and the key aspects of dimensional emotion recognition and category emotion recognition, mainly containing the following four researches: To achieve temporal modeling in dimensional emotion recognition, this thesis proposes a multi-scale temporal modeling method, which includes feature level modeling and decision level modeling. Firstly, Deep Belief Network with temporal pooling (DBN-TP) is applied to achieve feature level modeling at short range. Compared with the Long Short Term Recurrent Neural Network (LSTM-RNN), one of the state-of-the-art machine learning algorithms for dimensional emotion recognition, DBN-TP achieves better results on Audio Visual Emotion Challenge (AVEC) 2014 dataset. Secondly, a multimodal-temporal fusion method is proposed to achieve decision level modeling. Multimodal-temporal fusion method combines multi-modal fusion and temporal fusion. It can achieve temporal modeling at longer range, which is complementary to feature level modeling. The proposed method obtains the second best results among the participants in AVEC2014. To settle the problem of label noises and the reality of excessive sample rate of label data, this thesis puts forward solutions from optimation objective and recognition model respectively. Firstly, the widely used loss functions, combined with the LSTM-RNN based prediction model, are researched. Theε-insensitive loss function is proved to be the most suitable one among the researched loss functions. On one hand, the linear punishment of outliers in data makes it robust to label noises. On the other hand, by ignoring small errors, theε-insensitive loss function optimized LSTM-RNN model can have more chance to predict results, which have strong correlation with the label data. Secondly, temporal pooling method is introduced into LSTM-RNN based model. By shortening the label data and the audio visual data, this solution can fix the problem of the short time span of LSTM-RNN modeling, which is caused by label data redundancy. Furthermore, it improves the convergence speed of the model. The proposed method also obtains the second best performance in AVEC2015 and competitive results in AVEC2014 dataset. In terms of the sequence data encoding method in category emotion recognition, this thesis proposes the LSTM-RNN based encoding method, while two LSTM-RNN based encoding methods, average encoding and last time encoding, are researched. Compared with the traditional pooling based encoding method and temporal pooling based method, the proposed method can combine the dynamic information of feature sequence and the best performance achieved. This thesis also researches the features from Convolutional Neural Network (CNN) for emotion recognition. Particularly, features from different depths’ convolutional layers are researched. Results show that features from different convolutional layers are repetitive to some extent. Based on the above encoding methods, an audio visual feature level fusion based category emotion recognition method is proposed. To utilize the audio visual temporal coupling information and encode the sequence data in category emotion recognition, this thesis proposes solutions based on the LSTM-RNN model and Attention mechanism. Audio visual temporal alignment is achieved based on the correlation between audio feature and visual feature, which is implemented by Soft Attention mechanism. After alignment, the audio visual temporal coupling information is modeled by LSTM-RNN. Meanwhile, inspired by the perception process of human, this thesis proposes a new method to encode the sequence data based on the emotional saliency of audio visual segments. Emotion embedding vectors are added to assign the salient scores of the segments, which is also achieved by soft attention mechanism. The proposed two solutions are implemented in a uniform framework. Besides, both the qualitative analysis and quantitative analysis prove the effectiveness of the proposed two key solutions.
语种	中文
源URL	[http://ir.ia.ac.cn/handle/173211/11842]
专题	毕业生_博士学位论文
作者单位	中科院自动化研究所
推荐引用方式 GB/T 7714	巢林林. 音视频融合的情感识别技术研究[D]. 北京. 中国科学院研究生院. 2016.

入库方式： OAI收割

来源：自动化研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。