中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
Learning to predict salient faces: a novel visual-audio saliency model

文献类型:会议论文

作者Yufan Liu2,3; Minglang Qiao1; Mai Xu1; Bing Li2; Weiming Hu2,3,5; Ali Borji4
出版日期2020-11
会议日期2020.8.23-2020.8.28
会议地点Virtual conference
英文摘要

Recently, video streams have occupied a large proportion of Internet traffic, most of which contain human faces. Hence, it is necessary to predict saliency on multiple-face videos, which can provide attention cues for many content based applications. However, most of multiple-face saliency prediction works only consider visual information and ignore audio, which is not consistent with the naturalistic scenarios. Several behavioral studies have established that sound influences human attention, especially during speech turn-taking in multiple face videos. In this paper, we thoroughly investigate such influences by establishing a large-scale eye-tracking database of Multiple-face Video in Visual-Audio condition (MVVA). Inspired by the findings of our investigation, we propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face. The visual branch takes the RGB frames as the input and encodes them into visual feature maps. The audio and face branches encode the audio signal and multiple cropped faces, respectively. A fusion module is introduced to integrate the information from three modalities, and to generate the final saliency map.
Experimental results show that the proposed method outperforms 11
state-of-the-art saliency prediction works. It performs closer to human
multi-modal attention.

源URL[http://ir.ia.ac.cn/handle/173211/51644]  
专题自动化研究所_模式识别国家重点实验室_视频内容安全团队
通讯作者Mai Xu; Bing Li
作者单位1.The School of Electronic and Information Engineering and Hangzhou Innovation Institute, Beihang University
2.Institution of Automation, Chinese Academy of Sciences
3.the School of Artificial Intelligence (AI), University of Chinese Academy of Sciences
4.MarkableAI Inc.
5.CAS Center for Excellence in Brain Science and Intelligence Technology
推荐引用方式
GB/T 7714
Yufan Liu,Minglang Qiao,Mai Xu,et al. Learning to predict salient faces: a novel visual-audio saliency model[C]. 见:. Virtual conference. 2020.8.23-2020.8.28.

入库方式: OAI收割

来源:自动化研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。