中国科学院机构知识库网格系统: Disentangled Representation Learning for Cross-modal Biometric Matching

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

Disentangled Representation Learning for Cross-modal Biometric Matching

文献类型：期刊论文


作者	Ning, Hailong 1; Zheng, Xiangtao2 ; Lu, Xiaoqiang3 ; Yuan, Yuan4
刊名	IEEE Transactions on Multimedia
关键词	Cross-modal biometric matching Disentangled representation learning Latent identity factors Modalitydependent factors
ISSN号	15209210;19410077
DOI	10.1109/TMM.2021.3071243
产权排序	1
英文摘要	Cross-modal biometric matching (CMBM) aims to determine the corresponding voice from a face, or identify the corresponding face from a voice. Recently, many CMBM methods have been proposed by forcing the distance between two modal features to be narrowed. However, these methods ignore the alignability between the two modal features. Because the feature is extracted under the supervision of identity information from single modal data, it can only reflect the identity information of single modal data. In order to address this problem, a disentangled representation learning method is proposed to disentangle the alignable latent identity factors and nonalignable the modality-dependent factors for CMBM. The proposed method consists of two main steps: 1) feature extraction and 2) disentangled representation learning. Firstly, an image feature extraction network is adopted to obtain face features, and a voice feature extraction network is applied to learn voice features. Secondly, a disentangled latent variable is explored to disentangle the latent identity factors that are shared across the modalities from the modality-dependent factors. The modality-dependent factors are filtered out, while the latent identity factors from the two modalities are enforced to be narrowed to align the same identity information. Then, the disentangled latent identity factors are considered as pure identity information to bridge the two modalities for cross-modal verification, 1:N matching, and retrieval. Note that the proposed method learns the identity information from the input face images and voice segments with only identity label as supervised information. Extensive experiments on the challenging VoxCeleb dataset demonstrate the proposed method outperforms the state-of-the-art methods. IEEE
语种	英语
出版者	Institute of Electrical and Electronics Engineers Inc.
源URL	[http://ir.opt.ac.cn/handle/181661/94690]
专题	西安光学精密机械研究所_光学影像学习与分析中心
作者单位	1.Key Laboratory of Spectral Imaging Technology CAS, Xi'an Institute of Optics and Precision Mechanics,Chinese Academy of Sciences, Xi'An, ShannXi, China, 710119 (e-mail: ninghailong93@gmail.com); 2.Xian Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi'an, China, 710119 (e-mail: xiangtaoz@gmail.com); 3.OPTical IMagery Analysis and Learning, Chinese Academy of Sciences, Xi'an, China, 710119 (e-mail: luxq666666@gmail.com); 4.OPTical IMagery Analysis and Learning, Chinese Academy of Sciences, Xi'an, China, (e-mail: y.yuan1.ieee@gmail.com)
推荐引用方式 GB/T 7714	Ning, Hailong,Zheng, Xiangtao,Lu, Xiaoqiang,et al. Disentangled Representation Learning for Cross-modal Biometric Matching[J]. IEEE Transactions on Multimedia.
APA	Ning, Hailong,Zheng, Xiangtao,Lu, Xiaoqiang,&Yuan, Yuan.
MLA	Ning, Hailong,et al."Disentangled Representation Learning for Cross-modal Biometric Matching".IEEE Transactions on Multimedia

入库方式： OAI收割

来源：西安光学精密机械研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。