中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
Visual enhanced hierarchical network for sentence-based video thumbnail generation

文献类型:期刊论文

作者Wu, Junxian1,2; Zhang, Yujia1; Zhao, Xiaoguang1
刊名APPLIED INTELLIGENCE
出版日期2023-06-30
页码17
ISSN号0924-669X
关键词Video thumbnail DVTG task Multi-modal fusion Visual information Hierarchical multi-layer perceptions
DOI10.1007/s10489-023-04726-x
通讯作者Zhang, Yujia(zhangyujia2014@ia.ac.cn)
英文摘要With the development of the Internet, video has become the most prevalent medium for communication among people. To improve the user experience on video networks, the study of video thumbnail generation has become an important area of exploration. In this paper, we focus on the sentence-specified dynamic video thumbnail generation(DVTG) task. It aims to select several video clips to compose video thumbnails that not only provide a compressed presentation of the video, but also match the user sentence. Existing methods rely on complex multi-modal fusion modules to select thumbnail clips, despite ignoring the semantic similarity between video clips and the coherence of the generated thumbnail. To address these issues, we propose a novel visual-enhanced hierarchical network (VEH-Net) comprising the clip-guided visual-enhanced fusion module (CVFM) and a dynamic prototype thumbnail generator (DPTG). Our CVFM is designed to introduce more visual information during feature fusion to differentiate clips with similar semantic salience. The DPTG utilizes hierarchical Multi-Layer Perceptions (MLPs) to choose thumbnail clips in succession based on prototypes that contain information about previously selected clips. Additionally, we employ the captioning loss function to learn the internal relationships among the selected thumbnail clips to enhance the consistency of the generated thumbnails. Based on our VEH-Net, we also explore the unsupervised DVTG task to reduce the effect of subjective thumbnail annotations. Extensive experiments demonstrate that our approach outperforms other baseline methods, achieving state-of-the-art performance based on the dataset. The unsupervised version of our model also displays competitive performance in unsupervised experiments. All the codes can be found at .
资助项目National Natural Science Foundation of China[62103410] ; National Natural Science Foundation of China[62203438]
WOS研究方向Computer Science
语种英语
出版者SPRINGER
WOS记录号WOS:001025927700001
资助机构National Natural Science Foundation of China
源URL[http://ir.ia.ac.cn/handle/173211/53675]  
专题多模态人工智能系统全国重点实验室
通讯作者Zhang, Yujia
作者单位1.Chinese Acad Sci, Inst Automat, State Key Lab Multimodal Artificial Intelligence S, Beijing 100190, Peoples R China
2.Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100190, Peoples R China
推荐引用方式
GB/T 7714
Wu, Junxian,Zhang, Yujia,Zhao, Xiaoguang. Visual enhanced hierarchical network for sentence-based video thumbnail generation[J]. APPLIED INTELLIGENCE,2023:17.
APA Wu, Junxian,Zhang, Yujia,&Zhao, Xiaoguang.(2023).Visual enhanced hierarchical network for sentence-based video thumbnail generation.APPLIED INTELLIGENCE,17.
MLA Wu, Junxian,et al."Visual enhanced hierarchical network for sentence-based video thumbnail generation".APPLIED INTELLIGENCE (2023):17.

入库方式: OAI收割

来源:自动化研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。