中国科学院机构知识库网格系统: Visual enhanced hierarchical network for sentence-based video thumbnail generation

Visual enhanced hierarchical network for sentence-based video thumbnail generation

文献类型：期刊论文


作者	Wu, Junxian 1,2; Zhang, Yujia1 ; Zhao, Xiaoguang1
刊名	APPLIED INTELLIGENCE
出版日期	2023-06-30
页码	17
关键词	Video thumbnail DVTG task Multi-modal fusion Visual information Hierarchical multi-layer perceptions
ISSN号	0924-669X
DOI	10.1007/s10489-023-04726-x
通讯作者	Zhang, Yujia(zhangyujia2014@ia.ac.cn)
英文摘要	With the development of the Internet, video has become the most prevalent medium for communication among people. To improve the user experience on video networks, the study of video thumbnail generation has become an important area of exploration. In this paper, we focus on the sentence-specified dynamic video thumbnail generation(DVTG) task. It aims to select several video clips to compose video thumbnails that not only provide a compressed presentation of the video, but also match the user sentence. Existing methods rely on complex multi-modal fusion modules to select thumbnail clips, despite ignoring the semantic similarity between video clips and the coherence of the generated thumbnail. To address these issues, we propose a novel visual-enhanced hierarchical network (VEH-Net) comprising the clip-guided visual-enhanced fusion module (CVFM) and a dynamic prototype thumbnail generator (DPTG). Our CVFM is designed to introduce more visual information during feature fusion to differentiate clips with similar semantic salience. The DPTG utilizes hierarchical Multi-Layer Perceptions (MLPs) to choose thumbnail clips in succession based on prototypes that contain information about previously selected clips. Additionally, we employ the captioning loss function to learn the internal relationships among the selected thumbnail clips to enhance the consistency of the generated thumbnails. Based on our VEH-Net, we also explore the unsupervised DVTG task to reduce the effect of subjective thumbnail annotations. Extensive experiments demonstrate that our approach outperforms other baseline methods, achieving state-of-the-art performance based on the dataset. The unsupervised version of our model also displays competitive performance in unsupervised experiments. All the codes can be found at .
资助项目	National Natural Science Foundation of China[62103410] ; National Natural Science Foundation of China[62203438]
WOS研究方向	Computer Science
语种	英语
WOS记录号	WOS:001025927700001
出版者	SPRINGER
资助机构	National Natural Science Foundation of China
源URL	[http://ir.ia.ac.cn/handle/173211/53675]
专题	多模态人工智能系统全国重点实验室
通讯作者	Zhang, Yujia
作者单位	1.Chinese Acad Sci, Inst Automat, State Key Lab Multimodal Artificial Intelligence S, Beijing 100190, Peoples R China 2.Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100190, Peoples R China
推荐引用方式 GB/T 7714	Wu, Junxian,Zhang, Yujia,Zhao, Xiaoguang. Visual enhanced hierarchical network for sentence-based video thumbnail generation[J]. APPLIED INTELLIGENCE,2023:17.
APA	Wu, Junxian,Zhang, Yujia,&Zhao, Xiaoguang.(2023).Visual enhanced hierarchical network for sentence-based video thumbnail generation.APPLIED INTELLIGENCE,17.
MLA	Wu, Junxian,et al."Visual enhanced hierarchical network for sentence-based video thumbnail generation".APPLIED INTELLIGENCE (2023):17.

入库方式： OAI收割

来源：自动化研究所

下载0

Visual enhanced hierarchical network for sentence-based video thumbnail generation

其他版本