中国科学院机构知识库网格系统: Image captioning: Semantic selection unit with stacked residual attention

Image captioning: Semantic selection unit with stacked residual attention

文献类型：期刊论文


作者	Song, Lifei 1,2; Li, Fei 3; Wang, Ying1,2 ; Liu, Yu4 ; Wang, Yuanhua 4; Xiang, Shiming1,2
刊名	IMAGE AND VISION COMPUTING
出版日期	2024-04-01
卷号	144 页码:12
关键词	Image captioning Semantic attributes Semantic selection unit Transformer Stacked residual attention
ISSN号	0262-8856
DOI	10.1016/j.imavis.2024.104965
通讯作者	Wang, Ying(ying.wang@ia.ac.cn)
英文摘要	Semantic information and attention mechanism play important roles in the task of image captioning. Semantic information can strengthen the relationship between images and languages, while attention operation can steer the relevant regions spatially in the image. However, in most current works, semantic attributes are always confined to be learned from pairs of images and sentences, which ignore to fully utilize more semantic attributes and the structure information of sentences, thus limit the variety of sentences to be generated. Meanwhile, current attention models usually lack the ability to learn the positional information in an explicit way during attention generation, and have the problem of vanishing gradient in the training process. This paper proposes a Semantic Selection Unit (SSU) and a Stacked Residual Attention (SRA) to remedy these drawbacks. Specifically, the SSU is designed to capture selectively semantic information from expanding attributes or guidance sentences. With the help of expanding vocabulary and the structure information in sentences, the SSU can improve the quality of the generated sentences. The SRA is constructed to solve the problem of positional information missing and vanishing gradient problem during attention generation. Architecturally, the SSU and SRA work together in a jointed framework with end -to -end learning for image captioning. Extensive experiments have been conducted on the public dataset of the MS COCO, achieving 139.7 CIDEr score on the test set.
WOS关键词	TRANSFORMER
资助项目	National Key Research and Development Program of China[2018AAA0100400] ; National Natural Science Foundation of China[62076242]
WOS研究方向	Computer Science ; Engineering ; Optics
语种	英语
WOS记录号	WOS:001202109600001
出版者	ELSEVIER
资助机构	National Key Research and Development Program of China ; National Natural Science Foundation of China
源URL	[http://ir.ia.ac.cn/handle/173211/58150]
专题	自动化研究所_模式识别国家重点实验室_遥感图像处理团队
通讯作者	Wang, Ying
作者单位	1.Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100049, Peoples R China 2.Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China 3.China Tower Corp Ltd, Beijing 100029, Peoples R China 4.Beijing Inst Tracking & Telecommun Technol, Beijing 100094, Peoples R China
推荐引用方式 GB/T 7714	Song, Lifei,Li, Fei,Wang, Ying,et al. Image captioning: Semantic selection unit with stacked residual attention[J]. IMAGE AND VISION COMPUTING,2024,144:12.
APA	Song, Lifei,Li, Fei,Wang, Ying,Liu, Yu,Wang, Yuanhua,&Xiang, Shiming.(2024).Image captioning: Semantic selection unit with stacked residual attention.IMAGE AND VISION COMPUTING,144,12.
MLA	Song, Lifei,et al."Image captioning: Semantic selection unit with stacked residual attention".IMAGE AND VISION COMPUTING 144(2024):12.

入库方式： OAI收割

来源：自动化研究所

下载0

Image captioning: Semantic selection unit with stacked residual attention

其他版本