Image captioning: Semantic selection unit with stacked residual attention
文献类型:期刊论文
作者 | Song, Lifei1,2; Li, Fei3; Wang, Ying1,2![]() ![]() ![]() |
刊名 | IMAGE AND VISION COMPUTING
![]() |
出版日期 | 2024-04-01 |
卷号 | 144页码:12 |
关键词 | Image captioning Semantic attributes Semantic selection unit Transformer Stacked residual attention |
ISSN号 | 0262-8856 |
DOI | 10.1016/j.imavis.2024.104965 |
通讯作者 | Wang, Ying(ying.wang@ia.ac.cn) |
英文摘要 | Semantic information and attention mechanism play important roles in the task of image captioning. Semantic information can strengthen the relationship between images and languages, while attention operation can steer the relevant regions spatially in the image. However, in most current works, semantic attributes are always confined to be learned from pairs of images and sentences, which ignore to fully utilize more semantic attributes and the structure information of sentences, thus limit the variety of sentences to be generated. Meanwhile, current attention models usually lack the ability to learn the positional information in an explicit way during attention generation, and have the problem of vanishing gradient in the training process. This paper proposes a Semantic Selection Unit (SSU) and a Stacked Residual Attention (SRA) to remedy these drawbacks. Specifically, the SSU is designed to capture selectively semantic information from expanding attributes or guidance sentences. With the help of expanding vocabulary and the structure information in sentences, the SSU can improve the quality of the generated sentences. The SRA is constructed to solve the problem of positional information missing and vanishing gradient problem during attention generation. Architecturally, the SSU and SRA work together in a jointed framework with end -to -end learning for image captioning. Extensive experiments have been conducted on the public dataset of the MS COCO, achieving 139.7 CIDEr score on the test set. |
WOS关键词 | TRANSFORMER |
资助项目 | National Key Research and Development Program of China[2018AAA0100400] ; National Natural Science Foundation of China[62076242] |
WOS研究方向 | Computer Science ; Engineering ; Optics |
语种 | 英语 |
WOS记录号 | WOS:001202109600001 |
出版者 | ELSEVIER |
资助机构 | National Key Research and Development Program of China ; National Natural Science Foundation of China |
源URL | [http://ir.ia.ac.cn/handle/173211/58150] ![]() |
专题 | 自动化研究所_模式识别国家重点实验室_遥感图像处理团队 |
通讯作者 | Wang, Ying |
作者单位 | 1.Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100049, Peoples R China 2.Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China 3.China Tower Corp Ltd, Beijing 100029, Peoples R China 4.Beijing Inst Tracking & Telecommun Technol, Beijing 100094, Peoples R China |
推荐引用方式 GB/T 7714 | Song, Lifei,Li, Fei,Wang, Ying,et al. Image captioning: Semantic selection unit with stacked residual attention[J]. IMAGE AND VISION COMPUTING,2024,144:12. |
APA | Song, Lifei,Li, Fei,Wang, Ying,Liu, Yu,Wang, Yuanhua,&Xiang, Shiming.(2024).Image captioning: Semantic selection unit with stacked residual attention.IMAGE AND VISION COMPUTING,144,12. |
MLA | Song, Lifei,et al."Image captioning: Semantic selection unit with stacked residual attention".IMAGE AND VISION COMPUTING 144(2024):12. |
入库方式: OAI收割
来源:自动化研究所
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。