中国科学院机构知识库网格系统: Vision Transformers with Hierarchical Attention

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

Vision Transformers with Hierarchical Attention

文献类型：期刊论文


作者	Yun Liu 1; Yu-Huan Wu 2; Guolei Sun 3; Le Zhang 4; Ajad Chhatkuli 3; Luc Van Gool 3
刊名	Machine Intelligence Research
出版日期	2024
卷号	21 期号:4 页码:670-683
关键词	Vision transformer hierarchical attention global attention local attention scene understanding
ISSN号	2731-538X
DOI	10.1007/s11633-024-1393-8
英文摘要	This paper tackles the high computational/space complexity associated with multi-head self-attention (MHSA) in vanilla vision transformers. To this end, we propose hierarchical MHSA (H-MHSA), a novel approach that computes self-attention in a hierarchical fashion. Specifically, we first divide the input image into patches as commonly done, and each patch is viewed as a token. Then, the proposed H-MHSA learns token relationships within local patches, serving as local relationship modeling. Then, the small patches are merged into larger ones, and H-MHSA models the global dependencies for the small number of the merged tokens. At last, the local and global attentive features are aggregated to obtain features with powerful representation capacity. Since we only calculate attention for a limited number of tokens at each step, the computational load is reduced dramatically. Hence, H-MHSA can efficiently model glob al relationships among tokens without sacrificing fine-grained information. With the H-MHSA module incorporated, we build a family of hierarchical-attention-based transformer networks, namely HAT-Net. To demonstrate the superiority of HAT-Net in scene understand ing, we conduct extensive experiments on fundamental vision tasks, including image classification, semantic segmentation, object detection and instance segmentation. Therefore, HAT-Net provides a new perspective for vision transformers. Code and pretrained models are available at https://github.com/yun-liu/HAT-Net.
源URL	[http://ir.ia.ac.cn/handle/173211/58566]
专题	自动化研究所_学术期刊_International Journal of Automation and Computing
作者单位	1.Institute for Infocomm Research (I2R), ASTAR, Singapore 138632, Singapore 2.Institute of High Performance Computing (IHPC), ASTAR, Singapore 138632, Singapore 3.Computer Vision Lab, ETH Zürich, Zürich 8092, Switzerland 4.School of Information and Communication Engineering, University of Electronic Science and Technology of China (UESTC), Chengdu 611731, China
推荐引用方式 GB/T 7714	Yun Liu, Yu-Huan Wu, Guolei Sun,et al. Vision Transformers with Hierarchical Attention[J]. Machine Intelligence Research,2024,21(4):670-683.
APA	Yun Liu, Yu-Huan Wu, Guolei Sun, Le Zhang,Ajad Chhatkuli,& Luc Van Gool.(2024).Vision Transformers with Hierarchical Attention.Machine Intelligence Research,21(4),670-683.
MLA	Yun Liu,et al."Vision Transformers with Hierarchical Attention".Machine Intelligence Research 21.4(2024):670-683.

入库方式： OAI收割

来源：自动化研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。