中国科学院机构知识库网格系统: Mst: Masked self-supervised transformer for visual representation

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

Mst: Masked self-supervised transformer for visual representation

文献类型：会议论文


作者	Li, Zhaowen; Chen, Zhiyang; Yang, Fan; Li, Wei; Zhu, Yousong; Zhao, Chaoyang; Zhao, Rui; Deng, Rui; Tang, Ming; Wang, Jinqiao
出版日期	2021
会议日期	2021
会议地点	北京（虚拟会议）
英文摘要	Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP) and achieved great success. However, it has not been fully explored in visual self-supervised learning. Meanwhile, previous methods only consider the high-level feature and learning representation from a global perspective, which may fail to transfer to the downstream dense prediction tasks focusing on local features. In this paper, we present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image while preserving the global semantic information. Specifically, inspired by the Masked Language Modeling (MLM) in NLP, we propose a masked token strategy based on the multi-head self-attention map, which dynamically masks some tokens of local patches without damaging the crucial structure for self-supervised learning. More importantly, the masked tokens together with the remaining tokens are further recovered by a global image decoder, which preserves the spatial information of the image and is more friendly to the downstream dense prediction tasks. The experiments on multiple datasets demonstrate the effectiveness and generality of the proposed method. For instance, MST achieves Top-1 accuracy of 76.9\% with DeiT-S only using 300-epoch pre-training by linear evaluation, which outperforms supervised methods with the same epoch by 0.4\% and its comparable variant DINO by 1.0\%. For dense prediction tasks, MST also achieves 42.7\% mAP on MS COCO object detection and 74.04\% mIoU on Cityscapes segmentation only with 100-epoch pre-training.
源URL	[http://ir.ia.ac.cn/handle/173211/56719]
专题	紫东太初大模型研究中心_大模型计算
作者单位	中国科学院自动化研究所2.中国科学院大学
推荐引用方式 GB/T 7714	Li, Zhaowen,Chen, Zhiyang,Yang, Fan,et al. Mst: Masked self-supervised transformer for visual representation[C]. 见:. 北京（虚拟会议）. 2021.

入库方式： OAI收割

来源：自动化研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。