中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
Length Cross-scale Vision Transformer for crowd localization

文献类型:期刊论文

作者Liu, Shuang3; Lian, Yu3; Zhang, Zhong3; Xiao, Baihua1; Durrani, Tariq S.2
刊名JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES
出版日期2024-02-01
卷号36期号:2页码:9
关键词Crowd localization Multi-scale information fusion Long-range context dependencies Adaptive windows
ISSN号1319-1578
DOI10.1016/j.jksuci.2024.101972
通讯作者Zhang, Zhong(zhong.zhang8848@gmail.com)
英文摘要Crowd localization can provide the positions of individuals and the total number of people, which has great application value for security monitoring and public management, meanwhile it meets the challenges of lighting, occlusion and perspective effect. In recent times, Transformer has been applied in crowd localization to overcome these challenges. Yet such kind of methods only consider to integrate the multi -scale information once, which results in incomplete multi -scale information fusion. In this paper, we propose a novel Transformer network named Cross -scale Vision Transformer (CsViT) for crowd localization, which simultaneously fuses multi -scale information during both the encoder and decoder stages and meanwhile building the long-range context dependencies on the combined feature maps. To this end, we design the multi -scale encoder to fuse the feature maps of multiple scales at corresponding positions so as to obtain the combined feature maps, and meanwhile design the multi -scale decoder to integrate the tokens at multiple scales when modeling the longrange context dependencies. Furthermore, we propose Multi -scale SSIM (MsSSIM) loss to adaptively compute head regions and optimize the similarity at multiple scales. Specifically, we set the adaptive windows with different scales for each head and compute the loss values within these windows so as to enhance the accuracy of the predicted distance transform map. We perform comprehensive experiments on five public datasets, and the results obtained validate the effectiveness of our method.
WOS研究方向Computer Science
语种英语
WOS记录号WOS:001188598600001
出版者ELSEVIER
源URL[http://ir.ia.ac.cn/handle/173211/56975]  
专题自动化研究所_复杂系统管理与控制国家重点实验室_影像分析与机器视觉团队
通讯作者Zhang, Zhong
作者单位1.Chinese Acad Sci, Inst Automat, State Key Lab Management & Control Complex Syst, Beijing 100190, Peoples R China
2.Univ Strathclyde, Dept Elect & Elect Engn, Glasgow, Scotland
3.Tianjin Normal Univ, Tianjin Key Lab Wireless Mobile Commun & Power Tra, Tianjin 300387, Peoples R China
推荐引用方式
GB/T 7714
Liu, Shuang,Lian, Yu,Zhang, Zhong,et al. Length Cross-scale Vision Transformer for crowd localization[J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES,2024,36(2):9.
APA Liu, Shuang,Lian, Yu,Zhang, Zhong,Xiao, Baihua,&Durrani, Tariq S..(2024).Length Cross-scale Vision Transformer for crowd localization.JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES,36(2),9.
MLA Liu, Shuang,et al."Length Cross-scale Vision Transformer for crowd localization".JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES 36.2(2024):9.

入库方式: OAI收割

来源:自动化研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。