arXiv:2106.03714 Abstract | arXiv Analytics

arXiv:2106.03714 [cs.CV]Abstract References Reviews Resources

Refiner: Refining Self-attention for Vision Transformers

Daquan Zhou, Yujun Shi, Bingyi Kang, Weihao Yu, Zihang Jiang, Yuan Li, Xiaojie Jin, Qibin Hou, Jiashi Feng

Published 2021-06-07Version 1

Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs. Yet, they generally require much more data for model pre-training. Most of recent works thus are dedicated to designing more complex architectures or training methods to address the data-efficiency issue of ViTs. However, few of them explore improving the self-attention mechanism, a key factor distinguishing ViTs from CNNs. Different from existing works, we introduce a conceptually simple scheme, called refiner, to directly refine the self-attention maps of ViTs. Specifically, refiner explores attention expansion that projects the multi-head attention maps to a higher-dimensional space to promote their diversity. Further, refiner applies convolutions to augment local patterns of the attention maps, which we show is equivalent to a distributed local attention features are aggregated locally with learnable kernels and then globally aggregated with self-attention. Extensive experiments demonstrate that refiner works surprisingly well. Significantly, it enables ViTs to achieve 86% top-1 classification accuracy on ImageNet with only 81M parameters.

Categories: cs.CV

Keywords: vision transformers, refining self-attention, image classification tasks, multi-head attention maps, refiner applies convolutions

Related articles: Most relevant | Search more

arXiv:2106.07617 [cs.CV] (Published 2021-06-14)

Delving Deep into the Generalization of Vision Transformers under Distribution Shifts

Chongzhi Zhang et al.

arXiv:2203.11894 [cs.CV] (Published 2022-03-22)

GradViT: Gradient Inversion of Vision Transformers

Ali Hatamizadeh, Hongxu Yin, Holger Roth, Wenqi Li, Jan Kautz, Daguang Xu, Pavlo Molchanov

arXiv:1609.02781 [cs.CV] (Published 2016-09-09)

An empirical study on the effects of different types of noise in image classification tasks

Gabriel B. Paranhos da Costa, Welinton A. Contato, Tiago S. Nazare, João E. S. Batista Neto, Moacir Ponti

arXiv Analytics

arXiv:2106.03714 [cs.CV]Abstract References Reviews Resources

Refiner: Refining Self-attention for Vision Transformers

Links

Toolbox

arXiv:2106.03714 [cs.CV]AbstractReferencesReviewsResources

Refiner: Refining Self-attention for Vision Transformers

Links

Toolbox

arXiv:2106.03714 [cs.CV]Abstract References Reviews Resources