arXiv Analytics

Sign in

arXiv:2106.03714 [cs.CV]AbstractReferencesReviewsResources

Refiner: Refining Self-attention for Vision Transformers

Daquan Zhou, Yujun Shi, Bingyi Kang, Weihao Yu, Zihang Jiang, Yuan Li, Xiaojie Jin, Qibin Hou, Jiashi Feng

Published 2021-06-07Version 1

Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs. Yet, they generally require much more data for model pre-training. Most of recent works thus are dedicated to designing more complex architectures or training methods to address the data-efficiency issue of ViTs. However, few of them explore improving the self-attention mechanism, a key factor distinguishing ViTs from CNNs. Different from existing works, we introduce a conceptually simple scheme, called refiner, to directly refine the self-attention maps of ViTs. Specifically, refiner explores attention expansion that projects the multi-head attention maps to a higher-dimensional space to promote their diversity. Further, refiner applies convolutions to augment local patterns of the attention maps, which we show is equivalent to a distributed local attention features are aggregated locally with learnable kernels and then globally aggregated with self-attention. Extensive experiments demonstrate that refiner works surprisingly well. Significantly, it enables ViTs to achieve 86% top-1 classification accuracy on ImageNet with only 81M parameters.

Related articles: Most relevant | Search more
arXiv:2106.07617 [cs.CV] (Published 2021-06-14)
Delving Deep into the Generalization of Vision Transformers under Distribution Shifts
arXiv:2203.11894 [cs.CV] (Published 2022-03-22)
GradViT: Gradient Inversion of Vision Transformers
arXiv:1609.02781 [cs.CV] (Published 2016-09-09)
An empirical study on the effects of different types of noise in image classification tasks