arXiv Analytics

Sign in

arXiv:2205.15173 [cs.CV]AbstractReferencesReviewsResources

Self-Supervised Pre-training of Vision Transformers for Dense Prediction Tasks

Jaonary Rabarisoa, Velentin Belissen, Florian Chabot, Quoc-Cuong Pham

Published 2022-05-30Version 1

We present a new self-supervised pre-training of Vision Transformers for dense prediction tasks. It is based on a contrastive loss across views that compares pixel-level representations to global image representations. This strategy produces better local features suitable for dense prediction tasks as opposed to contrastive pre-training based on global image representation only. Furthermore, our approach does not suffer from a reduced batch size since the number of negative examples needed in the contrastive loss is in the order of the number of local features. We demonstrate the effectiveness of our pre-training strategy on two dense prediction tasks: semantic segmentation and monocular depth estimation.

Related articles: Most relevant | Search more
arXiv:2107.04735 [cs.CV] (Published 2021-07-10)
Local-to-Global Self-Attention in Vision Transformers
arXiv:2106.15788 [cs.CV] (Published 2021-06-30)
Align Yourself: Self-supervised Pre-training for Fine-grained Recognition via Saliency Alignment
Di Wu et al.
arXiv:2203.11894 [cs.CV] (Published 2022-03-22)
GradViT: Gradient Inversion of Vision Transformers