arXiv:2204.08721 Abstract | arXiv Analytics

arXiv:2204.08721 [cs.CV]Abstract References Reviews Resources

Multimodal Token Fusion for Vision Transformers

Yikai Wang, Xinghao Chen, Lele Cao, Wenbing Huang, Fuchun Sun, Yunhe Wang

Published 2022-04-19Version 1

Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked to handle input sources like images. Intuitively, feeding multiple modalities of data to vision transformers could improve the performance, yet the inner-modal attentive weights may also be diluted, which could thus undermine the final performance. In this paper, we propose a multimodal token fusion method (TokenFusion), tailored for transformer-based vision tasks. To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features. Residual positional alignment is also adopted to enable explicit utilization of the inter-modal alignments after fusion. The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact. Extensive experiments are conducted on a variety of homogeneous and heterogeneous modalities and demonstrate that TokenFusion surpasses state-of-the-art methods in three typical vision tasks: multimodal image-to-image translation, RGB-depth semantic segmentation, and 3D object detection with point cloud and images.

Comments: CVPR 2022

Categories: cs.CV

Keywords: vision transformers, vision tasks, dynamically detects uninformative tokens, transformer architecture remains largely intact, tokenfusion surpasses state-of-the-art methods

Related articles: Most relevant | Search more

arXiv:2201.02767 [cs.CV] (Published 2022-01-08, updated 2022-03-23)

QuadTree Attention for Vision Transformers

Shitao Tang, Jiahui Zhang, Siyu Zhu, Ping Tan

arXiv:2311.17983 [cs.CV] (Published 2023-11-29)

Improving Faithfulness for Vision Transformers

Lijie Hu, Yixin Liu, Ninghao Liu, Mengdi Huai, Lichao Sun, Di Wang

arXiv:2106.03714 [cs.CV] (Published 2021-06-07)

Refiner: Refining Self-attention for Vision Transformers

Daquan Zhou et al.

arXiv Analytics

arXiv:2204.08721 [cs.CV]Abstract References Reviews Resources

Multimodal Token Fusion for Vision Transformers

Links

Toolbox

arXiv:2204.08721 [cs.CV]AbstractReferencesReviewsResources

Multimodal Token Fusion for Vision Transformers

Links

Toolbox

arXiv:2204.08721 [cs.CV]Abstract References Reviews Resources