arXiv:2505.03703 Abstract | arXiv Analytics

arXiv:2505.03703 [cs.CV]Abstract References Reviews Resources

Fill the Gap: Quantifying and Reducing the Modality Gap in Image-Text Representation Learning

François Role, Sébastien Meyer, Victor Amblard

Published 2025-05-06Version 1

Vision-language models (VLMs) allow to embed texts and images in a shared representation space. However, it has been shown that these models are subject to a modality gap phenomenon meaning there exists a clear separation between the embeddings from one modality and another in the embedding space. While this misalignment is detrimental for downstream tasks such as multimodal retrieval, multimodal clustering or zero-shot classification, etc. no generic and practical methods have so far been proposed to assess it precisely and even reduce it. We therefore propose novel measures and effective techniques (spectral- and optimal transport-based methods) to achieve this goal. Extensive experiments conducted on several image-text datasets and models demonstrate their effectiveness and beneficial effects on downstream tasks. Our code is available at the URL provided in the paper's abstract.

Categories: cs.CV, cs.LG

Keywords: image-text representation learning, downstream tasks, vision-language models, multimodal, beneficial effects

Related articles: Most relevant | Search more

arXiv:2204.03934 [cs.CV] (Published 2022-04-08)

Does Robustness on ImageNet Transfer to Downstream Tasks?

Yutaro Yamada, Mayu Otani

arXiv:2109.01134 [cs.CV] (Published 2021-09-02)

Learning to Prompt for Vision-Language Models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu

arXiv:2301.04101 [cs.CV] (Published 2023-01-10)

Neural Radiance Field Codebooks

Matthew Wallingford, Aditya Kusupati, Alex Fang, Vivek Ramanujan, Aniruddha Kembhavi, Roozbeh Mottaghi, Ali Farhadi