arXiv:2407.18520 Abstract | arXiv Analytics

arXiv:2407.18520 [cs.CV]Abstract References Reviews Resources

Text-Region Matching for Multi-Label Image Recognition with Missing Labels

Leilei Ma, Hongxing Xie, Lei Wang, Yanping Fu, Dengdi Sun, Haifeng Zhao

Published 2024-07-26Version 1

Recently, large-scale visual language pre-trained (VLP) models have demonstrated impressive performance across various downstream tasks. Motivated by these advancements, pioneering efforts have emerged in multi-label image recognition with missing labels, leveraging VLP prompt-tuning technology. However, they usually cannot match text and vision features well, due to complicated semantics gaps and missing labels in a multi-label image. To tackle this challenge, we propose \textbf{T}ext-\textbf{R}egion \textbf{M}atching for optimizing \textbf{M}ulti-\textbf{L}abel prompt tuning, namely TRM-ML, a novel method for enhancing meaningful cross-modal matching. Compared to existing methods, we advocate exploring the information of category-aware regions rather than the entire image or pixels, which contributes to bridging the semantic gap between textual and visual representations in a one-to-one matching manner. Concurrently, we further introduce multimodal contrastive learning to narrow the semantic gap between textual and visual modalities and establish intra-class and inter-class relationships. Additionally, to deal with missing labels, we propose a multimodal category prototype that leverages intra- and inter-category semantic relationships to estimate unknown labels, facilitating pseudo-label generation. Extensive experiments on the MS-COCO, PASCAL VOC, Visual Genome, NUS-WIDE, and CUB-200-211 benchmark datasets demonstrate that our proposed framework outperforms the state-of-the-art methods by a significant margin. Our code is available here\href{https://github.com/yu-gi-oh-leilei/TRM-ML}{\raisebox{-1pt}{\faGithub}}.

Comments: Accepted to ACM International Conference on Multimedia (ACM MM) 2024

Categories: cs.CV

Keywords: multi-label image recognition, missing labels, text-region matching, semantic gap, large-scale visual language

Tags: conference paper

Related articles: Most relevant | Search more

arXiv:2107.11159 [cs.CV] (Published 2021-07-23)

Learning Discriminative Representations for Multi-Label Image Recognition

Mohammed Hassanin, Ibrahim Radwan, Salman Khan, Murat Tahtali

arXiv:2204.03795 [cs.CV] (Published 2022-04-08)

Semantic Representation and Dependency Learning for Multi-Label Image Recognition

Tao Pu, Lixian Yuan, Hefeng Wu, Tianshui Chen, Ling Tian, Liang Lin

arXiv:2211.12739 [cs.CV] (Published 2022-11-23)