arXiv:2312.06968 Abstract | arXiv Analytics

arXiv:2312.06968 [cs.CV]Abstract References Reviews Resources

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

Chaoya Jiang, Haiyang Xu, Mengfan Dong, Jiaxing Chen, Wei Ye, Ming Yan, Qinghao Ye, Ji Zhang, Fei Huang, Shikun Zhang

Published 2023-12-12Version 1

Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks. However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information. In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning. We first analyzed the representation distribution of textual and visual tokens in MLLM, revealing two important findings: 1) there is a significant gap between textual and visual representations, indicating unsatisfactory cross-modal representation alignment; 2) representations of texts that contain and do not contain hallucinations are entangled, making it challenging to distinguish them. These two observations inspire us with a simple yet effective method to mitigate hallucinations. Specifically, we introduce contrastive learning into MLLMs and use text with hallucination as hard negative examples, naturally bringing representations of non-hallucinative text and visual samples closer while pushing way representations of non-hallucinating and hallucinative text. We evaluate our method quantitatively and qualitatively, showing its effectiveness in reducing hallucination occurrences and improving performance across multiple benchmarks. On the MMhal-Bench benchmark, our method obtains a 34.66% /29.5% improvement over the baseline MiniGPT-4/LLaVA.

Categories: cs.CV

Keywords: multimodal large language model, hallucination augmented contrastive learning, indicating unsatisfactory cross-modal representation alignment

Related articles: Most relevant | Search more

arXiv:2306.13549 [cs.CV] (Published 2023-06-23)

A Survey on Multimodal Large Language Models

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, Enhong Chen

arXiv:2505.10769 [cs.CV] (Published 2025-05-16)

Unifying Segment Anything in Microscopy with Multimodal Large Language Model

Manyu Li, Ruian He, Zixian Zhang, Weimin Tan, Bo Yan

arXiv:2503.08507 [cs.CV] (Published 2025-03-11, updated 2025-05-12)

Referring to Any Person

Qing Jiang et al.

arXiv Analytics

arXiv:2312.06968 [cs.CV]Abstract References Reviews Resources

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

Links

Toolbox

arXiv:2312.06968 [cs.CV]AbstractReferencesReviewsResources

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

Links

Toolbox

arXiv:2312.06968 [cs.CV]Abstract References Reviews Resources