arXiv:2406.20092 Abstract | arXiv Analytics

arXiv:2406.20092 [cs.CV]Abstract References Reviews Resources

LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression

Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, Alan Yuille

Published 2024-06-28Version 1

While significant advancements have been made in compressed representations for text embeddings in large language models (LLMs), the compression of visual tokens in large multi-modal models (LMMs) has remained a largely overlooked area. In this work, we present the study on the analysis of redundancy concerning visual tokens and efficient training within these models. Our initial experiments show that eliminating up to 70% of visual tokens at the testing stage by simply average pooling only leads to a minimal 3% reduction in visual question answering accuracy on the GQA benchmark, indicating significant redundancy in visual context. Addressing this, we introduce Visual Context Compressor, which reduces the number of visual tokens during training to enhance training efficiency without sacrificing performance. To minimize information loss caused by the compression on visual tokens while maintaining training efficiency, we develop LLaVolta as a lite training scheme. LLaVolta incorporates stage-wise visual context compression to progressively compress the visual tokens from heavily to lightly, and finally no compression at the end of training, yielding no loss of information when testing. Extensive experiments demonstrate that our approach enhances the performance of MLLMs in both image-language and video-language understanding, while also significantly cutting training costs. Code is available at https://github.com/Beckschen/LLaVolta

Comments: Code is available at https://github.com/Beckschen/LLaVolta

Categories: cs.CV

Keywords: visual tokens, efficient multi-modal models, incorporates stage-wise visual context compression, llavolta incorporates stage-wise visual context, visual question answering accuracy

Tags: github project

Related articles: Most relevant | Search more

arXiv:2312.08870 [cs.CV] (Published 2023-12-12)

Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens

Fan Ma, Xiaojie Jin, Heng Wang, Yuchen Xian, Jiashi Feng, Yi Yang

arXiv:2501.09532 [cs.CV] (Published 2025-01-16, updated 2025-02-01)

AdaFV: Rethinking of Visual-Language alignment for VLM acceleration

Jiayi Han, Liang Du, Yiwen Wu, Xiangguo Zhou, Hongwei Du, Weibo Zheng

arXiv:2501.18269 [cs.CV] (Published 2025-01-30)