arXiv:2406.20092 [cs.CV]AbstractReferencesReviewsResources Classifications Subjects Themes Keywords visual tokens, efficient multi-modal models, incorporates stage-wise visual context compression, llavolta incorporates stage-wise visual context, visual question answering accuracy Tags github project Journal Information Publisher Journal Year Month Volume Number Pages DOI URL Miscellaneous Typesetting Pages Language License Submit Reset