arXiv:2505.03703 [cs.CV]AbstractReferencesReviewsResources Classifications Subjects Themes Keywords image-text representation learning, downstream tasks, vision-language models, multimodal, beneficial effects Tags Journal Information Publisher Journal Year Month Volume Number Pages DOI URL Miscellaneous Typesetting Pages Language License Submit Reset