arXiv:2501.18269 [cs.CV]AbstractReferencesReviewsResources Classifications Subjects Themes Keywords visual tokens, multi-modal video captioning methods, caption generation module, captioning methods typically extract, first model-agnostic module selection framework Tags Journal Information Publisher Journal Year Month Volume Number Pages DOI URL Miscellaneous Typesetting Pages Language License Submit Reset