arXiv:2203.16434 [cs.CV]AbstractReferencesReviewsResources Classifications Subjects Themes Keywords transformers, models spatial multi-modal interactions, jointly performs spatio-temporal localization, spatio-temporal video grounding task, object detection Tags Journal Information Publisher Journal Year Month Volume Number Pages DOI URL Miscellaneous Typesetting Pages Language License Submit Reset