arXiv Analytics

Sign in

arXiv:2304.04385 [cs.LG]AbstractReferencesReviewsResources

On Robustness in Multimodal Learning

randon McKinzie, Joseph Cheng, Vaishaal Shankar, Yinfei Yang, Jonathon Shlens, Alexander Toshev

Published 2023-04-10Version 1

Multimodal learning is defined as learning over multiple heterogeneous input modalities such as video, audio, and text. In this work, we are concerned with understanding how models behave as the type of modalities differ between training and deployment, a situation that naturally arises in many applications of multimodal learning to hardware platforms. We present a multimodal robustness framework to provide a systematic analysis of common multimodal representation learning methods. Further, we identify robustness short-comings of these approaches and propose two intervention techniques leading to $1.5\times$-$4\times$ robustness improvements on three datasets, AudioSet, Kinetics-400 and ImageNet-Captions. Finally, we demonstrate that these interventions better utilize additional modalities, if present, to achieve competitive results of $44.2$ mAP on AudioSet 20K.

Related articles: Most relevant | Search more
arXiv:2402.06223 [cs.LG] (Published 2024-02-09, updated 2025-05-12)
Beyond DAGs: A Latent Partial Causal Model for Multimodal Learning
Yuhang Liu et al.
arXiv:2202.06218 [cs.LG] (Published 2022-02-13)
Emotion Based Hate Speech Detection using Multimodal Learning
arXiv:2312.00935 [cs.LG] (Published 2023-12-01)
A Theory of Unimodal Bias in Multimodal Learning