arXiv:2011.07191 Abstract | arXiv Analytics

arXiv:2011.07191 [cs.LG]Abstract References Reviews Resources

On the Benefits of Early Fusion in Multimodal Representation Learning

George Barnum, Sabera Talukder, Yisong Yue

Published 2020-11-14Version 1

Intelligently reasoning about the world often requires integrating data from multiple modalities, as any individual modality may contain unreliable or incomplete information. Prior work in multimodal learning fuses input modalities only after significant independent processing. On the other hand, the brain performs multimodal processing almost immediately. This divide between conventional multimodal learning and neuroscience suggests that a detailed study of early multimodal fusion could improve artificial multimodal representations. To facilitate the study of early multimodal fusion, we create a convolutional LSTM network architecture that simultaneously processes both audio and visual inputs, and allows us to select the layer at which audio and visual information combines. Our results demonstrate that immediate fusion of audio and visual inputs in the initial C-LSTM layer results in higher performing networks that are more robust to the addition of white noise in both audio and visual inputs.

Categories: cs.LG, cs.AI

Keywords: multimodal representation learning, early fusion, visual inputs, early multimodal fusion, initial c-lstm layer results

Related articles: Most relevant | Search more

arXiv:2205.00142 [cs.LG] (Published 2022-04-30)

Multimodal Representation Learning With Text and Images

Aishwarya Jayagopal, Ankireddy Monica Aiswarya, Ankita Garg, Srinivasan Kolumam Nandakumar

arXiv:2506.20494 [cs.LG] (Published 2025-06-25)

Multimodal Representation Learning and Fusion

Qihang Jin et al.

arXiv:2502.06846 [cs.LG] (Published 2025-02-07)

Prot2Chat: Protein LLM with Early Fusion of Sequence and Structure

Zhicong Wang, Zicheng Ma, Ziqiang Cao, Changlong Zhou, Jun Zhang, Yiqin Gao

arXiv Analytics

arXiv:2011.07191 [cs.LG]Abstract References Reviews Resources

On the Benefits of Early Fusion in Multimodal Representation Learning

Links

Toolbox

arXiv:2011.07191 [cs.LG]AbstractReferencesReviewsResources

On the Benefits of Early Fusion in Multimodal Representation Learning

Links

Toolbox

arXiv:2011.07191 [cs.LG]Abstract References Reviews Resources