arXiv:2312.03700 Abstract | arXiv Analytics

arXiv:2312.03700 [cs.CV]Abstract References Reviews Resources

OneLLM: One Framework to Align All Modalities with Language

Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, Xiangyu Yue

Published 2023-12-06Version 1

Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail, we first train an image projection module to connect a vision encoder with LLM. Then, we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally, we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions, we also curated a comprehensive multimodal instruction dataset, including 2M items from image, audio, video, point cloud, depth/normal map, IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning, where it delivers excellent performance. Code, data, model and online demo are available at https://github.com/csuhan/OneLLM

Comments: Code: https://github.com/csuhan/OneLLM

Categories: cs.CV, cs.AI, cs.CL, cs.LG, cs.MM

Keywords: modalities, mixing multiple image projection modules, multimodal large language models, multimodal alignment pipeline, delivers excellent performance

Tags: github project

Related articles: Most relevant | Search more

arXiv:2404.11207 [cs.CV] (Published 2024-04-17)

Exploring the Transferability of Visual Prompting for Multimodal Large Language Models

Yichi Zhang, Yinpeng Dong, Siyuan Zhang, Tianzan Min, Hang Su, Jun Zhu

arXiv:2403.09072 [cs.CV] (Published 2024-03-14)

UniCode: Learning a Unified Codebook for Multimodal Large Language Models

Sipeng Zheng, Bohan Zhou, Yicheng Feng, Ye Wang, Zongqing Lu

arXiv:2404.12390 [cs.CV] (Published 2024-04-18)

BLINK: Multimodal Large Language Models Can See but Not Perceive

Xingyu Fu et al.

arXiv Analytics

arXiv:2312.03700 [cs.CV]Abstract References Reviews Resources

OneLLM: One Framework to Align All Modalities with Language

Links

Toolbox

arXiv:2312.03700 [cs.CV]AbstractReferencesReviewsResources

OneLLM: One Framework to Align All Modalities with Language

Links

Toolbox

arXiv:2312.03700 [cs.CV]Abstract References Reviews Resources