arXiv:2409.11059 Abstract | arXiv Analytics

arXiv:2409.11059 [cs.CV]Abstract References Reviews Resources

OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities

Bilal Faye, Hanane Azzag, Mustapha Lebbah

Published 2024-09-17Version 1

Cross-modal alignment Learning integrates information from different modalities like text, image, audio and video to create unified models. This approach develops shared representations and learns correlations between modalities, enabling applications such as visual question answering and audiovisual content analysis. Current techniques rely on large modality-specific encoders, necessitating fine-tuning or training from scratch on vast aligned datasets (e.g., text-image, text-audio, image-audio). This approach has limitations: (i) it is very expensive due to the need for training large encoders on extensive datasets, (ii) acquiring aligned large paired datasets is challenging, and (iii) adding new modalities requires retraining the entire framework to incorporate these modalities. To address these issues, we propose OneEncoder, a lightweight framework that progressively represents and aligns four modalities (image, text, audio, video). Initially, we train a lightweight Universal Projection module (UP) to align image and text modalities. Then, we freeze the pretrained UP and progressively align future modalities to those already aligned. OneEncoder operates efficiently and cost-effectively, even in scenarios where vast aligned datasets are unavailable, due to its lightweight design. Trained on small paired datasets, it shows strong performance in tasks like classification, querying, and visual question answering, surpassing methods that rely on large datasets and specialized encoders.

Categories: cs.CV, cs.LG

Keywords: modalities, lightweight framework, progressive alignment, oneencoder, aligned large paired datasets

Related articles: Most relevant | Search more

arXiv:2203.12197 [cs.CV] (Published 2022-03-23)

Biceph-Net: A robust and lightweight framework for the diagnosis of Alzheimer's disease using 2D-MRI scans and deep similarity learning

A. H. Rashid, A. Gupta, J. Gupta, M. Tanveer

arXiv:2312.03700 [cs.CV] (Published 2023-12-06)

OneLLM: One Framework to Align All Modalities with Language

Jiaming Han et al.

arXiv:2305.11846 [cs.CV] (Published 2023-05-19)

Any-to-Any Generation via Composable Diffusion

Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, Mohit Bansal

arXiv Analytics

arXiv:2409.11059 [cs.CV]Abstract References Reviews Resources

OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities

Links

Toolbox

arXiv:2409.11059 [cs.CV]AbstractReferencesReviewsResources

OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities

Links

Toolbox

arXiv:2409.11059 [cs.CV]Abstract References Reviews Resources