arXiv:2312.12423 Abstract | arXiv Analytics

arXiv:2312.12423 [cs.CV]Abstract References Reviews Resources

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

Shraman Pramanick, Guangxing Han, Rui Hou, Sayan Nag, Ser-Nam Lim, Nicolas Ballas, Qifan Wang, Rama Chellappa, Amjad Almahairi

Published 2023-12-19Version 1

The ability of large language models (LLMs) to process visual inputs has given rise to general-purpose vision systems, unifying various vision-language (VL) tasks by instruction tuning. However, due to the enormous diversity in input-output formats in the vision domain, existing general-purpose models fail to successfully integrate segmentation and multi-image inputs with coarse-level tasks into a single framework. In this work, we introduce VistaLLM, a powerful visual system that addresses coarse- and fine-grained VL tasks over single and multiple input images using a unified framework. VistaLLM utilizes an instruction-guided image tokenizer that filters global embeddings using task descriptions to extract compressed and refined features from numerous images. Moreover, VistaLLM employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences, significantly improving over previously used uniform sampling. To bolster the desired capability of VistaLLM, we curate CoinIt, a comprehensive coarse-to-fine instruction tuning dataset with 6.8M samples. We also address the lack of multi-image grounding datasets by introducing a novel task, AttCoSeg (Attribute-level Co-Segmentation), which boosts the model's reasoning and grounding capability over multiple input images. Extensive experiments on a wide range of V- and VL tasks demonstrate the effectiveness of VistaLLM by achieving consistent state-of-the-art performance over strong baselines across all downstream tasks. Our project page can be found at https://shramanpramanick.github.io/VistaLLM/.

Comments: 24 pages including references and supplementary

Categories: cs.CV, cs.AI

Keywords: designing general-purpose coarse-to-fine vision-language model, coarse-to-fine instruction tuning dataset, multiple input images, vl tasks

Related articles: Most relevant | Search more

arXiv:2105.03761 [cs.CV] (Published 2021-05-08)

e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks

Maxime Kayser, Oana-Maria Camburu, Leonard Salewski, Cornelius Emde, Virginie Do, Zeynep Akata, Thomas Lukasiewicz

arXiv:2012.06946 [cs.CV] (Published 2020-12-13, updated 2021-08-09)

MiniVLM: A Smaller and Faster Vision-Language Model

Jianfeng Wang et al.

arXiv:2208.08104 [cs.CV] (Published 2022-08-17)

Understanding Attention for Vision-and-Language Tasks

Feiqi Cao, Soyeon Caren Han, Siqu Long, Changwei Xu, Josiah Poon

arXiv Analytics

arXiv:2312.12423 [cs.CV]Abstract References Reviews Resources

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

Links

Toolbox

arXiv:2312.12423 [cs.CV]AbstractReferencesReviewsResources

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

Links

Toolbox

arXiv:2312.12423 [cs.CV]Abstract References Reviews Resources