arXiv:2105.11333 Abstract | arXiv Analytics

arXiv:2105.11333 [cs.CV]Abstract References Reviews Resources

Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

Jong Hak Moon, Hyungyung Lee, Woncheol Shin, Edward Choi

Published 2021-05-24Version 1

Recently a number of studies demonstrated impressive performance on diverse vision-language multi-modal tasks such as image captioning and visual question answering by extending the BERT architecture with multi-modal pre-training objectives. In this work we explore a broad set of multi-modal representation learning tasks in the medical domain, specifically using radiology images and the unstructured report. We propose Medical Vision Language Learner (MedViLL) which adopts a Transformer-based architecture combined with a novel multimodal attention masking scheme to maximize generalization performance for both vision-language understanding tasks (image-report retrieval, disease classification, medical visual question answering) and vision-language generation task (report generation). By rigorously evaluating the proposed model on four downstream tasks with two chest X-ray image datasets (MIMIC-CXR and Open-I), we empirically demonstrate the superior downstream task performance of MedViLL against various baselines including task-specific architectures.

Comments: v1: Main paper + supplementary material (15 pages, 5 figures, 6 tables)

Categories: cs.CV

Keywords: medical images, generation, vision-language pre-training, multi-modal understanding, diverse vision-language multi-modal tasks

Related articles: Most relevant | Search more

arXiv:2312.02980 [cs.CV] (Published 2023-12-05)

GPT4Point: A Unified Framework for Point-Language Understanding and Generation

Zhangyang Qi et al.

arXiv:2410.22217 [cs.CV] (Published 2024-10-29)

Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective

Shenghao Xie et al.

arXiv:2407.21017 [cs.CV] (Published 2024-07-30)

Matting by Generation

Zhixiang Wang, Baiang Li, Jian Wang, Yu-Lun Liu, Jinwei Gu, Yung-Yu Chuang, Shin'ichi Satoh

arXiv Analytics

arXiv:2105.11333 [cs.CV]Abstract References Reviews Resources

Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

Links

Toolbox

arXiv:2105.11333 [cs.CV]AbstractReferencesReviewsResources

Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

Links

Toolbox

arXiv:2105.11333 [cs.CV]Abstract References Reviews Resources