arXiv Analytics

Sign in

arXiv:2105.11333 [cs.CV]AbstractReferencesReviewsResources

Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

Jong Hak Moon, Hyungyung Lee, Woncheol Shin, Edward Choi

Published 2021-05-24Version 1

Recently a number of studies demonstrated impressive performance on diverse vision-language multi-modal tasks such as image captioning and visual question answering by extending the BERT architecture with multi-modal pre-training objectives. In this work we explore a broad set of multi-modal representation learning tasks in the medical domain, specifically using radiology images and the unstructured report. We propose Medical Vision Language Learner (MedViLL) which adopts a Transformer-based architecture combined with a novel multimodal attention masking scheme to maximize generalization performance for both vision-language understanding tasks (image-report retrieval, disease classification, medical visual question answering) and vision-language generation task (report generation). By rigorously evaluating the proposed model on four downstream tasks with two chest X-ray image datasets (MIMIC-CXR and Open-I), we empirically demonstrate the superior downstream task performance of MedViLL against various baselines including task-specific architectures.

Comments: v1: Main paper + supplementary material (15 pages, 5 figures, 6 tables)
Categories: cs.CV
Related articles: Most relevant | Search more
arXiv:2312.02980 [cs.CV] (Published 2023-12-05)
GPT4Point: A Unified Framework for Point-Language Understanding and Generation
Zhangyang Qi et al.
arXiv:2410.22217 [cs.CV] (Published 2024-10-29)
Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective
Shenghao Xie et al.
arXiv:2407.21017 [cs.CV] (Published 2024-07-30)
Matting by Generation