arXiv Analytics

Sign in

arXiv:2105.03761 [cs.CV]AbstractReferencesReviewsResources

e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks

Maxime Kayser, Oana-Maria Camburu, Leonard Salewski, Cornelius Emde, Virginie Do, Zeynep Akata, Thomas Lukasiewicz

Published 2021-05-08Version 1

Recently, an increasing number of works have introduced models capable of generating natural language explanations (NLEs) for their predictions on vision-language (VL) tasks. Such models are appealing because they can provide human-friendly and comprehensive explanations. However, there is still a lack of unified evaluation approaches for the explanations generated by these models. Moreover, there are currently only few datasets of NLEs for VL tasks. In this work, we introduce e-ViL, a benchmark for explainable vision-language tasks that establishes a unified evaluation framework and provides the first comprehensive comparison of existing approaches that generate NLEs for VL tasks. e-ViL spans four models and three datasets. Both automatic metrics and human evaluation are used to assess model-generated explanations. We also introduce e-SNLI-VE, the largest existing VL dataset with NLEs (over 430k instances). Finally, we propose a new model that combines UNITER, which learns joint embeddings of images and text, and GPT-2, a pre-trained language model that is well-suited for text generation. It surpasses the previous state-of-the-art by a large margin across all datasets.

Related articles: Most relevant | Search more
arXiv:2309.00215 [cs.CV] (Published 2023-09-01)
Towards Addressing the Misalignment of Object Proposal Evaluation for Vision-Language Tasks via Semantic Grounding
arXiv:2208.10442 [cs.CV] (Published 2022-08-22)
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Wenhui Wang et al.
arXiv:1912.03063 [cs.CV] (Published 2019-12-06)
Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks