arXiv:2105.03761 Abstract | arXiv Analytics

arXiv:2105.03761 [cs.CV]Abstract References Reviews Resources

e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks

Maxime Kayser, Oana-Maria Camburu, Leonard Salewski, Cornelius Emde, Virginie Do, Zeynep Akata, Thomas Lukasiewicz

Published 2021-05-08Version 1

Recently, an increasing number of works have introduced models capable of generating natural language explanations (NLEs) for their predictions on vision-language (VL) tasks. Such models are appealing because they can provide human-friendly and comprehensive explanations. However, there is still a lack of unified evaluation approaches for the explanations generated by these models. Moreover, there are currently only few datasets of NLEs for VL tasks. In this work, we introduce e-ViL, a benchmark for explainable vision-language tasks that establishes a unified evaluation framework and provides the first comprehensive comparison of existing approaches that generate NLEs for VL tasks. e-ViL spans four models and three datasets. Both automatic metrics and human evaluation are used to assess model-generated explanations. We also introduce e-SNLI-VE, the largest existing VL dataset with NLEs (over 430k instances). Finally, we propose a new model that combines UNITER, which learns joint embeddings of images and text, and GPT-2, a pre-trained language model that is well-suited for text generation. It surpasses the previous state-of-the-art by a large margin across all datasets.

Categories: cs.CV, cs.CL, cs.LG

Keywords: vision-language tasks, vl tasks, generating natural language explanations, learns joint embeddings, largest existing vl dataset

Related articles: Most relevant | Search more

arXiv:2309.00215 [cs.CV] (Published 2023-09-01)

Towards Addressing the Misalignment of Object Proposal Evaluation for Vision-Language Tasks via Semantic Grounding

Joshua Feinglass, Yezhou Yang

arXiv:2208.10442 [cs.CV] (Published 2022-08-22)

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

Wenhui Wang et al.

arXiv:1912.03063 [cs.CV] (Published 2019-12-06)

Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks

Corentin Kervadec, Grigory Antipov, Moez Baccouche, Christian Wolf

arXiv Analytics

arXiv:2105.03761 [cs.CV]Abstract References Reviews Resources

e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks

Links

Toolbox

arXiv:2105.03761 [cs.CV]AbstractReferencesReviewsResources

e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks

Links

Toolbox

arXiv:2105.03761 [cs.CV]Abstract References Reviews Resources