arXiv:2106.11539 Abstract | arXiv Analytics

arXiv:2106.11539 [cs.CV]Abstract References Reviews Resources

DocFormer: End-to-End Transformer for Document Understanding

Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, R. Manmatha

Published 2021-06-22Version 1

We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).

Categories: cs.CV

Keywords: end-to-end transformer, document understanding, docformer achieves state-of-the-art results, novel multi-modal self-attention layer, shares learned spatial embeddings

Related articles: Most relevant | Search more

arXiv:2305.14218 [cs.CV] (Published 2023-05-23, updated 2023-05-24)

DUBLIN -- Document Understanding By Language-Image Network

Kriti Aggarwal et al.

arXiv:2306.01733 [cs.CV] (Published 2023-06-02)

DocFormerv2: Local Features for Document Understanding

Srikar Appalaraju, Peng Tang, Qi Dong, Nishant Sankaran, Yichu Zhou, R. Manmatha

arXiv:2405.11757 [cs.CV] (Published 2024-05-20)

DLAFormer: An End-to-End Transformer For Document Layout Analysis

Jiawei Wang, Kai Hu, Qiang Huo

arXiv Analytics

arXiv:2106.11539 [cs.CV]Abstract References Reviews Resources

DocFormer: End-to-End Transformer for Document Understanding

Links

Toolbox

arXiv:2106.11539 [cs.CV]AbstractReferencesReviewsResources

DocFormer: End-to-End Transformer for Document Understanding

Links

Toolbox

arXiv:2106.11539 [cs.CV]Abstract References Reviews Resources