# [Unifying Vision, Text, and Layout for Universal Document Processing](https://arxiv.org/abs/2212.02623) The proposed network UDOP jointly models visual,textual and layout information for several document AI tasks. ### Proposed Method : ![](https://hackmd.io/_uploads/BJovXUWKh.png) **VTL Encoder:** - A vision text layout(VTL) transformer architecure is used to dynamically fuse and unite image pixels and text tokens based on layout information. - The document image is patched into square sections and encoded with a D-dim vector and grouped into a sequence of vectors. - Then a unified representation is built using ![](https://hackmd.io/_uploads/r1fuqIZY2.png) - Then for each text token embedding s<sub>i</sub>, the joint representation is the sum of its image patch feature2 and the text feature: ![](https://hackmd.io/_uploads/SkJpc8ZY2.png) - For image patches v<sub>j</sub> without any text tokens,the joint representation, v<sub>j</sub>' is itself: ![](https://hackmd.io/_uploads/rkz-jUbth.png) - These new embeddings are fed into the VTL transformer encoder - 2D text token position is encoded as 2D relative attention bias as done in paper TILT although without using 1D position embeddings **VTL Decoder:** - The VTL decoder consists of text-layout decoder and vision decoder - The text-layout decoder is a unidirectional Transformer decoder to generate text and layout tokens in a seq-to-seq manner - The vision decoder uses the decoder of MAE with cross attention ![](https://hackmd.io/_uploads/SykoCI-K2.png) - In UDOP the encoder output cannot be directly fed to the vision decoder since the joint embedding contains non-masked image patches which are fused with text tokens. - Thus the vision decoder takes a sequence of trainable placeholder embeddings which indicate whether the image patch is masked. - The vision decoder attends to encoder vision-text output AND character embeddings via cross-attention. - UDOP is pretrained on a large labeled and unlabeled datasets using a universa; generative task format and prompt. - Joint text layout reconstruction: A percentage of text tokens is masked(15%) and the network is asked to model the text token along with the bounding boxes - Layout Modeling: It asks the model to predict positions of text tokens given document image and context text. - Visual Text Recognition: It identifies text at given location in the image. - Masked Image Reconstruction with Text and Layout: It aims to reconstruct image with text and layout. ### Experiments : - The datasets used for training are FUNSD,CORD,RVL-CDIP, DocVQA and datasets of DUE-Benchmark - The pretrained models arr finetuned on each evaluation dataset. - UDOP achieves SOTA performace on all 7 tasks in DUE benchmark and sets SOTA on CORD as well.