# [Unifying Vision, Text, and Layout for Universal Document Processing](https://arxiv.org/abs/2212.02623)
The proposed network UDOP jointly models visual,textual and layout information for several document AI tasks.
### Proposed Method :

**VTL Encoder:**
- A vision text layout(VTL) transformer architecure is used to dynamically fuse and unite image pixels and text tokens based on layout information.
- The document image is patched into square sections and encoded with a D-dim vector and grouped into a sequence of vectors.
- Then a unified representation is built using

- Then for each text token embedding s<sub>i</sub>, the joint representation is the sum of its image patch feature2 and the text feature:

- For image patches v<sub>j</sub> without any text tokens,the joint representation, v<sub>j</sub>' is itself:

- These new embeddings are fed into the VTL transformer encoder
- 2D text token position is encoded as 2D relative attention bias as done in paper TILT although without using 1D position embeddings
**VTL Decoder:**
- The VTL decoder consists of text-layout decoder and vision decoder
- The text-layout decoder is a unidirectional Transformer decoder to generate text and layout tokens in a seq-to-seq manner
- The vision decoder uses the decoder of MAE with cross attention

- In UDOP the encoder output cannot be directly fed to the vision decoder since the joint embedding contains non-masked image patches which are fused with text tokens.
- Thus the vision decoder takes a sequence of trainable placeholder embeddings which indicate whether the image patch is masked.
- The vision decoder attends to encoder vision-text output AND character embeddings via cross-attention.
- UDOP is pretrained on a large labeled and unlabeled datasets using a universa; generative task format and prompt.
- Joint text layout reconstruction: A percentage of text tokens is masked(15%) and the network is asked to model the text token along with the bounding boxes
- Layout Modeling: It asks the model to predict positions of text tokens given document image and context text.
- Visual Text Recognition: It identiļ¬es text at given location in the image.
- Masked Image Reconstruction with Text and Layout: It aims to reconstruct image with text and layout.
### Experiments :
- The datasets used for training are FUNSD,CORD,RVL-CDIP, DocVQA and datasets of DUE-Benchmark
- The pretrained models arr finetuned on each evaluation dataset.
- UDOP achieves SOTA performace on all 7 tasks in DUE benchmark and sets SOTA on CORD as well.