# [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387)
The proposed network LayoutLMv3 jointly model interactions between text,visual and layout information across scanned document images.
### Proposed Method :

- The model applies a unified text-image multimodal transformer to learn cross-modal representations.
- The text embeddings here use segment level layout positions instead of word level like LayoutLM and LayoutLMv2
- The image embeddings dont use a CNN to extract features but are instead linear projections with learnable 1D position embeddings.
- Semantic relative 1D position and spatial 2D relative position is inserted as bias terms like that of LayoutLMv2
**Pretraining:**
- There are three objectives used which are masked language modeling, masked image modeling and word patch alignment.
- In Masked Language Modeling the text tokens are masked using a span masking strategy with span lengths drawn from a Poisson distribution and a cross-entropy loss is used.
- The Masked Image Modeling is similar to MLM but uses a blockwise masking strategy.
- The Word Patch Alignment hepks brigde the text and image modalities.
- It is done by assigning a aligned label to a unmasked text-image token pair and uses a cross entropy loss
### Experiments :
- The datasets used are FUNSD,CORD,RVL-CDIP,DocVQA,PubLayNet and the model is pretrained on IIT-CDIP
- The evaluation metrics used are mAP@IoU,F-score,ANLS and accuracy
- There are two versions used BASE and LARGE
- The model outperforms all the SOTA methods on various subtasks