# [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) The proposed network LayoutLMv3 jointly model interactions between text,visual and layout information across scanned document images. ### Proposed Method : ![](https://hackmd.io/_uploads/ByLccMVKn.png) - The model applies a unified text-image multimodal transformer to learn cross-modal representations. - The text embeddings here use segment level layout positions instead of word level like LayoutLM and LayoutLMv2 - The image embeddings dont use a CNN to extract features but are instead linear projections with learnable 1D position embeddings. - Semantic relative 1D position and spatial 2D relative position is inserted as bias terms like that of LayoutLMv2 **Pretraining:** - There are three objectives used which are masked language modeling, masked image modeling and word patch alignment. - In Masked Language Modeling the text tokens are masked using a span masking strategy with span lengths drawn from a Poisson distribution and a cross-entropy loss is used. - The Masked Image Modeling is similar to MLM but uses a blockwise masking strategy. - The Word Patch Alignment hepks brigde the text and image modalities. - It is done by assigning a aligned label to a unmasked text-image token pair and uses a cross entropy loss ### Experiments : - The datasets used are FUNSD,CORD,RVL-CDIP,DocVQA,PubLayNet and the model is pretrained on IIT-CDIP - The evaluation metrics used are mAP@IoU,F-score,ANLS and accuracy - There are two versions used BASE and LARGE - The model outperforms all the SOTA methods on various subtasks