# [LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding](https://arxiv.org/abs/2012.14740) The proposed network LayoutLMv2 builds on Layout LM for visually rich Document understanding ### Proposed Method : ![](https://hackmd.io/_uploads/BJtB9-EYh.png) - The authors use a multi-modal Transformer architecture as the backbone along with a spatial aware self-attention mechanism to model the document layout. - The text embedding consists of token embedding obtained by WordPiece, 1D positional embedding and a segment embedding to distinguish text segments - The visusal encoder uses ResNeXt-FPN architecture as the backbone and the visual embedding is flattened and then linearly projected. - The final visual embedding consists of the projected token embedding, 1D position embedding and segment embedding - The layout embedding specifies the bounding boxes (2D postion embeddings) **Spatial Aware Self-attention Mechanism:** - This used to model the local invariance in the document layout. - This done by using relative 2D position information - Regular self attention is given by: ![](https://hackmd.io/_uploads/S1uEhZNY2.png) - To add relative information they are added in term of biases such that the attention score becomes: ![](https://hackmd.io/_uploads/H1XDhZVYh.png) - The output vectors are represented as weighted averages of the projected value vectors. ![](https://hackmd.io/_uploads/S1v93-4t2.png) **Pretraining:** - MVLM is used similar to LayoutLM - Text-Image Alignment is used to help the model learn the spatial location correspondence between image and coordinates of the bounding boxes. - Text-Image Matching is applied to help the model learn the correspondence bwteen document image and textual content. - Cross entropy loss and mask and covering strategy is used in optimisation. ### Experiments : - The datasets used are FUNSD,CORD,SROIE,Kleister-NDA,RVL-CDIP and the model is pretrained on IIT-CDIP. - Two setups are used BASE and LARGE with latter having more than twice the parameters of BASE. - The benchmark models are UniLMv2,BERT,LayoutLM - The model ouperforms SOTA not only on VrDU task but also on VQA.