# [LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding](https://arxiv.org/abs/2012.14740)
The proposed network LayoutLMv2 builds on Layout LM for visually rich Document understanding
### Proposed Method :

- The authors use a multi-modal Transformer architecture as the backbone along with a spatial aware self-attention mechanism to model the document layout.
- The text embedding consists of token embedding obtained by WordPiece, 1D positional embedding and a segment embedding to distinguish text segments
- The visusal encoder uses ResNeXt-FPN architecture as the backbone and the visual embedding is flattened and then linearly projected.
- The final visual embedding consists of the projected token embedding, 1D position embedding and segment embedding
- The layout embedding specifies the bounding boxes (2D postion embeddings)
**Spatial Aware Self-attention Mechanism:**
- This used to model the local invariance in the document layout.
- This done by using relative 2D position information
- Regular self attention is given by:

- To add relative information they are added in term of biases such that the attention score becomes:

- The output vectors are represented as weighted averages of the projected value vectors.

**Pretraining:**
- MVLM is used similar to LayoutLM
- Text-Image Alignment is used to help the model learn the spatial location correspondence between image and coordinates of the bounding boxes.
- Text-Image Matching is applied to help the model learn the correspondence bwteen document image and textual content.
- Cross entropy loss and mask and covering strategy is used in optimisation.
### Experiments :
- The datasets used are FUNSD,CORD,SROIE,Kleister-NDA,RVL-CDIP and the model is pretrained on IIT-CDIP.
- Two setups are used BASE and LARGE with latter having more than twice the parameters of BASE.
- The benchmark models are UniLMv2,BERT,LayoutLM
- The model ouperforms SOTA not only on VrDU task but also on VQA.