# [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) The proposed network LayoutLM jointly model interactions between text and layout information across scanned document images. ### Proposed Method : ![](https://hackmd.io/_uploads/SkWt84WK2.png) - Using the BERT architecture the LayoutLM model aims to encode two types of information for improving performance. - Document Layout Information: These are embedded using 2-D position representations. Embedding these features into language representation will better align layout information with semantic representation. - 2D Position Embedding: It aims to model relative spatial position in a document. The bounding box is defined by coordinates of upper left and lower right points. - Visual Information: Visual information is an important feature in document representation. It is represented using image features. Combining these features with text representations can bring richer semantic representations to documents. - Image Embedding: Image embedding layer is added by splitting the images into pieces that have one-to-one correspondence with words from the OCR results bounding boxes. - The features are generated using Faster R-CNN model using the pieces as token image embeddings, and for [CLS] token entire image is taken as ROI. - The pretaing of the models is done mainly for two tasks. - Masked Visual Language Model: the objective is to learn language representations using the help of 2D position and text embeddings. Some of input tokens are masked but the corresponding 2D positions are kept. This way, the model utilises the position information , bridging the gap between language and visual modalitites. - Multilabel Document Classification: Document tags are used to supervise the pretraing such that knowledge from different domains can be clustered for generating better document level representations. This is done by incorporating an Multilabel Document Classification(MDC) loss - The pretrained LayoutLM model is finetuned on 3 tasks, which are form understanding,receipt understanding and document image classification. - For first 2 tasks, the model predicts tags for each tokens and uses sequential labelling to detect each type on entity and for the last task, the model predicts class label using the representation of [cls] token. ### Experiments and Results : - The datasets used for pretraining is the IIT-CDIP Test collection 1.0 and it is finetuned on three datasets FUNSD, SROIE and RVL-CDIP dataset. - Evaluation metric used in F1 score and classification accuracy. - The pretrained model is intialised using BERT model weights and is finetuned for specific tasks. - Benchmark models used are BERT and RoBERTa BASE and LARGE models. - It substantially outperforms several SOTA pre-trained model on all three tasks(Form understanding, receipt understanding and document image classification)