###### tags: `2022Q1技術研討`, `detection`, `ocr` # LayoutXLM > Multimodal Pre-training for Multilingual Visually-rich Document Understanding, CV + NLP的Layout Language model ## Objective 做全文辨識之後要怎麼將這些辨識的結果變成有意義的資訊 ![](https://i.imgur.com/8HbEIY2.png) ## Contribution - 整合image, text, layout的Multi-modal Transformer model - 在pre-train階段的時候, 利用一些策略去整合image, text, layout的資訊 ## Model Architecture ![](https://i.imgur.com/ZJTSBbR.png) ### Input Feature - Text Embeddings - Text token embeddings - 每一個不同的字代表一個token - 1 D position embedding - 代表順序index - 2 D position embedding - bbox位置(x1, x2, y1, y2, w, h) - segment embeddings - {A, B}, distinguish different text segments - Vision Embeddings - Visual token embeddings 1. Resized image to 224 × 224 then fed into the visual backbone(ResNeXt-FPN) 2. Output feature map is average-pooled to a fixed size with the width being W and height being H 3. Flattened into a visual embedding sequence of length WH 4. linear projection layer is then applied to each visual token embedding in order to unify the dimensions. - 1 D position embedding - 代表順序index - 2 D position embedding - 每一個visual token embedding相對應的bbox位置(x1, x2, y1, y2, w, h) - segment embeddings, visual embedding都是屬於同個類別 ### Pretraining Objectives - Masked Visual-Language Modeling - 跟BERT一樣, 在訓練時, 隨機遮罩某一個Text token embedding, 去預測這個Text是什麼 - Make the model learn better in the language side with the cross-modality clue. - Text-Image Alignment - 在圖片上隨機遮蓋某個部分, 並在訓練時相對文字的output embedding去預測這個文字在圖片上有沒有被擋住 - Help the model capture the fine-grained alignment relationship between text and image. - Text-Image Match - 將一部分的圖片換成其他圖片但文字保持不動, 並在[CLS]的output embedding去預測這張圖片有沒有被換掉 - Align the high-level semantic representation between text and image. > 30 million documents to pre-train the LayoutXLM ### Fine Tune Objectives ![](https://i.imgur.com/y5VhZme.png) #### Semantic Entity Relation - Discrete token set $t = \{t_0, t_1, t_2, ...,t_n\}$ - Each token $t_i = (w, (x_0, y_0, x_1, y_1))$ consist of word $w$ and bbox $(x_0, y_0, x_1, y_1)$ - $C=\{c_0, c_1, c_2,...,c_m\}$ is the semantic entity labels where the tokens are classified into. - Objective: find a function $F_{SER}:C \rightarrow \mathcal{E}$, where $\mathcal{E}=\{(\{t^0_0,...,t^{n_0}_0\},c_0),...,(\{t^0_k,...,t^{n_0}_k\},c_k)\}$ > 將每一個input text的output embedding做分類看是不是屬於同一個semantic entity label(N-class) #### Relation Extraction - 看每一個semantic entity label的pair是不是有關係(2-class) > 任兩個input text的output embedding concat起來去做二元分類