###### tags: `2022Q1技術研討`, `detection`, `ocr`
# LayoutXLM
> Multimodal Pre-training for Multilingual Visually-rich Document Understanding, CV + NLP的Layout Language model
## Objective
做全文辨識之後要怎麼將這些辨識的結果變成有意義的資訊

## Contribution
- 整合image, text, layout的Multi-modal Transformer model
- 在pre-train階段的時候, 利用一些策略去整合image, text, layout的資訊
## Model Architecture

### Input Feature
- Text Embeddings
- Text token embeddings
- 每一個不同的字代表一個token
- 1 D position embedding
- 代表順序index
- 2 D position embedding
- bbox位置(x1, x2, y1, y2, w, h)
- segment embeddings
- {A, B}, distinguish different text segments
- Vision Embeddings
- Visual token embeddings
1. Resized image to 224 × 224 then fed into the visual backbone(ResNeXt-FPN)
2. Output feature map is average-pooled to a fixed size with the width being W and height being H
3. Flattened into a visual embedding sequence of length WH
4. linear projection layer is then applied to each visual token embedding in order to unify the dimensions.
- 1 D position embedding
- 代表順序index
- 2 D position embedding
- 每一個visual token embedding相對應的bbox位置(x1, x2, y1, y2, w, h)
- segment embeddings, visual embedding都是屬於同個類別
### Pretraining Objectives
- Masked Visual-Language Modeling
- 跟BERT一樣, 在訓練時, 隨機遮罩某一個Text token embedding, 去預測這個Text是什麼
- Make the model learn better in the language side with the cross-modality clue.
- Text-Image Alignment
- 在圖片上隨機遮蓋某個部分, 並在訓練時相對文字的output embedding去預測這個文字在圖片上有沒有被擋住
- Help the model capture the fine-grained alignment relationship between text and image.
- Text-Image Match
- 將一部分的圖片換成其他圖片但文字保持不動, 並在[CLS]的output embedding去預測這張圖片有沒有被換掉
- Align the high-level semantic representation between text and image.
> 30 million documents to pre-train the LayoutXLM
### Fine Tune Objectives

#### Semantic Entity Relation
- Discrete token set $t = \{t_0, t_1, t_2, ...,t_n\}$
- Each token $t_i = (w, (x_0, y_0, x_1, y_1))$ consist of word $w$ and bbox $(x_0, y_0, x_1, y_1)$
- $C=\{c_0, c_1, c_2,...,c_m\}$ is the semantic entity labels where
the tokens are classified into.
- Objective: find a function $F_{SER}:C \rightarrow \mathcal{E}$, where $\mathcal{E}=\{(\{t^0_0,...,t^{n_0}_0\},c_0),...,(\{t^0_k,...,t^{n_0}_k\},c_k)\}$
> 將每一個input text的output embedding做分類看是不是屬於同一個semantic entity label(N-class)
#### Relation Extraction
- 看每一個semantic entity label的pair是不是有關係(2-class)
> 任兩個input text的output embedding concat起來去做二元分類