DocReader
[paper] DocReader: Bounding-Box Free Training of a Document Information Extraction Model
2021 年的 paper, Germany, SAP
Abstract
提出一種 End to End 的架構:DocReader,可以直接給定一張圖片,輸出目標文字的 values
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
過去的方法
過去的方法都需要給予 bounding box 以及 text 的 annotation 資料~~
本篇提出的方法不需要任何的 bounding-box annotations !!!,僅需給與目標 target 的 values
Model Architecture
DocReader 架構:
- Encoder + Decoder 結構
- 輸入 image 後經過 encoder 結構產出 feature map,feature map + key 的 embedding 一起進去 RNN-based decoder,最後兩者一起進去計算 cross-entropy loss。
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Encoder 結構:
- Input: image (H * W * C)
- 經過三個 convolution blocks,每一個 blocks 包含三個 convolution layers
- 每一層 convolution layers 皆包含 dropout, batch-normalization, relu activation function
- 最後的 output 被稱作為 memory
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- memory 再進 decoder 前會依先針對 memory 做前處理 (上圖右)
- 原本的 memory 額外再加上轉換為 one-hot encoding 的 memory,共組而成 Spatial Aware Memory
- 對 H, W 做 position encoding
- 所以藍色的就是 pixel 的 feature, 橘色的就是 location 的 feature
Decoder 結構:
- 架構為 RNN-based (LSTM) + attention layer (sum-attention layer)
- 輸入兩種 inputs
- Spatial Aware Memory (from encoder + preprocessing)
- 目標 target 的 values (key),會先轉換成 one-hot encoding or 其他 embedding 方式~
- 這個 key 指的應該是分類標籤 e.g. invoice number, invoice amount…
- output 則是相對應的數值
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Score function:
- m: Spatial Aware Memory; k: embedded key; h: 前一步的 lstm state; o: 前一步的預測字元; a: 前一步的 attention weights
- 這邊在做 attention 的計算,經過 softmax 得到 attention weights (參數包含前一步的資料)
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Context vector:
- attention weights * memory 在做相加
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- 一份文件中有各種地方都有包含到 text 的部分,因此有個很困難的議題就是,怎麼找到目標的文字,本篇提出的解法就是增加額外的資訊進去 sum-attention layer 做計算
LSTM:
- h: 前一步的 lstm state; l: 前一步的 character embedding; c: context vector; k: embedded key
- LSTM 的 output,先 concat 各個 steps 的 context vector,經過一層 project layer,在做 softmax 輸出,得到預測的字元
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
實驗結果
訓練資料量有 1.5 M
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Attention maps 視覺化
- 橘色: invoice number
- 藍色: invoice amount
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Future work:
- 是否可以當作 detection model,只拿 attention 的部分就好
- 有機會不需要人力標注,直接拿 db 資料即可