DocReader

tags: `2022Q1技術研討`, `detection`, `ocr`

[paper] DocReader: Bounding-Box Free Training of a Document Information Extraction Model

Document Information Extraction

2021 年的 paper, Germany, SAP

Abstract

提出一種 End to End 的架構：DocReader，可以直接給定一張圖片，輸出目標文字的 values

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

過去的方法

現有的 information extraction 以兩階段為主：
1. detect 出文件中的所有文字
2. 在所有文字中找出目標的 target
另一種結構：給定目標的 bounding box 再接 OCR model
- 同樣是 two-step 架構~~ (e.g. Yolo + CRNN)

過去的方法都需要給予 bounding box 以及 text 的 annotation 資料~~

本篇提出的方法不需要任何的 bounding-box annotations !!!，僅需給與目標 target 的 values

Model Architecture

DocReader 架構：

Encoder + Decoder 結構
輸入 image 後經過 encoder 結構產出 feature map，feature map + key 的 embedding 一起進去 RNN-based decoder，最後兩者一起進去計算 cross-entropy loss。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Encoder 結構：

Input: image (H * W * C)
經過三個 convolution blocks，每一個 blocks 包含三個 convolution layers
- 每一層 convolution layers 皆包含 dropout, batch-normalization, relu activation function
最後的 output 被稱作為 memory
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
memory 再進 decoder 前會依先針對 memory 做前處理 (上圖右)
- 原本的 memory 額外再加上轉換為 one-hot encoding 的 memory，共組而成 Spatial Aware Memory
- 對 H, W 做 position encoding
- 所以藍色的就是 pixel 的 feature, 橘色的就是 location 的 feature

Decoder 結構：

架構為 RNN-based (LSTM) + attention layer (sum-attention layer)
輸入兩種 inputs
- Spatial Aware Memory (from encoder + preprocessing)
- 目標 target 的 values (key)，會先轉換成 one-hot encoding or 其他 embedding 方式~
  - 這個 key 指的應該是分類標籤 e.g. invoice number, invoice amount…
output 則是相對應的數值

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Score function：

m: Spatial Aware Memory; k: embedded key; h: 前一步的 lstm state; o: 前一步的預測字元; a: 前一步的 attention weights
這邊在做 attention 的計算，經過 softmax 得到 attention weights (參數包含前一步的資料)
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Context vector：

attention weights * memory 在做相加
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
一份文件中有各種地方都有包含到 text 的部分，因此有個很困難的議題就是，怎麼找到目標的文字，本篇提出的解法就是增加額外的資訊進去 sum-attention layer 做計算

LSTM：

h: 前一步的 lstm state; l: 前一步的 character embedding; c: context vector; k: embedded key
LSTM 的 output，先 concat 各個 steps 的 context vector，經過一層 project layer，在做 softmax 輸出，得到預測的字元
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

實驗結果

訓練資料量有 1.5 M

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Attention maps 視覺化

橘色: invoice number
藍色: invoice amount
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Future work:

是否可以當作 detection model，只拿 attention 的部分就好
有機會不需要人力標注，直接拿 db 資料即可

DocReader

tags: 2022Q1技術研討, detection, ocr

Document Information Extraction

Abstract

過去的方法

Model Architecture

實驗結果

Read more

研究內容

Multiple Object Tracking (MOT)

Week 2：模型壓縮 - 剪枝 (pruning)

Disentangling Writer and Character Styles for Handwriting Generation

tags: `2022Q1技術研討`, `detection`, `ocr`