Optical Character Recognition

Using Tesseract tool to train ocr
github source https://github.com/tesseract-ocr/tesseract

Tesseract datasets preprcessing

將圖片轉成 .tif 檔
下載 jTessBoxEditor 工具，連結 https://sourceforge.net/projects/vietocr/
使用 jTessBoxEditor 將圖片合併成一個 .tif 檔
生成合併後.tif的.box 檔
使用 jTessBoxEditor 工具調整 box 位置
生成 .lstmf 檔案
從已經 train 好的 model 提取 .lstm 檔案 (training 使用 fine tuning)
開始訓練
合併模型

1. 將圖片轉成 .tif 檔

import glob
from PIL import Image

for i in glob.glob(r'*.png'):
    im = Image.open(i,"r")
    print(i.split(".")[0])
    im.save("{}_new.tif".format(i.split(".")[0]),quality=95)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

2. 下載 jTessBoxEditor 工具

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

3. 使用 jTessBoxEditor 將圖片合併成一個 .tif 檔

點擊 train.bat 檔案打開 jTessBoxEditor

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

點擊 tool 然後選擇 Merge

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

選擇要合併的圖片檔案，點擊開啟

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

在 save 命名規則為 [lang].[fontname].exp[num].tif
– lang: 語言
– fontname: 字型
– num: 序號

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

4. 生成合併後 .tif 的 .box 檔

生成 eng.font.exp0.box 檔案，裡面儲存文字 position 的內容

$ tesseract eng.font.exp0.tif eng.font.exp0 batch.nochop makebox

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

5. 使用 jTessBoxEditor 工具調整 box 位置

依序選擇 Box Eiditor -> Open -> eng.font.exp0.tiff 檔案

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

開啟後依序校正文字框 position (注意還有下一頁)

6. 生成 .lstmf 檔案

生成訓練所需 .lstmf 檔案

$ tesseract eng.font.exp0.tif eng.font.exp0 -l eng --psm 6 lstm.train

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

7. 從已經 train 好的 model 提取 .lstm 檔案 (training 使用 fine tuning)

下載已經 train 好的 model，此處使用eng.traineddata，連結為https://github.com/tesseract-ocr/tessdata_best

$ combine_tessdata -e eng.traineddata eng.lstm

8. 開始訓練

$ lstmtraining \
    --model_output='/home/datasets/ocr'  \
    --continue_from='/home/datasets/eng.lstm' \
    --train_listfile='/home/datasets/eng.training_files.txt' \
    --traineddata='/home/datasets/eng.traineddata' \
    --debug_interval -1  \
    --max_iterations 4000"

–model_output 命名輸出 model 名稱
–continue_from train 好的 model 轉出來的.lstm檔案
–train_listfile 裡面寫 eng.font.exp0.lstmf 的路徑
–traineddata train 好的 model
–max_iterations 疊代次數

9. 合併模型

train 會得到很多 chechpoint 檔案，tocr_checkpoint 為最後一次輸出模型，使用這個模型與 eng.traineddata 合併為新模型

$ lstmtraining \
    --stop_training \
    --continue_from='/home/datasets/ocr_checkpoint' \
    --traineddata='/home/datasets/eng.traineddata' \
    --model_output='/home/datasets/ocr.traineddata

–continue_from checkpoint model 名稱
–traineddata eng.trainedata 模型
–model_output 輸出模型名稱

Thank you!

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

You can find me on

GitHub: https://github.com/shaung08
Email: a2369875@gmail.com

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

Optical Character Recognition

Tesseract datasets preprcessing

1. 將圖片轉成 .tif 檔

2. 下載 jTessBoxEditor 工具

Image Not Showing Possible Reasons The image file may be corruptedThe server hosting the image is unavailableThe image path is incorrectThe image format is not supported Learn More →

3. 使用 jTessBoxEditor 將圖片合併成一個 .tif 檔

4. 生成合併後 .tif 的 .box 檔

5. 使用 jTessBoxEditor 工具調整 box 位置

6. 生成 .lstmf 檔案

7. 從已經 train 好的 model 提取 .lstm 檔案 (training 使用 fine tuning)

8. 開始訓練

9. 合併模型

Thank you! Image Not Showing Possible Reasons The image file may be corruptedThe server hosting the image is unavailableThe image path is incorrectThe image format is not supported Learn More →

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Thank you!

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →