changed 2 years ago
Published Linked with GitHub

Optical Character Recognition

Using Tesseract tool to train ocr
github source https://github.com/tesseract-ocr/tesseract

Tesseract datasets preprcessing

  1. 將圖片轉成 .tif 檔
  2. 下載 jTessBoxEditor 工具,連結 https://sourceforge.net/projects/vietocr/
  3. 使用 jTessBoxEditor 將圖片合併成一個 .tif 檔
  4. 生成合併後.tif的.box 檔
  5. 使用 jTessBoxEditor 工具調整 box 位置
  6. 生成 .lstmf 檔案
  7. 從已經 train 好的 model 提取 .lstm 檔案 (training 使用 fine tuning)
  8. 開始訓練
  9. 合併模型

1. 將圖片轉成 .tif 檔

import glob
from PIL import Image

for i in glob.glob(r'*.png'):
    im = Image.open(i,"r")
    print(i.split(".")[0])
    im.save("{}_new.tif".format(i.split(".")[0]),quality=95)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


2. 下載 jTessBoxEditor 工具

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

3. 使用 jTessBoxEditor 將圖片合併成一個 .tif 檔

點擊 train.bat 檔案打開 jTessBoxEditor

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

點擊 tool 然後選擇 Merge

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

選擇要合併的圖片檔案,點擊開啟

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

在 save 命名規則為 [lang].[fontname].exp[num].tif
lang: 語言
fontname: 字型
num: 序號

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


4. 生成合併後 .tif 的 .box 檔

生成 eng.font.exp0.box 檔案,裡面儲存文字 position 的內容

$ tesseract eng.font.exp0.tif eng.font.exp0 batch.nochop makebox

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


5. 使用 jTessBoxEditor 工具調整 box 位置

依序選擇 Box Eiditor -> Open -> eng.font.exp0.tiff 檔案

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

開啟後依序校正文字框 position (注意還有下一頁)
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


6. 生成 .lstmf 檔案

生成訓練所需 .lstmf 檔案

$ tesseract eng.font.exp0.tif eng.font.exp0 -l eng --psm 6 lstm.train

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


7. 從已經 train 好的 model 提取 .lstm 檔案 (training 使用 fine tuning)

下載已經 train 好的 model,此處使用eng.traineddata,連結為https://github.com/tesseract-ocr/tessdata_best

$ combine_tessdata -e eng.traineddata eng.lstm

8. 開始訓練

$ lstmtraining \
    --model_output='/home/datasets/ocr'  \
    --continue_from='/home/datasets/eng.lstm' \
    --train_listfile='/home/datasets/eng.training_files.txt' \
    --traineddata='/home/datasets/eng.traineddata' \
    --debug_interval -1  \
    --max_iterations 4000"

model_output 命名輸出 model 名稱
continue_from train 好的 model 轉出來的.lstm檔案
train_listfile 裡面寫 eng.font.exp0.lstmf 的路徑
traineddata train 好的 model
max_iterations 疊代次數


9. 合併模型

train 會得到很多 chechpoint 檔案,tocr_checkpoint 為最後一次輸出模型,使用這個模型與 eng.traineddata 合併為新模型

$ lstmtraining \
    --stop_training \
    --continue_from='/home/datasets/ocr_checkpoint' \
    --traineddata='/home/datasets/eng.traineddata' \
    --model_output='/home/datasets/ocr.traineddata

continue_from checkpoint model 名稱
traineddata eng.trainedata 模型
model_output 輸出模型名稱


Thank you!
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

You can find me on

Select a repo