or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Do you want to remove this version name and description?
Syncing
xxxxxxxxxx
---
title: Optical Character Recognition(tesseract)
tags: Tesseract
description: Explam tesseract training process
---
# Optical Character Recognition
Using Tesseract tool to train ocr
github source [https://github.com/tesseract-ocr/tesseract](https://github.com/tesseract-ocr/tesseract)
# Tesseract datasets preprcessing
1. 將圖片轉成 .tif 檔
2. 下載 jTessBoxEditor 工具,連結 [https://sourceforge.net/projects/vietocr/](https://sourceforge.net/projects/vietocr/)
3. 使用 jTessBoxEditor 將圖片合併成一個 .tif 檔
4. 生成合併後.tif的.box 檔
5. 使用 jTessBoxEditor 工具調整 box 位置
6. 生成 .lstmf 檔案
7. 從已經 train 好的 model 提取 .lstm 檔案 (training 使用 fine tuning)
8. 開始訓練
9. 合併模型
Optical Character Recognition
Using Tesseract tool to train ocr
github source https://github.com/tesseract-ocr/tesseract
Tesseract datasets preprcessing
1. 將圖片轉成 .tif 檔
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →2. 下載 jTessBoxEditor 工具
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →3. 使用 jTessBoxEditor 將圖片合併成一個 .tif 檔
點擊 train.bat 檔案打開 jTessBoxEditor
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →點擊 tool 然後選擇 Merge
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →選擇要合併的圖片檔案,點擊開啟
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →在 save 命名規則為 [lang].[fontname].exp[num].tif
– lang: 語言
– fontname: 字型
– num: 序號
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →4. 生成合併後 .tif 的 .box 檔
生成 eng.font.exp0.box 檔案,裡面儲存文字 position 的內容
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →5. 使用 jTessBoxEditor 工具調整 box 位置
依序選擇 Box Eiditor -> Open -> eng.font.exp0.tiff 檔案
Image Not Showing
Possible Reasons
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →開啟後依序校正文字框 position (注意還有下一頁)
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →6. 生成 .lstmf 檔案
生成訓練所需 .lstmf 檔案
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →7. 從已經 train 好的 model 提取 .lstm 檔案 (training 使用 fine tuning)
下載已經 train 好的 model,此處使用eng.traineddata,連結為https://github.com/tesseract-ocr/tessdata_best
8. 開始訓練
–model_output 命名輸出 model 名稱
–continue_from train 好的 model 轉出來的.lstm檔案
–train_listfile 裡面寫 eng.font.exp0.lstmf 的路徑
–traineddata train 好的 model
–max_iterations 疊代次數
9. 合併模型
train 會得到很多 chechpoint 檔案,tocr_checkpoint 為最後一次輸出模型,使用這個模型與 eng.traineddata 合併為新模型

–continue_from checkpoint model 名稱
–traineddata eng.trainedata 模型
–model_output 輸出模型名稱
Thank you!
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →You can find me on