Tesseract 使用＆安裝＆訓練

--- GA: G-FFF1L2PEEZ --- # Tesseract 使用＆安裝＆訓練 ## 簡單驗證碼去噪灰度二值化 ###### tags: `python` `tessract` `辨識文字` 2024更新 : 剛好在上大型語言模型實作初階課程，可結合 RPA 工具，串接至 LINE 平台，實現上傳健檢報告自動執行文字掃描，並根據狀況回傳有趣的圖片。  ![image](https://hackmd.io/_uploads/S1BMoZ5gke.png) OpenCV 處理圖片雜訊方式： NYCU 資料科學軟體 2024 Week10: OpenCV & skimage https://www.youtube.com/live/d1fz8nvAkOM?si=DG0n8ZCvNSiGmgXk --- > [TOC] :::info >無意間找到網址是固定的教育部某平台之簡易驗證碼 ![](https://i.imgur.com/xBEiIlt.png) > 雖然可以爬好幾10xxx張，但是看了資訊安全政策可能會被 ban > 所以弄了一個驗證碼產生器，自製訓練庫， > 好處是可自定義驗證碼形式，本機端即可儲存圖片。實際操作畫面如下： ![](https://i.imgur.com/ORSlEI8.png) 簡單的製作驗證碼範例： https://ithelp.ithome.com.tw/articles/10159457 > 人工智慧課堂企劃報告後，老師說明有 OCR 這個目前由 Google 維護的識別引擎，用途為可辨識掃描的文件/圖片中的文字 > 老師提供實務上業界的需求，可應用在掃描健檢報告數據，存入資料庫。 **實做起來遇到一些問題微調操作** **所以做個紀錄有了這篇筆記整理** :page_facing_up: ::: > [name=wa.__.wa] > [人工智慧應用課程](http://120.108.221.55/PROFCHWU/dctai/index.php) > [TOC] ## 安裝 Tessract / 語言包： > `homebrew install tesseract-lang` https://formulae.brew.sh/formula/tesseract-lang 語言包一次打包/ 需要的再匯入 /usr/local/Cellar/tesseract/4.x/share/tessdata > `homebrew install tesseract ` https://formulae.brew.sh/formula/tesseract ## 編譯安裝問題： ``` Error: invalid option: --with-training-tools ``` 要安裝訓練模型時，發現 tesseract 最新4.x版本 homebrew 沒有訓練模型相關指令撰寫文章當下根據網路文章找到了解決方法需自行編譯 build,make install **參考下列文章（感謝文章作者）：** https://juejin.im/post/6844904097863188487 ## 可下載 `python` 的對應模組： ```python pip install pytesseract ``` 並搭配 pillow opencv numpy ![](https://i.imgur.com/Jk2wQ7i.png) 詳細教學： https://lufor129.medium.com/pytesseract-辨識圖片中的文字-b1024f678fac https://blog.csdn.net/huitailangyz/article/details/80390090 ## 查看安裝版本： ```python tesseract 4.1.1 leptonica-1.80.0 libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1 Found AVX2 Found AVX Found FMA Found SSE ``` ![](https://i.imgur.com/HVS8rMt.png) ## 參數設定：列出可使用的語言包 `tesseract --list-langs` * chi_tra 繁體 * chi_tra_vert 繁體，直排 `tesseract -l test.png test` -l :選擇輸入語言/ 預設是英文 default “eng” --- ### OEM * --oem ： OCR Engine modes, 演算法類型 / 目前沒了解這塊用預設 ``` tesseract --help-oem OCR Engine modes: 0 Legacy engine only. 1 Neural nets LSTM engine only. 2 Legacy + LSTM engines. 3 Default, based on what is available. ``` ```htmlembedded 參考文章解釋为了识别包含单个字符的图像，我们通常使用卷积神经网络(CNN)。任意长度的文本是一系列字符，这类问题可以通过RNNs解决，LSTM是RNN的一种常用形式 ``` ![](https://i.imgur.com/TdxEkRl.png) --- --psm : 判別語句類型模式 / 大陸翻譯：自动页面分割模式 * 0：定向脚本监测（OSD） * 1：使用OSD自动分页 * 2 ：自动分页，但是不使用OSD或OCR * 3 ：全自动分页，但是没有使用OSD（默认） * 4 ：假设可变大小的一个文本列。 * 5 ：假设垂直对齐文本的单个统一块。 * 6 ：假设一个统一的文本块。 - [x] 7 ：將圖片視為一行文字句子。 - [x] 8 ：將圖片視為一個『單字』。 - [x] 9 ：將圖像視為『環狀字』。 - [x] 10 ：將圖片視為一個『字母』。參數不懂的沿用參考文章簡體中文翻譯 PSM 參數詳細說明： https://pyimagesearch.com/2021/11/15/tesseract-page-segmentation-modes-psms-explained-how-to-improve-your-ocr-accuracy/ ## 測試 > 執行畫面： ![](https://i.imgur.com/Us8mDqc.png) ![](https://i.imgur.com/Q6Vdk25.png) 簡易驗證碼： ![](https://i.imgur.com/SVxpeYZ.png) 較複雜驗證碼：登入系統驗證碼 ![](https://i.imgur.com/ScpJvjH.png) 線條干擾誤判斷成 - 符號 ## 預先處理圖片 / 提升辨識度再來就是要測試一些更難辨識的照片跟訓練了官方文檔解釋： https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html#image-processing 每個Image都有不同的標記，需要乾淨能夠辨識標記的物件 OpenCV 套件預先處理圖片辨識原則： * 至少為300 DPI為最佳 * 白底黑字，字體清晰 * 去噪，灰度圖二值化轉灰度方式： ![](https://i.imgur.com/z4wqhcP.png) mode 可為下列幾種  ```python · 1 (1-bit pixels, black and white, stored with one pixel per byte) · L (8-bit pixels, black and white) · P (8-bit pixels, mapped to any other mode using a colour palette) · RGB (3x8-bit pixels, true colour) · RGBA (4x8-bit pixels, true colour with transparency mask) · CMYK (4x8-bit pixels, colour separation) · YCbCr (3x8-bit pixels, colour video format) · I (32-bit signed integer pixels) · F (32-bit floating point pixels) ``` opencv 辨識教學 : https://lufor129.medium.com/pytesseract-辨識圖片中的文字-b1024f678fac 常見錯誤： https://zhuanlan.zhihu.com/p/35590364 套件記得載 ![](https://i.imgur.com/Qr4oQwf.png) ```python # opencv 方式讀取圖檔轉灰度 imgcv = cv2.imread('CaptchaImage.jpg', cv2.IMREAD_GRAYSCALE) type(imgcv) cv2.imshow('CaptchaImage', imgcv) cv2.imwrite('output/gray05.tiff', imgcv) cv2.waitKey(0) cv2.destroyAllWindows() ```  灰度轉二值化(去噪): ```python print 'binary mode: ', binary.mode print binary.getpixel((0, 0)) co = binary.getcolors() print co #輸出 ---------------------------------------- binary mode: 1 0 #左上角為黑色 [(503, 0), (1993, 1)] 共有幾個黑與白像素點 ``` 判斷文字位置： https://youtu.be/6DjFscX4I_c?t=502 訓練文字辨識，手動標註｜工人智慧的力量 ## jTessBoxEditor / VietOCR-4.5 ### 安裝下載位置： https://sourceforge.net/projects/vietocr/ 軟體資訊： ![](https://i.imgur.com/LcyLe83.png) ## 使用語法總覽 ```bash= Usage: tesseract --help | --help-extra | --version tesseract --list-langs tesseract imagename outputbase [options...] [configfile...] OCR options: -l LANG[+LANG] Specify language(s) used for OCR. NOTE: These options must occur before any configfile. Single options: --help Show this help message. --help-extra Show extra help for advanced users. --version Show version information. --list-langs List available languages for tesseract engine. ``` ```python= Page segmentation modes: 0 Orientation and script detection (OSD) only. 1 Automatic page segmentation with OSD. 2 Automatic page segmentation, but no OSD, or OCR. (not implemented) 3 Fully automatic page segmentation, but no OSD. (Default) 4 Assume a single column of text of variable sizes. 5 Assume a single uniform block of vertically aligned text. 6 Assume a single uniform block of text. 7 Treat the image as a single text line. 8 Treat the image as a single word. 9 Treat the image as a single word in a circle. 10 Treat the image as a single character. 11 Sparse text. Find as much text as possible in no particular order. 12 Sparse text with OSD. 13 Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific. OCR Engine modes: 0 Legacy engine only. 1 Neural nets LSTM engine only. 2 Legacy + LSTM engines. 3 Default, based on what is available. Single options: -h, --help Show minimal help message. --help-extra Show extra help for advanced users. --help-psm Show page segmentation modes. --help-oem Show OCR Engine modes. -v, --version Show version information. --list-langs List available languages for tesseract engine. --print-parameters Print tesseract parameters. ``` 未完待續，下方整理了一些相關文章，有興趣的同學可以參考。 #### 實用文章： Python爬蟲結合辨識圖片文字 >https://medium.com/@kaojia/資料分析-python爬蟲專案實作-辨識圖片文字-驗證碼-pytesseract套件-43e7adb29316 此篇文章作者有一系列影像辨識/深度學習教學文章值得細讀 > https://vocus.cc/salon/crab [OCR_應用]Tesseract-OCR_Config說明 https://vocus.cc/tags/Tesseract [ 實用心得 ] Tesseract-OCR > [https://medium.com/@b98606021/實用心得-tesseract-ocr](https://medium.com/@b98606021/實用心得-tesseract-ocr-eef4fcd425f0) mac上文字识别(Tesseract-OCR for mac ) >https://blog.csdn.net/u010670689/article/details/78374623 OpenCV-Python Tutorials document & tutorial > [Image Thresholding](https://opencv24-python-tutorials.readthedocs.io/en/latest/py_tutorials/py_imgproc/py_thresholding/py_thresholding.html#thresholding) 使用Tesseract训练lang文件并OCR识别集装箱号 > https://www.jianshu.com/p/5f847d8089ce python 读取、保存、二值化、灰度化图片+opencv处理图片的方法 >https://blog.csdn.net/JohinieLi/article/details/69389980 设置白名单:让tesseract-OCR只识别指定内容 > tessdata/configs/digits 沒找到 > https://blog.csdn.net/ADebugMan/article/details/100592200 ### 檔案放置區: 組員分工：我負責程式.夥伴負責繪圖 >[Tesseract-OCR 辨識＆驗證碼分析及實作](https://medium.com/mr-wang/文字辨識實作及逆向解析-e42078f50916) 000webhost 提供免費虛擬主機服務 (支援 PHP)(網站也不會內置其他廣告)，當初臨時架站使用驗證碼用途找到的，但太久沒流量會砍站的樣子(某年某天查看掛了) ### 實際圖片辨識： https://linevoom.line.me/user/_dZzGNZLRzW9RJ1tQUGgcLuD1WYPEsedKzf9Jbjo 部署到 Heroku 串接 Line Bot https://www.heroku.com 數ㄕㄨˇ數ㄕㄨˋ鼠叔叔 (掛了) 2024 可以換成 Render.com 架站 https://linevoom.line.me/user/_dZzGNZLRzW9RJ1tQUGgcLuD1WYPEsedKzf9Jbjo ![截圖 2023-12-03 下午11.36.00](https://hackmd.io/_uploads/Hyrw-X5rp.png) ![](https://i.imgur.com/IxnKVer.png) 4! 8! :::spoiler 4! = 24 ~~所以辨識錯了~~ ![](https://hackmd.io/_uploads/SkWEbRFGT.png)