LHB阿好伯, 2021/07/16
Image Not Showing Possible ReasonsLearn More →
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
R
Python
Ubuntu
軟體
這篇文章將教大家如何將掃描的PDF檔加上OCR光學辨識
增加文字層的資訊讓我們可以複製PDF上的文字
建議是使用Ubuntu或是WSL的系統
WSL可以參考之前文章_使用Win10自架RStudio Server(Workbench)
目前是無法直接使用在WIN10上面
安裝上也非常簡單的一行程式碼
Operating system | Install command |
---|---|
Debian, Ubuntu | apt install ocrmypdf |
Windows Subsystem for Linux | apt install ocrmypdf |
Fedora | dnf install ocrmypdf |
macOS | brew install ocrmypdf |
LinuxBrew | brew install ocrmypdf |
FreeBSD | pkg install py37-ocrmypdf |
Conda | conda install ocrmypdf |
apt-cache search tesseract-ocr
python3-tesserocr - Python wrapper for the tesseract-ocr API (Python3 version)
tesseract-ocr - Tesseract command line OCR tool
tesseract-ocr-afr - tesseract-ocr language files for Afrikaans
tesseract-ocr-all - Tesseract OCR with all language and script packages
tesseract-ocr-amh - tesseract-ocr language files for Amharic
tesseract-ocr-ara - tesseract-ocr language files for Arabic
tesseract-ocr-asm - tesseract-ocr language files for Assamese
tesseract-ocr-aze - tesseract-ocr language files for Azerbaijani
tesseract-ocr-aze-cyrl - tesseract-ocr language files for Azerbaijani (Cyrillic)
tesseract-ocr-bel - tesseract-ocr language files for Belarusian
tesseract-ocr-ben - tesseract-ocr language files for Bengali
tesseract-ocr-bod - tesseract-ocr language files for Tibetan Standard
tesseract-ocr-bos - tesseract-ocr language files for Bosnian
tesseract-ocr-bre - tesseract-ocr language files for Breton
tesseract-ocr-bul - tesseract-ocr language files for Bulgarian
tesseract-ocr-cat - tesseract-ocr language files for Catalan
tesseract-ocr-ceb - tesseract-ocr language files for Cebuano
tesseract-ocr-ces - tesseract-ocr language files for Czech
tesseract-ocr-chi-sim - tesseract-ocr language files for Chinese - Simplified
tesseract-ocr-chi-sim-vert - tesseract-ocr language files for Chinese - Simplified (vertical)
tesseract-ocr-chi-tra - tesseract-ocr language files for Chinese - Traditional
tesseract-ocr-chi-tra-vert - tesseract-ocr language files for Chinese - Traditional (vertical)
tesseract-ocr-chr - tesseract-ocr language files for Cherokee
tesseract-ocr-cos - tesseract-ocr language files for Corsican
tesseract-ocr-cym - tesseract-ocr language files for Welsh
tesseract-ocr-dan - tesseract-ocr language files for Danish
tesseract-ocr-deu - tesseract-ocr language files for German
tesseract-ocr-div - tesseract-ocr language files for Divehi
tesseract-ocr-dzo - tesseract-ocr language files for Dzongkha
tesseract-ocr-ell - tesseract-ocr language files for Greek
tesseract-ocr-eng - tesseract-ocr language files for English
tesseract-ocr-enm - tesseract-ocr language files for English, Middle (1100-1500)
tesseract-ocr-epo - tesseract-ocr language files for Esperanto
tesseract-ocr-est - tesseract-ocr language files for Estonian
tesseract-ocr-eus - tesseract-ocr language files for Basque
tesseract-ocr-fao - tesseract-ocr language files for Faroese
tesseract-ocr-fas - tesseract-ocr language files for Persian
tesseract-ocr-fil - tesseract-ocr language files for Filipino
tesseract-ocr-fin - tesseract-ocr language files for Finnish
tesseract-ocr-fra - tesseract-ocr language files for French
tesseract-ocr-frk - tesseract-ocr language files for German (Fraktur)
tesseract-ocr-frm - tesseract-ocr language files for French, Middle (ca.1400-1600)
tesseract-ocr-fry - tesseract-ocr language files for Frisian (Western)
tesseract-ocr-gla - tesseract-ocr language files for Gaelic (Scots)
tesseract-ocr-gle - tesseract-ocr language files for Irish
tesseract-ocr-glg - tesseract-ocr language files for Galician
tesseract-ocr-grc - tesseract-ocr language files for Greek, Ancient (to 1453)
tesseract-ocr-guj - tesseract-ocr language files for Gujarati
tesseract-ocr-hat - tesseract-ocr language files for Haitian
tesseract-ocr-heb - tesseract-ocr language files for Hebrew
tesseract-ocr-hin - tesseract-ocr language files for Hindi
tesseract-ocr-hrv - tesseract-ocr language files for Croatian
tesseract-ocr-hun - tesseract-ocr language files for Hungarian
tesseract-ocr-hye - tesseract-ocr language files for Armenian
tesseract-ocr-iku - tesseract-ocr language files for Inuktitut
tesseract-ocr-ind - tesseract-ocr language files for Indonesian
tesseract-ocr-isl - tesseract-ocr language files for Icelandic
tesseract-ocr-ita - tesseract-ocr language files for Italian
tesseract-ocr-ita-old - tesseract-ocr language files for Italian - Old
tesseract-ocr-jav - tesseract-ocr language files for Javanese
tesseract-ocr-jpn - tesseract-ocr language files for Japanese
tesseract-ocr-jpn-vert - tesseract-ocr language files for Japanese (vertical)
tesseract-ocr-kan - tesseract-ocr language files for Kannada
tesseract-ocr-kat - tesseract-ocr language files for Georgian
tesseract-ocr-kat-old - tesseract-ocr language files for Old Georgian
tesseract-ocr-kaz - tesseract-ocr language files for Kazakh
tesseract-ocr-khm - tesseract-ocr language files for Khmer
tesseract-ocr-kir - tesseract-ocr language files for Kyrgyz
tesseract-ocr-kmr - tesseract-ocr language files for Kurmanji (Latin)
tesseract-ocr-kor - tesseract-ocr language files for Korean
tesseract-ocr-kor-vert - tesseract-ocr language files for Korean (vertical)
tesseract-ocr-lao - tesseract-ocr language files for Lao
tesseract-ocr-lat - tesseract-ocr language files for Latin
tesseract-ocr-lav - tesseract-ocr language files for Latvian
tesseract-ocr-lit - tesseract-ocr language files for Lithuanian
tesseract-ocr-ltz - tesseract-ocr language files for Luxembourgish
tesseract-ocr-mal - tesseract-ocr language files for Malayalam
tesseract-ocr-mar - tesseract-ocr language files for Marathi
tesseract-ocr-mkd - tesseract-ocr language files for Macedonian
tesseract-ocr-mlt - tesseract-ocr language files for Maltese
tesseract-ocr-mon - tesseract-ocr language files for Mongolian
tesseract-ocr-mri - tesseract-ocr language files for Maori
tesseract-ocr-msa - tesseract-ocr language files for Malay
tesseract-ocr-mya - tesseract-ocr language files for Burmese
tesseract-ocr-nep - tesseract-ocr language files for Nepali
tesseract-ocr-nld - tesseract-ocr language files for Dutch
tesseract-ocr-nor - tesseract-ocr language files for Norwegian
tesseract-ocr-oci - tesseract-ocr language files for Occitan (post 1500)
tesseract-ocr-ori - tesseract-ocr language files for Oriya
tesseract-ocr-osd - tesseract-ocr language files for script and orientation
tesseract-ocr-pan - tesseract-ocr language files for Punjabi
tesseract-ocr-pol - tesseract-ocr language files for Polish
tesseract-ocr-por - tesseract-ocr language files for Portuguese
tesseract-ocr-pus - tesseract-ocr language files for Pashto
tesseract-ocr-que - tesseract-ocr language files for Quechua
tesseract-ocr-ron - tesseract-ocr language files for Romanian
tesseract-ocr-rus - tesseract-ocr language files for Russian
tesseract-ocr-san - tesseract-ocr language files for Sanskrit
tesseract-ocr-script-arab - tesseract-ocr data for Arabic script
tesseract-ocr-script-armn - tesseract-ocr data for Armenian script
tesseract-ocr-script-beng - tesseract-ocr data for Bengali script
tesseract-ocr-script-cans - tesseract-ocr data for Canadian Aboriginal script
tesseract-ocr-script-cher - tesseract-ocr data for Cherokee script
tesseract-ocr-script-cyrl - tesseract-ocr data for Cyrillic script
tesseract-ocr-script-deva - tesseract-ocr data for Devanagari script
tesseract-ocr-script-ethi - tesseract-ocr data for Ethiopic script
tesseract-ocr-script-frak - tesseract-ocr data for Fraktur script
tesseract-ocr-script-geor - tesseract-ocr data for Georgian script
tesseract-ocr-script-grek - tesseract-ocr data for Greek script
tesseract-ocr-script-gujr - tesseract-ocr data for Gujarati script
tesseract-ocr-script-guru - tesseract-ocr data for Gurmukhi script
tesseract-ocr-script-hang - tesseract-ocr data for Hangul script
tesseract-ocr-script-hang-vert - tesseract-ocr data for Hangul (vertical) script
tesseract-ocr-script-hans - tesseract-ocr data for Han - Simplified script
tesseract-ocr-script-hans-vert - tesseract-ocr data for Han - Simplified (vertical) script
tesseract-ocr-script-hant - tesseract-ocr data for Han - Traditional script
tesseract-ocr-script-hant-vert - tesseract-ocr data for Han - Traditional (vertical) script
tesseract-ocr-script-hebr - tesseract-ocr data for Hebrew script
tesseract-ocr-script-jpan - tesseract-ocr data for Japanese script
tesseract-ocr-script-jpan-vert - tesseract-ocr data for Japanese (vertical) script
tesseract-ocr-script-khmr - tesseract-ocr data for Khmer script
tesseract-ocr-script-knda - tesseract-ocr data for Kannada script
tesseract-ocr-script-laoo - tesseract-ocr data for Lao script
tesseract-ocr-script-latn - tesseract-ocr data for Latin script
tesseract-ocr-script-mlym - tesseract-ocr data for Malayalam script
tesseract-ocr-script-mymr - tesseract-ocr data for Myanmar script
tesseract-ocr-script-orya - tesseract-ocr data for Oriya (Odia) script
tesseract-ocr-script-sinh - tesseract-ocr data for Sinhala script
tesseract-ocr-script-syrc - tesseract-ocr data for Syriac script
tesseract-ocr-script-taml - tesseract-ocr data for Tamil script
tesseract-ocr-script-telu - tesseract-ocr data for Telugu script
tesseract-ocr-script-thaa - tesseract-ocr data for Thaana script
tesseract-ocr-script-thai - tesseract-ocr data for Thai script
tesseract-ocr-script-tibt - tesseract-ocr data for Tibetan script
tesseract-ocr-script-viet - tesseract-ocr data for Vietnamese script
tesseract-ocr-sin - tesseract-ocr language files for Sinhala
tesseract-ocr-slk - tesseract-ocr language files for Slovakian
tesseract-ocr-slv - tesseract-ocr language files for Slovenian
tesseract-ocr-snd - tesseract-ocr language files for Sindhi
tesseract-ocr-spa - tesseract-ocr language files for Spanish
tesseract-ocr-spa-old - tesseract-ocr language files for Spanish, Castilian - Old
tesseract-ocr-sqi - tesseract-ocr language files for Albanian
tesseract-ocr-srp - tesseract-ocr language files for Serbian
tesseract-ocr-srp-latn - tesseract-ocr language files for Serbian (Latin)
tesseract-ocr-sun - tesseract-ocr language files for Sundanese
tesseract-ocr-swa - tesseract-ocr language files for Swahili
tesseract-ocr-swe - tesseract-ocr language files for Swedish
tesseract-ocr-syr - tesseract-ocr language files for Syriac
tesseract-ocr-tam - tesseract-ocr language files for Tamil
tesseract-ocr-tat - tesseract-ocr language files for Tatar
tesseract-ocr-tel - tesseract-ocr language files for Telugu
tesseract-ocr-tgk - tesseract-ocr language files for Tajik
tesseract-ocr-tha - tesseract-ocr language files for Thai
tesseract-ocr-tir - tesseract-ocr language files for Tigrinya
tesseract-ocr-ton - tesseract-ocr language files for Tonga
tesseract-ocr-tur - tesseract-ocr language files for Turkish
tesseract-ocr-uig - tesseract-ocr language files for Uyghur
tesseract-ocr-ukr - tesseract-ocr language files for Ukrainian
tesseract-ocr-urd - tesseract-ocr language files for Urdu
tesseract-ocr-uzb - tesseract-ocr language files for Uzbek
tesseract-ocr-uzb-cyrl - tesseract-ocr language files for Uzbek (Cyrillic)
tesseract-ocr-vie - tesseract-ocr language files for Vietnamese
tesseract-ocr-yid - tesseract-ocr language files for Yiddish
tesseract-ocr-yor - tesseract-ocr language files for Yoruba
tesseract-ocr-eng - tesseract-ocr language files for English
tesseract-ocr-chi-tra - tesseract-ocr language files for Chinese - Traditional
tesseract-ocr-chi-tra-vert - tesseract-ocr language files for Chinese - Traditional (vertical)
我將上一篇文章轉成pdf檔進行測試
測試結果如下是會有多餘的空白
但辨識結果還不錯
詳細測試結果可以查看示範檔案
PDF 免費 解鎖 _pikepdf 使 用 python&R 快 速 解鎖
使用之前Win10自架的RStudio Server測試使用也沒問題
ERROR - 1: page already has text! – aborting (use –force-ocr to force OCR)
🌟全文可以至下方連結觀看或是補充
https://hackmd.io/@LHB-0222/PDF_OCR
全文分享至
https://www.facebook.com/LHB0222/
https://www.instagram.com/ahb0222/
有疑問想討論的都歡迎於下方留言
喜歡的幫我分享給所有的朋友 \o/
有所錯誤歡迎指教