Try   HackMD

PDF免費文字辨識(OCR) ocrmypdf

LHB阿好伯, 2021/07/16

tags: R Python Ubuntu 軟體

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

這篇文章將教大家如何將掃描的PDF檔加上OCR光學辨識
增加文字層的資訊讓我們可以複製PDF上的文字

安裝

建議是使用Ubuntu或是WSL的系統
WSL可以參考之前文章_使用Win10自架RStudio Server(Workbench)
目前是無法直接使用在WIN10上面
安裝上也非常簡單的一行程式碼

sudo apt install ocrmypdf

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

sudo apt-get -y remove ocrmypdf # remove system ocrmypdf, if installed sudo apt-get -y update sudo apt-get -y install \ ghostscript \ icc-profiles-free \ libxml2 \ pngquant \ python3-pip \ tesseract-ocr \ zlib1g
Operating system Install command
Debian, Ubuntu apt install ocrmypdf
Windows Subsystem for Linux apt install ocrmypdf
Fedora dnf install ocrmypdf
macOS brew install ocrmypdf
LinuxBrew brew install ocrmypdf
FreeBSD pkg install py37-ocrmypdf
Conda conda install ocrmypdf

查看可安裝字體

apt-cache search tesseract-ocr

python3-tesserocr - Python wrapper for the tesseract-ocr API (Python3 version)
tesseract-ocr - Tesseract command line OCR tool
tesseract-ocr-afr - tesseract-ocr language files for Afrikaans
tesseract-ocr-all - Tesseract OCR with all language and script packages
tesseract-ocr-amh - tesseract-ocr language files for Amharic
tesseract-ocr-ara - tesseract-ocr language files for Arabic
tesseract-ocr-asm - tesseract-ocr language files for Assamese
tesseract-ocr-aze - tesseract-ocr language files for Azerbaijani
tesseract-ocr-aze-cyrl - tesseract-ocr language files for Azerbaijani (Cyrillic)
tesseract-ocr-bel - tesseract-ocr language files for Belarusian
tesseract-ocr-ben - tesseract-ocr language files for Bengali
tesseract-ocr-bod - tesseract-ocr language files for Tibetan Standard
tesseract-ocr-bos - tesseract-ocr language files for Bosnian
tesseract-ocr-bre - tesseract-ocr language files for Breton
tesseract-ocr-bul - tesseract-ocr language files for Bulgarian
tesseract-ocr-cat - tesseract-ocr language files for Catalan
tesseract-ocr-ceb - tesseract-ocr language files for Cebuano
tesseract-ocr-ces - tesseract-ocr language files for Czech
tesseract-ocr-chi-sim - tesseract-ocr language files for Chinese - Simplified
tesseract-ocr-chi-sim-vert - tesseract-ocr language files for Chinese - Simplified (vertical)
tesseract-ocr-chi-tra - tesseract-ocr language files for Chinese - Traditional
tesseract-ocr-chi-tra-vert - tesseract-ocr language files for Chinese - Traditional (vertical)
tesseract-ocr-chr - tesseract-ocr language files for Cherokee
tesseract-ocr-cos - tesseract-ocr language files for Corsican
tesseract-ocr-cym - tesseract-ocr language files for Welsh
tesseract-ocr-dan - tesseract-ocr language files for Danish
tesseract-ocr-deu - tesseract-ocr language files for German
tesseract-ocr-div - tesseract-ocr language files for Divehi
tesseract-ocr-dzo - tesseract-ocr language files for Dzongkha
tesseract-ocr-ell - tesseract-ocr language files for Greek
tesseract-ocr-eng - tesseract-ocr language files for English
tesseract-ocr-enm - tesseract-ocr language files for English, Middle (1100-1500)
tesseract-ocr-epo - tesseract-ocr language files for Esperanto
tesseract-ocr-est - tesseract-ocr language files for Estonian
tesseract-ocr-eus - tesseract-ocr language files for Basque
tesseract-ocr-fao - tesseract-ocr language files for Faroese
tesseract-ocr-fas - tesseract-ocr language files for Persian
tesseract-ocr-fil - tesseract-ocr language files for Filipino
tesseract-ocr-fin - tesseract-ocr language files for Finnish
tesseract-ocr-fra - tesseract-ocr language files for French
tesseract-ocr-frk - tesseract-ocr language files for German (Fraktur)
tesseract-ocr-frm - tesseract-ocr language files for French, Middle (ca.1400-1600)
tesseract-ocr-fry - tesseract-ocr language files for Frisian (Western)
tesseract-ocr-gla - tesseract-ocr language files for Gaelic (Scots)
tesseract-ocr-gle - tesseract-ocr language files for Irish
tesseract-ocr-glg - tesseract-ocr language files for Galician
tesseract-ocr-grc - tesseract-ocr language files for Greek, Ancient (to 1453)
tesseract-ocr-guj - tesseract-ocr language files for Gujarati
tesseract-ocr-hat - tesseract-ocr language files for Haitian
tesseract-ocr-heb - tesseract-ocr language files for Hebrew
tesseract-ocr-hin - tesseract-ocr language files for Hindi
tesseract-ocr-hrv - tesseract-ocr language files for Croatian
tesseract-ocr-hun - tesseract-ocr language files for Hungarian
tesseract-ocr-hye - tesseract-ocr language files for Armenian
tesseract-ocr-iku - tesseract-ocr language files for Inuktitut
tesseract-ocr-ind - tesseract-ocr language files for Indonesian
tesseract-ocr-isl - tesseract-ocr language files for Icelandic
tesseract-ocr-ita - tesseract-ocr language files for Italian
tesseract-ocr-ita-old - tesseract-ocr language files for Italian - Old
tesseract-ocr-jav - tesseract-ocr language files for Javanese
tesseract-ocr-jpn - tesseract-ocr language files for Japanese
tesseract-ocr-jpn-vert - tesseract-ocr language files for Japanese (vertical)
tesseract-ocr-kan - tesseract-ocr language files for Kannada
tesseract-ocr-kat - tesseract-ocr language files for Georgian
tesseract-ocr-kat-old - tesseract-ocr language files for Old Georgian
tesseract-ocr-kaz - tesseract-ocr language files for Kazakh
tesseract-ocr-khm - tesseract-ocr language files for Khmer
tesseract-ocr-kir - tesseract-ocr language files for Kyrgyz
tesseract-ocr-kmr - tesseract-ocr language files for Kurmanji (Latin)
tesseract-ocr-kor - tesseract-ocr language files for Korean
tesseract-ocr-kor-vert - tesseract-ocr language files for Korean (vertical)
tesseract-ocr-lao - tesseract-ocr language files for Lao
tesseract-ocr-lat - tesseract-ocr language files for Latin
tesseract-ocr-lav - tesseract-ocr language files for Latvian
tesseract-ocr-lit - tesseract-ocr language files for Lithuanian
tesseract-ocr-ltz - tesseract-ocr language files for Luxembourgish
tesseract-ocr-mal - tesseract-ocr language files for Malayalam
tesseract-ocr-mar - tesseract-ocr language files for Marathi
tesseract-ocr-mkd - tesseract-ocr language files for Macedonian
tesseract-ocr-mlt - tesseract-ocr language files for Maltese
tesseract-ocr-mon - tesseract-ocr language files for Mongolian
tesseract-ocr-mri - tesseract-ocr language files for Maori
tesseract-ocr-msa - tesseract-ocr language files for Malay
tesseract-ocr-mya - tesseract-ocr language files for Burmese
tesseract-ocr-nep - tesseract-ocr language files for Nepali
tesseract-ocr-nld - tesseract-ocr language files for Dutch
tesseract-ocr-nor - tesseract-ocr language files for Norwegian
tesseract-ocr-oci - tesseract-ocr language files for Occitan (post 1500)
tesseract-ocr-ori - tesseract-ocr language files for Oriya
tesseract-ocr-osd - tesseract-ocr language files for script and orientation
tesseract-ocr-pan - tesseract-ocr language files for Punjabi
tesseract-ocr-pol - tesseract-ocr language files for Polish
tesseract-ocr-por - tesseract-ocr language files for Portuguese
tesseract-ocr-pus - tesseract-ocr language files for Pashto
tesseract-ocr-que - tesseract-ocr language files for Quechua
tesseract-ocr-ron - tesseract-ocr language files for Romanian
tesseract-ocr-rus - tesseract-ocr language files for Russian
tesseract-ocr-san - tesseract-ocr language files for Sanskrit
tesseract-ocr-script-arab - tesseract-ocr data for Arabic script
tesseract-ocr-script-armn - tesseract-ocr data for Armenian script
tesseract-ocr-script-beng - tesseract-ocr data for Bengali script
tesseract-ocr-script-cans - tesseract-ocr data for Canadian Aboriginal script
tesseract-ocr-script-cher - tesseract-ocr data for Cherokee script
tesseract-ocr-script-cyrl - tesseract-ocr data for Cyrillic script
tesseract-ocr-script-deva - tesseract-ocr data for Devanagari script
tesseract-ocr-script-ethi - tesseract-ocr data for Ethiopic script
tesseract-ocr-script-frak - tesseract-ocr data for Fraktur script
tesseract-ocr-script-geor - tesseract-ocr data for Georgian script
tesseract-ocr-script-grek - tesseract-ocr data for Greek script
tesseract-ocr-script-gujr - tesseract-ocr data for Gujarati script
tesseract-ocr-script-guru - tesseract-ocr data for Gurmukhi script
tesseract-ocr-script-hang - tesseract-ocr data for Hangul script
tesseract-ocr-script-hang-vert - tesseract-ocr data for Hangul (vertical) script
tesseract-ocr-script-hans - tesseract-ocr data for Han - Simplified script
tesseract-ocr-script-hans-vert - tesseract-ocr data for Han - Simplified (vertical) script
tesseract-ocr-script-hant - tesseract-ocr data for Han - Traditional script
tesseract-ocr-script-hant-vert - tesseract-ocr data for Han - Traditional (vertical) script
tesseract-ocr-script-hebr - tesseract-ocr data for Hebrew script
tesseract-ocr-script-jpan - tesseract-ocr data for Japanese script
tesseract-ocr-script-jpan-vert - tesseract-ocr data for Japanese (vertical) script
tesseract-ocr-script-khmr - tesseract-ocr data for Khmer script
tesseract-ocr-script-knda - tesseract-ocr data for Kannada script
tesseract-ocr-script-laoo - tesseract-ocr data for Lao script
tesseract-ocr-script-latn - tesseract-ocr data for Latin script
tesseract-ocr-script-mlym - tesseract-ocr data for Malayalam script
tesseract-ocr-script-mymr - tesseract-ocr data for Myanmar script
tesseract-ocr-script-orya - tesseract-ocr data for Oriya (Odia) script
tesseract-ocr-script-sinh - tesseract-ocr data for Sinhala script
tesseract-ocr-script-syrc - tesseract-ocr data for Syriac script
tesseract-ocr-script-taml - tesseract-ocr data for Tamil script
tesseract-ocr-script-telu - tesseract-ocr data for Telugu script
tesseract-ocr-script-thaa - tesseract-ocr data for Thaana script
tesseract-ocr-script-thai - tesseract-ocr data for Thai script
tesseract-ocr-script-tibt - tesseract-ocr data for Tibetan script
tesseract-ocr-script-viet - tesseract-ocr data for Vietnamese script
tesseract-ocr-sin - tesseract-ocr language files for Sinhala
tesseract-ocr-slk - tesseract-ocr language files for Slovakian
tesseract-ocr-slv - tesseract-ocr language files for Slovenian
tesseract-ocr-snd - tesseract-ocr language files for Sindhi
tesseract-ocr-spa - tesseract-ocr language files for Spanish
tesseract-ocr-spa-old - tesseract-ocr language files for Spanish, Castilian - Old
tesseract-ocr-sqi - tesseract-ocr language files for Albanian
tesseract-ocr-srp - tesseract-ocr language files for Serbian
tesseract-ocr-srp-latn - tesseract-ocr language files for Serbian (Latin)
tesseract-ocr-sun - tesseract-ocr language files for Sundanese
tesseract-ocr-swa - tesseract-ocr language files for Swahili
tesseract-ocr-swe - tesseract-ocr language files for Swedish
tesseract-ocr-syr - tesseract-ocr language files for Syriac
tesseract-ocr-tam - tesseract-ocr language files for Tamil
tesseract-ocr-tat - tesseract-ocr language files for Tatar
tesseract-ocr-tel - tesseract-ocr language files for Telugu
tesseract-ocr-tgk - tesseract-ocr language files for Tajik
tesseract-ocr-tha - tesseract-ocr language files for Thai
tesseract-ocr-tir - tesseract-ocr language files for Tigrinya
tesseract-ocr-ton - tesseract-ocr language files for Tonga
tesseract-ocr-tur - tesseract-ocr language files for Turkish
tesseract-ocr-uig - tesseract-ocr language files for Uyghur
tesseract-ocr-ukr - tesseract-ocr language files for Ukrainian
tesseract-ocr-urd - tesseract-ocr language files for Urdu
tesseract-ocr-uzb - tesseract-ocr language files for Uzbek
tesseract-ocr-uzb-cyrl - tesseract-ocr language files for Uzbek (Cyrillic)
tesseract-ocr-vie - tesseract-ocr language files for Vietnamese
tesseract-ocr-yid - tesseract-ocr language files for Yiddish
tesseract-ocr-yor - tesseract-ocr language files for Yoruba

tesseract-ocr-eng - tesseract-ocr language files for English
tesseract-ocr-chi-tra - tesseract-ocr language files for Chinese - Traditional
tesseract-ocr-chi-tra-vert - tesseract-ocr language files for Chinese - Traditional (vertical)

安裝常用字體

sudo apt-get install tesseract-ocr-chi-tra #繁體中文 sudo apt-get install tesseract-ocr-chi-tra-vert #繁體中文_垂直 sudo apt-get install tesseract-ocr-eng #英文 sudo apt-get install tesseract-ocr-jpn #日文

執行中英文OCR辨識

ocrmypdf -l eng+chi_tra old_pdf new_pdf

我將上一篇文章轉成pdf檔進行測試

測試結果如下是會有多餘的空白
但辨識結果還不錯
詳細測試結果可以查看示範檔案

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

PDF 免費 解鎖 _pikepdf 使 用 python&R 快 速 解鎖

使用之前Win10自架的RStudio Server測試使用也沒問題

pdffile <- file.choose() system(paste0("ocrmypdf -l eng+chi_tra ",pdffile," ",pdffile,"_OCR.pdf"))

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

錯誤排除

已包含文字

ERROR - 1: page already has text! – aborting (use force-ocr to force OCR)

system(paste0("ocrmypdf -l   eng+chi_tra --redo-ocr ",pdffile," ",pdffile,"_OCR.pdf"))

🌟全文可以至下方連結觀看或是補充
https://hackmd.io/@LHB-0222/PDF_OCR

全文分享至

https://www.facebook.com/LHB0222/

https://www.instagram.com/ahb0222/

有疑問想討論的都歡迎於下方留言

喜歡的幫我分享給所有的朋友 \o/

有所錯誤歡迎指教

全部文章列表

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →