--- disqus: ahb0222 GA : G-VF9ZT413CG --- # PDF免費文字辨識(OCR) ocrmypdf > [color=#40f1ef][name=LHB阿好伯, 2021/07/16][:earth_africa:](https://www.facebook.com/LHB0222/) ###### tags: `R` `Python` `Ubuntu` `軟體` [TOC] ![](https://hackmd.io/_uploads/r1yscj8AF.png) 這篇文章將教大家如何將掃描的PDF檔加上OCR光學辨識 增加文字層的資訊讓我們可以複製PDF上的文字 # 安裝 建議是使用Ubuntu或是WSL的系統 WSL可以參考之前文章_[使用Win10自架RStudio Server(Workbench)](/u5FrQz_cT8Wd7Dawvgw-Ew) 目前是無法直接使用在WIN10上面 安裝上也非常簡單的一行程式碼 ```c= sudo apt install ocrmypdf ``` ![](https://hackmd.io/_uploads/r1ye0Z80Y.png) :::spoiler ```cpp= sudo apt-get -y remove ocrmypdf # remove system ocrmypdf, if installed sudo apt-get -y update sudo apt-get -y install \ ghostscript \ icc-profiles-free \ libxml2 \ pngquant \ python3-pip \ tesseract-ocr \ zlib1g ``` ::: Operating system | Install command | | :---: | :---: | | Debian, Ubuntu | `apt install ocrmypdf` | | Windows Subsystem for Linux | `apt install ocrmypdf` | | Fedora | `dnf install ocrmypdf` | | macOS | `brew install ocrmypdf` | | LinuxBrew | `brew install ocrmypdf` | | FreeBSD | `pkg install py37-ocrmypdf` | | Conda | `conda install ocrmypdf` | ## 查看可安裝字體 ` apt-cache search tesseract-ocr` :::spoiler python3-tesserocr - Python wrapper for the tesseract-ocr API (Python3 version) tesseract-ocr - Tesseract command line OCR tool tesseract-ocr-afr - tesseract-ocr language files for Afrikaans tesseract-ocr-all - Tesseract OCR with all language and script packages tesseract-ocr-amh - tesseract-ocr language files for Amharic tesseract-ocr-ara - tesseract-ocr language files for Arabic tesseract-ocr-asm - tesseract-ocr language files for Assamese tesseract-ocr-aze - tesseract-ocr language files for Azerbaijani tesseract-ocr-aze-cyrl - tesseract-ocr language files for Azerbaijani (Cyrillic) tesseract-ocr-bel - tesseract-ocr language files for Belarusian tesseract-ocr-ben - tesseract-ocr language files for Bengali tesseract-ocr-bod - tesseract-ocr language files for Tibetan Standard tesseract-ocr-bos - tesseract-ocr language files for Bosnian tesseract-ocr-bre - tesseract-ocr language files for Breton tesseract-ocr-bul - tesseract-ocr language files for Bulgarian tesseract-ocr-cat - tesseract-ocr language files for Catalan tesseract-ocr-ceb - tesseract-ocr language files for Cebuano tesseract-ocr-ces - tesseract-ocr language files for Czech tesseract-ocr-chi-sim - tesseract-ocr language files for Chinese - Simplified tesseract-ocr-chi-sim-vert - tesseract-ocr language files for Chinese - Simplified (vertical) tesseract-ocr-chi-tra - tesseract-ocr language files for Chinese - Traditional tesseract-ocr-chi-tra-vert - tesseract-ocr language files for Chinese - Traditional (vertical) tesseract-ocr-chr - tesseract-ocr language files for Cherokee tesseract-ocr-cos - tesseract-ocr language files for Corsican tesseract-ocr-cym - tesseract-ocr language files for Welsh tesseract-ocr-dan - tesseract-ocr language files for Danish tesseract-ocr-deu - tesseract-ocr language files for German tesseract-ocr-div - tesseract-ocr language files for Divehi tesseract-ocr-dzo - tesseract-ocr language files for Dzongkha tesseract-ocr-ell - tesseract-ocr language files for Greek tesseract-ocr-eng - tesseract-ocr language files for English tesseract-ocr-enm - tesseract-ocr language files for English, Middle (1100-1500) tesseract-ocr-epo - tesseract-ocr language files for Esperanto tesseract-ocr-est - tesseract-ocr language files for Estonian tesseract-ocr-eus - tesseract-ocr language files for Basque tesseract-ocr-fao - tesseract-ocr language files for Faroese tesseract-ocr-fas - tesseract-ocr language files for Persian tesseract-ocr-fil - tesseract-ocr language files for Filipino tesseract-ocr-fin - tesseract-ocr language files for Finnish tesseract-ocr-fra - tesseract-ocr language files for French tesseract-ocr-frk - tesseract-ocr language files for German (Fraktur) tesseract-ocr-frm - tesseract-ocr language files for French, Middle (ca.1400-1600) tesseract-ocr-fry - tesseract-ocr language files for Frisian (Western) tesseract-ocr-gla - tesseract-ocr language files for Gaelic (Scots) tesseract-ocr-gle - tesseract-ocr language files for Irish tesseract-ocr-glg - tesseract-ocr language files for Galician tesseract-ocr-grc - tesseract-ocr language files for Greek, Ancient (to 1453) tesseract-ocr-guj - tesseract-ocr language files for Gujarati tesseract-ocr-hat - tesseract-ocr language files for Haitian tesseract-ocr-heb - tesseract-ocr language files for Hebrew tesseract-ocr-hin - tesseract-ocr language files for Hindi tesseract-ocr-hrv - tesseract-ocr language files for Croatian tesseract-ocr-hun - tesseract-ocr language files for Hungarian tesseract-ocr-hye - tesseract-ocr language files for Armenian tesseract-ocr-iku - tesseract-ocr language files for Inuktitut tesseract-ocr-ind - tesseract-ocr language files for Indonesian tesseract-ocr-isl - tesseract-ocr language files for Icelandic tesseract-ocr-ita - tesseract-ocr language files for Italian tesseract-ocr-ita-old - tesseract-ocr language files for Italian - Old tesseract-ocr-jav - tesseract-ocr language files for Javanese tesseract-ocr-jpn - tesseract-ocr language files for Japanese tesseract-ocr-jpn-vert - tesseract-ocr language files for Japanese (vertical) tesseract-ocr-kan - tesseract-ocr language files for Kannada tesseract-ocr-kat - tesseract-ocr language files for Georgian tesseract-ocr-kat-old - tesseract-ocr language files for Old Georgian tesseract-ocr-kaz - tesseract-ocr language files for Kazakh tesseract-ocr-khm - tesseract-ocr language files for Khmer tesseract-ocr-kir - tesseract-ocr language files for Kyrgyz tesseract-ocr-kmr - tesseract-ocr language files for Kurmanji (Latin) tesseract-ocr-kor - tesseract-ocr language files for Korean tesseract-ocr-kor-vert - tesseract-ocr language files for Korean (vertical) tesseract-ocr-lao - tesseract-ocr language files for Lao tesseract-ocr-lat - tesseract-ocr language files for Latin tesseract-ocr-lav - tesseract-ocr language files for Latvian tesseract-ocr-lit - tesseract-ocr language files for Lithuanian tesseract-ocr-ltz - tesseract-ocr language files for Luxembourgish tesseract-ocr-mal - tesseract-ocr language files for Malayalam tesseract-ocr-mar - tesseract-ocr language files for Marathi tesseract-ocr-mkd - tesseract-ocr language files for Macedonian tesseract-ocr-mlt - tesseract-ocr language files for Maltese tesseract-ocr-mon - tesseract-ocr language files for Mongolian tesseract-ocr-mri - tesseract-ocr language files for Maori tesseract-ocr-msa - tesseract-ocr language files for Malay tesseract-ocr-mya - tesseract-ocr language files for Burmese tesseract-ocr-nep - tesseract-ocr language files for Nepali tesseract-ocr-nld - tesseract-ocr language files for Dutch tesseract-ocr-nor - tesseract-ocr language files for Norwegian tesseract-ocr-oci - tesseract-ocr language files for Occitan (post 1500) tesseract-ocr-ori - tesseract-ocr language files for Oriya tesseract-ocr-osd - tesseract-ocr language files for script and orientation tesseract-ocr-pan - tesseract-ocr language files for Punjabi tesseract-ocr-pol - tesseract-ocr language files for Polish tesseract-ocr-por - tesseract-ocr language files for Portuguese tesseract-ocr-pus - tesseract-ocr language files for Pashto tesseract-ocr-que - tesseract-ocr language files for Quechua tesseract-ocr-ron - tesseract-ocr language files for Romanian tesseract-ocr-rus - tesseract-ocr language files for Russian tesseract-ocr-san - tesseract-ocr language files for Sanskrit tesseract-ocr-script-arab - tesseract-ocr data for Arabic script tesseract-ocr-script-armn - tesseract-ocr data for Armenian script tesseract-ocr-script-beng - tesseract-ocr data for Bengali script tesseract-ocr-script-cans - tesseract-ocr data for Canadian Aboriginal script tesseract-ocr-script-cher - tesseract-ocr data for Cherokee script tesseract-ocr-script-cyrl - tesseract-ocr data for Cyrillic script tesseract-ocr-script-deva - tesseract-ocr data for Devanagari script tesseract-ocr-script-ethi - tesseract-ocr data for Ethiopic script tesseract-ocr-script-frak - tesseract-ocr data for Fraktur script tesseract-ocr-script-geor - tesseract-ocr data for Georgian script tesseract-ocr-script-grek - tesseract-ocr data for Greek script tesseract-ocr-script-gujr - tesseract-ocr data for Gujarati script tesseract-ocr-script-guru - tesseract-ocr data for Gurmukhi script tesseract-ocr-script-hang - tesseract-ocr data for Hangul script tesseract-ocr-script-hang-vert - tesseract-ocr data for Hangul (vertical) script tesseract-ocr-script-hans - tesseract-ocr data for Han - Simplified script tesseract-ocr-script-hans-vert - tesseract-ocr data for Han - Simplified (vertical) script tesseract-ocr-script-hant - tesseract-ocr data for Han - Traditional script tesseract-ocr-script-hant-vert - tesseract-ocr data for Han - Traditional (vertical) script tesseract-ocr-script-hebr - tesseract-ocr data for Hebrew script tesseract-ocr-script-jpan - tesseract-ocr data for Japanese script tesseract-ocr-script-jpan-vert - tesseract-ocr data for Japanese (vertical) script tesseract-ocr-script-khmr - tesseract-ocr data for Khmer script tesseract-ocr-script-knda - tesseract-ocr data for Kannada script tesseract-ocr-script-laoo - tesseract-ocr data for Lao script tesseract-ocr-script-latn - tesseract-ocr data for Latin script tesseract-ocr-script-mlym - tesseract-ocr data for Malayalam script tesseract-ocr-script-mymr - tesseract-ocr data for Myanmar script tesseract-ocr-script-orya - tesseract-ocr data for Oriya (Odia) script tesseract-ocr-script-sinh - tesseract-ocr data for Sinhala script tesseract-ocr-script-syrc - tesseract-ocr data for Syriac script tesseract-ocr-script-taml - tesseract-ocr data for Tamil script tesseract-ocr-script-telu - tesseract-ocr data for Telugu script tesseract-ocr-script-thaa - tesseract-ocr data for Thaana script tesseract-ocr-script-thai - tesseract-ocr data for Thai script tesseract-ocr-script-tibt - tesseract-ocr data for Tibetan script tesseract-ocr-script-viet - tesseract-ocr data for Vietnamese script tesseract-ocr-sin - tesseract-ocr language files for Sinhala tesseract-ocr-slk - tesseract-ocr language files for Slovakian tesseract-ocr-slv - tesseract-ocr language files for Slovenian tesseract-ocr-snd - tesseract-ocr language files for Sindhi tesseract-ocr-spa - tesseract-ocr language files for Spanish tesseract-ocr-spa-old - tesseract-ocr language files for Spanish, Castilian - Old tesseract-ocr-sqi - tesseract-ocr language files for Albanian tesseract-ocr-srp - tesseract-ocr language files for Serbian tesseract-ocr-srp-latn - tesseract-ocr language files for Serbian (Latin) tesseract-ocr-sun - tesseract-ocr language files for Sundanese tesseract-ocr-swa - tesseract-ocr language files for Swahili tesseract-ocr-swe - tesseract-ocr language files for Swedish tesseract-ocr-syr - tesseract-ocr language files for Syriac tesseract-ocr-tam - tesseract-ocr language files for Tamil tesseract-ocr-tat - tesseract-ocr language files for Tatar tesseract-ocr-tel - tesseract-ocr language files for Telugu tesseract-ocr-tgk - tesseract-ocr language files for Tajik tesseract-ocr-tha - tesseract-ocr language files for Thai tesseract-ocr-tir - tesseract-ocr language files for Tigrinya tesseract-ocr-ton - tesseract-ocr language files for Tonga tesseract-ocr-tur - tesseract-ocr language files for Turkish tesseract-ocr-uig - tesseract-ocr language files for Uyghur tesseract-ocr-ukr - tesseract-ocr language files for Ukrainian tesseract-ocr-urd - tesseract-ocr language files for Urdu tesseract-ocr-uzb - tesseract-ocr language files for Uzbek tesseract-ocr-uzb-cyrl - tesseract-ocr language files for Uzbek (Cyrillic) tesseract-ocr-vie - tesseract-ocr language files for Vietnamese tesseract-ocr-yid - tesseract-ocr language files for Yiddish tesseract-ocr-yor - tesseract-ocr language files for Yoruba ::: :::success tesseract-ocr-eng - tesseract-ocr language files for English tesseract-ocr-chi-tra - tesseract-ocr language files for Chinese - Traditional tesseract-ocr-chi-tra-vert - tesseract-ocr language files for Chinese - Traditional (vertical) ::: ## 安裝常用字體 ```cpp= sudo apt-get install tesseract-ocr-chi-tra #繁體中文 sudo apt-get install tesseract-ocr-chi-tra-vert #繁體中文_垂直 sudo apt-get install tesseract-ocr-eng #英文 sudo apt-get install tesseract-ocr-jpn #日文 ``` ## 執行中英文OCR辨識 ```cpp= ocrmypdf -l eng+chi_tra old_pdf new_pdf ``` 我將上一篇文章轉成pdf檔進行測試 測試結果如下是會有多餘的空白 但辨識結果還不錯 詳細測試結果可以查看[示範檔案](https://1drv.ms/u/s!AuqcoLPA0_SImNl-Tb8HsQo05ZCz7Q?e=ZuWI1p) ![](https://hackmd.io/_uploads/ByJVviU0K.png) :::success PDF 免費 解鎖 _pikepdf 使 用 python&R 快 速 解鎖 ::: 使用之前[Win10自架的RStudio Server](/u5FrQz_cT8Wd7Dawvgw-Ew)測試使用也沒問題 ```r= pdffile <- file.choose() system(paste0("ocrmypdf -l eng+chi_tra ",pdffile," ",pdffile,"_OCR.pdf")) ``` ![](https://hackmd.io/_uploads/r1TH2oICK.png) # 錯誤排除 ## 已包含文字 :::danger ERROR - 1: page already has text! – aborting (use --force-ocr to force OCR) ::: ``` system(paste0("ocrmypdf -l eng+chi_tra --redo-ocr ",pdffile," ",pdffile,"_OCR.pdf")) ``` 🌟全文可以至下方連結觀看或是補充 https://hackmd.io/@LHB-0222/PDF_OCR 全文分享至 https://www.facebook.com/LHB0222/ https://www.instagram.com/ahb0222/ 有疑問想討論的都歡迎於下方留言 喜歡的幫我分享給所有的朋友 \o/ 有所錯誤歡迎指教 # [:page_with_curl: 全部文章列表](https://hackmd.io/@LHB-0222/AllWritings) ![](https://i.imgur.com/nHEcVmm.jpg)