---
disqus: ahb0222
GA : G-VF9ZT413CG
---
# PDF免費文字辨識(OCR) ocrmypdf
> [color=#40f1ef][name=LHB阿好伯, 2021/07/16][:earth_africa:](https://www.facebook.com/LHB0222/)
###### tags: `R` `Python` `Ubuntu` `軟體`
[TOC]
![](https://hackmd.io/_uploads/r1yscj8AF.png)
這篇文章將教大家如何將掃描的PDF檔加上OCR光學辨識
增加文字層的資訊讓我們可以複製PDF上的文字
# 安裝
建議是使用Ubuntu或是WSL的系統
WSL可以參考之前文章_[使用Win10自架RStudio Server(Workbench)](/u5FrQz_cT8Wd7Dawvgw-Ew)
目前是無法直接使用在WIN10上面
安裝上也非常簡單的一行程式碼
```c=
sudo apt install ocrmypdf
```
![](https://hackmd.io/_uploads/r1ye0Z80Y.png)
:::spoiler
```cpp=
sudo apt-get -y remove ocrmypdf # remove system ocrmypdf, if installed
sudo apt-get -y update
sudo apt-get -y install \
ghostscript \
icc-profiles-free \
libxml2 \
pngquant \
python3-pip \
tesseract-ocr \
zlib1g
```
:::
Operating system | Install command |
| :---: | :---: |
| Debian, Ubuntu | `apt install ocrmypdf` |
| Windows Subsystem for Linux | `apt install ocrmypdf` |
| Fedora | `dnf install ocrmypdf` |
| macOS | `brew install ocrmypdf` |
| LinuxBrew | `brew install ocrmypdf` |
| FreeBSD | `pkg install py37-ocrmypdf` |
| Conda | `conda install ocrmypdf` |
## 查看可安裝字體
` apt-cache search tesseract-ocr`
:::spoiler
python3-tesserocr - Python wrapper for the tesseract-ocr API (Python3 version)
tesseract-ocr - Tesseract command line OCR tool
tesseract-ocr-afr - tesseract-ocr language files for Afrikaans
tesseract-ocr-all - Tesseract OCR with all language and script packages
tesseract-ocr-amh - tesseract-ocr language files for Amharic
tesseract-ocr-ara - tesseract-ocr language files for Arabic
tesseract-ocr-asm - tesseract-ocr language files for Assamese
tesseract-ocr-aze - tesseract-ocr language files for Azerbaijani
tesseract-ocr-aze-cyrl - tesseract-ocr language files for Azerbaijani (Cyrillic)
tesseract-ocr-bel - tesseract-ocr language files for Belarusian
tesseract-ocr-ben - tesseract-ocr language files for Bengali
tesseract-ocr-bod - tesseract-ocr language files for Tibetan Standard
tesseract-ocr-bos - tesseract-ocr language files for Bosnian
tesseract-ocr-bre - tesseract-ocr language files for Breton
tesseract-ocr-bul - tesseract-ocr language files for Bulgarian
tesseract-ocr-cat - tesseract-ocr language files for Catalan
tesseract-ocr-ceb - tesseract-ocr language files for Cebuano
tesseract-ocr-ces - tesseract-ocr language files for Czech
tesseract-ocr-chi-sim - tesseract-ocr language files for Chinese - Simplified
tesseract-ocr-chi-sim-vert - tesseract-ocr language files for Chinese - Simplified (vertical)
tesseract-ocr-chi-tra - tesseract-ocr language files for Chinese - Traditional
tesseract-ocr-chi-tra-vert - tesseract-ocr language files for Chinese - Traditional (vertical)
tesseract-ocr-chr - tesseract-ocr language files for Cherokee
tesseract-ocr-cos - tesseract-ocr language files for Corsican
tesseract-ocr-cym - tesseract-ocr language files for Welsh
tesseract-ocr-dan - tesseract-ocr language files for Danish
tesseract-ocr-deu - tesseract-ocr language files for German
tesseract-ocr-div - tesseract-ocr language files for Divehi
tesseract-ocr-dzo - tesseract-ocr language files for Dzongkha
tesseract-ocr-ell - tesseract-ocr language files for Greek
tesseract-ocr-eng - tesseract-ocr language files for English
tesseract-ocr-enm - tesseract-ocr language files for English, Middle (1100-1500)
tesseract-ocr-epo - tesseract-ocr language files for Esperanto
tesseract-ocr-est - tesseract-ocr language files for Estonian
tesseract-ocr-eus - tesseract-ocr language files for Basque
tesseract-ocr-fao - tesseract-ocr language files for Faroese
tesseract-ocr-fas - tesseract-ocr language files for Persian
tesseract-ocr-fil - tesseract-ocr language files for Filipino
tesseract-ocr-fin - tesseract-ocr language files for Finnish
tesseract-ocr-fra - tesseract-ocr language files for French
tesseract-ocr-frk - tesseract-ocr language files for German (Fraktur)
tesseract-ocr-frm - tesseract-ocr language files for French, Middle (ca.1400-1600)
tesseract-ocr-fry - tesseract-ocr language files for Frisian (Western)
tesseract-ocr-gla - tesseract-ocr language files for Gaelic (Scots)
tesseract-ocr-gle - tesseract-ocr language files for Irish
tesseract-ocr-glg - tesseract-ocr language files for Galician
tesseract-ocr-grc - tesseract-ocr language files for Greek, Ancient (to 1453)
tesseract-ocr-guj - tesseract-ocr language files for Gujarati
tesseract-ocr-hat - tesseract-ocr language files for Haitian
tesseract-ocr-heb - tesseract-ocr language files for Hebrew
tesseract-ocr-hin - tesseract-ocr language files for Hindi
tesseract-ocr-hrv - tesseract-ocr language files for Croatian
tesseract-ocr-hun - tesseract-ocr language files for Hungarian
tesseract-ocr-hye - tesseract-ocr language files for Armenian
tesseract-ocr-iku - tesseract-ocr language files for Inuktitut
tesseract-ocr-ind - tesseract-ocr language files for Indonesian
tesseract-ocr-isl - tesseract-ocr language files for Icelandic
tesseract-ocr-ita - tesseract-ocr language files for Italian
tesseract-ocr-ita-old - tesseract-ocr language files for Italian - Old
tesseract-ocr-jav - tesseract-ocr language files for Javanese
tesseract-ocr-jpn - tesseract-ocr language files for Japanese
tesseract-ocr-jpn-vert - tesseract-ocr language files for Japanese (vertical)
tesseract-ocr-kan - tesseract-ocr language files for Kannada
tesseract-ocr-kat - tesseract-ocr language files for Georgian
tesseract-ocr-kat-old - tesseract-ocr language files for Old Georgian
tesseract-ocr-kaz - tesseract-ocr language files for Kazakh
tesseract-ocr-khm - tesseract-ocr language files for Khmer
tesseract-ocr-kir - tesseract-ocr language files for Kyrgyz
tesseract-ocr-kmr - tesseract-ocr language files for Kurmanji (Latin)
tesseract-ocr-kor - tesseract-ocr language files for Korean
tesseract-ocr-kor-vert - tesseract-ocr language files for Korean (vertical)
tesseract-ocr-lao - tesseract-ocr language files for Lao
tesseract-ocr-lat - tesseract-ocr language files for Latin
tesseract-ocr-lav - tesseract-ocr language files for Latvian
tesseract-ocr-lit - tesseract-ocr language files for Lithuanian
tesseract-ocr-ltz - tesseract-ocr language files for Luxembourgish
tesseract-ocr-mal - tesseract-ocr language files for Malayalam
tesseract-ocr-mar - tesseract-ocr language files for Marathi
tesseract-ocr-mkd - tesseract-ocr language files for Macedonian
tesseract-ocr-mlt - tesseract-ocr language files for Maltese
tesseract-ocr-mon - tesseract-ocr language files for Mongolian
tesseract-ocr-mri - tesseract-ocr language files for Maori
tesseract-ocr-msa - tesseract-ocr language files for Malay
tesseract-ocr-mya - tesseract-ocr language files for Burmese
tesseract-ocr-nep - tesseract-ocr language files for Nepali
tesseract-ocr-nld - tesseract-ocr language files for Dutch
tesseract-ocr-nor - tesseract-ocr language files for Norwegian
tesseract-ocr-oci - tesseract-ocr language files for Occitan (post 1500)
tesseract-ocr-ori - tesseract-ocr language files for Oriya
tesseract-ocr-osd - tesseract-ocr language files for script and orientation
tesseract-ocr-pan - tesseract-ocr language files for Punjabi
tesseract-ocr-pol - tesseract-ocr language files for Polish
tesseract-ocr-por - tesseract-ocr language files for Portuguese
tesseract-ocr-pus - tesseract-ocr language files for Pashto
tesseract-ocr-que - tesseract-ocr language files for Quechua
tesseract-ocr-ron - tesseract-ocr language files for Romanian
tesseract-ocr-rus - tesseract-ocr language files for Russian
tesseract-ocr-san - tesseract-ocr language files for Sanskrit
tesseract-ocr-script-arab - tesseract-ocr data for Arabic script
tesseract-ocr-script-armn - tesseract-ocr data for Armenian script
tesseract-ocr-script-beng - tesseract-ocr data for Bengali script
tesseract-ocr-script-cans - tesseract-ocr data for Canadian Aboriginal script
tesseract-ocr-script-cher - tesseract-ocr data for Cherokee script
tesseract-ocr-script-cyrl - tesseract-ocr data for Cyrillic script
tesseract-ocr-script-deva - tesseract-ocr data for Devanagari script
tesseract-ocr-script-ethi - tesseract-ocr data for Ethiopic script
tesseract-ocr-script-frak - tesseract-ocr data for Fraktur script
tesseract-ocr-script-geor - tesseract-ocr data for Georgian script
tesseract-ocr-script-grek - tesseract-ocr data for Greek script
tesseract-ocr-script-gujr - tesseract-ocr data for Gujarati script
tesseract-ocr-script-guru - tesseract-ocr data for Gurmukhi script
tesseract-ocr-script-hang - tesseract-ocr data for Hangul script
tesseract-ocr-script-hang-vert - tesseract-ocr data for Hangul (vertical) script
tesseract-ocr-script-hans - tesseract-ocr data for Han - Simplified script
tesseract-ocr-script-hans-vert - tesseract-ocr data for Han - Simplified (vertical) script
tesseract-ocr-script-hant - tesseract-ocr data for Han - Traditional script
tesseract-ocr-script-hant-vert - tesseract-ocr data for Han - Traditional (vertical) script
tesseract-ocr-script-hebr - tesseract-ocr data for Hebrew script
tesseract-ocr-script-jpan - tesseract-ocr data for Japanese script
tesseract-ocr-script-jpan-vert - tesseract-ocr data for Japanese (vertical) script
tesseract-ocr-script-khmr - tesseract-ocr data for Khmer script
tesseract-ocr-script-knda - tesseract-ocr data for Kannada script
tesseract-ocr-script-laoo - tesseract-ocr data for Lao script
tesseract-ocr-script-latn - tesseract-ocr data for Latin script
tesseract-ocr-script-mlym - tesseract-ocr data for Malayalam script
tesseract-ocr-script-mymr - tesseract-ocr data for Myanmar script
tesseract-ocr-script-orya - tesseract-ocr data for Oriya (Odia) script
tesseract-ocr-script-sinh - tesseract-ocr data for Sinhala script
tesseract-ocr-script-syrc - tesseract-ocr data for Syriac script
tesseract-ocr-script-taml - tesseract-ocr data for Tamil script
tesseract-ocr-script-telu - tesseract-ocr data for Telugu script
tesseract-ocr-script-thaa - tesseract-ocr data for Thaana script
tesseract-ocr-script-thai - tesseract-ocr data for Thai script
tesseract-ocr-script-tibt - tesseract-ocr data for Tibetan script
tesseract-ocr-script-viet - tesseract-ocr data for Vietnamese script
tesseract-ocr-sin - tesseract-ocr language files for Sinhala
tesseract-ocr-slk - tesseract-ocr language files for Slovakian
tesseract-ocr-slv - tesseract-ocr language files for Slovenian
tesseract-ocr-snd - tesseract-ocr language files for Sindhi
tesseract-ocr-spa - tesseract-ocr language files for Spanish
tesseract-ocr-spa-old - tesseract-ocr language files for Spanish, Castilian - Old
tesseract-ocr-sqi - tesseract-ocr language files for Albanian
tesseract-ocr-srp - tesseract-ocr language files for Serbian
tesseract-ocr-srp-latn - tesseract-ocr language files for Serbian (Latin)
tesseract-ocr-sun - tesseract-ocr language files for Sundanese
tesseract-ocr-swa - tesseract-ocr language files for Swahili
tesseract-ocr-swe - tesseract-ocr language files for Swedish
tesseract-ocr-syr - tesseract-ocr language files for Syriac
tesseract-ocr-tam - tesseract-ocr language files for Tamil
tesseract-ocr-tat - tesseract-ocr language files for Tatar
tesseract-ocr-tel - tesseract-ocr language files for Telugu
tesseract-ocr-tgk - tesseract-ocr language files for Tajik
tesseract-ocr-tha - tesseract-ocr language files for Thai
tesseract-ocr-tir - tesseract-ocr language files for Tigrinya
tesseract-ocr-ton - tesseract-ocr language files for Tonga
tesseract-ocr-tur - tesseract-ocr language files for Turkish
tesseract-ocr-uig - tesseract-ocr language files for Uyghur
tesseract-ocr-ukr - tesseract-ocr language files for Ukrainian
tesseract-ocr-urd - tesseract-ocr language files for Urdu
tesseract-ocr-uzb - tesseract-ocr language files for Uzbek
tesseract-ocr-uzb-cyrl - tesseract-ocr language files for Uzbek (Cyrillic)
tesseract-ocr-vie - tesseract-ocr language files for Vietnamese
tesseract-ocr-yid - tesseract-ocr language files for Yiddish
tesseract-ocr-yor - tesseract-ocr language files for Yoruba
:::
:::success
tesseract-ocr-eng - tesseract-ocr language files for English
tesseract-ocr-chi-tra - tesseract-ocr language files for Chinese - Traditional
tesseract-ocr-chi-tra-vert - tesseract-ocr language files for Chinese - Traditional (vertical)
:::
## 安裝常用字體
```cpp=
sudo apt-get install tesseract-ocr-chi-tra #繁體中文
sudo apt-get install tesseract-ocr-chi-tra-vert #繁體中文_垂直
sudo apt-get install tesseract-ocr-eng #英文
sudo apt-get install tesseract-ocr-jpn #日文
```
## 執行中英文OCR辨識
```cpp=
ocrmypdf -l eng+chi_tra old_pdf new_pdf
```
我將上一篇文章轉成pdf檔進行測試
測試結果如下是會有多餘的空白
但辨識結果還不錯
詳細測試結果可以查看[示範檔案](https://1drv.ms/u/s!AuqcoLPA0_SImNl-Tb8HsQo05ZCz7Q?e=ZuWI1p)
![](https://hackmd.io/_uploads/ByJVviU0K.png)
:::success
PDF 免費 解鎖 _pikepdf 使 用 python&R 快 速 解鎖
:::
使用之前[Win10自架的RStudio Server](/u5FrQz_cT8Wd7Dawvgw-Ew)測試使用也沒問題
```r=
pdffile <- file.choose()
system(paste0("ocrmypdf -l eng+chi_tra ",pdffile," ",pdffile,"_OCR.pdf"))
```
![](https://hackmd.io/_uploads/r1TH2oICK.png)
# 錯誤排除
## 已包含文字
:::danger
ERROR - 1: page already has text! – aborting (use --force-ocr to force OCR)
:::
```
system(paste0("ocrmypdf -l eng+chi_tra --redo-ocr ",pdffile," ",pdffile,"_OCR.pdf"))
```
🌟全文可以至下方連結觀看或是補充
https://hackmd.io/@LHB-0222/PDF_OCR
全文分享至
https://www.facebook.com/LHB0222/
https://www.instagram.com/ahb0222/
有疑問想討論的都歡迎於下方留言
喜歡的幫我分享給所有的朋友 \o/
有所錯誤歡迎指教
# [:page_with_curl: 全部文章列表](https://hackmd.io/@LHB-0222/AllWritings)
![](https://i.imgur.com/nHEcVmm.jpg)