###### tags: `Github` Keras-Python3.6-captcha === <!-- /* Font Definitions */ @font-face {font-family:新細明體; panose-1:2 2 5 0 0 0 0 0 0 0; mso-font-alt:PMingLiU; mso-font-charset:136; mso-generic-font-family:roman; mso-font-pitch:variable; mso-font-signature:-1610611969 684719354 22 0 1048577 0;} @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-charset:0; mso-generic-font-family:roman; mso-font-pitch:variable; mso-font-signature:-536870145 1107305727 0 0 415 0;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4; mso-font-charset:0; mso-generic-font-family:swiss; mso-font-pitch:variable; mso-font-signature:-536859905 -1073732485 9 0 511 0;} @font-face {font-family:"Apple Color Emoji"; panose-1:0 0 0 0 0 0 0 0 0 0; mso-font-charset:0; mso-generic-font-family:auto; mso-font-pitch:variable; mso-font-signature:3 402653184 335544320 0 1 0;} @font-face {font-family:"Heiti TC Medium"; panose-1:0 0 0 0 0 0 0 0 0 0; mso-font-charset:128; mso-generic-font-family:auto; mso-font-pitch:variable; mso-font-signature:-2147483601 134676554 16 0 4063233 0;} @font-face {font-family:"\@Heiti TC Medium"; mso-font-charset:128; mso-generic-font-family:auto; mso-font-pitch:variable; mso-font-signature:-2147483601 134676554 16 0 4063233 0;} @font-face {font-family:"\@新細明體"; panose-1:2 1 6 1 0 1 1 1 1 1; mso-font-charset:136; mso-generic-font-family:roman; mso-font-pitch:variable; mso-font-signature:-1610611969 684719354 22 0 1048577 0;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-unhide:no; mso-style-qformat:yes; mso-style-parent:""; margin:0cm; margin-bottom:.0001pt; mso-pagination:none; font-size:12.0pt; font-family:"Calibri",sans-serif; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:新細明體; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi; mso-font-kerning:1.0pt;} h1 {mso-style-priority:9; mso-style-unhide:no; mso-style-qformat:yes; mso-style-link:"標題 1 字元"; mso-style-next:內文; margin-top:9.0pt; margin-right:0cm; margin-bottom:9.0pt; margin-left:0cm; line-height:300%; mso-pagination:none; page-break-after:avoid; mso-outline-level:1; font-size:26.0pt; font-family:"Calibri Light",sans-serif; mso-ascii-font-family:"Calibri Light"; mso-ascii-theme-font:major-latin; mso-fareast-font-family:新細明體; mso-fareast-theme-font:major-fareast; mso-hansi-font-family:"Calibri Light"; mso-hansi-theme-font:major-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:major-bidi; mso-font-kerning:26.0pt;} h4 {mso-style-noshow:yes; mso-style-priority:9; mso-style-qformat:yes; mso-style-link:"標題 4 字元"; mso-margin-top-alt:auto; margin-right:0cm; mso-margin-bottom-alt:auto; margin-left:0cm; mso-pagination:widow-orphan; mso-outline-level:4; font-size:12.0pt; font-family:"新細明體",serif; mso-bidi-font-family:新細明體;} a:link, span.MsoHyperlink {mso-style-priority:99; color:#0563C1; mso-themecolor:hyperlink; text-decoration:underline; text-underline:single;} a:visited, span.MsoHyperlinkFollowed {mso-style-noshow:yes; mso-style-priority:99; color:#954F72; mso-themecolor:followedhyperlink; text-decoration:underline; text-underline:single;} p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph {mso-style-priority:34; mso-style-unhide:no; mso-style-qformat:yes; margin-top:0cm; margin-right:0cm; margin-bottom:0cm; margin-left:24.0pt; margin-bottom:.0001pt; mso-para-margin-top:0cm; mso-para-margin-right:0cm; mso-para-margin-bottom:0cm; mso-para-margin-left:2.0gd; mso-para-margin-bottom:.0001pt; mso-pagination:none; font-size:12.0pt; font-family:"Calibri",sans-serif; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:新細明體; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi; mso-font-kerning:1.0pt;} span.1 {mso-style-name:"標題 1 字元"; mso-style-priority:9; mso-style-unhide:no; mso-style-locked:yes; mso-style-link:"標題 1"; mso-ansi-font-size:26.0pt; mso-bidi-font-size:26.0pt; font-family:"Calibri Light",sans-serif; mso-ascii-font-family:"Calibri Light"; mso-ascii-theme-font:major-latin; mso-fareast-font-family:新細明體; mso-fareast-theme-font:major-fareast; mso-hansi-font-family:"Calibri Light"; mso-hansi-theme-font:major-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:major-bidi; mso-font-kerning:26.0pt; font-weight:bold;} span.4 {mso-style-name:"標題 4 字元"; mso-style-noshow:yes; mso-style-priority:9; mso-style-unhide:no; mso-style-locked:yes; mso-style-link:"標題 4"; font-family:"新細明體",serif; mso-ascii-font-family:新細明體; mso-fareast-font-family:新細明體; mso-hansi-font-family:新細明體; mso-bidi-font-family:新細明體; mso-font-kerning:0pt; font-weight:bold;} p.msonormal0, li.msonormal0, div.msonormal0 {mso-style-name:msonormal; mso-style-unhide:no; mso-margin-top-alt:auto; margin-right:0cm; mso-margin-bottom-alt:auto; margin-left:0cm; mso-pagination:widow-orphan; font-size:12.0pt; font-family:"新細明體",serif; mso-bidi-font-family:新細明體;} span.SpellE {mso-style-name:""; mso-spl-e:yes;} .MsoChpDefault {mso-style-type:export-only; mso-default-props:yes; font-size:10.0pt; mso-ansi-font-size:10.0pt; mso-bidi-font-size:10.0pt; font-family:"Calibri",sans-serif; mso-ascii-font-family:Calibri; mso-hansi-font-family:Calibri; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi; mso-font-kerning:0pt;} /* Page Definitions */ @page {mso-page-border-surround-header:no; mso-page-border-surround-footer:no;} @page WordSection1 {size:595.0pt 842.0pt; margin:72.0pt 90.0pt 72.0pt 90.0pt; mso-header-margin:42.55pt; mso-footer-margin:49.6pt; mso-paper-source:0; layout-grid:20.0pt;} div.WordSection1 {page:WordSection1;} /* List Definitions */ @list l0 {mso-list-id:34282827; mso-list-type:hybrid; mso-list-template-ids:170686730 -745627960 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;} @list l0:level1 {mso-level-number-format:decimal-full-width; mso-level-text:%1.; mso-level-tab-stop:none; mso-level-number-position:left; text-indent:-36.0pt;} @list l0:level2 {mso-level-number-format:ideograph-traditional; mso-level-text:%2、; mso-level-tab-stop:none; mso-level-number-position:left; margin-left:48.0pt; text-indent:-24.0pt;} @list l0:level3 {mso-level-number-format:roman-lower; mso-level-tab-stop:none; mso-level-number-position:right; margin-left:72.0pt; text-indent:-24.0pt;} @list l0:level4 {mso-level-tab-stop:none; mso-level-number-position:left; margin-left:96.0pt; text-indent:-24.0pt;} @list l0:level5 {mso-level-number-format:ideograph-traditional; mso-level-text:%5、; mso-level-tab-stop:none; mso-level-number-position:left; margin-left:120.0pt; text-indent:-24.0pt;} @list l0:level6 {mso-level-number-format:roman-lower; mso-level-tab-stop:none; mso-level-number-position:right; margin-left:144.0pt; text-indent:-24.0pt;} @list l0:level7 {mso-level-tab-stop:none; mso-level-number-position:left; margin-left:168.0pt; text-indent:-24.0pt;} @list l0:level8 {mso-level-number-format:ideograph-traditional; mso-level-text:%8、; mso-level-tab-stop:none; mso-level-number-position:left; margin-left:192.0pt; text-indent:-24.0pt;} @list l0:level9 {mso-level-number-format:roman-lower; mso-level-tab-stop:none; mso-level-number-position:right; margin-left:216.0pt; text-indent:-24.0pt;} @list l1 {mso-list-id:1778215038; mso-list-type:hybrid; mso-list-template-ids:140940664 -1828952166 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;} @list l1:level1 {mso-level-number-format:decimal-full-width; mso-level-text:%1.; mso-level-tab-stop:none; mso-level-number-position:left; margin-left:54.0pt; text-indent:-54.0pt;} @list l1:level2 {mso-level-number-format:ideograph-traditional; mso-level-text:%2、; mso-level-tab-stop:none; mso-level-number-position:left; margin-left:48.0pt; text-indent:-24.0pt;} @list l1:level3 {mso-level-number-format:roman-lower; mso-level-tab-stop:none; mso-level-number-position:right; margin-left:72.0pt; text-indent:-24.0pt;} @list l1:level4 {mso-level-tab-stop:none; mso-level-number-position:left; margin-left:96.0pt; text-indent:-24.0pt;} @list l1:level5 {mso-level-number-format:ideograph-traditional; mso-level-text:%5、; mso-level-tab-stop:none; mso-level-number-position:left; margin-left:120.0pt; text-indent:-24.0pt;} @list l1:level6 {mso-level-number-format:roman-lower; mso-level-tab-stop:none; mso-level-number-position:right; margin-left:144.0pt; text-indent:-24.0pt;} @list l1:level7 {mso-level-tab-stop:none; mso-level-number-position:left; margin-left:168.0pt; text-indent:-24.0pt;} @list l1:level8 {mso-level-number-format:ideograph-traditional; mso-level-text:%8、; mso-level-tab-stop:none; mso-level-number-position:left; margin-left:192.0pt; text-indent:-24.0pt;} @list l1:level9 {mso-level-number-format:roman-lower; mso-level-tab-stop:none; mso-level-number-position:right; margin-left:216.0pt; text-indent:-24.0pt;} ol {margin-bottom:0cm;} ul {margin-bottom:0cm;} --> **使用Keras基於TensorFlow和Python3.6識別高鐵驗證碼** (僅供學術研究用途,請勿違法使用於大量自動訂票。) 開放原始碼:[https://github.com/gary9987/-Keras-TensorFlow-Python3.6](https://github.com/gary9987/-Keras-TensorFlow-Python3.6-)- ![](使用Keras%20和Python3.fld/image001.png) 圖片處理的部分: 高鐵的驗證碼看起來還不算太不雜,就是有條很粗的曲線。 所以在訓練之前得先將驗證碼處理。 1.先用OpenCV去雜訊 2.黑白化 3.去除曲線 具體方法是參考: [[爬蟲實戰] 如何破解高鐵驗證碼 (1) - 去除圖片噪音點?](https://youtu.be/6HGbKdB4kVY) [[爬蟲實戰] 如何破解高鐵驗證碼 (2) - 使用迴歸方法去除多餘弧線?](https://youtu.be/4DHcOPSfC4c) 但實際上高鐵驗證碼的每張圖片大小是不同的,因此比較不建議直接用screenshot後裁切的方式,這樣在處理曲線的方面也會比較完整。 爬取驗證碼就得靠session去抓,因為需要有cookies的關係。 具體的程式碼大概是這樣: ![](使用Keras%20和Python3.fld/image004.png) 後來發現曲線處理得非常差因為可用來算迴歸線的像素點太少,因此我把驗證碼的原本的dpi乘以10倍之後再處理曲線,並且針對不同圖案大小需要不同的曲線寬度,最後在resize成固定的大小,我是存成(140, 48)。 處理完會像這樣: ![](使用Keras%20和Python3.fld/image002.jpg) 圖片處理大概是我弄最久的部分了,大概也是對Python沒有很熟悉的關係。 CNN的整個架構圖: 架構是參考這篇文章再自己修改:[實作基於CNN的台鐵訂票驗證碼辨識以及驗證性高的訓練集產生器](https://github.com/JasonLiTW/simple-railway-captcha-solver) ![](使用Keras%20和Python3.fld/image003.png) ![](使用Keras%20和Python3.fld/image005.png) 最上面的輸入因為我圖片是(48x140)、顏色Channel是3(RGB)。 中間比較不一樣的是加入了兩層Batch_normalization來降低Overfitting的問題。 後面一樣Dropout(0.5),放棄50%的神經元。 原本想造加一個4096個神經元的隱藏層,但後來發現Model的大小暴增到100MB以上,但加了能增加點準確率。 最後輸出四個Digit,因為高鐵的驗證碼是4位數,而每個Digit有19個神經元,因為仔細觀察可以發現高鐵的驗證碼由6個數字與13個英文字母組成。 可以在我建立的字典中了解: ![](使用Keras%20和Python3.fld/image006.png) 訓練方式: 由於我對用Python繪圖還沒有足夠的了解,因此沒辦法自己寫出產生訓練集的code,只能請朋友幫忙一起標Label啦,感謝我的好友們。 我們手動標的5000張訓練樣本,丟給機器做訓練,只用CPU大約三小時內訓練完畢。 訓練機器硬體配置: `CPU: Intel Core I5-6360U` `RAM: DDR3L 16GB` 起出這個Model的準確率大約是85%左右,因為訓練樣本太少的問題。 因此我拿這個Model去高鐵的網站Try,假如有過那就把驗證碼爬回來並且自動標上Label,這邊我是用比較簡單的方法,使用Selenium去做這件事情,最後共得到11000張訓練樣本。 我是再拿新樣本訓練出新的Model畢竟手動可能會有標錯,最後新的模型測了5000張樣本準確率達到94.5%算是很夠用了。 以下是訓練過程:數據有底多,只截出部分週期。 ![](使用Keras%20和Python3.fld/image007.png) ![](使用Keras%20和Python3.fld/image008.png) ![](使用Keras%20和Python3.fld/image010.png)![](使用Keras%20和Python3.fld/image009.png) ![](使用Keras%20和Python3.fld/image011.png)![](使用Keras%20和Python3.fld/image012.png) 參考: 1. TemsorFlow+Keras深度學習 人工智慧實務應用 (書) 2. [實作基於CNN的台鐵訂票驗證碼辨識以及驗證性高的訓練集產生器](https://github.com/JasonLiTW/simple-railway-captcha-solver) 3. [~~TensorFlow識別字母扭曲干擾型驗證碼-開放源碼與98%模型~~](https://www.urlteam.org/2017/03/tensorflow%E8%AF%86%E5%88%AB%E5%AD%97%E6%AF%8D%E6%89%AD%E6%9B%B2%E5%B9%B2%E6%89%B0%E5%9E%8B%E9%AA%8C%E8%AF%81%E7%A0%81-%E5%BC%80%E6%94%BE%E6%BA%90%E7%A0%81%E4%B8%8E98%E6%A8%A1%E5%9E%8B/) 原始碼大略介紹 ****update: 2019.11.18**** --- 因為很久沒有修改程式碼,在一些對高鐵網站request的部分失效。 暫時沒有時間修改。 **更新內容**: * 更README.md * 路徑改為相對路徑,workspace請在`-Keras-Python3.6-captcha/cnn/` * 更新`cnn_model.hdf5`,發先原本上傳的是錯誤的版本 **主要步驟**: 1. 先用`python catch img session/get img.py`抓一些驗證碼圖片下來。(網路部分還未修復) 2. 使用`img process plus/img p plus.py`處理剛抓下來的圖片,標Label。 3. 使用`keras tensor cnn/keras train.py`訓練model。 4. 使用`get_imgae/2/get img drive`並用訓練好的model做預測,將預測成功的圖片再抓下來,自動標Label,就能有更大的資料集來訓練Model。