###### tags: `Github`
Keras-Python3.6-captcha
===
<!--
/* Font Definitions */
@font-face
{font-family:新細明體;
panose-1:2 2 5 0 0 0 0 0 0 0;
mso-font-alt:PMingLiU;
mso-font-charset:136;
mso-generic-font-family:roman;
mso-font-pitch:variable;
mso-font-signature:-1610611969 684719354 22 0 1048577 0;}
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;
mso-font-charset:0;
mso-generic-font-family:roman;
mso-font-pitch:variable;
mso-font-signature:-536870145 1107305727 0 0 415 0;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;
mso-font-charset:0;
mso-generic-font-family:swiss;
mso-font-pitch:variable;
mso-font-signature:-536859905 -1073732485 9 0 511 0;}
@font-face
{font-family:"Apple Color Emoji";
panose-1:0 0 0 0 0 0 0 0 0 0;
mso-font-charset:0;
mso-generic-font-family:auto;
mso-font-pitch:variable;
mso-font-signature:3 402653184 335544320 0 1 0;}
@font-face
{font-family:"Heiti TC Medium";
panose-1:0 0 0 0 0 0 0 0 0 0;
mso-font-charset:128;
mso-generic-font-family:auto;
mso-font-pitch:variable;
mso-font-signature:-2147483601 134676554 16 0 4063233 0;}
@font-face
{font-family:"\@Heiti TC Medium";
mso-font-charset:128;
mso-generic-font-family:auto;
mso-font-pitch:variable;
mso-font-signature:-2147483601 134676554 16 0 4063233 0;}
@font-face
{font-family:"\@新細明體";
panose-1:2 1 6 1 0 1 1 1 1 1;
mso-font-charset:136;
mso-generic-font-family:roman;
mso-font-pitch:variable;
mso-font-signature:-1610611969 684719354 22 0 1048577 0;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{mso-style-unhide:no;
mso-style-qformat:yes;
mso-style-parent:"";
margin:0cm;
margin-bottom:.0001pt;
mso-pagination:none;
font-size:12.0pt;
font-family:"Calibri",sans-serif;
mso-ascii-font-family:Calibri;
mso-ascii-theme-font:minor-latin;
mso-fareast-font-family:新細明體;
mso-fareast-theme-font:minor-fareast;
mso-hansi-font-family:Calibri;
mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:"Times New Roman";
mso-bidi-theme-font:minor-bidi;
mso-font-kerning:1.0pt;}
h1
{mso-style-priority:9;
mso-style-unhide:no;
mso-style-qformat:yes;
mso-style-link:"標題 1 字元";
mso-style-next:內文;
margin-top:9.0pt;
margin-right:0cm;
margin-bottom:9.0pt;
margin-left:0cm;
line-height:300%;
mso-pagination:none;
page-break-after:avoid;
mso-outline-level:1;
font-size:26.0pt;
font-family:"Calibri Light",sans-serif;
mso-ascii-font-family:"Calibri Light";
mso-ascii-theme-font:major-latin;
mso-fareast-font-family:新細明體;
mso-fareast-theme-font:major-fareast;
mso-hansi-font-family:"Calibri Light";
mso-hansi-theme-font:major-latin;
mso-bidi-font-family:"Times New Roman";
mso-bidi-theme-font:major-bidi;
mso-font-kerning:26.0pt;}
h4
{mso-style-noshow:yes;
mso-style-priority:9;
mso-style-qformat:yes;
mso-style-link:"標題 4 字元";
mso-margin-top-alt:auto;
margin-right:0cm;
mso-margin-bottom-alt:auto;
margin-left:0cm;
mso-pagination:widow-orphan;
mso-outline-level:4;
font-size:12.0pt;
font-family:"新細明體",serif;
mso-bidi-font-family:新細明體;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
mso-themecolor:hyperlink;
text-decoration:underline;
text-underline:single;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-noshow:yes;
mso-style-priority:99;
color:#954F72;
mso-themecolor:followedhyperlink;
text-decoration:underline;
text-underline:single;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{mso-style-priority:34;
mso-style-unhide:no;
mso-style-qformat:yes;
margin-top:0cm;
margin-right:0cm;
margin-bottom:0cm;
margin-left:24.0pt;
margin-bottom:.0001pt;
mso-para-margin-top:0cm;
mso-para-margin-right:0cm;
mso-para-margin-bottom:0cm;
mso-para-margin-left:2.0gd;
mso-para-margin-bottom:.0001pt;
mso-pagination:none;
font-size:12.0pt;
font-family:"Calibri",sans-serif;
mso-ascii-font-family:Calibri;
mso-ascii-theme-font:minor-latin;
mso-fareast-font-family:新細明體;
mso-fareast-theme-font:minor-fareast;
mso-hansi-font-family:Calibri;
mso-hansi-theme-font:minor-latin;
mso-bidi-font-family:"Times New Roman";
mso-bidi-theme-font:minor-bidi;
mso-font-kerning:1.0pt;}
span.1
{mso-style-name:"標題 1 字元";
mso-style-priority:9;
mso-style-unhide:no;
mso-style-locked:yes;
mso-style-link:"標題 1";
mso-ansi-font-size:26.0pt;
mso-bidi-font-size:26.0pt;
font-family:"Calibri Light",sans-serif;
mso-ascii-font-family:"Calibri Light";
mso-ascii-theme-font:major-latin;
mso-fareast-font-family:新細明體;
mso-fareast-theme-font:major-fareast;
mso-hansi-font-family:"Calibri Light";
mso-hansi-theme-font:major-latin;
mso-bidi-font-family:"Times New Roman";
mso-bidi-theme-font:major-bidi;
mso-font-kerning:26.0pt;
font-weight:bold;}
span.4
{mso-style-name:"標題 4 字元";
mso-style-noshow:yes;
mso-style-priority:9;
mso-style-unhide:no;
mso-style-locked:yes;
mso-style-link:"標題 4";
font-family:"新細明體",serif;
mso-ascii-font-family:新細明體;
mso-fareast-font-family:新細明體;
mso-hansi-font-family:新細明體;
mso-bidi-font-family:新細明體;
mso-font-kerning:0pt;
font-weight:bold;}
p.msonormal0, li.msonormal0, div.msonormal0
{mso-style-name:msonormal;
mso-style-unhide:no;
mso-margin-top-alt:auto;
margin-right:0cm;
mso-margin-bottom-alt:auto;
margin-left:0cm;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:"新細明體",serif;
mso-bidi-font-family:新細明體;}
span.SpellE
{mso-style-name:"";
mso-spl-e:yes;}
.MsoChpDefault
{mso-style-type:export-only;
mso-default-props:yes;
font-size:10.0pt;
mso-ansi-font-size:10.0pt;
mso-bidi-font-size:10.0pt;
font-family:"Calibri",sans-serif;
mso-ascii-font-family:Calibri;
mso-hansi-font-family:Calibri;
mso-bidi-font-family:"Times New Roman";
mso-bidi-theme-font:minor-bidi;
mso-font-kerning:0pt;}
/* Page Definitions */
@page
{mso-page-border-surround-header:no;
mso-page-border-surround-footer:no;}
@page WordSection1
{size:595.0pt 842.0pt;
margin:72.0pt 90.0pt 72.0pt 90.0pt;
mso-header-margin:42.55pt;
mso-footer-margin:49.6pt;
mso-paper-source:0;
layout-grid:20.0pt;}
div.WordSection1
{page:WordSection1;}
/* List Definitions */
@list l0
{mso-list-id:34282827;
mso-list-type:hybrid;
mso-list-template-ids:170686730 -745627960 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l0:level1
{mso-level-number-format:decimal-full-width;
mso-level-text:%1.;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-36.0pt;}
@list l0:level2
{mso-level-number-format:ideograph-traditional;
mso-level-text:%2、;
mso-level-tab-stop:none;
mso-level-number-position:left;
margin-left:48.0pt;
text-indent:-24.0pt;}
@list l0:level3
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
margin-left:72.0pt;
text-indent:-24.0pt;}
@list l0:level4
{mso-level-tab-stop:none;
mso-level-number-position:left;
margin-left:96.0pt;
text-indent:-24.0pt;}
@list l0:level5
{mso-level-number-format:ideograph-traditional;
mso-level-text:%5、;
mso-level-tab-stop:none;
mso-level-number-position:left;
margin-left:120.0pt;
text-indent:-24.0pt;}
@list l0:level6
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
margin-left:144.0pt;
text-indent:-24.0pt;}
@list l0:level7
{mso-level-tab-stop:none;
mso-level-number-position:left;
margin-left:168.0pt;
text-indent:-24.0pt;}
@list l0:level8
{mso-level-number-format:ideograph-traditional;
mso-level-text:%8、;
mso-level-tab-stop:none;
mso-level-number-position:left;
margin-left:192.0pt;
text-indent:-24.0pt;}
@list l0:level9
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
margin-left:216.0pt;
text-indent:-24.0pt;}
@list l1
{mso-list-id:1778215038;
mso-list-type:hybrid;
mso-list-template-ids:140940664 -1828952166 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l1:level1
{mso-level-number-format:decimal-full-width;
mso-level-text:%1.;
mso-level-tab-stop:none;
mso-level-number-position:left;
margin-left:54.0pt;
text-indent:-54.0pt;}
@list l1:level2
{mso-level-number-format:ideograph-traditional;
mso-level-text:%2、;
mso-level-tab-stop:none;
mso-level-number-position:left;
margin-left:48.0pt;
text-indent:-24.0pt;}
@list l1:level3
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
margin-left:72.0pt;
text-indent:-24.0pt;}
@list l1:level4
{mso-level-tab-stop:none;
mso-level-number-position:left;
margin-left:96.0pt;
text-indent:-24.0pt;}
@list l1:level5
{mso-level-number-format:ideograph-traditional;
mso-level-text:%5、;
mso-level-tab-stop:none;
mso-level-number-position:left;
margin-left:120.0pt;
text-indent:-24.0pt;}
@list l1:level6
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
margin-left:144.0pt;
text-indent:-24.0pt;}
@list l1:level7
{mso-level-tab-stop:none;
mso-level-number-position:left;
margin-left:168.0pt;
text-indent:-24.0pt;}
@list l1:level8
{mso-level-number-format:ideograph-traditional;
mso-level-text:%8、;
mso-level-tab-stop:none;
mso-level-number-position:left;
margin-left:192.0pt;
text-indent:-24.0pt;}
@list l1:level9
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
margin-left:216.0pt;
text-indent:-24.0pt;}
ol
{margin-bottom:0cm;}
ul
{margin-bottom:0cm;}
-->
**使用Keras基於TensorFlow和Python3.6識別高鐵驗證碼**
(僅供學術研究用途,請勿違法使用於大量自動訂票。)
開放原始碼:[https://github.com/gary9987/-Keras-TensorFlow-Python3.6](https://github.com/gary9987/-Keras-TensorFlow-Python3.6-)-
![](使用Keras%20和Python3.fld/image001.png)
圖片處理的部分:
高鐵的驗證碼看起來還不算太不雜,就是有條很粗的曲線。
所以在訓練之前得先將驗證碼處理。
1.先用OpenCV去雜訊
2.黑白化
3.去除曲線
具體方法是參考:
[[爬蟲實戰] 如何破解高鐵驗證碼 (1) - 去除圖片噪音點?](https://youtu.be/6HGbKdB4kVY)
[[爬蟲實戰] 如何破解高鐵驗證碼 (2) - 使用迴歸方法去除多餘弧線?](https://youtu.be/4DHcOPSfC4c)
但實際上高鐵驗證碼的每張圖片大小是不同的,因此比較不建議直接用screenshot後裁切的方式,這樣在處理曲線的方面也會比較完整。
爬取驗證碼就得靠session去抓,因為需要有cookies的關係。
具體的程式碼大概是這樣:
![](使用Keras%20和Python3.fld/image004.png)
後來發現曲線處理得非常差因為可用來算迴歸線的像素點太少,因此我把驗證碼的原本的dpi乘以10倍之後再處理曲線,並且針對不同圖案大小需要不同的曲線寬度,最後在resize成固定的大小,我是存成(140, 48)。
處理完會像這樣:
![](使用Keras%20和Python3.fld/image002.jpg)
圖片處理大概是我弄最久的部分了,大概也是對Python沒有很熟悉的關係。
CNN的整個架構圖:
架構是參考這篇文章再自己修改:[實作基於CNN的台鐵訂票驗證碼辨識以及驗證性高的訓練集產生器](https://github.com/JasonLiTW/simple-railway-captcha-solver)
![](使用Keras%20和Python3.fld/image003.png)
![](使用Keras%20和Python3.fld/image005.png)
最上面的輸入因為我圖片是(48x140)、顏色Channel是3(RGB)。
中間比較不一樣的是加入了兩層Batch_normalization來降低Overfitting的問題。
後面一樣Dropout(0.5),放棄50%的神經元。
原本想造加一個4096個神經元的隱藏層,但後來發現Model的大小暴增到100MB以上,但加了能增加點準確率。
最後輸出四個Digit,因為高鐵的驗證碼是4位數,而每個Digit有19個神經元,因為仔細觀察可以發現高鐵的驗證碼由6個數字與13個英文字母組成。
可以在我建立的字典中了解:
![](使用Keras%20和Python3.fld/image006.png)
訓練方式:
由於我對用Python繪圖還沒有足夠的了解,因此沒辦法自己寫出產生訓練集的code,只能請朋友幫忙一起標Label啦,感謝我的好友們。
我們手動標的5000張訓練樣本,丟給機器做訓練,只用CPU大約三小時內訓練完畢。
訓練機器硬體配置:
`CPU: Intel Core I5-6360U`
`RAM: DDR3L 16GB`
起出這個Model的準確率大約是85%左右,因為訓練樣本太少的問題。
因此我拿這個Model去高鐵的網站Try,假如有過那就把驗證碼爬回來並且自動標上Label,這邊我是用比較簡單的方法,使用Selenium去做這件事情,最後共得到11000張訓練樣本。
我是再拿新樣本訓練出新的Model畢竟手動可能會有標錯,最後新的模型測了5000張樣本準確率達到94.5%算是很夠用了。
以下是訓練過程:數據有底多,只截出部分週期。
![](使用Keras%20和Python3.fld/image007.png)
![](使用Keras%20和Python3.fld/image008.png)
![](使用Keras%20和Python3.fld/image010.png)![](使用Keras%20和Python3.fld/image009.png)
![](使用Keras%20和Python3.fld/image011.png)![](使用Keras%20和Python3.fld/image012.png)
參考:
1.
TemsorFlow+Keras深度學習 人工智慧實務應用 (書)
2.
[實作基於CNN的台鐵訂票驗證碼辨識以及驗證性高的訓練集產生器](https://github.com/JasonLiTW/simple-railway-captcha-solver)
3.
[~~TensorFlow識別字母扭曲干擾型驗證碼-開放源碼與98%模型~~](https://www.urlteam.org/2017/03/tensorflow%E8%AF%86%E5%88%AB%E5%AD%97%E6%AF%8D%E6%89%AD%E6%9B%B2%E5%B9%B2%E6%89%B0%E5%9E%8B%E9%AA%8C%E8%AF%81%E7%A0%81-%E5%BC%80%E6%94%BE%E6%BA%90%E7%A0%81%E4%B8%8E98%E6%A8%A1%E5%9E%8B/)
原始碼大略介紹 ****update: 2019.11.18****
---
因為很久沒有修改程式碼,在一些對高鐵網站request的部分失效。
暫時沒有時間修改。
**更新內容**:
* 更README.md
* 路徑改為相對路徑,workspace請在`-Keras-Python3.6-captcha/cnn/`
* 更新`cnn_model.hdf5`,發先原本上傳的是錯誤的版本
**主要步驟**:
1. 先用`python catch img session/get img.py`抓一些驗證碼圖片下來。(網路部分還未修復)
2. 使用`img process plus/img p plus.py`處理剛抓下來的圖片,標Label。
3. 使用`keras tensor cnn/keras train.py`訓練model。
4. 使用`get_imgae/2/get img drive`並用訓練好的model做預測,將預測成功的圖片再抓下來,自動標Label,就能有更大的資料集來訓練Model。