# Gradient-Based Learning Applied to Document Recognition_Paper(翻譯)(III, IV)
###### tags: `LeNet-5` `CNN` `論文翻譯` `deeplearning` `Gradient-Based Learning Applied to Document Recognition`
[TOC]
## 說明
區塊如下分類,原文區塊為藍底,翻譯區塊為綠底,部份專業用語翻譯參考國家教育研究院
:::info
原文
:::
:::success
翻譯
:::
:::warning
任何的翻譯不通暢部份都請留言指導
:::
:::danger
* [paper hyperlink](http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf)
* [\*paper hyperlink](http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf)
:::
## III. Results and Comparison with Other Methods
:::info
While recognizing individual digits is only one of many problems involved in designing a practical recognition system, it is an excellent benchmark for comparing shape recognition methods. Though many existing method combine a handcrafted feature extractor and a trainable classifier, this study concentrates on adaptive methods that operate directly on size-normalized images.
:::
:::success
儘管辨識單個數字只是設計一個實用辨識系統所涉及的眾多問題之一,但它是比較形狀辨識方法中的一個很好的[基準](http://terms.naer.edu.tw/detail/2338242/)。雖然許多現有方法結合手工特徵提取器以及可訓練的分類別,但是,本研究集中在直接對大小[標準化](http://terms.naer.edu.tw/detail/5425344/)影像上直接操作的[自適應](http://terms.naer.edu.tw/detail/2454885/)方法上。
:::
### A. Databas: the Modified NIST set
:::info
The database used to train and test the systems described in this paper was constructed from the NIST's Special Database 3 and Special Database 1 containing binary images of handwritten digits. NIST originally designated SD-3 as their training set and SD-1 as their test set. However, SD-3 is much cleaner and easier to recognize than SD-1. The reason for this can be found on the fact that SD-3 was collected among Census Bureau employees, while SD-1 was collected among high-school students. Drawing sensible conclusions from learning experiments requires that the result be independent of the choice of training set and test among the complete set of samples. Therefore it was necessary to build a new database by mixing NIST's datasets
SD-1 contains 58,527 digit images written by 500 different writers. In contrast to SD-3, where blocks of data from each writer appeared in sequence, the data in SD-1 is scrambled. Writer identities for SD-1 are available and we used this information to unscramble the writers. We then split SD-1 in two: characters written by the first 250 writers went into our new training set. The remaining 250 writers were placed in our test set. Thus we had two sets with nearly 30,000 examples each. The new training set was completed with enough examples from SD-3, starting at pattern # 0, to make a full set of 60,000 training patterns. Similarly, the new test set was completed with SD-3 examples starting at pattern # 35,000 to make a full set with 60,000 test patterns. In the experiments described here, we only used a subset of 10,000 test images (50,000 from SD-1 and 5,000 from SD-3), but we used the full 60,000 training samples. The resulting database was called the Modified NIST, or MNIST, dataset.
The original black and white (bilevel) images were size-normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as result of the anti-aliasing (image interpolation) technique used by the normalization algorithm. Three versions of the database were used. In the first version, the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field. In some instances, this 28x28 field was extended to 32x32 with background pixels. This version of the database will be referred to as the regular database. In the second version of the database, the character images were deslanted and cropped down to 20x20 pixels images. The deslanting computes the second moments of inertia of the pixels (counting a foreground pixel as 1 and a background pixel as 0), and shears the image by horizontally shifting the lines so that the principal axis is vertical. This version of the database will be referred to as the deslanted database. In the third version of the database, used in some early experiments, the images were reduced to 16x16 pixels. The regular database (60,000 training examples, 10,000 test examples size-normalized to 20x20, and centered by center of mass in 28x28 fields) is available at http://www.research.att.com/yann/ocr/mnist. Figure 4 shows examples randomly picked from the test set.
:::
:::success
用來訓練與測試論文中所說的系統的資料庫是由NIST的特別資料庫3以及特別資料庫1所建構,包含手寫數字的[二進制影像](http://terms.naer.edu.tw/detail/2341775/)。NIST原本將SD-3指定為訓練資料集,SD-1為測試資料集。然而,SD-3比SD-1更乾淨,更容易辨識。究其原因,[普查局](http://terms.naer.edu.tw/detail/338511/)僱員收集SD-3,高中學生收集SD-1。從學習實驗中得到合理的結論,要求結果獨立於訓練集的選擇,並且測試在完整樣本集之中。因此,有必要通過混合NIST的資料集來建立一個資料庫。
SD-1包含由500位不同的作家所寫的58,527張數字照片。對比SD-3,SD-1中的資料是被打散的,而SD-3在每一個作者的資料區塊中依序出現。SD-1的作者身份可用,我們使用這個訊息來對解讀作者。然後我們將SD-1分為兩部份:前250個作者所寫的字符進入新的訓練集。其餘250個作者的資料則置於測試集。因此,我們有兩個資料集,每一個擁有將近30,000筆資料。新的訓練集已經完成,包含來自SD-3足夠的範例(起始於圖像#0),形成一個完整的60,000張訓練圖像。同樣的,新的測試集以SD-3範例完成(從圖像#35,000開始),形成一個完整的60,000張測試圖像。在這裡說明的實驗中,我們只使用10,000張測試照片子集(50,00從SD-1,5000從SD-3),但我們使用完整的60,000訓練樣本。結果資料庫稱為Modified NIST或MNIST資料集。
原始黑白照片做大小[標準化](http://terms.naer.edu.tw/detail/5425344/)以適應20x20像素的框,同時保留長寬比。由於[正規化](http://terms.naer.edu.tw/detail/2400626/)演算法使用的抗鋸齒(圖像插值)技術的結果,所得影像包含[灰階](http://terms.naer.edu.tw/detail/2374203/)。使用三個版本的資料庫。第一個版本中,透過計算像素的質心影像,並[位移](http://terms.naer.edu.tw/detail/2432307/)影像,以便將這一點定位在28x28範圍的中心,從而將影像在28x28影像中置中。某些情況下,28x28範圍會以背景像素擴展為32x32。這版本的資料庫被稱為正規資料庫。資料庫的第二個版本中,字符影像被deslanted並剪裁為20x20像素影像。deslanting計算像素的[二次](http://terms.naer.edu.tw/detail/1330095/)[慣性矩](http://terms.naer.edu.tw/detail/2395319/)(前景像素計數為1,背景像素計數為0),並利用水平移動線條剪裁影像,以使[主軸](http://terms.naer.edu.tw/detail/2122258/)垂直。這個版本的資料庫稱為deslanted資料庫。第三個版本的資料庫中(早期實驗使用),影像縮小為16x16像素。正規資料庫可以由[超連結](http://www.research.att.com/yann/ocr/mnist)獲得(60,000訓練樣本,10,000測試樣本,大小[標準化](http://terms.naer.edu.tw/detail/5425344/)為20x20,置中於28x28域中的[質心](http://terms.naer.edu.tw/detail/700455/))。圖片4說明從測試集中隨機選擇的樣本。
:::
:::info

Fig. 4. Size-normalized examples from the MNIST database.
圖片4. 來自MNIST資料庫中大小[標準化](http://terms.naer.edu.tw/detail/5425344/)的樣本。
:::
### B. Results
:::info
Several versions of LeNet-5 were trained on the regular MNIST database. 20 iterations through the entire training data were performed for each session. The values of the global learning rate $\eta$ (see Equation 21 in Appendix C for a definition) was decreased using the following schedule: 0.0005 for the first two passes, 0.0002 for the next three, 0.0001 for the next three, 0.00005 for the next 4, and 0.00001 thereafter. Before each iteration, the diagonal Hessian approximation was reevaluated on 500 samples, as described in Appendix C and kept fixed during the entire iteration. The parameter $\mu$ was set to 0.02. The resulting effective learning rates during the first pass varied between approximately $7 \times 10^{-5}$ and 0.016 over the set of parameters. The test error rate stabilizes after around 10 passes through the training set at 0.95%. The error rate on the training set reaches 0.35% after 19 passes. Many authors have reported observing the common phenomenon of over-training when training neural networks or other adaptive algorithms on various tasks. When over-training occurs, the training error keeps decreasing over time, but the test error goes through a minimum and starts increasing after a certain number of iterations. While this phenomenon is very common, it was not observed in our case as the learning curves in figure 5 show. A possible reason is that the learning rate was kept relatively large. The effect of this is that the weights never settle down in the local minimum but keep oscillating randomly. Because of those fluctuations, the average cost will be lower in a broader minimum. Therefore, stochastic gradient will have a similar effect as a regularization term that favors broader minima. Broader minima correspond to solutions with large entropy of the parameter distribution, which is beneficial to the generalization error.
:::
:::success
LeNet-5的多個版本在正規MNIST資料庫上進行訓練。每個session對整個資料集做20次的迭代計算。全域學習效率$\eta$使用下面規劃減少(見附錄C中公式21定義):前兩次為0.0005,接下來三次為0.0002,接下來三次為0.0001,接下來四次為0.00005,此後為0.00001。每次迭代之前,對500個樣本重新評估diagonal Hessian近似值,如附錄C所說明,並在整個迭代過程中保持固定。參數$\mu$設置為0.02。The resulting effective learning rates during the first pass varied between approximately $7 \times 10^{-5}$ and 0.016 over the set of parameters. 在大約十次通過訓練集後,測試誤差率穩定在0.95%。在19次之後,訓練集上的誤差誤率來到0.35%。許多作者報告說著,當在各種任務上訓練神經網路或其它自適應演算法時,會觀察到過度訓練的普遍現象。當過度訓練發生的時候,訓練誤差隨著時間減少,但測試誤差會達到最小值,並在一定次數的迭代之後開始增加。儘管這種現象非常常見,但在我們的案例中並沒有發現,如圖5學習曲線所示。一個可能的原因就是學習效率維持在相對較高。這樣子的影響在於權重永遠不會陷入局部最小值,而是維持隨機振動。由於這些[波動](http://terms.naer.edu.tw/detail/2116285/),平均成本將在更大廣泛的最低限度內降底。因此,隨機梯度將有著類似於[正則化](http://terms.naer.edu.tw/detail/2123513/)項目的效果,有利於更廣泛的最小值。更廣域的最小值,相當於參數分佈的熵較大的解,有利於泛化誤差。
:::
:::info

Fig. 5. Training and test error of LeNet-5 as a function of the number of passes through the 60,000 pattern training set (without distortions). The average training error is measured on-the-fly as training proceeds. This explains why the training error appears to be larger than the test error. Convergence is attained after 10 to 12 passes through the training set.
圖5. LeNet-5的訓練與測試誤差,以通過60,000張圖像訓練集的數量的方式(無失真)。平均訓練誤差隨著訓練進行而即時測量。這說明了為什麼訓練誤差似乎大於測試誤差。經過10到12次迭代之後達到收斂。
:::
:::info
The influence of the training set size was measured by training the network with 15,000,30,000, and 60,000 examples. The resulting training error and test error are shown in figure 6. It is clear that, even with specialized architectures such as LeNet-5, more training data would improve the accuracy.
:::
:::success
通過使用15,000、30,000與60,000樣本訓練網路來測量訓練集大小對影響。產生的訓練誤差與測試誤差如圖6所示。很明顯的,即使是專用架構(如LeNet-5),更多的訓練資料也會提高準確度。
:::
:::info

Fig.6. Training and test errors of LeNet-5 achieved using training sets of various sizes. This graph suggests that a larger training set could improve the performance of LeNet-5. The hollow square show the test error when more training patterns are artificially generated using random distortions. The test patterns are not distorted.
圖6. 使用各種不同訓練集大小的LeNet-5達到的訓練與測試誤差。這圖說明,大的訓練集可以提高LeNet-5的效能。當使用隨機[失真](http://terms.naer.edu.tw/detail/2360643/)(變形)來人工生成訓練圖像時,空間正方形顯示測試誤差。測試圖像並沒有[失真](http://terms.naer.edu.tw/detail/2360643/)(變形)。
:::
:::info
To verify this hypothesis, we artificially generated more training examples by randomly distorting the original training images. The increased training set was composed of the 60,000 original patterns plus 540,000 instances of distorted patterns with randomly picked distortion parameters. The distortions were combinations of the following planar affine transformations: horizontal and vertical translations, scaling, squeezing (simultaneous horizontal compression and vertical elongation, or the reverse), and horizontal shearing. Figure 7 shows examples of distorted patterns used for training. When distorted data was used for training, the test error rate dropped to 0.8% (from 0.95% without deformation). The same training parameters were used as without deformations. The total length of the training session was left unchanged (20 passes of 60,000 patterns each). It is interesting to note that the network effectively sees each individual sample only twice over the course of these 20 passes.
Figure 8 shows all 82 misclassified test examples. some of those examples are genuinely ambiguous, but several are perfectly identifiable by humans, although they are written in an under-represented style. This shows that further improvements are to be expected with more training data.
:::
:::success
為了驗證這個假設,我們利用隨機[失真](http://terms.naer.edu.tw/detail/2360643/)(變形)原始訓練影像,人工生成更多訓練樣本。增加的訓練集包含60,000張原始訓練圖像加上540,000張隨機選擇[失真](http://terms.naer.edu.tw/detail/2360643/)(變形)參數所生成的[失真](http://terms.naer.edu.tw/detail/2360643/)(變形)圖像。[失真](http://terms.naer.edu.tw/detail/2360643/)(變形)包含以下平面仿射轉換的組合:水平、垂直平移,縮放,壓縮(同時水平壓縮與垂直[延伸](http://terms.naer.edu.tw/detail/2363797/),或者反轉)與水平剪裁。圖七說明用於訓練的[失真](http://terms.naer.edu.tw/detail/2360643/)(變形)圖像的樣本。當[失真](http://terms.naer.edu.tw/detail/2360643/)(變形)的資料被用於訓練的時候,測試誤差下降到0.8%(從沒變形的0.95%)。使用與沒變形相同的訓練參數。訓練時間總長度保持不變(20次通過60,000張圖像)。值得注意的是,在20次的過程中,網路只能有效的地看到每個單獨樣本兩次。
圖片八展示82張分類錯誤的測試樣本。部份樣本真的是模棱兩可,即使它們以表示不足的方式所寫,人類依然可以完美的辨識。這說明了,更多的訓練資料有望更進一步的改善。
:::
:::info

Fig. 7. Examples of distortions of ten training patterns.
圖片7. 十個[失真](http://terms.naer.edu.tw/detail/2360643/)(變形)的訓練圖像。
:::
:::info

Fig.8. The 82 test patterns misclassified by LeNet-5. Below each image is displayed the correct answers (left) and the network answer (right). These errors are mostly caused either by genuinely ambiguous patterns or by digits written in a style that are under-represented in the training set. perfectly identifiable by humans, although they are written in an under-represented style. This shows that further improvements are to be expected with more training data.
圖片8. 82個LeNet-5分類錯誤的測試圖像。每張影像的下方顯示正確答案(左)以及網路的答案(右)。這些錯誤大多由真的模棱兩可的圖像或訓練集中表示不足的樣式編寫的數字所引起的。儘管它們以代表性不足的樣式所寫,但人類還是可以完美的辨識。這說明了,更多的訓練資料有望更進一步的改善。
:::
### C. Comparison with Other Classifiers
:::info
For the sake of comparison, a variety of other trainable classifiers was trained and tested on the same database. An early subset of these results was presented in \[51\]. The error rates on the test set for the various methods are shown in figure 9.
:::
:::success
為了進行比較,用相同的資料集來訓練、測試各種其它可訓練的分類器。這些結果的早期子集在\[51\]中呈現。圖9說明各種方法在測試集上的誤差率。
:::
:::info

Fig. 9. Error rate on the test set (%) for various classification methods. \[deslant\] indicates that the classifier was trained and tested on the deslanted version of the database. \[dist\] indicates that the training set was augmented with artificially distorted examples. \[16x16\] indicates that the system used the 16x16 pixel images. The uncertainty in the quoted error rates is about 0.1%.
圖9. 各種分類方法在測試集上的誤差率(%)。\[deslant\]表示分類器訓練、測試在deslanted版本的資料集。\[dist\]表示訓練集增加人工[失真](http://terms.naer.edu.tw/detail/2360643/)(變形)樣本。\[16x16\]表示系統使用16x16像素影像。引用的誤差率的不確定性約0.1%
:::
#### C.1 Linear Classifier, and Pairwise Linear Classifier
:::info
Possibly the simplest classifier that one might consider is a linear classifier. Each input pixel value contributes to a weighted sum for each output unit. The output unit with the highest sum (including the contribution of a bias constant) indicates the class of the input character. On the regular data, the error rate is 12%. The network has 7850 free parameters. On the deslanted images, the test error rate is 8.4%. The network has 4010 free parameters. The deficiencies of the linear classifier are well documented \[1\] and it is included here simply to form a basis of comparison for more sophisticated classifiers. Various combinations of sigmoid units, linear units, gradient descent learning, and learning by directly solving linear systems gave similar results.
A simple improvement of the basic linear classifier was tested \[52\]. The idea is to train each unit of a single-layer network to separate each class from each other class. In our case this layer comprises 45 units labeled 0/1, 0/2,...0/9, 1/2....8/9. Unit $\frac{i}{j}$ is trained to produce +1 on patterns of class $i$, -1 on patterns of class $j$, and is not trained on other patterns. The final score for class $i$ is the sum of the outputs all the units labeled $\frac{i}{x}$ minus the sum of the output of all the units labeled $\frac{y}{i}$, for all $x$ and $y$. The error rate on the regular test set was 7.6%.
:::
:::success
人們可能認為最簡單的分類器是線性分類器。每一個輸入像素值有助於對每一個輸出單元的加權和。總和最高的輸出單元(包含偏差常數的貢獻)代表輸入字符的類別。在常規資料上,誤差率為12%。網路擁有7850個可用參數。在deslanted影像上,測試誤差率為8.4%。網路擁有4010個可用參數。線性分類器的缺陷已經有證實\[1\],這邊包含的只是為了形成更複雜分類器的比較基礎。各種的組合,sigmoid單元,線性單元,梯度下降學習以及直接對線性系統求解的學習,給出了類似的結果。
基本線性分類器的簡單改進進行測試\[52\]。想法上,訓練單層網路的每個單元,讓每個類與其它類分離。在我們的情況中,該層包含45個標記為0/1, 0/2,...0/9, 1/2....8/9的單元。單元$\frac{i}{j}$被訓練在類別$i$的圖像上產生+1,在類別$j$的圖像上產生-1,並且沒有在其它圖像上進行訓練。對所有的$x$、$y$而言,類別$i$的最終得分為所有標記為$\frac{i}{x}$的輸出總和減掉所有標記為$\frac{y}{i}$的輸出總和。常規測試集的誤差率為7.6%。
:::
#### C.2 Baseline Nearest Neighbor Classifier
:::info
Another simple classifier is a K-nearest neighbor classifier with a Euclidean distance measure between input images. This classifier has the advantage that no training time, and no brain on the part of the designer, are required. However, the memory requirement and recognition time are large: the complete 60,000 twenty by twenty pixel training images (about 24 Megabytes at one byte per pixel) must be available at run time. Much more compact representations could be devised with modest increase in error rate. On the regular test set the error rate was 5.0%. On the deslanted data, the error rate was 2.4%, with $k=3$. Naturally, a realistic Euclidean distance nearest-neighbor system would operate on feature vectors rather than directly on the pixels, but since all of the other systems presented in this study operate directly on the pixels, this result is useful for a baseline comparison.
:::
:::success
另一種簡單的分類器就是在輸入影像之間帶有歐氏距離量測的KNN分類器。KNN分類器有一個優點,不需訓練時間,也不需要設計人員在這部份燒腦。但記憶體需求很大,辨識時間很久:執行過程中必需能夠完整載入60,000張20x20像素的訓練照片(大約24MB)。在錯誤率適當增加的情況下,可以設計更為緊湊的表式形式。在常規測試集上的誤差率為5.0%。在delasnted資料集上,誤差率為2.4%,$K=3$。很自然的,一個合理的歐氏距離最近鄰系統運算於特徵向量而不是直接運算於像素上,但這個研究的其它系統都是直接運算在像素上,因此這個結果對基線比較非常有用。
:::
#### C.3 Principal Component Analysis (PCA) and PolynomialClassiifer
:::info
Following \[53\], \[54\], a preprocessing stage was constructed which computes the projection of the input pattern on the 40 principal components of the set of training vectors. To compute the principal components, the mean of each input component was first computed and subtracted from the training vectors. The covariance matrix of the resulting vectors was then computed and diagonalized using Singular Value Decomposition. The 40-dimensional feature vector was used as the input of a second degree polynomial classifier. This classifier can be seen as a linear classifier with 821 inputs, preceded by a module that computes all products of pairs of input variables. The error on the regular test set was 3.3%.
:::
:::success
根據\[53\],\[54\],建構一個預處理階段,用以計算輸入圖像在訓練向量集的四十個[主成份](http://terms.naer.edu.tw/detail/5588/)上的投影。為了計算[主成份](http://terms.naer.edu.tw/detail/5588/),首先計算每個輸入成份的均值,然後從訓練向量中減去均值。然後用[奇異值分解](http://terms.naer.edu.tw/detail/2124842/)計算所得向量的[共變異方陣](http://terms.naer.edu.tw/detail/3186847/)(協方差矩陣)並對角化。40維度特徵向量用來做為二次多項式分類器的輸入。這個分類器可以視為帶有821個輸入的線性分類器,其後是一個模組,計算輸入變數對的所有乘積。常規測試集誤差為3.3%。
:::
#### C.4 Radial Basis Function Network
:::info
Following \[55\], an RBF network was constructed. The first layer was composed of 1,000 Gaussian RBF units with 28x28 inputs, and the second layer was a simple 1000 inputs / 10 outputs linear classifier. The RBF units were divided into 10 groups of 100. Each group of units was trained on all the training examples of one of the 10 classes using the adaptive K-means algorithm. The second layer weights were computed using a regularized pseudo-inverse method. The error rate on the regular test set was 3.6%
:::
:::success
根據\[55\]建構RBF網路。第一層由1,000個帶有28x28輸入的Gaussian RBF單元組成,第二層是簡單的1000個輸入/10個輸出的性分類器。RBF單元劃分為10組,每組100個。使用adaptive K-means演算法對每一組單元都用十個類別中的一個類別的所有訓練資料做訓練。第二層權重使用正則化的[擬逆(矩陣)](http://terms.naer.edu.tw/detail/3165026/)方法計算。常規資料集的誤差率為3.6%。
:::
#### C.5 One-Hidden Layer Fully Connected Multilayer Neural Network
:::info
Another classifier that we tested was a fully connected multi-layer neural network with two layers of weights (one hidden layer) trained with the version of back-propagation
described in Appendix C. Error on the regular test set was 4.7% for a network with 300 hidden units, and 4.5% for a network with 1000 hidden units. Using artificial distortions to generate more training data brought only marginal improvement: 3.6% for 300 hidden units, and 3.8% for 1000 hidden units. When deslanted images were used, the test error jumped down to 1.6% for a network with 300 hidden units.
It remains somewhat of a mystery that networks with such a large number of free parameters manage to achieve reasonably low testing errors. We conjecture that the dynamics of gradient descent learning in multi-layer nets has a "self-regularization" effect. Because the origin of weight space is a saddle point that is attractive in almost every direction, the weights invariably shrink during the first few epochs (recent theoretical analysis seem to confirm this \[56\]). Small weights cause the sigmoids to operate in the quasi-linear region, making the network essentially equivalent to a low-capacity, single-layer network. As the learning proceeds, the weights grow, which progressively increases the effective capacity of the network. This seems to be an almost perfect, if fortuitous, implementation of Vapnik's "Structural Risk Minimization" principle \[6\]. A better theoretical understanding of these phenomena, and more empirical evidence, are definitely needed.
:::
:::success
我們測試的另一個分類器是全連接的多層神經網路,帶有兩層權重(一個隱藏層),用附錄C中說明的反向傳播訓練。在常規資料集上,隱藏單元為300的時候,誤差率為4.7%,隱藏單位為1000的時候,誤差率為4.5%。使用人工變形來生成更多訓練資料只帶來些許的改善:300個隱藏單元時為3.6%誤差率,1000個隱藏單元為3.8%。當使用deslanted影像時,對於帶300個隱藏單元的網路,其測試誤差降為1.6%。
具有如此大量的可用參數的網路能夠達到相當低的測試誤差,這仍然是一個謎。我們推測,多層網路中的梯度下降學習[動力學](http://terms.naer.edu.tw/detail/931674/)具有"[自我正規化](https://www.ithome.com.tw/news/110578)"效應。因為權重空間的起始點是一個鞍點,幾乎在幾一個方向都具有吸引力,因此,權重在前幾個epochs總是會收縮(近期的理論分析似乎證明這一點\[56\])。較小的權重會導致sigmoid在[準線性](http://terms.naer.edu.tw/detail/2413064/)區域中執行,這造成網路實質上相當於低容量、單層網路。隨著學習的進行,權重的增加,這逐漸增加網路的有效容量。這似乎是一種接近完美,Vapnik的"結構風險最小化"原則的實現(如果是偶然的話)\[6\]。必須對這些現象有更好的理論理解,並且更多的實驗證據。
:::
#### C.6 Two-Hidden Layer Fully Connected Multilayer Neural Network
:::info
To see the effect of the architecture, several two-hidden layer multilayer neural networks were trained. Theoretical results have shown that any function can be approximated by a one-hidden layer neural network \[57\]. However, several authors have observed that two-hidden layer architectures sometimes yield better performance in practical situations. This phenomenon was also observed here. The test error rate of a 28x28-300-100-10 network was 3.05%, a much better result than the one-hidden layer network, obtained using marginally more weights and connections. Increasing the network size to 28x28-1000-150-10 yielded only marginally improved error rates: 2.95%. Training with distorted patterns improved the performance somewhat: 2.50% error for the 28x28-300-100-10 network, and 2.45% for the 28x28-1000-150-10 network.
:::
:::success
為了瞭解此架構的效果,訓練多個兩層隱藏層的多層神經網路。理論結果表明,任意函數都可以由單層隱藏層神經網路來近似\[57\]。而然,多位作者觀察到,實際情況中,兩層隱藏層架構有時候會產生更好的效能。在這邊也觀察到這種現象。28x28-300-100-10網路的測試誤差率為3.5%,這比使用略微更多的權重與連接所獲得的單層隱藏層網路要好的多。網路大小增加為28x28-1000-150-10只會些許改善誤差率:2.95%。以[失真](http://terms.naer.edu.tw/detail/2360643/)(變形)圖像訓練在某種程度上提高效能:28x28-300-100-10網路的誤差率為2.50%,28x28-1000-150-10網路的誤差率為2.45%。
:::
#### C.7 A Small Convolutional Network: LeNet-1
:::info
Convolutional Networks are an attempt to solve the dilemma between small networks that cannot learn the training set, and large networks that seem over-parameterized. LeNet-1 was an early embodiment of the Convolutional Network architecture which is included here for comparison purposes. The images were down-sampled to 16x16 pixels and centered in the 28x28 input layer. Although about 100,000 multiply/add steps are required to evaluate LeNet-1, its convolutional nature keeps the number of free parameters to only about 2600. The LeNet-1 architecture was developed using our own version of the USPS (US Postal Service zip codes) database and its size was tuned to match the available data \[35\]. LeNet-1 achieved 1.7% test error. The fact that a network with such a small number of parameters can attain such a good error rate is an indication that the architecture is appropriate for the task.
:::
:::success
卷積網路是試圖解決無法學習訓練集的小型網路以及似乎過度參數化的大型網路之間的困境。LeNet-1是一個早期卷積網路架構的具體化,出於比較目的,這邊也將它包含進來。影像做降採樣(大陸用語)為16x16像素,並在28x28的輸入層中置中。儘管評估LeNet-1需要大約100,000個乘法/加法步驟,但是它的卷積性質將可用參數的數量保持在大約只有2600。LeNet-1架構是使用我們擁有的USPS(US Postal Service zip codes)資料庫版本所開發,大小已調整與可用資料\[35\]相匹配。LeNet-1的測試誤差來到1.7%。事實上,具有如此少量參數的網路可以達到如此好的誤差率,這表明了這個架構適用於這個任務。
:::
#### C.8 LeNet-4
:::info
Experiments with LeNet-1 made it clear that a larger convolutional network was needed to make optimal use of the large size of the training set. LeNet-4 and later LeNet-5 were designed to address this problem. LeNet-4 is very similar to LeNet-5, except for the details of the architecture. It contains 4 first-level feature maps, followed by 8 subsampling maps connected in pairs to each first-layer feature maps, then 16 feature maps, followed by 16 subsampling map, followed by a fully connected layer with 120 units, followed by the output layer (10 units). LeNet-4 contains about 260,000 connections and has about 10,000 free parameters. Test error was 1.1%. In a series of experiments, we replaced the last layer of LeNet-4 with a Euclidean Nearest Neighbor classifier, and with the "local learning" method of Bottou and Vapnik \[58\], in which a local linear classifier is retrained each time a new test pattern is shown. Neither of those methods improved the raw error rate, although they did improve the rejection performance.
:::
:::success
LeNet-1的實驗清楚表明,需要一個更大的卷積網路來最佳化使用大型訓練集。LeNet-4與後續的LeNet-5就是設計來解決這個問題。LeNet-4與LeNet-5除了架構細節外,非常相似。它包含4個一級feature map,然後是成對連接到第一層feature map的8個subsampling maps,然後是16個subsampling map,然後是帶120個單元的全連接層,最後是輸出層(10個單元)。LeNet-4包含大約260,000個連接,並且約有10,000個可用參數。測試誤差為1.1%。在一系列的實驗中,我們用Euclidean Nearest Neighbor分類器來替換LeNet-4的最後一層,並且用Bottou and Vapnik的"局部學習"方法 \[58\],其中每次顯示新的測試圖像時其局部線性分類器都會重新訓練。儘管這兩種方法都提高拒絕效能,但這兩種方法都沒有提高原始誤差率。
:::
#### C.9 Boosted LeNet-4
:::info
Following theoretical work by R Schapire \[59\], Drucker et al. \[60\] developed the "boosting" method for combining multiple classifiers. Three LeNet-4s are combined: the first one is trained the usual way. the second one is trained on patterns that are filtered by the first net so that the second machine sees a mix of patterns, 60% of which the first net got right, and 50% of which it got wrong. Finally, the third net is trained on new patterns on which the first and the second nets disagree. During testing, the outputs of the three nets are simply added. Because the error rate of LeNet-4 is very low, it was necessary to use the artificially distorted images (as with LeNet-5) in order to get enough samples to train the second and third nets. The test error rate was 0.7%, the best of any of our classifiers. At first glance, boosting appears to be three times more expensive as a single net. In fact, when the first net produces a high confidence answer, the other nets are not called. The average computational cost is about 1.75 times that of a single net.
:::
:::success
根據R Schapire\[59\]的理論,Drucker等人\[60\]開發結合多個分類器的"[推升法](http://terms.naer.edu.tw/detail/3646585/)"。三個LeNet-4結合在一起:第一個以通常方法訓練。第二個訓練於第一個網路過濾的圖像,如此,第二個機器會看到混合的圖像,其中第一個網路60%的圖像是正確的,50%是錯誤的。最後,第三個網路用第一、第二個網路不同意的新圖像訓練。測試過程中,只需要將三個網路的輸出相加即可。因為LeNet-4的誤差率非常低,因此需要使用人工[失真](http://terms.naer.edu.tw/detail/2360643/)(變形)的影像(如LeNet-5),以便獲得足夠的樣本來訓練第二、第三個網路。測試誤差率為0.7%,是我們所有分類器中最好的。乍一看,[推升法](http://terms.naer.edu.tw/detail/3646585/)的成本似乎是單一網路的三倍。事實上,當第一個網路生成高置信度的答案時,並不會調用其它網路。平均計算成本約為單一網路的1.75倍。
:::
#### C.10 Tangent Distance Classifier (TDC)
:::info
The Tangent Distance classifier (TDC) is a nearest-neighbor method where the distance function is made insensitive to small distortions and translations of the input image \[61\]. If we consider an image as a point in a high dimensional pixel space (where the dimensionality equals the number of pixels), then an evolving distortion of a character traces out a curve in pixel space. Taken together, all these distortions define a low-dimensional manifold in pixel space. For small distortions, in the vicinity of the original image, this manifold can be approximated by a plane, known as the tangent plane. An excellent measure of "closeness" for character images is the distance between their tangent planes, where the set of distortions used to generate the planes includes translations, scaling, skewing, squeezing, rotation, and line thickness variations. A test error rate of 1.1% was achieved using 16x16 pixel images. Prefiltering techniques using simple Euclidean distance at multiple resolutions allowed to reduce the number of necessary Tangent Distance calculations.
:::
:::success
切線距離分類器(TDC)是一種最近鄰方法,其距離函數對輸入影像的小失真與平移較不敏感\[61\]。如果我們將影像視為高維度像素空間中的一個點(其維度為像素數量),那麼字符不斷演進的[失真](http://terms.naer.edu.tw/detail/2360643/)(變形)會在像素空間中描繪出一條曲線。綜合來說,所有的這些[失真](http://terms.naer.edu.tw/detail/2360643/)(變形)在像素空間中定義出一個低維流形。對較小的[失真](http://terms.naer.edu.tw/detail/2360643/)(變形),在原始影像附近,這個流形可以用一個平面來近似(稱為切線平面)。字符影像的"[估計接近度](http://terms.naer.edu.tw/detail/3647068/)"的好的度量是它們切線平面之間的距離,其中用於生成平面的[失真](http://terms.naer.edu.tw/detail/2360643/)(變形)集包含平移、縮放、傾斜、擠壓、璇轉以及線粗細變化。使用16x16像素影像的測試誤差率來到1.1%。在多種解析度下使用簡單歐氏距離的預過濾技術可以減少必要的切線距離計算次數。
:::
#### C.11 Support Vector Machine (SVM)
:::info
Polynomial classifiers are well-studied methods for generating complex decision surfaces. Unfortunately, they are impractical for high-dimensional problems, because the number of product terms is prohibitive. The Support Vector technique is an extremely economical way of representing complex surfaces in high-dimensional spaces, including polynomials and many other types of surfaces \[6\].
A particularly interesting subset of decision surfaces is the ones that correspond to hyperplanes that are at a maximum distance from the convex hulls of the two classes in the high-dimensional space of the product terms. Boser, Guyon, and Vapnik \[62\] realized that any polynomial of degree $k$ in this "maximum margin" set can be computed by first computing the dot product of the input image with a subset of the training samples (called the "support vectors"), elevating the result to the $k$-th power, and linearly combining the numbers thereby obtained. Finding the support vectors and the coefficients amounts to solving a high-dimensional quadratic minimization problem with linear inequality constraints. For the sake of comparison, we include here the results obtained by Burges and Sch"olkopf reported in \[63\]. With a regular SVM, their error rate on the regular test set was 1.4%. Cortes and Vapnik had reported an error rate of 1.1% with SVM on the same data using a slightly different technique. The computational cost of this technique is very high: about 14 million multiply-adds per recognition. Using Sch"olkopf's Virtual Support Vectors technique (V-SVM), 1.0% error was attained. More recently, Sch"olkopf (personal communication) has reached 0.8% using a modified version of the V-SVM. Unfortunately, V-SVM is extremely expensive: about twice as much as regular SVM. To alleviate this problem, Burges has proposed the Reduced Set Support Vector technique (RS-SVM), which attained 1.1% on the regular test set \[63\], with a computational cost of only 650,000 multiply-adds per recognition, i.e. only about 60% more expensive than LeNet-5.
:::
:::success
多項式分類器用於生成複雜決策面是經過充份研究的方法。不幸的是,它們對高維度的問題是不切實際的,因為產生的乘積項目太多了。支援向量技術是一種在高維空間中表示複雜[曲面](http://terms.naer.edu.tw/detail/2125824/)的非常經濟的方法,包含多項式以及許多其它類型的[曲面](http://terms.naer.edu.tw/detail/2125824/)\[6\]。
[決策面](http://terms.naer.edu.tw/detail/2357266/)的一個特別有趣的子集是與超平面相對應的子集,這些超平面在乘積項的高維空間中與兩個類別凸包的最大距離。Boser,Guyon,與Vapnik\[62\]瞭解到,在這個"最大邊界"集中,任何$k$次多項式都可以透過首先計算輸入影像與訓練樣本子集的點積(稱為"支援向量機"),將結果提升至$k$次方,然後線性組合由此獲得的數字來計算。尋找支援向量以及係數相當於解一個具[線性不等式](http://terms.naer.edu.tw/detail/2119001/)約束的高維[二次](http://terms.naer.edu.tw/detail/3497942/)最小化問題。為了進行比較,我們這邊包含Burges與Sch"olkopf在\[63\]中報告的結果。使用常規的SVM時,在常規資料集上的誤差率為1.4%。Cortes與Vapnik的報告說明,使用些許不同的技術在相同數據上使用SVM的誤差率為1.1%。這個技術的計算成本非常高:每次的識別大約有1400萬次乘、加法計算。使用Sch"olkopf的虛擬支援向量技術(V-SVM),可獲得1.0%的誤差率。最近,Sch"olkopf(個人)使用V-SVM的修改版本,已經來到0.8%的誤差率。不幸的是,V-SVM非常昂貴:大約是常規SVM的兩倍。為了減緩這個問題,Burges提出Reduced Set Support Vector技術(RS-SVM),在常規測試集上來到1.1%的測試誤差率\[63\],每一次識別的計算成本只需要650,000次乘、加法計算,大約比LeNet-5貴了60%。
:::
:::info

Fig. 10. Rejection Performance percentage of test patterns that must be rejected to achieve 0.5% error for some of the systems.
圖10. 測試圖像的拒絕效能百分比,部份系統必須拒絕才能達到0.5%的誤差率。
:::
:::info

Fig. 11. Number of multiply-accumulate operations for the recognition of a single character starting with a size-normalized image.
圖11. 從大小[標準化](http://terms.naer.edu.tw/detail/5425344/)影像辨別單個字符的乘法累加執行次數
:::
### D. Discussion
:::info
A summary of the performance of the classifiers is shown in Figures 9 to 12‧ Figure 9 shows the raw error rate of the classifiers on the 10,000 example test set. Boosted LeNet-4 performed best, achieving a score of 0.7%, closely followed by LeNet-5 at 0.8%.
Figure 10 shows the number of patterns in the test set that must be rejected to attain a 0.5% error for some of the methods. Patterns are rejected when the value of corresponding output is smaller than a predefined threshold. In many applications, rejection performance is more significant than raw error rate. The score used to decide upon the rejection of a pattern was the difference between the scores of the top two classes. Again, Boosted LeNet-4 has the best performance. The enhanced versions of LeNet-4 did better than the original LeNet-4, even though the raw accuracies were identical.
:::
:::success
分類器的效能總結如圖9至圖12所示。圖9說明分類器在10,000測試集樣本的原始誤差率。Boosted LeNet-4的效能最好,得分為0.7%,緊接在後的是LeNet-5,得分為0.8%。
圖10說明某些方法的測試集中必須拒絕才能達到0.5%的誤差率的數量。當相對應的輸入值比預定義的閥值還要小的時候,圖像會被拒絕。在許多應用程式中,拒絕的效能比原始誤差率更為重要。用來決定圖像被拒絕的分數,是由前兩個類別之間的得分差值。再次的,Boosted LeNet-4有著最好的效能。加強版的LeNet-4比原始的LeNet-4更好,即使原始的精度是相同的。
:::
:::info

Fig. 12. Memory requirements measured in number of variables, for each of the methods. Most of the methods only require one byte per variable for adequate performance.
圖12. 以變數量衡量每一種方法的記憶體需求。多數的方法每一個變數只需要1byte就有足夠的效能。
:::
:::info
Figure 11 shows the number of multiply-accumulate operations necessary for the recognition of a single size-normalized image for each method. Expectedly, neural networks are much less demanding than memory-based methods. Convolutional Neural Networks are particularly well suited to hardware implementations because of their regular structure and their low memory requirements for the weights. Single chip mixed analog digital implementations of LeNet-5's predecessors have been shown to operate at speeds in excess of 1000 characters per second \[64\]. However, the rapid progress of mainstream computer technology renders those exotic technologies quickly obsolete. Cost-effective implementations of memory-based techniques are more elusive, due to their enormous memory requirements, and computational requirements.
Training time was also measured. K-nearest neighbors and TDC have essentially zero training time. While the single-layer net, the pairwise net, and PCA+quadratic net could be trained in less than an hour, the multi-layer net training times were expectedly much longer, but only required 10 to 20 passes through the training set. This amounts to 2 to 3 days of CPU to train LeNet-5 on a Silicon Graphics Origin 2000 server, using a single 200MHz R10000 processor. It is important to note that while the training time is somewhat relevant to the designer, it is of little interest to the final user of the system. Given the choice between an existing technique, and a new technique that brings marginal accuracy improvements at the price of considerable training time, any final user would chose the latter.
Figure 12 shows the memory requirements, and therefore the number of free parameters, of the various classifiers measured in terms of the number of variables that need to be stored. Most methods require only about one byte per variable for adequate performance. However, Nearest-Neighbor methods may get by with 4 bits per pixel for storing the template images. Not surprisingly, neural networks require much less memory than memory-based methods.
The Overall performance depends on many factors including accuracy, running time, and memory requirements. As computer technology improves, larger-capacity recognizers become feasible. Larger recognizers in turn require larger training sets. LeNet-1 was appropriate to the available technology in 1989, just as LeNet-5 is appropriate now. In 1980 a recognizer as complex as LeNet-5 would have required several weeks' training, and more data than was available, and was therefore not even considered. For quite a long time, LeNet-1 was considered the state of the art. The local learning classifier, the optimal margin classifier, and the tangent distance classifier were developed to improve upon LeNet-1 - and they succeeded at that. However, they in turn motivated a search for improved neural network architectures. This search was guided in part by estimates of the capacity of various learning machines, derived from measurements of the training and test error as a function of the number of training examples. We discovered that more capacity was needed. Through a series of experiments in architecture, combined with an analysis of the characteristics of recognition errors, LeNet-4 and LeNet-5 were crafted.
We find that boosting gives a substantial improvement in accuracy, with a relatively modest penalty in memory and computing expense. Also, distortion models can be used to increase the effective size of a data set without actually requiring to collect more data.
The Support Vector Machine has excellent accuracy, which is most remarkable, because unlike the other high performance classifiers, it does not include a priori knowledge about the problem. In fact, this classifier would do just as well if the image pixels were permuted with a fixed mapping and lost their pictorial structure. However, reaching levels of performance comparable to the Convolutional Neural Networks can only be done at considerable expense in memory and computational requirements. The reduced-set SVM requirements are within a factor of two of the Convolutional Networks, and the error rate is very close. Improvements of those results are expected, as the technique is relatively new.
When plenty of data is available, many methods can attain respectable accuracy. The neural-net methods run much faster and require much less space than memory-based techniques. The neural nets' advantage will become more striking as training databases continue to increase in size.
:::
:::success
圖11說明每一種方法辨識單個大小[標準化](http://terms.naer.edu.tw/detail/5425344/)的照片所需要的相乘累加次數。如預期的,神經網路比基於記憶體的方法要低的多。卷積神經網路因為它的規則結構以及權重的低記憶體需求,因此特別適合於硬體實現。LeNet-5的前代[單晶片](http://terms.naer.edu.tw/detail/5433797/)混合[類比數位](http://terms.naer.edu.tw/detail/2341250/)實現已經顯示出每秒超過1000個字符的速度執行\[64\]。然而,主流電腦技術的快速發展讓這些奇異技術很快的被淘汱。基於記憶體技術的實現的成本效益因為其巨大的記憶體需求以及計算需求,因而難以估計。
我們還測量訓練時間。KNN與TDC,實質上它們的訓練時間為零。儘管單層網路,成對網路以及PCA加二次網路可以在不到一小時的時間訓練,但多層網路的訓練時間預期要更長,但只需要10到20次通過訓練集。在Silicon Graphics Origin 2000 server上,使用單個200MHz R1000的處理器,需要2至3天來訓練LeNet-5。重要的是要注意,儘管訓練時間與設計人員有部份相關,但對系統的最終用戶來說卻沒什麼興趣。在現有技術與新技術做選擇,新技術用更長的訓練時間為代價帶來邊際精度的提高,任何一位最終用戶都會選擇後者。
圖12說明就需要被保存的變數數量來量測各種分類器的記憶體需求以及由此得出的可用參數的數量。為了足夠的效能,多數的方法每一個變數只需要約1byte。然而,Nearest-Neighbor方法可能用每個像素4bit來保存模板影像。不意外的,神經網路需要的記憶體比基於記憶體的方法還要來的少。
整體效能取決於很多因素,包含準確性、執行時間、以及記憶體需求。隨著電腦技術的進步,大容量的識別器變的可行。大型的識別器則需要較大的訓練集。LeNet-1適用於1989年的可用技術,如同LeNet-5適用現在。在1980年,一個像LeNet-5一樣複雜的識別器需要數週的訓練,而且提供的資料更多,因此甚至沒有考慮。長久以來,LeNet-1被認為是最先進的技術。局部學習分類器,最佳邊界分類器以及切線距離分類器,都在LeNet-1基礎上改進,並取得成功。然而,他們反過來足使人們尋找改進神經網路的架構。這個尋找部份是根據各種學習機器的能力來估測,這些估測是依據訓練與測試誤差的測量值得出,做為訓練樣本數量的函數(這段有點翻不出來)。我們發現需要更多的容量。通過一系列的架構實驗,結合辨識錯誤的特徵分析,設計出LeNet-4與LeNet-5。
我們發現[推升法](http://terms.naer.edu.tw/detail/3646585/)提升準確性,而且在記憶體與計算開銷上的損失相對較小。同樣的,[失真](http://terms.naer.edu.tw/detail/2360643/)(變形)的模型可以用來增加資料集有效大小,不需要真的收集更多資料。
支援向量機有著極佳的準確性,這是非常出色的,因為不同於其它高效能分類器,它並不包含該問題的先驗知識。事實上,如果影像像素用固定的映射進行置換,並丟棄它們的圖像結構,這個分類器也一樣可以做的很好。但是,要達到與卷積神經網路相比的效能水平,只能在記憶體與計算需求方面付出相當的代價。[縮減集](http://terms.naer.edu.tw/detail/2123333/)SVM要求在卷積網路兩個因子內,而且誤差率非常接近。因為這技術相對較新,因此改善這些結果是可期待的。
當有大量資料可以用的時候,很多方法都可以來到可觀的準確性。相對於基於記憶體的技術,神經網路方法執行的更快,需要更少的空間。隨著訓練資料庫規模不斷擴大,神經網路的優勢會更引人注目。
:::
### E. Invariance and Noise Resistance
:::info
Convolutional networks are particularly well suited for recognizing or rejecting shapes with widely varying size, position, and orientation, such as the ones typically produced by heuristic segmenters in realworld string recognition systems
In an experiment like the one described above, the importance of noise resistance and distortion invariance is not obvious. The situation in most real applications is quite different. Characters must generally be segmented out of their context prior to recognition. Segmentation algorithms are rarely perfect and often leave extraneous marks in character images (noise, underlines, neighboring characters), or sometimes cut characters too much and produce incomplete characters. Those images cannot be reliably size-normalized and centered. Normalizing incomplete characters can be very dangerous. For example, an enlarged stray mark can look like a genuine 1. Therefore many systems have resorted to normalizing the images at the level of fields or words. In our case, the upper and lower profiles of entire fields (amounts in a check) are detected and used to normalize the image to a fixed height. While this guarantees that stray marks will not be blown up into character looking images, this also creates wide variations of the size and vertical position of characters after segmentation. Therefore it is preferable to use a recognizer that is robust to such variations. Figure 13 shows several examples of distorted characters that are correctly recognized by LeNet-5. It is estimated that accurate recognition occurs for scale variations up to about a factor of 2, vertical shift variations of plus or minus about half the height of the character, and rotations up to plus or minus 30 degrees. While fully invariant recognition of complex shapes is still an elusive goal, it seems that Convolutional Networks offer a partial answer to the problem of invariance or robustness with respect to geometrical distortions.
Figure 13 includes examples of the robustness of LeNet-5 under extremely noisy conditions. Processing those images would pose unsurmountable problems of segmentation and feature extraction to many methods, but LeNet-5 seems able to robustly extract salient features from these cluttered images. The training set used for the network shown here was the MNIST training set with salt and pepper noise added. Each pixel was randomly inverted with probability 0.1. More examples of LeNet-5 in action are available on the Internet at http://www.research.att.com/~yann/ocr.
:::
:::success
卷積網路特別適合用於辨識或拒絕大小、位置、方向變化很大的形狀,像是在真實世界的字串辨識系統中,透過啟發式分段器典型生成的形狀。
一如上面實驗說明,抗噪性以及[失真](http://terms.naer.edu.tw/detail/2360643/)(變形)不變性的重要性並不明顯。這種情況在多數的實際應用中是截然不同的。字符通常必須在辨識之前從上下文中切割出來。切割演算很少是完美的,而且通常在字符影像中留下多餘的標記(噪點,底線、相鄰字符),或剪裁過多而產生不完整的字符。這些影像無法確實的大小[標準化](http://terms.naer.edu.tw/detail/5425344/)和置中。[標準化](http://terms.naer.edu.tw/detail/5425344/)不完整的字符可能非常危險。舉例來說,一個放大的雜散標記可能看起來像是真正的1。因此,很多系統已經採取在字段或單詞的級別上對影像進行[標準化](http://terms.naer.edu.tw/detail/5425344/)。在我們的範例中,檢測到整個字段的上下輪郭(支票中的金額),然後將其用於將影像[標準化](http://terms.naer.edu.tw/detail/5425344/)為固定高度。儘管這可以保證雜散的標記不會被放大到字符外觀的影像中,但這也在切割之後造成字符大小、垂直位置的巨大變化。因此,最好使用對這種變化具有魯棒性的識別器。圖13說明,利用LeNet-5正確辨識的幾個扭曲字符的範例。據估計,對於高達約兩倍的比例變化,垂直平移變化的正負約字符高度的一半,以及旋轉角度正負三十度,都可以準確的辨識。儘管對複雜形狀的完全不變識別仍然是一個遙不可及的目標,但是卷積網路似乎對幾何變形的不變性或魯棒性提供一個部份的答案。
圖片13包含LeNet-5在極度噪點條件下的魯棒性範例。處理這些照片會給許多方法帶來無法克服的分割、特徵提取的問題,但是LeNet-5似乎能夠從這些雜亂的影像中穩健的提取出顯著的特徵。用於這個網路的訓練集說明,這資料集是加了黑白相間雜訊的MNIST訓練集。每個像素以0.1的機率隨機倒置。關於LeNet-5的更多範例,請參考連結http://www.research.att.com/~yann/ocr。
:::
:::info

Fig. 13. Examples of unusual distorted and noisy characters correctly recognized by LeNet-5. The grey-level of the output label represents the penalty (lighter for higher penalties).
圖13 利用LeNet-5可以正確的辨識異常與噪點字符的範例。輸出標記的灰階代表懲罰。(較高的懲罰較淺)
:::
## IV. Multi-Module Systems and Graph Transformer Networks
:::info
The classical backpropagation algorithm, as described and used in the previous sections, is a simple form of Gradient-Based Learning. However, it is clear that the gradient back-propagation algorithm given by Equation 4 describes a more general situation than simple multi-layer feed-forward networks composed of alternated linear transformations and sigmoidal functions. In principle, derivatives can be back-propagated through any arrangement of functional modules, as long as we can compute the product of the Jacobians of those modules by any vector. Why would we want to train systems composed of multiple heterogeneous modules? The answer is that large and complex trainable systems need to be built out of simple, specialized modules. The simplest example is LeNet-5, which mixes convolutional layers, sub-sampling layers, fully-connected layers, and RBF layers. Another less trivial example, described in the next two sections, is a system for recognizing words, that can be trained to simultaneously segment and recognize words, without ever being given the correct segmentation
Figure 14 shows an example of a trainable multi-modular system. A multi-module system is defined by the function implemented by each of the modules, and by the graph of interconnection of the modules to each other. The graph implicitly defines a partial order according to which the modules must be updated in the forward pass. For example in Figure 14, module 0 is first updated, then modules 1 and 2 are updated (possibly in parallel), and finally module 3. Modules may or may not have trainable parameters. Loss functions, which measure the performance of the system, are implemented as module 4. In the simplest case, the loss function module receives an external input that carries the desired output. In this framework, there is no qualitative difference between trainable parameters (W1,W2 in the figure), external inputs and outputs (Z, D, E) and intermediate state variables(X1, X2, X3, X4, X5).
:::
:::success
如之前所說,經典的反向傳播演算法是基於梯度學習的一種簡單形式。但是,很明顯的,公式4給出的梯度反向傳播演算法比由交替線性轉換與sigmoidal函數組成的簡單多層前饋網路描述更一般的情況。原則上,只要我們可以計透過任意向量計算那些模組的Jacobians內積,那麼就可以通過任意功能模組的排列對導數進行反向傳播。為什麼我們要訓練由多個異質模組組成的系統?答案是,大型與複雜的可訓練系統需要由簡單、專門的模組建立而成。最簡單的範例就是LeNet-5,混合卷積層,子採樣層、全連接層以及RBF層。在下兩節中描述另一個不那麼瑣碎的範例,是一個辨識單詞的系統,它可以被訓練來同時分割與辨識單詞,而不需要給出正確的分割。
圖片14說明可訓練的多模組系統範例。一個多模組系統是由每一個模組實現的功能,以及模組相互之間的互連關係圖定義。『圖』隱式的定義局部順序,依據這個順序,模組必需在前向傳播中更新。舉例來說,圖14中,首先更新模組0,然後更新模組1、2(可能並行),最後更新模組3。模組可能有,也可能沒有可訓練參數。測量系統效能的損失函數在模組4中實現。在最簡單的情況下,損失函數模組會接收帶有所需輸出的外部輸入。在這個框架中,可訓練參數(圖中的W1,W2)、外部輸入以、輸出(Z,D,E)以及中間狀態變數(X1,X2,X3,X4,X5)沒有定性差異。
:::
:::info

Fig. 14. A trainable system composed of heterogeneous modules.
圖14 一個由異質模組組成的可訓練系統。
:::
### A. An Object-Oriented Approach
:::info
Object-Oriented programming offers a particularly convenient way of implementing multi-module systems. Each module is an instance of a class. Module classes have a "forward propagation" method (or member function) called fprop whose arguments are the inputs and outputs of the module. For example, computing the output of module 3 in Figure 14 can be done by calling the method fprop on module 3 with the arguments X3, X4, X5. Complex modules can be constructed from simpler modules by simply defining a new class whose slots will contain the member modules and the intermediate state variables between those modules. The fprop method for the class simply calls the fprop methods of the member modules, with the appropriate intermediate state variables or external input and outputs as arguments. Although the algorithms are easily generalizable to any network of such modules, including those whose influence graph has cycles, we will limit the discussion to the case of directed acyclic graphs (feed-forward networks).
Computing derivatives in a multi-module system is just as simple. A "backward propagation" method, called bprop, for each module class can be defined for that purpose. The bprop method of a module takes the same arguments as the fprop method. All the derivatives in the system can be computed by calling the bprop method on all the modules in reverse order compared to the forward propagation phase. The state variables are assumed to contain slots for storing the gradients computed during the backward pass, in addition to storage for the states computed in the forward pass. The backward pass effectively computes the partial derivatives of the loss $E$ with respect to all the state variables and all the parameters in the system. There is an interesting duality property between the forward and backward functions of certain modules. For example, a sum of several variables in the forward direction is transformed into a simple fan-out (replication) in the backward direction. Conversely, a fan-out in the forward direction is transformed into a sum in the backward direction. The software environment used to obtain the results described in this paper, called SN3.1, uses the above concepts. It is based on a homegrown object oriented dialect of Lisp with a compiler to $C$.
The fact that derivatives can be computed by propagation in the reverse graph is easy to understand intuitively. The best way to justify it theoretically is through the use of Lagrange functions \[21\], \[22\]. The same formalism can be used to extend the procedures to networks with recurrent connections.
:::
:::success
物件導向程式提供一種特別方便的方式來實作多模組系統。每一個模組都是一個類的實例。模組類擁有"前向傳播"的方法(或[成員函數](http://terms.naer.edu.tw/detail/2392599/)),稱為fprop,其參數為模組的輸入、輸出。舉例來說,圖14中,計算模組3的輸出,可以透過參數X3,X4,X5呼叫模組3上fprop方法來達成。可以通過簡單定義一個新的類別,由較簡單的模組架構出複雜的模組,其類別的slots將包含成員模組以及模組之間的中間狀態變數。類別的fprop方法僅調用成員模組的fprop方法,將適當的中間狀態變數或外部輸入、輸出做為參數。簡單演算法很容易推廣到這類模組的任何網路,包含那些影響圖具循環的網路,但我們將討論限制在[有向非循環圖](http://terms.naer.edu.tw/detail/2359779/)(前饋網路)
在多模組系統中計算導數也是很簡單。為此,可以為每一個模組類別定義反向傳播方法,稱為bprop。模組的bprop方法帶有與fprop方法相同的參數。系統中的所有導數都可以透過呼叫所有模組上的bprop方法來計算,其順序與正向傳播階段相反。假設狀態變數包含用於保存反向傳播期間的梯度,以及用於正向傳播中計算的狀態的保存。反向傳遞有效地計算系統中所有狀態變數以及所有參數相對於損失$E$的偏導數。在部份模組的前向、反向函數之間有一個有趣的對偶屬性。舉例來說,[正向](http://terms.naer.edu.tw/detail/3475427/)中的幾個變數加總轉換為[反向](http://terms.naer.edu.tw/detail/5902650/)的簡單[扇出](http://terms.naer.edu.tw/detail/2453764/)(複製)。相反的,[正向](http://terms.naer.edu.tw/detail/3475427/)的[扇出](http://terms.naer.edu.tw/detail/2453764/)轉換為[反向](http://terms.naer.edu.tw/detail/5902650/)的加總。論文中獲得所述結果的軟體環境稱為SN3.1,使用上面所說明的。這是基於Lisp的物件導向的[方言](http://terms.naer.edu.tw/detail/3154468/),其編譯器為$C$。
事實上,導數可以透過反向圖中的傳播來計算,這很容易直觀的理解。從理論上證明它的最好方法是使用Lagrange functions \[21\]。\[22\]。相同的形式可以用來將過程擴展到具有遞迴連接的網路。
:::
### B. Special Modules
:::info
Neural networks and many other standard pattern recognition techniques can be formulated in terms of multi-modular systems trained with Gradient-Based Learning. Commonly used modules include matrix multiplications and sigmoidal modules, the combination of which can be used to build conventional neural networks. Other modules include convolutional layers, sub-sampling layers, RBF layers, and "softmax" layers \[65\]. Loss functions are also represented as modules whose single output produces the value of the loss. Commonly used modules have simple bprop methods. In general, the bprop method of a function $F$ is a multiplication by the Jacobian of $F$. Here are a few commonly used examples. The bprop method of a fanout (a "Y" connection) is a sum, and vice versa. The bprop method of a multiplication by a coefficient is a multiplication by the same coefficient. The bprop method of a multiplication by a matrix is a multiplication by the transpose of that matrix. The bprop method of an addition with a constant is the identity.
:::
:::success
神經網路以及許多其它標準圖像辨識技術可以用基於梯度學習的多模組系統來制定。常用模組包含矩陣乘法與Sigmoidal模組,它們的組合可以用來建立傳統的神經網路。其它模組包含卷積層,子採樣層,RBF層,以及"softmax"層\[65\]。損失函數也可以表示為單個輸出產生損失值的模組。常用模組有簡單的bprop方法。一般來說,函數$F$的bprop方法是$F$的Jacobian乘法。下面是一些常用範例。[扇出](http://terms.naer.edu.tw/detail/2453764/)的bprop方法(一個"Y"連接)是總合,反之亦然。乘係數的bprop方法是乘相同的係數。乘矩陣的bprop方法是乘該矩陣的轉置。帶常數加法的bprop方法是恆等式。
:::
:::info

Fig. 15. Traditional neural networks, and multi-module systems communicate fixed-size vectors between layer. Multi-Layer Graph Transformer Networks are composed of trainable modules that operate on and produce graphs whose arcs carry numerical information.
圖15 傳統神經網路與多模組系統在層之間傳遞固定大小的向量。Multi-Layer Graph Transformer Networks由可訓練模組所組成,其模組運作於生成帶有數值信息的弧的圖。
:::
:::info
Interestingly, certain non-differentiable modules can be inserted in a multi-module system without adverse effect. An interesting example of that is the multiplexer module. It has two (or more) regular inputs, one switching input, and one output. The module selects one of its inputs, depending upon the (discrete) value of the switching input, and copies it on its output. While this module is not differentiable with respect to the switching input, it is differentiable with respect to the regular inputs. Therefore the overall function of a system that includes such modules will be differentiable with respect to its parameters as long as the switching input does not depend upon the parameters. For example, the switching input can be an external input.
Another interesting case is the min module. This module has two (or more) inputs and one output. The output of the module is the minimum of the inputs. The function of this module is differentiable everywhere, except on the switching surface which is a set of measure zero. Interestingly, this function is continuous and reasonably regular, and that is sufficient to ensure the convergence of a Gradient-Based Learning algorithm.
The object-oriented implementation of the multi-module idea can easily be extended to include a bbprop method that propagates Gauss-Newton approximations of the second derivatives. This leads to a direct generalization for modular systems of the second-derivative back-propagation Equation 22 given in the Appendix.
The multiplexer module is a special case of a much more general situation, described at length in Section VIII, where the architecture of the system changes dynamically with the input data. Multiplexer modules can be used to dynamically rewire (or reconfigure) the architecture of the system for each new input pattern.
:::
:::success
有趣的是,某些不可微模組可以被插入到多組模系統中,而不會有負面影響。一個有趣的例子是[多工器](http://terms.naer.edu.tw/detail/2396982/)模組。它有兩個常規輸入(或更多),一個開關輸入以及一個輸出。模組依據開關輸入的(離散)值選擇其中一個輸入,然後將它複製到輸出上。儘管這個模組對開關輸入是不可微的,但對常規輸入是可微的。因此,包含這樣的模組的系統整體功能就能對其參數微分,只要開關輸入不取決於參數即可。舉例來說,開關輸入可以是外部輸入。
另一個有趣的情況是min模組。這個模組擁有兩個(或多個)輸入以及一個輸出。模組的輸出是輸入的最小值。模組的函數在任何地方都可微,除了在開關表面上的一組[零測度](http://terms.naer.edu.tw/detail/2119539/)之外。有趣的是,這個函數是連續的而且相當規則的,這足以確保基於梯度的學習演算法的收斂性。
多模組物件導向的實現可以很容易的擴展到包含bbprop方法,這個方法傳播二階導數的Gauss-Newton(高斯牛頓)近似值。這導致附錄中給出的二階導數反向傳播公式22的模組化系統的直接[推廣](http://terms.naer.edu.tw/detail/2116747/)。
[多工器](http://terms.naer.edu.tw/detail/2396982/)模組是一個更為通用的特殊情況,在第八章詳細說明,其系統架構隨著輸入資料動態調整。多工器模組可以用來為每一個新的輸入影像動態重新佈線(或重新配置)系統架構。
:::
### C. Graph Transformer Networks
:::info
Multimodule systems are a very fiexible tool for building large trainable system. However, the descriptions in the previous sections implicitly assumed that the set of parameters, and the state information communicated between the modules, are all fixed-size vectors. The limited flexibility of fixed-size vectors for data representation is a serious deficiency for many applications, notably for tasks that deal with variable length inputs (e.g continuous speech recognition and handwritten word recognition), or for tasks that require encoding relationships between objects or features whose number and nature can vary (invariant perception, scene analysis, recognition of composite objects). An important special case is the recognition of strings of characters or words.
More generally, fixed-size vectors lack flexibility for tasks in which the state must encode probability distributions over sequences of vectors or symbols as is the case in linguistic processing. Such distributions over sequences are best represented by stochastic grammars, or, in the more general case, directed graphs in which each arc contains a vector (stochastic grammars are special cases in which the vector contains probabilities and symbolic information). Each path in the graph represents a different sequence of vectors. Distributions over sequences can be represented by interpreting elements of the data associated with each arc as parameters of a probability distribution or simply as a penalty. Distributions over sequences are particularly handy for modeling linguistic knowledge in speech or hand writing recognition systems: each sequence, i.e., each path in the graph, represents an alternative interpretation of the input. Successive processing modules progressively refine the interpretation. For example, a speech recognition system might start with a single sequence of acoustic vectors, transform it into a lattice of phonemes (distribution over phoneme sequences), then into a lattice of words (distribution over word sequences), and then into a single sequence of words representing the best interpretation.
:::
:::success
多模組系統對建置大型可訓練系統是非常靈活的工具。然而,前面章節的說明隱式地假設參數集與模組間的狀態信息傳遞都是固定大小的向量。對許多應用程式而言,固定大小向量的靈活限制對資料表示而言是嚴重不足,特別是處理可變長度輸入的任務(例如,連續語音辨識以及手寫單詞辨識),或對需要物件或特徵之間編碼關係的任務,其數量與性質為可變(不變的感知器,場景分析,複合物件的辨識)。一個重要的特殊範例是辨識字符串或單詞。
更一般而言,固定大小的向量對任務而言缺乏靈活性,任務中,狀態必需在向量或符號序列上編碼機率分佈,就像是[語言](http://terms.naer.edu.tw/detail/2387045/)處理那樣。序列上的這種分佈最好用隨機文法來表示,或者,更一般的狀況下,其中每一個弧都包含一個向量的[有向圖](http://terms.naer.edu.tw/detail/2114799/)(隨機文法是特殊情況,其向量包含機率與符號信息)。圖中的每一個路徑代表一個不同的向量序列。序列上的分佈可以透過將與每個弧相關聯的資料元素解釋為機率分佈的參數,或簡單的做為懲罰來表示。序列上的分佈對在語音或手寫辨識系統中[語言](http://terms.naer.edu.tw/detail/2387045/)知識建模特別方便:每個序列,像是圖中的每個路徑,表示輸入的替代解釋。[逐次](http://terms.naer.edu.tw/detail/2125698/)處理模組逐步改善解釋。舉例來說,語音辨識系統可以從單個聲學向量序列開始,然後轉換為[音素](http://terms.naer.edu.tw/detail/2407911/)網格(在音素序列上分佈),然後轉換為單詞網格(在單詞序列上分佈),然後轉為一個表示最佳解釋的單詞序列。
:::
:::info
In our work on building large-scale handwriting recognition systems, we have found that these systems could much more easily and quickly be developed and designed by viewing the system as a networks of modules that take one or several graphs as input and produce graphs as output. Such modules are called Graph Transformers, and the complete systems are called Graph Transformer Networks, or GTN. Modules in a GTN communicate their states and gradients in the form of directed graphs whose arcs carry numerical information (scalars or vectors) \[66\].
From the statistical point of view, the fixed-size state vectors of conventional networks can be seen as representing the means of distributions in state space. In variable size networks such as the Space-Displacement Neural Networks described in section VII, the states are variable-length sequences of fixed-size vectors. They can be seen as representing the mean of a probability distribution over variable-length sequences of fixed-size vectors. In GTNs, the states are represented as graphs, which can be seen as representing mixtures of probability distributions over structured collections (possibly sequences) of vectors (Figure 15).
One of the main points of the next several sections is to show that Gradient-Based Learning procedures are not limited to networks of simple modules that communicate through fixed-size vectors, but can be generalized to GTNs. Gradient back-propagation through a Graph Transformer takes gradients with respect to the numerical information in the output graph, and computes gradients with respect to the numerical information attached to the input graphs, and with respect to the module's internal parameters. Gradient-Based Learning can be applied as long as differentiable functions are used to produce the numerical data in the output graph from the numerical data in the input graph and the functions parameters.
The second point of the next several sections is to show that the functions implemented by many of the modules used in typical document processing systems (and other image recognition systems), though commonly thought to be combinatorial in nature, are indeed differentiable with respect to their internal parameters as well as with respect to their inputs, and are therefore usable as part of a globally trainable system.
In most of the following, we will purposely avoid making references to probability theory. All the quantities manipulated are viewed as penalties, or costs, which if necessary can be transformed into probabilities by taking exponentials and normalizing.
:::
:::success
在我們建構[大型](http://terms.naer.edu.tw/detail/2384935/)手機辨識系統的工作中,我們發現,通過將系統視為帶有一個或多個圖為輸入,以及產生圖為輸出的模組網路,可以更簡單而且快速的建立、設計這些系統。這樣的模組稱為Graph Transformers,完整的系統稱為Graph Transformer Networks,或GTN。GTN中的模組以[有向圖](http://terms.naer.edu.tw/detail/2114799/)的形式傳遞其狀態與梯度,其弧帶有數值信息([純量](http://terms.naer.edu.tw/detail/2124076/)或向量)\[66\]。
從統計的觀點來看,常規網路的固定大小狀態向量可以視為表示狀態空間中的分佈平均。在可變大小網路中,如第七章所說明的Space-Displacement Neural Networks,狀態是固定大小向量的[可變長度](http://terms.naer.edu.tw/detail/2435422/)序列。它們可以被視為是表示固定大小向量的[可變長度](http://terms.naer.edu.tw/detail/2435422/)序列上的機率分佈平均。在GTNs中,狀態以圖來表示,可以視為是向量的結構化集合(可能是序列)上的機率分佈的混合。
接下來幾個章節中的重點之後就是,說明基於梯度的學習過程不只是限於通過固定大小向量傳遞的簡單模組網路,還可以推廣到GTNs。通過Graph Transformer的梯度反向傳播帶有與輸出圖中的數值信息相關的梯度,然後計算附加到輸入圖中的數值信息以及模組內部相關參數的梯度。只要使用可微分函數從輸入圖中的數值信息以及函數參數來生成輸出圖中的數值信息,那就可以應用基於梯度的學習。
接下來幾個章節要說明的第二點是,但在典型的[文件處理系統](http://terms.naer.edu.tw/detail/2361101/)(與其它影像辨識系統)中使用的許多模組所實現的功能,儘管普遍被認為本質上是組合的,但其內部參數及其輸入確實是可微的,因此可用做全域可訓練系統的一部份。
在大多數情況下,我們將刻意的避免參考機率論。所有被控制的數量都被視為懲罰或成本,如果有需要,可以透過指數與標準化將其轉為率機。
:::