Gradient-Based Learning Applied to Document Recognition_Paper(翻譯)(IX)

# Gradient-Based Learning Applied to Document Recognition_Paper(翻譯)(IX) ###### tags: `LeNet-5` `CNN` `論文翻譯` `deeplearning` `Gradient-Based Learning Applied to Document Recognition` >[name=Shaoe.chen] [time=Mon, Jan 6, 2020 1:14 PM] [TOC] ## 說明區塊如下分類，原文區塊為藍底，翻譯區塊為綠底，部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf) * [\*paper hyperlink](http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf) ::: Page 34~37 ## IX. An OnLine Handwriting Recognition System :::info Natural handwriting is often a mixture of different "styles" lower case printed, upper case, and cursive. A reliable recognizer for such handwriting would greatly improve interaction with pen-based devices, but its implementation presents new technical challenges. Characters taken in isolation can be very ambiguous, but considerable information is available from the context of the whole word. We have built a word recognition system for pen-based devices based on four main modules: a preprocessor that normalizes a word, or word group, by fitting a geometrical model to the word structure; a module that produces an "annotated image" from the normalized pen trajectory; a replicated convolutional neural network that spots and recognizes characters; and a GTN that interprets the networks output by taking word-level constraints into account. The network and the GTN are jointly trained to minimize an error measure defined at the word level. In this work, we have compared a system based on SDNNs (such as described in Section VII), and a system based on Heuristic Over-Segmentation (such as described in Section V). Because of the sequential nature of the information in the pen trajectory (which reveals more information than the purely optical input from in image), Heuristic Over-Segmentation can be very efficient in proposing candidate character cuts, especially for non-cursive script. ::: :::success 自然手寫通常是由不同"樣式"混合而成，像是小寫、大寫與草寫。可靠的手寫識別器將會大大改善與基於筆的設備之間的互動，但是它的實作帶來新的技術挑戰。單獨的提取字符可能會非常模棱兩可，但是從整個單詞的上下文中可以獲得大量信息。我們建立一個基於筆的單詞辨識系統，基於四個主要模型：透過擬合幾何模型到單詞結構來標準化單詞或單詞群組的預處理器；從標準化筆跡生成"[帶註影像](http://terms.naer.edu.tw/detail/6635163/)"模組；複製的卷積神經網路，可以發現並辨識字符；以及一個GTN，透過考慮單詞級約束來解釋網路輸出。將網路與GTN一起訓練來最小化在單詞級別定義的誤差量測。這項工作中，我們比較了基於SDNN的系統(如第七章節說明)，以及基於Heuristic Over-Segmentation(如第五章節說明)。因為筆跡中信息的順序性質(比影像中的純光學輸入透露出更多信息)，Heuristic Over-Segmentation可以非常有效地提出候選字符片段，特別是非草寫腳本。 ::: ### A. Preprocessing :::info Input normalization reduces intra-character variability, simplifying character recognition. We have used a word normalization scheme \[92\] based on fitting a geometrical model of the word structure. Our model has four "flexible" lines representing respectively the ascenders line, the core line, the base line and the descenders line. The lines are fitted to local minima or maxima of the pen trajectory. The parameters of the lines are estimated with a modified version of the EM algorithm to maximize the joint probability of observed points and parameter values, using a prior on parameters that prevents the lines from collapsing on each other. The recognition of handwritten characters from a pen trajectory on a digitizing surface is often done in the time domain \[110\], \[44\], \[111\]. Typically, trajectories are normalized, and local geometrical or dynamical features are extracted. The recognition may then be performed using curve matching \[110\], or other classification techniques such as TDNNs \[44\], \[111\]. While these representations have several advantages, their dependence on stroke ordering and individual writing styles makes them difficult to use in high accuracy, writer independent systems that integrate the segmentation with the recognition. ::: :::success 輸入正規化減少字符內的[變化性](http://terms.naer.edu.tw/detail/6685669/)，從而簡化字符辨識。我們基於一個單詞結構的幾何模型進行擬合而使用單詞正規化方案\[92\]。我們的模型具有四條"靈活"線，各別代表上升線，核心線，基線以及下降線。這些線擬合筆跡的局部最小或最大值。線的參數使用EM演算法的修改版本估計，以最大化觀察點與參數值的[聯合機率](http://terms.naer.edu.tw/detail/88558/)，使用先驗參數預防線彼此崩潰。從數字表面上的筆跡識別的字符通常在時域內進行\[110\]，\[44\]，\[111\]。通常會被軌跡做正規化，然後提取局部幾何或動態特徵。然後可以使用曲線匹配\[110\]，或其它分類技術，像是TDNNs\[44\]，\[111\]來執行辨識。儘管這些表示有許多優點，但它們依賴[筆順](http://terms.naer.edu.tw/detail/6682988/)與各別的書寫風格，這造成它們難以在將分段與識別結合在一起的高精度且獨立系統中使用。 ::: :::info Since the intent of the writer is to produce a legible image, it seems natural to preserve as much of the pictorial nature of the signal as possible, while at the same time exploit the sequential information in the trajectory. For this purpose we have designed a representation scheme, called AMAP \[38\], where pen trajectories are represented by low-resolution images in which each picture element contains information about the local properties of the trajectory. An AMAP can be viewed as an "annotated image" in which each pixel is a 5-element feature vector: 4 features are associated to four orientations of the pen tra jectory in the area around the pixel, and the fifth one is associated to local curvature in the area around the pixel. A particularly useful feature of the AMAP representation is that it makes very few assumptions about the nature of the input trajectory. It does not depend on stroke ordering or writing speed, and it can be used with all types of handwriting (capital, lower case, cursive, punctuation, symbols). Unlike many other representations (such as global features), AMAPs can be computed for complete words without requiring segmentation. ::: :::success 由於作者的意圖是生成[易於辨識](http://terms.naer.edu.tw/detail/6324093/)的影像，因此似乎很自然的保留盡可能多的信號的[圖像的](http://terms.naer.edu.tw/detail/6665311/)性質，同時還要利用軌跡中的[順序](http://terms.naer.edu.tw/detail/6673467/)信息。為此，我們設計一種表示方案，稱為AMAP\[38\]，其筆跡由低解析度影像來表示，其中每個圖片元素包含關於軌跡局部屬性的信息。AMAP可以視為是[帶註影像](http://terms.naer.edu.tw/detail/6635163/)，其中每一個像素為5個元素的特徵向量：4個特徵關聯於像素週圍區域中筆跡的四個方向，第五個關聯於像素週圍區域中的局部[曲率](http://terms.naer.edu.tw/detail/2114063/)。AMAP表示的一個特別有用的功能是，它很少對輸入軌跡的性質做出假設。它不取決於[筆順](http://terms.naer.edu.tw/detail/6682988/)或書寫速度，而且可以用於所有手寫類型(大寫，小寫，草寫，標點符號，符號)。不同於其它的表示(如圖表示)，AMAPs可以計算完整的單詞，而不需要分段。 ::: ### B. Network Architecture :::info One of the best networks we found for both online and offline character recognition is a 5-layer convolutional network somewhat similar to LeNet-5 (Figure 2), but with multiple input planes and different numbers of units on the last two layers; layer 1: convolution with 8 kernels of size 3x3, layer 2: 2x2 sub-sampling, layer 3: convolution with 25 kernels of size 5x5, layer 4 convolution with 84 kernels of size 4x4, layer 5: 2x1 sub-sampling, classification layer: 95 RBF units (one per class in the full printable ASCII set). The distributed codes on the output are the same as for LeNet-5, except they are adaptive unlike with LeNet-5. When used in the heuristic over-segmentation system, the input to above network consisted of an AMAP with five planes, 20 rows and 18 columns. It was determined that this resolution was sufficient for representing handwritten characters. In the SDNN version, the number of columns was varied according to the width of the input word. Once the number of sub-sampling layers and the sizes of the kernels are chosen, the sizes of all the layers, including the input, are determined unambiguously. The only architectural parameters that remain to be selected are the number of feature maps in each layer, and the information as to what feature map is connected to what other feature map. In our case, the sub-sampling rates were chosen as small as possible (2x2) and the kernels as small as possible in the first layer (3x3) to limit the total number of connections. Kernel sizes in the upper layers are chosen to be as small as possible while satisfying the size constraints mentioned above. Larger architectures did not necessarily perform better and required considerably more time to be trained. A very small architecture with half the input field also performed worse, because of insufficient input resolution. Note that the input resolution is nonetheless much less than for optical character recognition, because the angle and curvature provide more information than would a single grey level at each pixel. ::: :::success 我們發現，用於在線與離線字符辨識的最佳網路之一是5層卷積網路，該網路有點類似於LeNet-5(圖2)，但是在最後兩層具有多個輸入平面與不同數量的單元；layer 1：擁有8個3x3的kernel的卷積，layer2：2x2的[分階抽樣(sub-sampling)](http://terms.naer.edu.tw/detail/2125658/)，layer 3：擁有25個5x5的kernel的卷積，layer 4：擁有84個4x4的kernel的卷積，layer 5：2x1的[分階抽樣(sub-sampling)](http://terms.naer.edu.tw/detail/2125658/)，classification layer：95個RBF單元(在完整可列印的ASCII集中，每個類別一個)。輸出的分佈代碼與LeNet-5一樣，除了它們是自適應的。當在heuristic over-segmentation system使用的時候，輸入到上述網路的是一個帶五個平面，20個rows與18個columns的AMAP。已經確認這解析度足以表示手寫字符。在SDNN版本中，column的數量會因為輸入單詞的寬度而變化。一但選擇[分階抽樣(sub-sampling)](http://terms.naer.edu.tw/detail/2125658/)層與kernels的大小，就可以明確的確定包含輸入層在內的所有層的大小。唯一需要選擇的架構參數是每一層中的feature maps的數量，以及那些feature map要連接到那些feature map的信息。在我們的案例中，[分階抽樣(sub-sampling)](http://terms.naer.edu.tw/detail/2125658/)率盡可能的小(2x2)，第一層的kernel盡可能的小(3x3)，以限制連接的總數。在滿足上述大小約束的同時，上一層的kernel大小盡可能的小。較大架構並不一定會有較好的性能，而且需要更多時間訓練。由於輸入解析度的不足，只有一半輸入字段的架構，其效能也會比較差。注意到，儘管如此，輸入解析度遠低於光學字符辨識，因為角度與[曲率](http://terms.naer.edu.tw/detail/2114063/)提供比每個像素的單的灰階級別更多的信息。 ::: ### C. Network Training :::info Training proceeded in two phases. First, we kept the centers of the RBFs fixed, and trained the network weights so as to minimize the output distance of the RBF unit corresponding to the correct class. This is equivalent to minimizing the mean-squared error between the previous layer and the center of the correct-class RBF. This bootstrap phase was performed on isolated characters. In the second phase, all the parameters, network weights and RBF centers were trained globally to minimize a discriminative criterion at the word level. ::: :::success 訓練分為兩個階段進行。首先，我們保持RBFs的中心固定，然後訓練網路權重，以最小化對應正確類別的RBF單元的輸出距離。這等價於最小化上一層與正確類別RBF中心之間的[均方誤差](http://terms.naer.edu.tw/detail/6568223/)。這個引導階段是在單獨字符上執行的。第二階段中，所有的參數，網路權重與RBF中心進行全域訓練，以最小化單詞級別的判別準則。 ::: :::info With the Heuristic Over-Segmentation approach, the GTN was composed of four main Graph Transformers: 1. The Segmentation Transformer performs the Heuristic Over-Segmentation, and outputs the segmentation graph. An AMAP is then computed for each image attached to the arcs of this graph. 2. The Character Recognition Transformer applies the the convolutional network character recognizer to each candidate segment, and outputs the recognition graph, with penalties and classes on each arc. 3. The Composition Transformer composes the recognition graph with a grammar graph representing a language model incorporating lexical constraints 4. The Beam Search Transformer extracts a good interpretation from the interpretation graph. This task could have been achieved with the usual Viterbi Transformer. The Beam Search algorithm however implements pruning strategies which are appropriate for large interpretation graphs. ::: :::success 透過Heuristic Over-Segmentation，GTN由四個主要的圖轉換器組成： 1. [分段](http://terms.naer.edu.tw/detail/6674993/)轉換器執行Heuristic Over-Segmentation，並輸出[分段](http://terms.naer.edu.tw/detail/6674993/)圖。然後，為每一個附加到這個圖上弧的影像計算AMAP。 2. 字符辨識轉換器(Character Recognition Transformer)將卷積網路字符識別器應用到每個候選分段，然後輸出辨識圖，以及每個弧上的懲罰與類別。 3. 組合轉換器(Composition Transformer)以語法圖構成辨識圖，這語法圖表示一個包含[詞彙](http://terms.naer.edu.tw/detail/6636104/)約束的語言模型。 4. [定向搜索](http://terms.naer.edu.tw/detail/6673485/)轉換器(Beam Search Transformer)從解釋圖提取一個好的解釋。通常的[維特比](http://terms.naer.edu.tw/detail/6304135/)轉換器可以完成這個任務。但是，[定向搜索](http://terms.naer.edu.tw/detail/6673485/)演算法實現了適用大型解釋圖的修剪策略。 ::: :::info With the SDNN approach, the main Graph Transformers are the following: 1. The SDNN Transformer replicates the convolutional network over the a whole word image, and outputs a recognition graph that is a linear graph with class penalties for every window centered at regular intervals on the input image. 2. The Character-Level Composition Transformer composes the recognition graph with a left-to-right HMM for each character class (as in Figure 27) 3. The Word-Level Composition Transformer composes the output of the previous transformer with a language model incorporating lexical constraints, and outputs the interpretation graph. 4. The Beam Search Transformer extracts a good interpretation from the interpretation graph. ::: :::success 使用SDNN方法，主要的圖轉換器如下： 1. SDNN轉換器複製卷積網路到整個單詞影像，然後輸出辨識圖，這辨識圖是線性圖，對輸入影像上[正則區間](http://terms.naer.edu.tw/detail/2123460/)的[窗中心](http://terms.naer.edu.tw/detail/6571046/)做類別懲罰。 2. 字符級別合成轉換器(Character-Level Composition Transformer)為每個字符類別以由左至右的HMM組合辨識圖(如圖27所示)。 3. 單詞級別合成轉換器(Word-Level Composition Transformer)以包含[詞彙](http://terms.naer.edu.tw/detail/6636104/)約束的語言模型組合上一個轉換器的輸出，並輸出解釋圖。 4. [定向搜索](http://terms.naer.edu.tw/detail/6673485/)轉換器(Beam Search Transformer)從解釋圖中提取好的解釋。 ::: :::info In this application, the language model simply constrains the final output graph to represent sequences of character labels from a given dictionary. Furthermore, the interpretation graph is not actually completely instantiated: the only nodes created are those that are needed by the Beam Search module. The interpretation graph is therefore represented procedurally rather than explicitly . ::: :::success 在這個應用程式中，語言模型僅約束最終的輸出圖來表示給定字典中的字符標籤序列。另外，解釋圖未沒有完全的實例化：建立的唯一節點就[定向搜索](http://terms.naer.edu.tw/detail/6673485/)模組需要的那些節點而以。因此，解釋圖是以程序表示，而不是顯式表示。 ::: :::info A crucial contribution of this research was the joint training of all graph transformer modules within the network with respect to a single criterion, as explained in Sections VI and VIII. We used the Discriminative Forward loss function on the final output graph: minimize the forward penalty of the constrained interpretation (i.e., along all the "correct" paths) while maximizing the forward penalty of the whole interpretation graph (i.e., along all the paths). ::: :::success 如第六、七章節所說明，這研究的重大貢獻在於，以單一標準對網路中所有的圖轉換器模組一起訓練。我們在最終的輸出圖上使用[鑑別式](http://terms.naer.edu.tw/detail/6320279/)正向損失函數：最小化約束解釋的正向懲罰(即沿著所有"正確"路徑)，同時最大化整個解釋圖的正向懲罰(即沿著所有路徑)。 ::: :::info During global training, the loss function was optimized with the stochastic diagonal Levenberg-Marquardt procedure described in Appendix C, that uses second derivatives to compute optimal learning rates. This optimization operates on all the parameters in the system, most notably the network weights and the RBF centers. ::: :::success 在全域訓練期間，損失函數使用附錄C所說明的stochastic diagonal Levenberg-Marquardt進行最佳化，過程中使用二階導數來計算最佳學習效率。這個最佳化操作針對系統中的所有參數，最明顯的是網路權重與RBF中心。 ::: ### D. Experimental Results :::info In the first set of experiments, we evaluated the generalization ability of the neural network classiffier coupled with the word normalization preprocessing and AMAP input representation. All results are in writer independent mode (different writers in training and testing). Initial training on isolated characters was performed on a database of approximately 100,000 hand printed characters (95 classes of upper case, lower case, digits, and punctuation). Tests on a database of isolated characters were performed separately on the four types of characters: upper case (2.99% error on 9122 patterns), lower case (4.15% error on 8201 patterns), digits (1.4% error on 2938 patterns), and punctuation (4.3% error on 881 patterns). Experiments were performed with the network architecture described above. To enhance the robustness of the recognizer to variations in position, size, orientation, and other distortions, additional training data was generated by applying local affine transformations to the original characters. ::: :::success 在第一組實驗中，我們評估神經網路分類器加上單詞正規化預處理以及AMAP輸入表示的泛化能力。所有的結果皆為作者獨立模式(即訓練與測試中不同的作者)。在約100,000手寫字符(95種大寫，小寫，數字與標點符號)的資料庫上做[隔離](http://terms.naer.edu.tw/detail/6637401/)字符的初始訓練。對[隔離](http://terms.naer.edu.tw/detail/6637401/)字符資料庫的測試分別在四種字符類型上執行：大寫(9122圖像上2.99%)，小寫(8201圖像上4.15%)，數字(2938圖像上1.4%)，標點符號(818圖像上4.3%)。實驗以上述網路架構執行。為了提高識別器對位置，大小，方向，以及其它變形的魯棒性，透過對原始字符的局部[仿射](http://terms.naer.edu.tw/detail/2110955/)轉換來生成其它訓練資料。 ::: :::info The second and third set of experiments concerned the recognition of lower case words (writer independent). The tests were performed on a database of 881 words. First we evaluated the improvements brought by the word normalization to the system. For the SDNN/HMM system we have to use word-level normalization since the network sees one whole word at a time. With the Heuristic OverSegmentation system, and before doing any word-level training, we obtained with character-level normalization 7.3% and 3.5% word and character errors (adding insertions, deletions and substitutions) when the search was constrained within a 25461-word dictionary. When using the word normalization preprocessing instead of a character level normalization, error rates dropped to 4.6% and 2.0% for word and character errors respectively, i.e., a relative drop of 37% and 43% in word and character error respectively. This suggests that normalizing the word in its entirety is better than first segmenting it and then normalizing and processing each of the segments. ::: :::success 第二與第三組實驗涉及小寫單詞的辨識(與作者無關)。測試在881個單詞的資料庫上進行。首先，我們評估正規化對系統帶來的改進。對於SDNN/HMM系統，由於網路一次只能看到一個完整的單詞，因此必須使南畢單詞級別的正規化。使用Heuristic OverSegmentation system，並且在做任何單詞級別訓練之前，當搜索被限制在25461的單詞字典中時，我們以字符級別正規化獲得7.3%與3.5%的單詞與字符誤差(加入插入，刪除與替代)。當使用單詞正規化預處理來取代字符級別正規化的時候，單詞與字符的誤差率分別下降到4.6%與2.0%，即單詞與字符誤差分別相對下降37%與43%。這說明對單詞做整體正規化比起先進行分段，然後對每一個分段做正規化與預處理還來的好。 ::: :::info In the third set of experiments, we measured the improvements obtained with the joint training of the neural network and the post-processor with the word-level criterion, in comparison to training based only on the errors performed at the character level. After initial training on individual characters as above, global word-level discriminative training was performed with a database of 3500 lower case words. For the SDNN/HMM system, without any dictionary constraints, the error rates dropped from 38% and 12.4% word and character error to 26% and 8.2% respectively after word-level training, i.e., a relative drop of 32% and 34%. For the Heuristic Over-Segmentation system and a slightly improved architecture, without any dictionary constraints, the error rates dropped from 22.5% and 8.5% word and character error to 17% and 6.3% respectively, i.e., a relative drop of 24.4% and 25.6%. With a 25461-word dictionary, errors dropped from 4.6% and 2.0% word and character errors to 3.2% and 1.4% respectively after word-level training, i.e., a relative drop of 30.4% and 30.0%. Even lower error rates can be obtained by drastically reducing the size of the dictionary to 350 words, yielding 1.6% and 0.94% word and character errors. ::: :::success 第三組實驗中，我們以單詞級別標準量測對神經網路與[後處理器](http://terms.naer.edu.tw/detail/6662025/)聯合訓練比較僅基於字符級別執行的錯誤進行相比所獲得的改進。對上述各別字符做初始訓練之後，使用3500個小寫單詞的資料庫做全域單詞級別的[鑑別式](http://terms.naer.edu.tw/detail/6320279/)訓練。對沒有字典約束的SDNN/HMM系統，在單詞級別訓練之後，誤差率從38%與12.4%分別下降到26%與8.2%，即相對下降32%與34%。對於 Heuristic Over-Segmentation系統並些許改進架構，沒有任何字典約束，單詞與字符的誤差率從22.5%與8.5%分別下降到17%與6.3%，即相對下降24.4%與25.6%。使用25461單詞字典，在單詞級別訓練之後，單詞與字符的誤差率從4.6%與2.0%分別下降到3.2%與1.4%，即相關下降30.4%與30.0%。通過將字典大小顯著的減少350個單詞，甚至可以得到更新的誤差率，產生1.6%與0.94%的單詞與字符誤差。 ::: :::info These results clearly demonstrate the usefulness of globally trained Neural-Net/HMM hybrids for handwriting recognition. This confirms similar results obtained earlier in speech recognition \[77\]. ::: :::success 這結果清楚的說明全域訓練Neural-Net/HMM混合模型在手寫辨識中的有用性。這證實先前的語音辨識中獲得的類似結果。\[77\] :::