Gradient-Based Learning Applied to Document Recognition_Paper(翻譯)(X)

# Gradient-Based Learning Applied to Document Recognition_Paper(翻譯)(X) ###### tags: `LeNet-5` `CNN` `論文翻譯` `deeplearning` `Gradient-Based Learning Applied to Document Recognition` >[name=Shaoe.chen] [time=Tue, Feb 4, 2020 4:59 PM] [TOC] ## 說明區塊如下分類，原文區塊為藍底，翻譯區塊為綠底，部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf) * [\*paper hyperlink](http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf) ::: Page 37~40 ## X. A Check Reading System :::info This section describes a GTN based Check Reading System, intended for immediate industrial deployment. It also shows how the use of Gradient Based-Learning and GTNs make this deployment fast and cost-effective while yielding an accurate and reliable solution. ::: :::info 這章節說明基於GTN的支票讀取系統，用於立即的工業佈署。它還說明如何使用基於梯度的學習與GTNs讓這種佈署快速，經濟高效，同時產生準確而可靠的解決方案。 ::: :::info The verification of the amount on a check is a task that is extremely time and money consuming for banks. As a consequence, there is a very high interest in automating the process as much as possible (see for example \[112\], \[113\], \[114\]). Even a partial automation would result in considerable cost reductions. The threshold of economic viability for automatic check readers, as set by the bank, is when 50% of the checks are read with less than 1% error. The other 50% of the check being rejected and sent to human operators. In such a case, we describe the performance of the system as 50% correct / 49% reject / 1% error. The system presented here was one of the first to cross that threshold on representative mixtures of business and personal checks. ::: :::success 驗證支票上的金額對銀行來說是一項極度勞心費力的一項任務。因此，人們對自動處理非常感興趣(\[112\]，\[113\]，\[114\])。即使部份自動化也都會導致可觀的成本降低。銀行設置的自動支票讀取器的經濟可行性閥值是，當讀取50%的支票且誤差小於1%的時候。剩下50%的支票被拒絕並給人員去操作。這種情況下，我們說明系統效能為50%正確/49%拒絕/1%錯誤。這邊介紹的系統是第一個在商業支票與個人支票的代表性組合支票上跨過這一門檻的系統。 ::: :::info Checks contain at least two versions of the amount. The Courtesy amount is written with numerals, while the Legal amount is written with letters. On business checks, which are generally machine-printed, these amounts are relatively easy to read, but quite difficult to find due to the lack of standard for business check layout. On the other hand, these amounts on personal checks are easy to find but much harder to read. ::: :::success 支票包含至少兩個版本的金額。小寫金額以數字書寫，法律費用以字母書寫。商業支票通常是機器列印的，這些金額相對簡單讀取，但由於缺少商業支票佈局的標準而難以找到。另一方面，個人支票上的這些金額很容易找到，但很難讀取。 ::: :::info For simplicity (and speed requirements), our initial task is to read the Courtesy amount only. This task consists of two main steps: * The system has to find, among all the fields (lines of text), the candidates that are the most likely to contain the courtesy amount. This is obvious for many personal checks, where the position of the amount is standardized. However, as already noted, finding the amount can be rather diffcult in business checks, even for the human eye. There are many strings of digits, such as the check number, the date, or even "not to exceed" amounts, that can be confused with the actual amount. In many cases, it is very difficult to decide which candidate is the courtesy amount before performing a full recognitio. * In order to read (and choose) some Courtesy amount candidates, the system has to segment the fields into characters, read and score the candidate characters, and finally find the best interpretation of the amount using contextual knowledge represented by a stochastic grammar for check amount. ::: :::success 簡單起見(速度要求)，我們的[初始任務](http://terms.naer.edu.tw/detail/6630778/)僅讀取小寫金額。這任務包含兩個主要步驟： * 系統必須在所有字段(文本行)中找到最有可能包含小寫金額的候選。對個人支票而言，這很明顯，其金額的位置是標準化的。然而，正如所述，即使對於人眼，在商業支票中發現金額也是相當困難的。有很多數字字串，像是支票號碼，日期，甚至"不超過"金額，它們可能與實際金額混淆。許多情況下，在執行完整的辨識之前，是非常難以確定那一個候選是小寫金額。 * 為了讀取(並選擇)部份小寫金額候選，系統必須將字段分段為字符，讀取並評分候選字符，最後使用隨機語法表示的上下文知識找到金額的最佳解釋。 ::: :::info The GTN methodology was used to build a check amount reading system that handles both personal checks and business chec. ::: :::success GTN方法用於建置一個處理個人與商業支票金額讀取系統。 ::: ### A. A GTN for Check Amount Recogniti :::info We now describe the successive graph transformations that allow this network to read the check amount (cf. Figure 33). Each Graph Transformer produces a graph whose paths encode and score the current hypotheses considered at this stage of the system. ::: :::success 現在，我們說明允許該網路讀取支票金額的連續圖轉換(見圖33)。每一個圖轉換器生成一個圖，其路徑對當前系統階段所考慮的當前假設做編碼與評分。 ::: :::info ![](https://i.imgur.com/fntHfIU.png) Fig. 33. A complete check amount reader implemented as a single cascade of Graph Transformer modules. Successive graph transformations progressively extract higher level information. 圖33. 完整的支票金額讀取器，以圖轉換器模組的單一[級聯](http://terms.naer.edu.tw/detail/6595667/)實作。連續圖轉換逐步提取更高級的信息。 ::: :::info The input to the system is a trivial graph with a single arc that carries the image of the whole check (cf. Figure 33). ::: :::success 輸入到系統的是一個簡單的圖，帶有一個弧，帶有整個支票的影像(見圖33)。 ::: :::info The field location transformer $T_{field}$ first performs classical image analysis (including connected component analysis, ink density histograms, layout analysis, etc...) and heuristically extracts rectangular zones that may contain the check amount. $T_{field}$ produces an output graph, called the field graph (cf. Figure 33) such that each candidate zone is associated with one arc that links the start node to the end node. Each arc contains the image of the zone, and a penalty term computed from simple features extracted from the zone (absolute position, size, aspect ratio, etc...). The penalty term is close to zero if the features suggest that the field is a likely candidate, and is large if the field is deemed less likely to be an amount. The penalty function is differentiable, therefore its parameter are globally tunable. ::: :::success 字段位置轉換器$T_{field}$首先執行經典的影像分析(包含連接組件分析，墨水密度直方圖，佈局分析，等)，然後試探性的提取也許包含支票金額的矩形區域。$T_{field}$生成一個輸出圖，稱為字段圖(field graph)(見圖33)，這樣每一個候選區域都會關聯到一個弧，這個弧連結[開始節點](http://terms.naer.edu.tw/detail/6661837/)與[端節點](http://terms.naer.edu.tw/detail/2115413/)。每一個弧包含該區域的影像，從該區域提取的簡單特徵計算出來的懲罰項(絕對位置，大小，長寬比，等)。如果特徵說明這字段可能是候選字段，那這懲罰項接近於零，如果認為該字段不大可能是一個金額，那懲罰項會較大。懲罰函數是可微的，因此其它參數是全域可調的。 ::: :::info An arc may represent separate dollar and cent amounts as a sequence of fields. In fact, in handwritten checks, the cent amount may be written over a fractional bar, and not aligned at all with the dollar amount. In the worst case, one may find several cent amount candidates (above and below the fraction bar) for the same dollar amount. ::: :::success 弧可以分別表示美元與美分金額，做為一系列字段(?)。事實上，在手寫支票中，美分金額多半會寫在分數條上，而完全沒有跟美元金額對齊。最壞情況下，可能會發現多個相同美元金額的候選金額(分數條上下)。 ::: :::info The segmentation transformer $T_{seq}$, similar to the one described in Section VIII examines each zone contained in the field graph, and cuts each image into pieces of ink using heuristic image processing techniques. Each piece of ink may be a whole character or a piece of character. Each arc in the field graph is replaced by its corresponding segmentation graph that represents all possible groupings of pieces of ink. Each field segmentation graph is appended to an arc that contains the penalty of the field in the field graph. Each arc carries the segment image, together with a penalty that provides a first evaluation of the likelihood that the segment actually contains a character. This penalty is obtained with a differentiable function that combines a few simple features such as the space be tween the pieces of ink or the compliance of the segment image with a global baseline, and a few tunable parameters. The segmentation graph represents all the possible segmentations of all the field images. We can compute the penalty for one segmented field by adding the arc penalties along the corresponding path. As before using a differentiable function for computing the penalties will ensure that the parameters can be optimized globally. ::: :::success 分段轉換器$T_{seq}$類似於第八章節所說的，檢查字段圖中包含的每一個區域，並使用[啟發式](http://terms.naer.edu.tw/detail/6629346/)影像處理技術將每一個影像切割為墨水塊。每一個墨水塊可以是一個完整的字符或字符的一部份。字段圖中的每一個弧都以其相對應的分段圖替代，這分段圖表示所有可能的墨水塊的群組。每一個字段分割圖都附加到一個弧上，這個弧包含字段圖中該字段的懲罰。每一個弧都帶有分段影像，以及懲罰，該懲罰提供對分段實際包含字符的可能性的第一個評估。這個懲罰透過可微函數而得，該函數結合一些簡單的功能，像是墨水之間的空間或分段影像與全域基線的一致性，以及一些可調參數。分段圖表示所有字段影像的所有可能分段。我們可以透過加入沿著相對應路徑的弧懲罰來計算一個分段字段的懲罰。與之前一樣，使用一個可微函數來計算懲罰將確保參數可以全域最佳化。 ::: :::info The segmenter uses a variety of heuristics to find candidate cut. One of the most important ones is called "hit and deflect" \[115\]. The idea is to cast lines downward from the top of the field image. When a line hits a black pixel, it is deflected so as to follow the contour of the object. When a line hits a local minimum of the upper profile, i.e. when it cannot continue downward without crossing a black pixel, it is just propagated vertically downward through the ink. When two such lines meet each other, they are merged into a single cut. The procedure can be repeated from the bottom up. This strategy allows the separation of touching characters such as double zeros. ::: :::success 分段器使用各種啟發式方法來找尋候選分割。其中最重要的一種稱為"hit and deflect"\[115\]。這想法是從字段影像頂部向下投射線。當線碰到黑色像素時，它會偏轉，以跟隨物件的輪廓。當線碰到上部輪廓的局部最小值時，即當它不能在不穿過黑色像素情況下繼續向下時，它只會通過墨水垂直向下傳播。當兩條線彼此相遇時，它們將合併為一個切段。這過程可以由上到上重覆。這策略允許分離觸摸的字符，像是兩個零。 ::: :::info The recognition transformer $T_{rec}$ iterates over all segment arcs in the segmentation graph and runs a character recognizer on the corresponding segment image. In our case, the recognizer is LeNet-5, the Convolutional Neural Network described in Section II, whose weights constitute the largest and most important subset of tunable parameters. The recognizer classifies segment images into one of 95 classes (full printable ASCII set) plus a rubbish class for unknown symbols or badly-formed characters. Each arc in the input graph $T_{rec}$ is replaced by 96 arcs in the output graph. Each of those 96 arcs contains the label of one of the classes, and a penalty that is the sum of the penalty of the corresponding arc in the input (segmentation) graph and the penalty associated with classifying the image in the corresponding class, as computed by the recognizer. In other words, the recognition graph represents a weighted trellis of scored character classes. Each path in this graph represents a possible character string for the corresponding field. We can compute a penalty for this interpretation by adding the penalties along the path. This sequence of characters may or may not be a valid check amount. ::: :::success 識別轉換器$T_{rec}$迭代分段圖中的所有分段弧，並在相對應的分段影像上執行字符識別器。在我們的案例中，識別器為LeNet-5，第二章所說的卷積神經網路，其權重構成可調參數最大也是最重要的子集。識別器將分段影像分類為95個類別(完整可列印的ASCII集)中的一種，加上一個用於未知符號或錯誤格式字符的垃圾類別。輸入圖$T_{rec}$中的每個弧被輸出圖中的96個弧替代。這96個弧，每個弧都包含一個類別標籤，以及一個懲罰，這懲罰是由識別器計算，計算輸入(分段)圖中相對應弧與相對應類別中的圖做分類相關的懲罰的總和。換句話說，識別圖表示得分字符類別的加權網格。這圖中的每個路徑都代表相對應字段的可能字符串。我們可以透過沿路徑添加懲罰來為這些解釋計算懲罰。這個字符序列可能是有效支票金額，也可能不是。 ::: :::info The composition transformer $T_{gram}$ selects the paths of the recognition graph that represent valid character sequences for check amounts. This transformer takes two graphs as input: the recognition graph, and the grammar graph. The grammar graph contains all possible sequences of symbols that constitute a well-formed amount. The output of the composition transformer, called the interpretation graph, contains all the paths in the recognition graph that are compatible with the grammar. The operation that combines the two input graphs to produce the output is a generalized transduction (see Section VIII).A differentiable function is used to compute the data attached to the output arc from the data attached to the input arcs. In our case, the output arc receives the class label of the two arcs, and a penalty computed by simply summing the penalties of the two input arcs (the recognizer penalty, and the arc penalty in the grammar graph). Each path in the interpretation graph represents one interpretation of one segmentation of one field on the check. The sum of the penalties along the path represents the "badness" of the corresponding interpretation and combines evidence from each of the modules along the process, as well as from the grammar. ::: :::success [合成](http://terms.naer.edu.tw/detail/2113135/)轉換器$T_{gram}$選擇識別圖的路徑，這些路徑代表支票金額的有效字符序列。這轉換器將兩個圖做為輸入：識別圖與語法圖。語法圖包含符號序列，這符號序列構成一個格式良好的金額。[合成](http://terms.naer.edu.tw/detail/2113135/)轉換器的輸出稱為解釋圖，包含識別圖中與語法兼容的所有可能路徑。結合兩個輸入圖來產生輸出的操作是[一般型轉導](http://terms.naer.edu.tw/detail/5457299/)(見第八章)。可微函數用於從附加到輸入弧的資料計算附加到輸出弧的資料。在我們的案例中，輸出弧接收兩個弧的類別標籤，並通過簡單加總兩個弧的懲罰來計算懲罰(識別器懲罰以及語法圖中的弧懲罰)。解釋圖中的每個路徑代表支票上一個字段的一個分段的一種解釋。路徑上的懲罰總和代表相對應解釋的"不好"，並結合過程中每個模組以及語法中的證據。 ::: :::info The Viterbi transformer finally selects the path with the lowest accumulated penalty, corresponding to the best grammatically correct interpretations. ::: :::success 最後，維特比轉換器選擇最低累積懲罰的路徑，這對應語法正確的最佳解釋。 ::: ### B. Gradient-Based Learning :::info Each stage of this check reading system contains tunable parameters. While some of these parameters could be manually adjusted, for example the parameters of the field locator and segmenter, the vast majority of them must be learned, particularly the weights of the neural net recognizer. ::: :::success 支票讀取系統每個階段都包含可調參數。儘管其中一些參數可以手動調整(像是字段定位器與分段器的參數)，但大多數的參數是需要學習的，尤其是神經網路識別器的權重。 ::: :::info Prior to globally optimizing the system, each module parameters are initialized with reasonable values. The parameters of the field locator and the segmenter are initialized by hand, while the parameters of the neural net character recognizer are initialized by training on a database of presegmented and labeled characters. Then, the entire system is trained globally from whole check images labeled with the correct amount. No explicit segmentation of the amounts is needed to train the system: it is trained at the check level. ::: :::success 在全域最佳化系統之前，每個模組參數都以合理的數值進行初始化。字段定義位與分段器的參數是手動初始化，而神經網路字符識別器的參數是透過預分段與標記字符的資料庫來進行訓練做初始化。然後，從標有正確金額的整個支票影像對整個系統做全域訓練。不需要對金額做明確的分割：這是在檢查級別中進行訓練。 ::: :::info The loss function $E$ minimized by our global training procedure is the Discriminative Forward criterion described in Section VI: the difference between (a) the forward penalty of the constrained interpretation graph (constrained by the correct label sequence), and (b) the forward penalty of the unconstrained interpretation graph. Derivatives can be back-propagated through the entire structure, although it only practical to do it down to the segmenter. ::: :::success 透過全域訓練程序來最小化損失函數$E$，就是第六章節中說明的鑑別式正向準則：(a)約束解釋圖的正向懲罰，(b)無約束解釋圖的正向懲罰，兩者之間的差異。導數可以在整體結構做反向傳播，儘管向下執行到分段器是可行的。(?) ::: ### C. Rejecting Low Confidence Checks :::info In order to be able to reject checks which are the most likely to carry erroneous Viterbi answers, we must rate them with a confidence, and reject the check if this confidence is below a given threshold. To compare the unnormalized Viterbi Penalties of two different checks would be meaningless when it comes to decide which answer we trust the most. ::: :::success 為了能夠拒絕最有可能帶有錯誤維特比答案的支票，我們必須對它們做置信度評估，如果置信度低於閥值，那就拒絕支票。在決定我們最信任哪一個答案時，比較兩種不同支票的無正規化維特比懲罰是毫無意義的。 ::: :::info The optimal measure of confidence is the probability of the Viterbi answer given the input image. As seen in Section VI-E, given a target sequence (which, in this case, would be the Viterbi answer), the discriminative forward loss function is an estimate of the logarithm of this probability. Therefore, a simple solution to obtain a good estimate of the confidence is to reuse the interpretation graph (see Figure 33) to compute the discriminative forward loss as described in Figure 21, using as our desired sequence the Viterbi answer. This is summarized in Figure 34, with: confidence = exp($E_{dforw}$) ::: :::success 置信度的最佳度量就是給定輸入影像的維特比答案的機率。如VI-E章節所示，給定目標序列(這種情況下為維特比答案)，鑑別式正向損失函數就是該機率的對數估計。因此，一個得到好的置信度的簡單解決方法，就是重覆使用解釋圖(見圖33)來計算維特比正向損失(如圖21所述)，使用維特比答案作為我們所期望的序列。圖34對此做了總結，包括： confidence = exp($E_{dforw}$) ::: :::info ![](https://i.imgur.com/ygJFEYP.png) Fig. 34. Additional processing required to compute the confidence. 圖34. 計算置信度所需要額外處理。 ::: ### D. Results :::info A version of the above system was fully implemented and tested on machine-print business checks. This system is basically a generic GTN engine with task specific heuristics encapsulated in the `check` and `fprop` method. As a consequence, the amount of code to write was minimal: mostly the adaptation of an earlier segmenter into the segmentation transformer. The system that deals with hand-written or personal checks was based on earlier implementations that used the GTN concept in a restricted way. ::: :::success 上面系統的一個版本已經完全實現，並在機器列印業務檢查中做了測試。這個系統基本上是一個通用的GTC引擎，在方法`check`與`fprop`中封裝任務特定的啟發式方法。因此，要寫的程式碼的很少：主要是將早期的分段器改寫到分段轉換器中。處理手寫或個人支票檢查的系統是基於早期所實現的，這些實現一種以受限的方式使用GTN的概念。 ::: :::info The neural network classifer was initially trained on 500,000 images of character images from various origins spanning the entire printable ASCII set. This contained both handwritten and machineprinted characters that had been previously size normalized at the string level. Additional images were generated by randomly distorting the original images using simple affine transformations of the images. The network was then further trained on character images that had been automatically segmented from check images and manually truthed. The network was also initially trained to reject non-characters that resulted from segmentation errors. The recognizer was then inserted in the check reading system and a small subset of the parameters were trained globally (at the field level) on whole check images. ::: :::success 神經網路分類器最初在不同來源的500,000張字符影像上訓練，這些影像橫跨整個可列印ASCII集。包含手寫與機器列印字符，這些字符已經先在字符串級別進行大小標準化。使用影像的簡單仿射轉換，透過隨機變形原圖來生成額外的影像。然後，網路進一步的在字符影像上訓練，這些字符影像已經從支票影像中自動分段並手動處理。網路最初也經過練，以拒絕分段錯誤而導致的非字符。然後，識別器插入支票讀取系統中，並在整個支票影像上全域訓練一小部份參數(字段級別)。 ::: :::info On 646 business checks that were automatically categorized as machine printed the performance was 82% correctly recognized checks, 1% errors, and 17% rejects. This can be compared to the performance of the previous system on the same test set: 68% correct, 1% errors, and 31% rejects. A check is categorized as machine-printed when characters that are near a standard position. Dollar sign are detected as machine printed, or when, if nothing is found in the standard position, at least one courtesy amount candidate is found somewhere else. The improvement is attributed to three main causes. First the neural network recognizer was bigger, and trained on more data. Second, because of the GTN architecture, the new system could take advantage of grammatical constraints in a much more efficient way than the previous system. Third, the GTN architecture provided extreme flexibility for testing heuristics, adjusting parameters, and tuning the system. This last point is more important than it seems. The GTN framework separates the "algorithmic" part of the system from the "knowledge-based" part of the system, allowing easy adjustments of the latter. The importance of global training was only minor in this task because the global training only concerned a small subset of the parameters. ::: :::success 在646張自動分類為機器印刷的業務支票上，確認辨識支票的機率為82%，1%錯誤，17%拒絕。可以將它與之前的系統在相同測試集上做效能的比較：68%正確，1%錯誤，以及31%拒絕。當字符靠近標準位置的時候，支票分類為機器印列。美元符號被檢測為機器列印的，或者，如果在標準位置沒有找到任何東西，在其它位置至少找到一個小寫金額候選項。改善主要歸因於三個主要原因。首先，神經網路識別器更大，並且在更多的資料上訓練。第二，因為GTN架構，因此新的系統可以比之前的系統更有效的利用語法約束。第三，GTN架構為測試啟發法，參數調整以系統調整提供極大的靈活性。最後一點比看起來要重要多了。GTN框架將系統的"算法"部份與系統"基於知識"的部份分開，從而可以輕鬆調整後者。全域訓練的重要性在這個任務中僅為次要，因為全域訓練只有涉及參數的一小部份。 ::: :::info An independent test performed by systems integrators in 1955 showed the superiority of this system over other commercial Courtesy amount reading systems. The system was integrated in NCR's line of check reading systems. It has been fielded in several banks across the US since June 1996, and has been reading millions of checks per day since then. ::: :::success 系統整合商在1955的獨立測試中表明，這系統優於其它商業性質的小寫金額讀取系統。這系統已經整合到NCR的支票讀取系統系列中。自1996年6月以來，它已經在美國多家銀行投入使用，自那之後每天讀取數百萬張支票。 :::