Gradient-Based Learning Applied to Document Recognition_Paper(翻譯)(V, VI)

# Gradient-Based Learning Applied to Document Recognition_Paper(翻譯)(V, VI) ###### tags: `LeNet-5` `CNN` `論文翻譯` `deeplearning` `Gradient-Based Learning Applied to Document Recognition` [TOC] ## 說明區塊如下分類，原文區塊為藍底，翻譯區塊為綠底，部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf) * [\*paper hyperlink](http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf) ::: Page 19~27 ## V. Multiple Object Recognition Heuristic OverSegmentation :::info One of the most difficult problems of handwriting recognition is to recognize not just isolated characters, but strings of characters, such as zip codes, check amounts, or words. Since most recognizers can only deal with one character at a time, we must first segment the string into individual character images. However, it is almost impossible to devise image analysis techniques that will infallibly segment naturally written sequences of characters into well formed characters. The recent history of automatic speech recognition \[28\],\[67\] is here to remind us that training a recognizer by optimizing a global criterion (at the word or sentence level) is much preferable to merely training it on hand-segmented phonemes or other units. Several recent works have shown that the same is true for handwriting recognition \[38\]: optimizing a word-level criterion is preferable to solely training a recognizer on pre-segmented characters because the recognizer can learn not only to recognize individual characters, but also to reject mis-segmented characters thereby minimizing the overall word error. This section and the next describe in detail a simple example of GTN to address the problem of reading strings of characters, such as words or check amounts. The method avoids the expensive and unreliable task of hand-truthing the result of the segmentation often required in more traditional systems trained on individually labeled character images. ::: :::success 手寫辨識最困難的問題之一，就是不僅要辨識[隔離](http://terms.naer.edu.tw/detail/2383122/)的字符，還要辨識字符串，像是郵政編碼，支票金額或單詞。多數的識別器一次只能處理一個字符，因此我們必需首先分割字串為獨立的字符影像。然而，幾乎不可能設計出將自然書寫的字符序列幾乎無誤的分割為格式正確的字符的影像分析技術。自動語音辨識的近代歷史在這裡提醒我們(\[28\]，\[67\])，透過最佳化全域準則(在單詞或語句級別)來訓練識別器比起只是手動分段[音素](http://terms.naer.edu.tw/detail/2407911/)或其它單元要來的好。幾個最近的研究顯示，手寫辨識也是如此\[38\]：最佳化一個單詞級別標準比只是以預分割字符訓練一個識別器來的好，因為識別器可以學習到不只是辨識單獨的字符，也可以拒絕錯誤分割的字符，從而最小化整體單詞失誤。本章節與下一章節將詳述GTN的簡單範例，以解決讀取字符串的問題，像是單詞或支票金額。這個方法避免傳統基於單獨標記的字符影像的訓練系統所需要的昂貴且不可靠的手工任務訓練。 ::: ### A. Segmentation Graph :::info A now classical method for word segmentation and recognition is called Heuristic Over-Segmentation \[68\], \[69\].Its main advantages over other approaches to segmentation are that it avoids making hard decisions about the segmentation by taking a large number of different segmentations into consideration. The idea is to use heuristic image processing techniques to find candidate cuts of the word or string, and then to use the recognizer to score the alternative segmentations thereby generated. The process is depicted in Figure 16. First, a number of candidate cuts are generated. Good candidate locations for cuts can be found by locating minima in the vertical projection profile, or minima of the distance between the upper and lower contours of the word. Better segmentation heuristics are described in section X. The cut generation heuristic is designed so as to generate more cuts than necessary, in the hope that the "correct" set of cuts will be included. Once the cuts have been generated, alternative segmentations are best represented by a graph, called the segmentation graph. The segmentation graph is a Directed Acyclic Graph (DAG) with a start node and an end node. Each internal node is associated with a candidate cut produced by the segmentation algorithm. Each arc between a source node and a destination node is associated with an image that contains all the ink between the cut associated with the source node and the cut associated with the destination node. An arc is created between two nodes if the segmentor decided that the ink between the corresponding cuts could form a candidate character. Typically, each individual piece of ink would be associated with an arc. Pairs of successive pieces of ink would also be included, unless they are separated by a wide gap, which is a clear indication that they belong to different characters. Each complete path through the graph contains each piece of ink once and only. once Each path corresponds to a different way of associating pieces of ink together so as to form characters. ::: :::success 現在用於[斷詞](http://terms.naer.edu.tw/detail/1274471/)與辨識的經典方法稱為Heuristic Over-Segmentation \[68\]，\[69\]。相較於其它的分段方法，它的主要優勢在於，通過考慮大量不同的分段來避免對分段做出困難的決定。這個想法就是使用[啟發式](http://terms.naer.edu.tw/detail/2911140/)影像處理技術來找出單詞或字串的候選片段，然後使用識別器對由此產生的可選分段評分。過程如圖16所示。首先，生成了多個候選片段。透過在垂直投影輪廓中定位最小值或在單詞的上下輪廓之間的距離最小值找到好的片段候選位置。第十章中介紹更好的片段啟發法。片段生成啟發式設計為生成比需要更多的片段，並期待那"正確"的片段集是包含在內的。一但片段生成，替代分段最好用一種稱為分段圖的圖來表示。分段圖是具有超始節點與結束節點的有向無環圖(DAG)。每一個內部節點都跟透過分段演算法生成的候選片段關聯。來源節點與目ＴＳ節點間的每一個弧都與一個影像關聯，這個影像包含來源節點與片段的關聯以及目標節點與片段的關聯之間的所有墨跡。如果分段器判斷相對應的片段間的墨跡可以形成一個候選字符，那就會在兩個節點之間建立弧。通常，每一塊墨跡都會跟一個弧關聯。也包含成對的連續墨跡，除非它們之間有很大的差異，這清楚的表明它們屬於不同的字符。通過圖的每個完整路徑都包含一次且僅一次的每個墨跡。每個路徑對應於將墨跡關聯在一起來形成字符的不同方式。 ::: :::info ![](https://i.imgur.com/saYC25j.png) Fig. 16. Building a segmentation graph with Heuristic Over-Segmentation. 圖16. 以Heuristic Over-Segmentation建立分段圖。 ::: ### B. Recognition Transformer and Viterbi Transformer :::info A simple GTN to recognize character strings is shown in Figure 17. It is composed of two graph transformers called the recognition transformer $T_{rec}$, and the Viterbi transformer $T_{vit}$. The goal of the recognition transformer is to generate a graph, called the interpretation graph or recognition graph $G_{int}$, that contains all the possible interpretations for all the possible segmentations of the input. Each path in $G_{int}$ represents one possible interpretation of one particular segmentation of the input. The role of the Viterbi transformer is to extract the best interpretation from the interpretation graph. The recognition transformer $T_{rec}$ takes the segmentation graph $G_{seg}$ as input, and applies the recognizer for single characters to the images associated with each of the arcs in the segmentation graph. The interpretation graph $G_{int}$ as almost the same structure as the segmentation graph, except that each arc is replaced by a set of arcs from and to the same node. In this set of arcs, there is one arc for each possible class for the image associated with the corresponding arc in $G_{seg}$. As shown in Figure 18, to each arc is attached a class label, and the penalty that the image belongs to this class as produced by the recognizer. If the segmentor has computed penalties for the candidate segments, these penalties are combined with the penalties computed by the character recognizer, to obtain the penalties on the arcs of the interpretation graph. Although combining penalties of different nature seems highly heuristic, the GTN training procedure will tune the penalties and take advantage of this combination anyway. Each path in the interpretation graph corresponds to a possible interpretation of the input word. The penalty of a particular interpretation for a particular segmentation is given by the sum of the arc penalties along the corresponding path in the interpretation graph. Computing the penalty of an interpretation independently of the segmentation requires to combine the penalties of all the paths with that interpretation. An appropriate rule for combining the penalties of parallel paths is given in section VI-C. ::: :::success 圖17中說明了一個簡單的GTN來辨識字符串。它由兩個圖所組成，分別為識別轉換器$T_{rec}$與[維特比](http://terms.naer.edu.tw/detail/5973878/)轉換器$T_{vit}$。識別轉換器的目標是生成一個圖，稱為解釋圖或識別圖$G_{int}$，包含對輸入的所有可能分段的所有可能解釋。$G_{int}$中的每一個路徑代表對輸入一個特定分段的一種可能解釋。[維特比](http://terms.naer.edu.tw/detail/5973878/)轉換器的作用在於從解釋圖中提取最佳解釋。識別轉換器$T_{rec}$以分段圖$G_{seg}$做為輸入，並將單個字符的識別器應用於分段圖中每個弧相關聯的影像。解釋圖$G_{int}$與分段圖擁有幾乎相同的架構，除了每個弧都被從同一節點到同一節點的一組弧替代。這組弧中，對$G{seg}$中與相對應弧相關聯的影像的每個可能類別都有一個弧。如圖18所示，在每一個弧上附加一個類別標記，然後由識別器生成的影像屬於該類別的懲罰。如果分段器已經計算出候選分段的懲罰，將這些懲罰與字符識別器計算的懲罰結合，以獲得解釋圖的弧上的懲罰。儘管將不同性質的懲罰結合起來似乎非常有啟發性，但是GTN訓練程序將調整懲罰並利用這種組合。解釋圖中的每個路徑對應輸入單詞的可能解釋。對特定分段的特定解釋的懲罰由解釋圖中相對應路徑上的弧懲罰加總給出。獨立地計算分段的解釋的懲罰需要結合所有路徑懲罰及其解釋。一個適合的結合平行路徑的懲罰準則在VI-C章中給出。 ::: :::info ![](https://i.imgur.com/jwKTG2E.png) Fig. 17. Recognizing a character string with a GTN. For readability, only the arcs with low penalties are shown. 圖17. 以GTN辨識字符串。為了可讀性，只顯示懲罰低的弧。 ::: :::info ![](https://i.imgur.com/RC37l3y.png) Fig. 18. The recognition transformer refines each arc of the segmentation arc into a set of arcs in the interpretation graph, one per character class with attached penalties and labels. 圖18. 識別轉換器在解釋圖中的每一個分段弧的弧細分為一組弧，每個字符類別一個，並附加懲罰以及標記。 ::: :::info The Viterbi transformer produces a graph $G_{vit}$ with a single path. This path is the path of least cumulated penalty in the Interpretation graph. The result of the recognition can be produced by reading off the labels of the arcs along the graph $G_{vit}$ extracted by the Viterbi transformer. The Viterbi transformer owes its name to the famous Viterbi algorithm \[70\], an application of the principle of dynamic programming to find the shortest path in a graph efficiently. Let $c_i$ be the penalty associated to arc $i$, with source node $s_i$, and destination node $d_i$ (note that there can be multiple arcs between two nodes). In the interpretation graph, arcs also have a label $l_i$. The Viterbi algorithm proceeds as follows. Each node $n$ is associated with a cumulated Viterbi penalty $v_n$. Those cumulated penalties are computed in any order that satisfies the partial order defined by the interpretation graph (which is directed and acyclic). The start node is initialized with the cumulated penalty $v_{start}=0$. The other nodes cumulated penalties $v_n$ are computed recursively from the $v$ values of their parent nodes, through the upstream arcs $U_n$=\{arc $i$ with destination $d_i=n$}: ![](https://i.imgur.com/TeGIG6G.png) Furthermore, the value of $i$ for each node $n$ which minimizes the right hand side is noted $m_n$, the minimizing entering arc. When the end node is reached we obtain in $v_{end}$ the total penalty of the path with the smallest total penalty. We call this penalty the Viterbi penalty, and this sequence of arcs and nodes the Viterbi path. To obtain the Viterbi path with nodes $n_1 ... n_t$ and arcs $i_i...i_{T-1}$, we trace back these nodes and arcs as follows, starting with $n_T=$ the end node, and recursively using the minimizing entering arc: $i_t=m_{n_{t+1}}$, and $n_t = s_{i_t}$ until the start node is reached. The label sequence can then be read off the arcs of the Viterbi path. ::: :::success [維特比](http://terms.naer.edu.tw/detail/5973878/)轉換器生成具有單個路徑的圖$G_{vit}$。這個路徑是解釋圖中最小累積懲罰的路徑。辨識的結果可以通過讀取[維特比](http://terms.naer.edu.tw/detail/5973878/)轉換器沿著圖$G_{vit}$提取的弧的標記來生成。[維特比](http://terms.naer.edu.tw/detail/5973878/)轉換器的名稱得益於著名的[維特比演算法](http://terms.naer.edu.tw/detail/1062121/)\[70\]，這演算法應用動態程式原理來有效地找到圖中的最短路徑。假設$c_i$是與弧$i$相關聯的懲罰，其[原始節點](http://terms.naer.edu.tw/detail/2424576/)為$s_i$，[目的地節點](http://terms.naer.edu.tw/detail/2358449/)為$d_i$(注意到，兩個節點之間可以有多個弧)。在解釋圖中，弧也擁有標記$l_i$。[維特比演算法](http://terms.naer.edu.tw/detail/1062121/)流程如下。每一個節點$n$與累積的[維特比](http://terms.naer.edu.tw/detail/5973878/)懲罰$v_n$相關聯。那些累積的懲罰以滿足由解釋圖定義的部份順序的任何順序計算(有向與非循環)。使用累積懲罰$v_{start}=0$初始化[開始節點](http://terms.naer.edu.tw/detail/2425451/)。其它節點累積的懲罰$v_n$由它們的父節點的 $v$值通過上游弧$U_n$=\{arc $i$ with destination $d_i=n$}遞迴計算： ![](https://i.imgur.com/peCujwf.png) 此外，每一個節點$n$的$i$的值(使右側最小化)標記為$m_n$，這是最小的輸入弧。當到達[端節點](http://terms.naer.edu.tw/detail/5981292/)的時候，我們在$v_{end}$中獲得擁有最小總懲罰的路徑的總懲罰。我們稱之為[維特比](http://terms.naer.edu.tw/detail/5973878/)懲罰，而這個弧的序列以及節點即為[維特比](http://terms.naer.edu.tw/detail/5973878/)路徑。為了獲得帶有節點$n_1 ... n_t$與弧$i_i...i_{T-1}$的[維特比](http://terms.naer.edu.tw/detail/5973878/)路徑，我們用下面的方法追溯這些節點與弧，由$n_T=$[端節點](http://terms.naer.edu.tw/detail/5981292/)開始，遞迴的使用最小輸入弧：$i_t=m_{n_{t+1}}$，以及$n_t = s_{i_t}$，直到到達[原始節點](http://terms.naer.edu.tw/detail/2424576/)。然後就可以從[維特比](http://terms.naer.edu.tw/detail/5973878/)路徑的弧讀取標記序列。 ::: ## VI. Global Training for Graph Transformer Networks :::info The previous section describes the process of recognizing a string using Heuristic Over-Segmentation, assuming that the recognizer is trained so as to give low penalties for the correct class label of correctly segmented characters, high penalties for erroneous categories of correctly segmented characters, and high penalties for all categories for badly formed characters. This section explains how to train the system at the string level to do the above without requiring manual labeling of character segments. This training will be performed with a GTN whose architecture is slightly different from the recognition architecture described in the previous section. In many applications, there is enough a priori knowledge about what is expected from each of the modules in order to train them separately. For example, with Heuristic Over-Segmentation one could individually label single-character images and train a character recognizer on them, but it might be difficult to obtain an appropriate set of non-character images to train the model to reject wrongly segmented candidates. Although separate training is simple, it requires additional supervision information that is often lacking or incomplete (the correct segmentation and the labels of incorrect candidate segments). Furthermore it can be shown that separate training is sub-optimal\[67\]. The following section describes three different gradient-based methods for training GTN-based handwriting recognizers at the string level: Viterbi training, discriminative Viterbi training, forward training, and discriminative forward training. The last one is a generalization to graph-based systems of the MAP criterion introduced in Section II-C. Discriminative forward training is somewhat similar to the so-called Maximum Mutual Information criterion used to train HMM in speech recognition. However, our rationale differs from the classical one. We make no recourse to a probabilistic interpretation, but show that, within the Gradient-Based Learning approach, discriminative training is a simple instance of the pervasive principle of error correcting learning. Training methods for graph-based sequence recognition systems such as HMMs have been extensively studied in the context of speech recognition \[28\]. Those methods require that the system be based on probabilistic generative models of the data, which provide normalized likelihoods over the space of possible input sequences. Popular HMM learning methods, such as the the Baum-Welsh algorithm, rely on this normalization. The normalization cannot be preserved when non-generative models such as neural networks are integrated into the system. Other techniques, such as discriminative training methods, must be used in this case. Several authors have proposed such methods to train neural network/HMM speech recognizers at the word or sentence level \[71\],\[72\],\[73\],\[74\],\[75\],\[76\],\[77\],\[78\],\[29\],\[67\]. Other globally trainable sequence recognition systems avoid the difficulties of statistical modeling by not resorting to graph-based techniques. The best example is Recurrent Neural Networks (RNN). Unfortunately, despite early enthusiasm, the training of RNNs with gradient-based techniques has proved very difficult in practice \[79\]. The GTN techniques presented below simplify and generalize the global training methods developed for speech recognition. ::: :::success 前一章節說明使用Heuristic Over-Segmentation來辨識字串的過程，假設識別器已經訓練，以便給正確分割的字符正確的類別標籤較低的懲罰，給正確分割的字符錯誤的類別類籤較高的懲罰，以及格式不正確的字符的所有類別較高的懲罰。本節說明，如何在字串級別上訓練系統執行上述動作，而不需要手動標記字符段。這個訓練會使用GTN來進行，該GTN的架構與上一章節所說的辨識架構會有些許的不同。在許多應用程式中，對每個模組的預期，有足夠的先驗知識，以便分別訓練它們。舉例來說，通過Heuristic Over-Segmentation，可以單獨標記單個字符影像，在這些字符影像上訓練字符識別器，但它也許難以獲得適當的非字符影像集來訓練模型如何拒絕錯誤分割的候選片段。雖然單獨訓練是簡單的，但它需要額外的監督信息，這些信息通常是缺乏或不完整的(正確的分段與錯誤的候選片段標記)。此外，可以證明，單獨的訓練是次優的\[67\]。下一章節說明在字串級別上訓練基於GTN的手寫識別器的三種不同的基於梯度的方法：[維特比](http://terms.naer.edu.tw/detail/6239753/)訓練，判別[維特比](http://terms.naer.edu.tw/detail/6239753/)訓練，[正向](http://terms.naer.edu.tw/detail/2371564/)訓練與判別[正向](http://terms.naer.edu.tw/detail/2371564/)訓練。最後一個是II-C章節中介紹的MAP準則的基於圖的系統的概括。判別[正向](http://terms.naer.edu.tw/detail/2371564/)訓練有點像是用於訓練語音辨識中的HMM的所謂最大相互資訊標準。但是，我們的理論基礎不同於經典的理論基礎。我們並沒有訴諸於機率的解釋，而是表明，在基於梯度學習方法中，判別訓練是一個普遍的[錯誤糾正](http://terms.naer.edu.tw/detail/2455012/)學習原理的一個簡單實例。基於圖的序列辨識系統(像是HMMs)的訓練方法，已經在語音辨識領域中有大量的研究。這些方法要求系統基於資料的機率生成模型，在可能的輸入序列空間上提供[正規化](http://terms.naer.edu.tw/detail/721767/)的可能性。受歡迎的HMM學習方法，像是Baum-Welsh algorithm，都依賴這個[正規化](http://terms.naer.edu.tw/detail/721767/)。當非生成模型(如神經網路)被集成到訓練的時候，[正規化](http://terms.naer.edu.tw/detail/721767/)無法被保留。在這情況下，必須使用其它技術，像是判別訓練方法。幾位作者提出這類方法來訓練單詞或語句級別的神經網路/HMM語音識別器\[71\]，\[72\]，\[73\]，\[74\]，\[75\]，\[76\]，\[77\]，\[78\]，\[29\]，\[67\]。其它全域可訓練序列辨識系統透過不使用基於圖的技術避免統計建模的困難。最好的例子就是遞迴神經網路(RNN)。不幸的是，儘管早期的熱情度很高，在實作中仍然證明很難用基於梯度的技術來訓練RNN\[79\]。 ::: :::info ![](https://i.imgur.com/1ODgZmA.png) Fig. 19. Viterbi Training GTN Architecture for a character string recognizer based on Heuristic Over-Segmentation. 圖19. 針對基於Heuristic Over-Segmentation的字符串識別器的[維特比](http://terms.naer.edu.tw/detail/6239753/)訓練GTN架構。 ::: ### A. Viterbi Training :::info During recognition, we select the path in the Interpretation Graph that has the lowest penalty with the Viterbi algorithm. Ideally, we would like this path of lowest penalty to be associated with the correct label sequence as often as possible. An obvious loss function to minimize is therefore the average over the training set of the penalty of the path associated with the correct label sequence that has the lowest penalty. The goal of training will be to find the set of recognizer parameters (the weights, if the recognizer is a neural network) that minimize the average penalty of this "correct" lowest penalty path. The gradient of this loss function can be computed by back-propagation through the GTN architecture shown in figure 19. This training architecture is almost identical to the recognition architecture described in the previous section, except that an extra graph transformer called a path selector is inserted between the Interpretation Graph and the Viterbi Transformer. This transformer takes the interpretation graph and the desired label sequence as input. It extracts from the interpretation graph those paths that contain the correct (desired) label sequence. Its output graph $G_c$ is called the constrained interpretation graph (also known as forced alignment in the HMM literature), and contains all the paths that correspond to the correct label sequence. The constrained interpretation graph is then sent to the Viterbi transformer which produces a graph $G_{cvit}$ with a single path. This path is the "correct" path with the lowest penalty. Finally, a path scorer transformer takes $G_{cvit}$, and simply computes its cumulated penalty $C_{cvit}$ by adding up the penalties along the path. The output of this GTN is the loss function for the current pattern: ![](https://i.imgur.com/vGY8kcx.png) The only label information that is required by the above system is the sequence of desired character labels. No knowledge of the correct segmentation is required on the part of the supervisor, since it chooses among the segmentations in the interpretation graph the one that yields the lowest penalty. ::: :::success 辨識過程中，我們用[維特比](http://terms.naer.edu.tw/detail/6239753/)演算法在解釋圖中選擇懲罰最低的路徑。理想情況下，我們希望最低懲罰的路徑盡可能與正確標記序列是相關聯的。因此，要最小化的明顯的損失函數是訓練集上與懲罰最低的正確標記序列相關的路徑的懲罰的平均值。訓練的目標將是找到一組識別器參數(如果識別器是神經網路，那就是權重)，這組參數可以最小化"正確"最低懲罰路徑的平均懲罰。圖19說明，損失函數的梯度通過GTN架構由反向傳播計算。除了在解釋圖與[維特比](http://terms.naer.edu.tw/detail/6239753/)轉換器中額外的插入稱為路徑選擇器的圖轉換器之外，這個訓練架構與上一章節所說的辨識架構相同。這個轉換器將解釋圖與所需的標籤序列做為輸入。它從解釋圖中提取出包含正確(所需)的標籤序列的路徑。它的輸出圖$G_c$稱為約束解釋圖(HMM文獻中稱為強制對齊)，並包含與正確標籤序列相對應的所有路徑。然後將約束解釋圖發送到[維特比](http://terms.naer.edu.tw/detail/6239753/)轉換器，這個轉換器生成一個帶有單徑的圖$G_{cvit}$。這個路徑就是最低懲罰的"正確"路徑"。最後，路徑計分器轉換器得到$G_{cvit}$，並通過加總路徑上的懲罰簡單計算它的累積懲罰$C_{cvit}$。這個GTN的輸出就是正確影像的損失函數： ![](https://i.imgur.com/PGwUsFq.png) 上述系統唯一需要的標籤信息就是需求字符標籤的序列。監督器並不需要知道正確的分段，因為它會在解釋圖中的分段選擇產生最少懲罰的分段。 ::: :::info The process of back-propagating gradients through the Viterbi training GTN is now described. As explained in section IV, the gradients must be propagated backwards through all modules of the GTN, in order to compute gradients in preceding modules and thereafter tune their parameters. Back-propagating gradients through the path scorer is quite straightforward. The partial derivatives of the loss function with respect to the individual penalties on the constrained Viterbi path $G_{cvit}$ are equal to 1, since the loss function is simply the sum of those penalties. Back-propagating through the Viterbi Transformer is equally simple. The partial derivatives of $E_{vit}$ with respect to the penalties on the arcs of the constrained graph $G_c$ are 1 for those arcs that appear in the constrained Viterbi path $G_{cvit}$, and 0 for those that do not. Why is it legitimate to back-propagate through an essentially discrete function such as the Viterbi Transformer? The answer is that the Viterbi Transformer is nothing more than a collection of min functions and adders put together. It was shown in Section IV that gradients can be back-propagated through min functions without adverse effects. Back-propagation through the path selector transformer is similar to back-propagation through the Viterbi transformer. Arcs in $G_{int}$ that appear in $G_c$ have the same gradient as the corresponding arc in $G_c$, i.e. 1 or 0, depending on whether the arc appear in $G_{cvit}$. The other arcs, i.e. those that do not have an alter ego in $G_c$ because they do not contain the right label have a gradient of 0. During the forward propagation through the recognition transformer, one instance of the recognizer for single character was created for each arc in the segmentation graph. The state of recognizer instances was stored. Since each arc penalty in $G_{int}$ is produced by an individual output of a recognizer instance, we now have a gradient (1 or 0) for each output of each instance of the recognizer. Recognizer outputs that have a non zero gradient are part of the correct answer, and will therefore have their value pushed down. The gradients present on the recognizer outputs can be back-propagated through each recognizer instance. For each recognizer instance, we obtain a vector of partial derivatives of the loss function with respect to the recognizer instance parameters. All the recognizer instances share the same parameter vector, since they are merely clones of each other, therefore the full gradient of the loss function with respect to the recognizer's parameter vector is simply the sum of the gradient vectors produced by each recognizer instance. Viterbi training, though formulated differently, is often use in HMM-based speech recognition systems \[28\]. Similar algorithms have been applied to speech recognition systems that integrate neural networks with time alignment \[71\],\[72\], \[76\] or hybrid neural-network/HMM systems \[29\], \[74\], \[75\]. ::: :::success 現在說明通過[維特比](http://terms.naer.edu.tw/detail/6239753/)訓練GTN反向傳播梯度的過程。如第四章節所說明，梯度通過GTN的所有模組向後傳播，以便計算[前導](http://terms.naer.edu.tw/detail/2409848/)模組的梯度，然後調整它們的參數。通過路徑計分器的反向傳播梯度是非常簡單的。損失函數相對於約束[維特比](http://terms.naer.edu.tw/detail/6239753/)路徑$G_{cvit}$上的各別懲罰的[偏導數](http://terms.naer.edu.tw/detail/254320/)等於1，因為損失函數只是這些懲罰的總和。通過[維特比](http://terms.naer.edu.tw/detail/6239753/)轉換器做反向傳播同樣簡單。$E{vit}$關於約束圖$G{c$的弧的懲罰的[偏導數](http://terms.naer.edu.tw/detail/254320/)對於那些出現在約束[維特比](http://terms.naer.edu.tw/detail/6239753/)路徑$G{cvit}$中的弧是1，對於那些不出現在約束[維特比](http://terms.naer.edu.tw/detail/6239753/)路徑$G{cvit}$中的弧是0。為什麼通過本質上是離散的函數(像是[維特比](http://terms.naer.edu.tw/detail/6239753/)轉換器)來反向傳播是合法的?答案是，[維特比](http://terms.naer.edu.tw/detail/6239753/)轉換器不過是一個最小函數與加法器放在一起的一個集合。這在第四章節已經說明，梯度可以透過最小函數做反向傳播，而不會有不良的影響。通過路徑選擇器轉換器的反向傳播類似於通過[維特比](http://terms.naer.edu.tw/detail/6239753/)轉換器做反向傳播。$G_{int}$中出現於$G_c$中的弧具有與$G_c$中相對應的弧有相同的梯度，即1或0，這取決於弧是否出現在$G_{cvit}$中。其它的弧，就是那些在$G_c$沒有改變自我的弧，因為它們並不包含正確的標籤，其梯度為0。在通過辨識轉換器的[正向](http://terms.naer.edu.tw/detail/2371564/)傳播期間，為分段圖中的每個弧建立一個單個字符識別器的實例。識別器實例的狀態已被保存。因為$G_{int}$中的每一個弧懲罰都是由各別識別器求例的輸出產生，因此現在對每一個識別器的每一個實例的每一個輸出，我們都有一個梯度(1或0)。具有非零梯度的識別器輸出是正確答案的一部份，因此它們的值會被向下推。識別器輸出上存在的梯度可以透過每個識別器實例做反向傳播。對每個識別器實例，我們獲得相對於識別器實例參數的損失函數的偏導數的向量。所有的識別器實例共享相同的參數向量，因為它們只是彼此clones，因此，損失函數相對於識別器的參數向量的完整梯度就只是每個識別器實例產生的梯度向量總和。[維特比](http://terms.naer.edu.tw/detail/6239753/)訓練，雖然形式不同，但通常應用於基於[HMM](http://terms.naer.edu.tw/detail/3617876/)(基於隱藏式馬可夫模型)語音辨識系統\[28\]。類似的演算法已經應用於語音辨識系統，該系統集成具時間對齊的神經網路\[71\]，\[72\]，\[76\] ，或混合神經網路/HMM系統 \[29\]，\[74\]，\[75\]。 ::: :::info While it seems simple and satisfying, this training architecture has a flaw that can potentially be fatal. The problem was already mentioned in Section II-C. If the recognizer is a simple neural network with sigmoid output units, the minimum of the loss function is attained, not when the recognizer always gives the right answer, but when it ignores the input, and sets its output to a constant vector with small values for all the components. This is known as the collapse problem. The collapse only occurs if the recognizer outputs can simultaneously take their minimum value. If on the other hand the recognizer's output layer contains RBF units with fixed parameters, then there is no such trivial solution. This is due to the fact that a set of RBF with fixed distinct parameter vectors cannot simultaneously take their minimum value. In this case, the complete collapse described above does not occur. However, this does not totally prevent the occurrence of a milder collapse because the loss function still has a "flat spot" for a trivial solution with constant recognizer output. This flat spot is a saddle point, but it is attractive in almost all directions and is very diffcult to get out of using gradient-based minimization procedures. If the parameters of the RBFs are allowed to adapt, then the collapse problems reappears because the RBF centers can all converge to a single vector, and the underlying neural network can learn to produce that vector, and ignore the input. A different kind of collapse occurs if the width of the RBFs are also allowed to adapt. The collapse only occurs if a trainable module such as a neural network feeds the RBFs. The collapse does not occur in HMM-based speech recognition systems because they are generative systems that produce normalized likelihoods for the input data (more on this later). Another way to avoid the collapse is to train the whole system with respect to a discriminative training criterion, such as maximizing the conditional probability of the correct interpretations (correct sequence of class labels) given the input image. Another problem with Viterbi training is that the penalty of the answer cannot be used reliably as a measure of confidence because it does not take low-penalty (or high-scoring) competing answers into account. ::: :::success 儘管那看似簡單且令人滿意，這個訓練架構仍然存在致命的缺點。這問題已經在II-C章節提到。如果識別器是一個帶有sigmoid輸出單元的簡單神經網路，那損失函數的最小值將達成，不是在識別器總是給定正確的答案時，而是當它忽略輸入，並設置它的輸出為帶有所有組件小數值的常數向量。這稱為崩潰問題。只有在識別器輸出可以同時取其最小值的時候才會崩潰。另一方面，如果識別器的輸出層包含具固定參數的RBF單元，那就沒有那麼簡單的解決方案。這是因為一組具有固定的不同參數向量的RBF不能同時取其最小值。在這情況下，上面所說的完整崩潰的事情不會發生。然而，這並不能完全防止較溫和的崩潰發生，因為對於帶有常數辨識器輸出的簡單解決方案，其損失函數依然存在"[平坦點](http://terms.naer.edu.tw/detail/2116266/)"。這個[平坦點](http://terms.naer.edu.tw/detail/2116266/)是一個鞍點，但它幾乎在所有方向都非常有吸引力，而且很難不使用基於梯度的最小化方法。如果允許RBF的參數進行調整，那崩潰的問題會再出現，因為RBF中心可以全部收斂到單個向量，而且基礎神經網路可以學習生成這個向量，而忽略輸出。如果還允許調整RBF的寬，那不同類別的崩潰就會發生。只有在可訓練模組(如神經網路)向RBF[反饋](http://terms.naer.edu.tw/detail/2368334/)的時候才會發生崩潰。在基於HMM的語音辨識系統中不會發生崩潰，因為它們是生成系統，會為輸入資料生成[正規化](http://terms.naer.edu.tw/detail/2400631/)的[似然度](http://terms.naer.edu.tw/detail/2118868/)(後續說明)。另一種避免崩潰的方法是相對於判別式訓練標準來訓練整個訓練，像是最大化給定輸入影像的正確解釋(類別標籤的正確序列)的條件機率。 [維特比](http://terms.naer.edu.tw/detail/6239753/)訓練的另一個問題是，答案的懲罰不能可靠的用作信心的衡量標準，因為它沒有考慮低懲罰(或高分)的競爭答案。 ::: ### B. Discriminative Viterbi Training :::info A modification of the training criterion can circumvent the collapse problem described above and at the same time produce more reliable confidence values. The idea is to not only minimize the cumulated penalty of the lowest penalty path with the correct interpretation, but also to somehow increase the penalty of competing and possibly incorrect paths that have a dangerously low penalty. This type of criterion is called discriminative, because it plays the good answers against the bad ones. Discriminative training procedures can be seen as attempting to build appropriate separating surfaces between classes rather than to model individual classes independently of each other. For example, modeling the conditional distribution of the classes given the input image is more discriminative (focus-sing more on the classification surface) than having a separate generative model of the input data associated to each class (which, with class priors, yields the whole joint distribution of classes and inputs). This is because the conditional approach does not need to assume a particular form for the distribution of the input data. One example of discriminative criterion is the difference between the penalty of the Viterbi path in the constrained graph, and the penalty of the Viterbi path in the (unconstrained) interpretation graph, i.e. the difference between the penalty of the best correct path, and the penalty of the best path (correct or incorrect). The corresponding GTN training architecture is shown in figure 20. The left side of the diagram is identical to the GTN used for non-discriminative Viterbi training. This loss function reduces the risk of collapse because it forces the recognizer to increases the penalty of wrongly recognized objects. Discriminative training can also be seen as another example of error correction procedure, which tends to minimize the difference between the desired output computed in the left half of the GTN in fgure 20 and the actual output com puted in the right half of figure 20. Let the discriminative Viterbi loss function be denoted $E_{dvit}$, and let us call $C_{cvit}$ the penalty of the Viterbi path in the constrained graph, and $C_{vit}$ the penalty of the Viterbi path in the unconstrained interpretation graph: ![](https://i.imgur.com/hSrkeKy.png) $E_{dvit}$ is always positive since the constrained graph is a subset of the paths in the interpretation graph, and the Viterbi algorithm selects the path with the lowest total penalty. In the ideal case, the two paths $C_{cvit}$ and $C_{vit}$ coincide, and Edvit is zero. ::: :::success 訓練準則的修正可以避免上述的崩潰問題，同時產生更可靠的置信度值。這想法不僅僅是在正確的解釋下最小化最低懲罰路徑的累積懲罰，也為了在某種程度上增加具有危險性的低懲罰的競爭路徑與可能不正確路徑的懲罰。這種類型的標準稱為判別式，因為它對壞的標準起了很好的答案。判別性訓練程序可以視為嚐試在類別之間建立適當的分離面，而不是單獨的對其它類別獨立建模。舉例來說，給定輸入影像的類別條件分佈建模比具有與每個類別關聯的輸入資料單獨生成模型(類別優先，生成類別與輸入的整體聯合分布)更具判別性(關注更多在分類表面)。這是因為條件方法不需要為輸入資料的分佈採取(假設)特定形式。判別式標準的一個範例是，約束圖中[維特比](http://terms.naer.edu.tw/detail/6239753/)路徑的懲罰與(無約束)解釋圖中[維特比](http://terms.naer.edu.tw/detail/6239753/)路徑的懲罰之間的差異，即最佳正確路徑的懲罰與最佳路徑(正確或不正確)路徑懲罰的差異。相對應的GTN訓練架構如圖20所示。該圖左邊與用於非判別性[維特比](http://terms.naer.edu.tw/detail/6239753/)訓練的GTN相同。這種損失函數降低崩潰的風險，因為它強制識別器增加對錯誤辨識物件的懲罰。判別性訓練也可以視為另一種[錯誤校正](http://terms.naer.edu.tw/detail/2455012/)程序的範例，它傾向於最小化圖20中GTN的左半部份的期望輸出與圖20中右半部的實際輸出之間的差異。假設判別式[維特比](http://terms.naer.edu.tw/detail/6239753/)損失函數以$E_{dvit}$表示，將$C_{cvit}$表示為約束圖中[維特比](http://terms.naer.edu.tw/detail/6239753/)路徑的懲罰，$C_{vit}$為[無約束](http://terms.naer.edu.tw/detail/2433496/)解釋圖中的[維特比](http://terms.naer.edu.tw/detail/6239753/)路徑的懲罰： ![](https://i.imgur.com/Phu059F.png) $E_{dvit}$始終為正，因為約束圖為解釋圖中的路徑子集，而且[維特比](http://terms.naer.edu.tw/detail/6239753/)演算法選擇具有最低總懲罰路徑。理想狀況下，兩個路徑$C_{cvit}$與$C_{vit}$是[重合](http://terms.naer.edu.tw/detail/2348962/)的，而$E_{dvit}$為0。 ::: :::info ![](https://i.imgur.com/jlo8S0a.png) Fig. 20. Discriminative Viterbi Training GTN Architecture for a character string recognizer based on Heuristic Over-Segmentation. Quantities in square brackets are penalties computed during the forward propagation. Quantities in parentheses are partial derivatives computed during the backward propagation. 圖20. 基於Heuristic Over-Segmentation的判別性[維特比](http://terms.naer.edu.tw/detail/6239753/)訓練GTN架構，用於字符串辨識器。 ::: :::info Back-propagating gradients through the discriminative Viterbi GTN adds some "negative" training to the previously described non-discriminative training. Figure 20 shows how the gradients are back-propagated. The left half is identical to the non-discriminative Viterbi training GTN, therefore the back-propagation is identical. The gradients back-propagated through the right half of the GTN are multiplied by -1, since $C_{vit}$ contributes to the loss with a negative sign. Otherwise the process is similar to the left half. The gradients on arcs of $G_{int}$ get positive contributions from the left half and negative contributions from the right half. The two contributions must be added, since the penalties on $G_{int}$ arcs are sent to the two halves through a "Y" connection in the forward pass. Arcs in $G_{int}$ that appear neither in $G_{vit}$ nor in $G_{cvit}$ have a gradient of zero. They do not contribute to the cost. Arcs that appear in both $G_{vit}$ and $G_{cvit}$ also have zero gradient. The -1 contribution from the right half cancels the the +1 contribution from the left half. In other words, when an arc is rightfully part of the answer, there is no gradient. If an arc appears in $G_{cvit}$ but not in $G_{vit}$, the gradient is +1. The arc should have had a lower penalty to make it to $G_{vit}$. If an arc is in $G_{vit}$ but not in $G_{cvit}$, the gradient is -1. The arc had a low penalty, but should have had a higher penalty since it is not part of the desired answer. Variations of this technique have been used for the speech recognition. Driancourt and Bottou \[76\] used a version of it where the loss function is saturated to a fixed value. This can be seen as a generalization of the Learning Vector Quantization 2 (LVQ2) loss function \[80\]. Other variations of this method use not only the Viterbi path, but the K-best paths. The Discriminative Viterbi algorithm does not have the flaws of the non-discriminative version, but there are problems nonetheless. The main problem is that the criterion does not build a margin between the classes. The gradient is zero as soon as the penalty of the constrained Viterbi path is equal to that of the Viterbi path. It would be desirable to push up the penalties of the wrong paths when they are dangerously close to the good one. The following section presents a solution to this problem. ::: :::success 透過判別性[維特比](http://terms.naer.edu.tw/detail/6239753/)GTN的反向傳播梯度為先前說明的非判別性訓練增加一些"負"的訓練。圖20說明梯度如何反向傳播。左半部與非判別性[維特比](http://terms.naer.edu.tw/detail/6239753/)訓練GTN相同，因此反向傳播是相同的。通過右半部回傳的梯度乘上-1，因為$C_{vit}$會導致損失為負。否則這過程與左半部類似。$G_{int}$的弧上的梯度從左半部得到正數[貢獻](http://terms.naer.edu.tw/detail/3462412/)，從右半部得到負數[貢獻](http://terms.naer.edu.tw/detail/3462412/)。這兩個[貢獻](http://terms.naer.edu.tw/detail/3462412/)必須相加，因為$G_{int}$弧上的懲罰會透過[正向]([正向](http://terms.naer.edu.tw/detail/2371564/))傳遞中的"Y"連結發送到兩邊[各半](http://terms.naer.edu.tw/detail/3478233/)。$G_{int}$中未出現在$G_{vit}$與$G_{cvit}$中的弧，其梯度為0。它們並不會增加成本。同時出現在$G_{vit}$與$G_{cvit}$中的弧，其梯度也是0。來自右半部-1的[貢獻](http://terms.naer.edu.tw/detail/3462412/)抵消掉來自左半部的+1[貢獻](http://terms.naer.edu.tw/detail/3462412/)。換句話說，當一個弧是正確答案的一部份的時候，它就沒有梯度。如果弧出現在$G_{cvit}$中，但不存在$G_{vit}$，其梯度為+1。弧應該有一個較低的懲罰來使它成為$G_{vit}$。如果弧存在$G_{vit}$，但不存在於$G_{cvit}$，其梯度為-1。該弧擁有較低的懲罰，但應該有較高的懲罰，因為它並不是所期望答案的一部份。這技術的變體已經應用於語音辨識。Driancourt與Bottou(\[76\])使用其中一個版本，其損失函數飽和到一個固定值。這可以視為是[學習向量量化2](http://terms.naer.edu.tw/detail/3623254/)(LVQ2)損失函數的概括\[80\]。此方法的其它變體不僅使用[維特比](http://terms.naer.edu.tw/detail/6239753/)路徑，也使用K-best路徑。判別式[維特比](http://terms.naer.edu.tw/detail/6239753/)演算法不具有非判別式版本的缺點，但仍然存在問題。主要的問題在於，該標準不會在類別之間建立邊界。當約束[維特比](http://terms.naer.edu.tw/detail/6239753/)路徑的懲罰等於[維特比](http://terms.naer.edu.tw/detail/6239753/)路徑的懲罰時，梯度為0。當錯誤的路徑很危險地接近正確的路徑的時候，最好[上推](http://terms.naer.edu.tw/detail/2412634/)它們的懲罰。下一章節介紹此問題的解決方法。 ::: ### C. Forward Scoring, and Forward Training :::info While the penalty of the Viterbi path is perfectly appropriate for the purpose of recognition, it gives only a partial picture of the situation. Imagine the lowest penalty paths corresponding to several different segmentations produced the same answer (the same label sequence). Then it could be argued that the overall penalty for the interpretation should be smaller than the penalty obtained when only one path produced that interpretation, because multiple paths with identical label sequences are more evidence that the label sequence is correct. Several rules can be used compute the penalty associated to a graph that contains several parallel paths. We use a combination rule borrowed from a probabilistic interpretation of the penalties as negative log posteriors. In a probabilistic framework, the posterior probability for the interpretation should be the sum of the posteriors for all the paths that produce that interpretation. Translated in terms of penalties, the penalty of an interpretation should be the negative logarithm of the sum of the negative exponentials of the penalties of the individual paths. The overall penalty will be smaller than all the penalties of the individual paths. Given an interpretation, there is a well known method, called the forward algorithm for computing the above quantity efficiently \[28\]. The penalty computed with this procedure for a particular interpretation is called the forward penalty. Consider again the concept of constrained graph, the subgraph of the interpretation graph which contains only the paths that are consistent with a particular label sequence. There is one constrained graph for each possible label sequence (some may be empty graphs, which have infinite penalties). Given an interpretation, running the forward algorithm on the corresponding constrained graph gives the forward penalty for that interpretation. The forward algorithm proceeds in a way very similar to the Viterbi algorithm, except that the operation used at each node to combine the incoming cumulated penalties, instead of being the min function is the so-called logadd operation, which can be seen as a "soft" version of the min function: ![](https://i.imgur.com/kEq9MDt.png) where $f_{start} = 0$, $U_n$ is the set of upstream arcs of node $n$, $c_i$ is the penalty on arc $i$, and ![](https://i.imgur.com/rvz4elB.png) Note that because of numerical inaccuracies, it is better to factorize the largest $e^{-x_i}$ (corresponding to the smallest penalty) out of the logarithm. An interesting analogy can be drawn if we consider that a graph on which we apply the forward algorithm is equivalent to a neural network on which we run a forward propagation, except that multiplications are replaced by additions, the additions are replaced by logadds, and there are no sigmoids One way to understand the forward algorithm is to think about multiplicative scores (e.g., probabilities) instead of additive penalties on the arcs: score=exp(-penalty). In that case the Viterbi algorithm selects the path with the largest cumulative score (with scores multiplied along the path), whereas the forward score is the sum of the cumulative scores associated to each of the possible paths from the start to the end node. The forward penalty is always lower than the cumulated penalty on any of the paths, but if one path "dominates" (with a much lower penalty), its penalty is almost equal to the forward penalty. The forward algorithm gets its name from the forward pass of the well-known BaumWelsh algorithm for training Hidden Markov Models \[28\]. Section VIII-E gives more details on the relation between this work and HMMs. ::: :::success 儘管[維特比](http://terms.naer.edu.tw/detail/6239753/)路徑的懲罰對辨識來說是非常適合，但它只給出部份情況。想像一下，對應不同片段的最低懲罰路徑生成相同的答案(相同標籤序列)。這可以說明為，對於解釋的總體懲罰應該小於只有一個路徑生成該解釋時獲得的懲罰，因為具相同標籤序列的多個路徑更能證明該標籤序列是正確的。可以使用幾個規則來計算包含多個平行路徑的圖的相關懲罰。我們使用從懲罰的機率性解釋中得到的組合規則做為負對數後驗。在機率框架中，解釋的[後驗機率](http://terms.naer.edu.tw/detail/3189962/)應該為生成該解釋的路有路徑的後驗機率的總和。從懲罰的角度來看，解釋的懲罰應該是各別路徑的懲罰的負指數總和的負對數。總體的懲罰將會小於各別路徑的所有懲罰。給出一個解釋，有一個眾所皆知的方法，稱為[正向](http://terms.naer.edu.tw/detail/2371564/)演算法，用於有效地計算上述量。以此程序計算而得的特定解釋的懲罰稱為[正向](http://terms.naer.edu.tw/detail/2371564/)懲罰。再次的考慮約束圖的概念，解釋圖的子圖僅包含與特定標籤序列一致的路徑。每個可能的標籤序列都只有一個約束圖(部份可能是[空圖](http://terms.naer.edu.tw/detail/2364040/)，擁有無限的懲罰)。給定一個解釋，在相對應的約束圖上執行[正向](http://terms.naer.edu.tw/detail/2371564/)演算法給出該解釋的[正向](http://terms.naer.edu.tw/detail/2371564/)懲罰。[正向](http://terms.naer.edu.tw/detail/2371564/)演算法以類似[維特比](http://terms.naer.edu.tw/detail/6239753/)演算法的方式執行，除了在每個節點上用於組合傳入累積懲罰的操作不是"min"函數，而是所謂的"logadd"的操作，它可以視為"min"函數的"sort"版本： ![](https://i.imgur.com/NhqaKES.png) 其中$f_{start} = 0$， $U_n$是節點$n$的上游弧，$c_i$是弧$i$上的懲罰，並且： ![](https://i.imgur.com/YyxPmeG.png) 注意到，因為數值[不準確](http://terms.naer.edu.tw/detail/2378499/)，最好從對數中分解出最大的$e^{-x_i}$(對應於最小懲罰)。如果我們認為應用[正向](http://terms.naer.edu.tw/detail/2371564/)演算法的圖等效於在其上執行[正向](http://terms.naer.edu.tw/detail/2371564/)傳播的神經網路，那可以得到一個有趣的類比，除了乘法被加法取代，加法被"logadds"取代，而且沒有sigmoids。理解[正向](http://terms.naer.edu.tw/detail/2371564/)演算法的一個方法就是考慮[乘法](http://terms.naer.edu.tw/detail/2120128/)的分數(像是機率)，而不是弧上的附加懲罰：score=exp(-penalty)。那種情況下，[維特比](http://terms.naer.edu.tw/detail/6239753/)演算法選擇具有最大累積分數的路徑(分數沿路徑相乘)，然而，[正向](http://terms.naer.edu.tw/detail/2371564/)分數是從起頭到結束節點的每一個可能路徑的相關累積分數。[正向](http://terms.naer.edu.tw/detail/2371564/)懲罰總是底於任一路徑上的累積懲罰，但如果一條路徑"[佔優勢](https://tw.voicetube.com/definition/dominate)"(懲罰較低)，其懲罰幾乎相等於[正向](http://terms.naer.edu.tw/detail/2371564/)懲罰。[正向](http://terms.naer.edu.tw/detail/2371564/)演算法的名稱來自訓練隱馬可夫模型的著名BaumWelsh演算法\[28\]。第VIII-E章節中有更多關於這項工作與HMMs之間的細節。 ::: :::info The advantage of the forward penalty with respect to the Viterbi penalty is that it takes into account all the different ways to produce an answer, and not just the one with the lowest penalty. This is important if there is some ambiguity in the segmentation, since the combined forward penalty of two paths $C_1$ and $C_2$ associated with the same label sequence may be less than the penalty of a path $C_3$ associated with another label sequence, even though the penalty of $C_3$ might be less than any one of $C_1$ or$C_2$. The Forward training GTN is only a slight modification of the previously introduced Viterbi training GTN. It suffices to turn the Viterbi transformers in Figure 19 into Forward Scorers that take an interpretation graph as input an produce the forward penalty of that graph on output. Then the penalties of all the paths that contain the correct answer are lowered, instead of just that of the best one. Back-propagating through the forward penalty computation (the forward transformer) is quite different from back-propagating through a Viterbi transformer. All the penalties of the input graph have an influence on the forward penalty, but penalties that belong to low-penalty paths have a stronger influence. Computing derivatives with respect to the forward penalties $f_n$ computed at each $n$ node of a graph is done by back-propagation through the graph $G_c$: ![](https://i.imgur.com/vUwdOml.png) where $D_n=$ \{ arc $i$ with source $s_i=n$ \} is the set of downstream arcs from node $n$. From the above derivatives, the derivatives with respect to the arc penalties are obtained: ![](https://i.imgur.com/X92gxvx.png) This can be seen as a "soft" version of the back-propagation through a Viterbi scorer and transformer. All the arcs in $G_c$ have an influence on the loss function. The arcs that belong to low penalty paths have a larger influence. Back-propagation through the path selector is the same as before. The derivative with respect to $G_{int}$ arcs that have an alter $ego$ in $G_c$ are simply copied from the corresponding arc in $G_c$. The derivatives with respect to the other arcs are 0. Several authors have applied the idea of back-propagating gradients through a forward scorer to train speech recognition systems, including Bridle and his $\alpha$-net model \[73\] and Haffner and his $\alpha \beta$-TDNN model \[81\], but these authors recommended discriminative training as described in the next section. ::: :::success 相對於[維特比](http://terms.naer.edu.tw/detail/6239753/)懲罰，[正向](http://terms.naer.edu.tw/detail/2371564/)懲罰的優勢在於它考慮所有不同的方法來產生一個答案，而不僅是最低懲罰的那一個。這非常重要，如果分段中存在某些模糊性，因為與相同標籤序列相關的兩個路徑$C_1$與$C_2$的組合懲罰也許小於與另一標籤序列相關的路徑$C_3$的懲罰，即使$C_3$ 的懲罰可能小於$C_1$與$C_2$中的任一個。 [正向](http://terms.naer.edu.tw/detail/2371564/)訓練GTN只是些許修正先前介紹的[維特比](http://terms.naer.edu.tw/detail/6239753/)訓練GTN。將圖19中的[維特比](http://terms.naer.edu.tw/detail/6239753/)轉換器轉為以解釋圖為輸入的正向計分器就足以生成圖在輸出上的[正向](http://terms.naer.edu.tw/detail/2371564/)懲罰。然後包含正確答案的所有路徑的懲罰是降低的，而不只是最佳答案的懲罰。通過[正向](http://terms.naer.edu.tw/detail/2371564/)懲罰計算反向傳播([正向](http://terms.naer.edu.tw/detail/2371564/)轉換器)與通過[維特比](http://terms.naer.edu.tw/detail/6239753/)轉換器做反向傳播非常不同。輸入圖的所有懲罰對正向懲罰都有影響，但屬於低懲罰路徑的懲罰則影響更大。通過圖$G_c$做反向傳播，可以計算在圖上的每個$n$節點與前向懲罰$f_n$相關的導數。 ![](https://i.imgur.com/f6cmK3E.png) 其$D_n=$ \{ arc $i$ with source $s_i=n$ \}是節點$n$的下游弧的集合。從上面的導數中，可以獲得弧懲罰的導數： ![](https://i.imgur.com/AqvdpTl.png) 這可以視為是通過[維特比](http://terms.naer.edu.tw/detail/6239753/)計分器與轉換器的反向傳播的"soft"版本。$G_c$中的所有弧都會影響損失函數。屬於低懲罰路徑的弧會有較大的影響。通過路徑選擇器的反向傳播與之前相同。The derivative with respect to $G_{int}$ arcs that have an alter $ego$ in $G_c$ are simply copied from the corresponding arc in $G_c$。相對於其它弧的導數為0。一些作者已經應用通過正向計分器進行反向傳播的來訓練語音辨識系統，包含Bridle與他的$\alpha$-net模型\[73\]，以及Haffner與他的$\alpha \beta$-TDNN模型\[81\]，但這些作者建議做判別式訓練，如下一章節說明。 ::: ### D. Discriminative Forward Training :::info The information contained in the forward penalty can be used in another discriminative training criterion which we will call the discriminative forward criterion. This criterion corresponds to maximization of the posterior probability of choosing the paths associated with the correct interpretation. This posterior probability is defined as the exponential of the minus the constrained forward penalty, normalized by the exponential of minus the unconstrained forward penalty. Note that the forward penalty of the constrained graph is always larger or equal to the forward penalty of the unconstrained interpretation graph. Ideally, we would like the forward penalty of the constrained graph to be equal to the forward penalty of the complete interpretation graph. Equality between those two quantities is achieved when the combined penalties of the paths with the correct label sequence is negligibly small compared to the penalties of all the other paths, or that the posterior probability associated to the paths with the correct interpretation is almost 1, which is precisely what we want. The corresponding GTN training architecture is shown in figure 21. Let the difference be denoted $E_{dforw}$, and let us call $C_{cforw}$ the forward penalty of the constrained graph, and $C_{forw}$ the forward penalty of the complete interpretation graph: ![](https://i.imgur.com/m5uUo8K.png) $E_{dforw}$ is always positive since the constrained graph is a subset of the paths in the interpretation graph, and the forward penalty of a graph is always larger than the forward penalty of a subgraph of this graph. In the ideal case, the penalties of incorrect. paths are infinitely large, therefore the two penalties coincide and $E_{dforw}$ is zero. Readers familiar with the Boltzmann machine connectionist model might recognize the constrained and unconstrained graphs as analogous to the "clamped" (constrained by the observed values of the output variable) and "free" (unconstrained) phases of the Boltzmann machine algorithm \[13\]. ::: :::success [正向](http://terms.naer.edu.tw/detail/2371564/)懲罰中包含的信息可用來做為另一個判別式訓練標準，我們將它稱為判別式[正向](http://terms.naer.edu.tw/detail/2371564/)標準。這個標準對應於選擇正確解釋相關的路徑的[後驗機率](http://terms.naer.edu.tw/detail/3189962/)的最大化。這個[後驗機率](http://terms.naer.edu.tw/detail/3189962/)的定義為減去約束正向懲罰的指數，並以減去[無約束](http://terms.naer.edu.tw/detail/941064/)[正向](http://terms.naer.edu.tw/detail/2371564/)懲罰的指數[正規化](http://terms.naer.edu.tw/detail/721767/)。注意到，約束圖的[正向](http://terms.naer.edu.tw/detail/2371564/)懲罰總是大於或等於[無約束](http://terms.naer.edu.tw/detail/941064/)解釋頢的[正向](http://terms.naer.edu.tw/detail/2371564/)懲罰。理想情況下，我們希望約束圖的[正向](http://terms.naer.edu.tw/detail/2371564/)懲罰相等於完整解釋圖的[正向](http://terms.naer.edu.tw/detail/2371564/)懲罰。當具有正確路標籤序列的路徑的組合懲罰與所有其它路徑的懲罰相比可忽略不計，或與帶有正確解釋的路徑相關的[後驗機率](http://terms.naer.edu.tw/detail/3189962/)幾乎為1的時候，那就可以實現兩個量之間的相等性，這正是我們想要的。相對應的GTN訓練架構如圖21所示。假設差異以$E_{dforw}$表示，並調用$C_{cforw}$為約束圖的[正向](http://terms.naer.edu.tw/detail/2371564/)懲罰，然後$C_{forw}$為完整解釋圖的[正向](http://terms.naer.edu.tw/detail/2371564/)懲罰： ![](https://i.imgur.com/m5uUo8K.png) $E_{dforw}$始終為正，因為約束圖是解釋圖中路徑的子集，而且圖的[正向](http://terms.naer.edu.tw/detail/2371564/)懲罰始終大於該圖的子圖的[正向](http://terms.naer.edu.tw/detail/2371564/)懲罰。這種理想情況下，錯誤路徑的懲罰是無限大的，因此這兩個懲罰是重合的，並且$E_{dforw}$為0。熟悉[波茲曼機](http://terms.naer.edu.tw/detail/2342986/)連接模型的讀者可能會認識到約束圖與[無約束](http://terms.naer.edu.tw/detail/941064/)圖類似於[波茲曼機](http://terms.naer.edu.tw/detail/2342986/)演算法\[13\]的"clamped"(受輸出變數的觀察值約束)與"free"([無約束](http://terms.naer.edu.tw/detail/941064/)) ::: :::info ![](https://i.imgur.com/mf1cJ0S.png) Fig. 21. Discriminative Forward Training GTN Architecture for a character string recognizer based on Heuristic Over-Segmentation. 圖21. 基於Heuristic Over-Segmentation的字符串識別器的判別式正向訓練GTN架構。 ::: :::info Back-propagating derivatives through the discriminative Forward GTN distributes gradients more evenly than in the Viterbi case. Derivatives are back-propagated through the left half of the the GTN in Figure 21 down to the interpretation graph. Derivatives are negated and back-propagated through the right-half, and the result for each arc is added to the contribution from the left half. Each arc in $G_{int}$ now has a derivative. Arcs that are part of a correct path have a positive derivative. This derivative is very large if an incorrect path has a lower penalty than all the correct paths. Similarly, the derivatives with respect to arcs that are part of a low-penalty incorrect path have a large negative derivative. On the other hand, if the penalty of a path associated with the correct interpretation is much smaller than all other paths, the loss function is very close to 0 and almost no gradient is back-propagated. The training therefore concentrates on examples of images which yield a classification error, and furthermore, it concentrates on the pieces of the image which cause that error. Discriminative forward training is an elegant and efficient way of solving the infamous credit assignment problem for learning machines that manipulate "dynamic" data structures such as graphs. More generally, the same idea can be used in all situations where a learning machine must choose between discrete alternative interpretations. As previously, the derivatives on the interpretation graph penalties can then be backpropagated into the character recognizer instances. Back-propagation through the character recognizer gives derivatives on its parameters. All the gradient contributions for the different candidate segments are added up to obtain the total gradient associated to one pair (input image, correct label sequence), that is, one example in the training set. A step of stochastic gradient descent can then be applied to update the parameters. ::: :::success 通過判別式[正向](http://terms.naer.edu.tw/detail/2371564/)GTN反向傳播的導數比在[維特比](http://terms.naer.edu.tw/detail/6239753/)情況下更均勻地分佈梯度。導數通過圖21中GTN左半部反向傳播一直到解釋圖。導數取反並通過右半部反向傳播，每個弧的結果會加到左半部的貢獻中。$G_{int}$中的每個弧現在都有一個導數。屬正確路徑的弧具有正導數。如果一個不正確的路徑比所有正確路徑的懲罰都低，那這個導數是非常大的。同樣的，相對於低懲罰錯誤路徑的一部分的弧的導數具有較大的負導數。換句話說，如果與正確解釋相關聯的路徑的懲罰比所有其它路徑還要小的多，損失函數非常接近0，而且幾乎沒有梯度反向傳播。因此，訓練著重於產生分類錯誤的影像範例，此外，它著重於引起該錯誤的影像片段。判別式[正向](http://terms.naer.edu.tw/detail/2371564/)訓練是解決[操作](http://terms.naer.edu.tw/detail/3263938/)"dynamic"資料結構（如圖）的學習機器臭名昭著的信用分配問題的一種優雅而有效的方法。更一般而言，在學習機器必需在離散的替代解釋之間進行選擇的所有情況下都可以使用相同的想法。如前所述，然後可將圖懲罰上的導數反向傳播到字符識別器實例中。通過字符識別器的反向傳播在其參數上給出導數。不同候選片段的所有梯度貢獻相加，以獲得一對(輸入影像，正確標籤序列)相關聯的總梯度，即訓練集中的一個範例。然後可以應用隨機梯度下降步驟來更新參數。 ::: ### E. Remarks on Discriminative Training :::info In the above discussion, the global training criterion was given a probabilistic interpretation, but the individual penalties on the arcs of the graphs were not. There are good reasons for that. For example, if some penalties are associated to the different class labels, they would (1) have to sum to 1 (class posteriors), or (2) integrate to 1 over the input domain (likelihoods). Let us first discuss the first case (class posteriors normalization). This local normalization of penalties may eliminate information that is important for locally rejecting all the classes \[82\], e.g., when a piece of image does not correspond to a valid character class, because some of the segmentation candidates may be wrong. Although an explicit "garbage class" can be introduced in a probabilistic framework to address that question, some problems remain because it is difficult to characterize such a class probabilistically and to train a system in this way (it would require a density model of unseen or unlabeled samples). The probabilistic interpretation of individual variables plays an important role in the Baum-Welsh algorithm in combination with the Expectation-Maximization procedure. Unfortunately, those methods cannot be applied to discriminative training criteria, and one is reduced to using gradient-based methods. Enforcing the normalization of the probabilistic quantities while performing gradient-based learning is complex, inefficient, time consuming, and creates ill-conditioning of the loss-function. Following \[82\], we therefore prefer to postpone normalization as far as possible (in fact, until the final decision stage of the system). Without normalization, the quantities manipulated in the system do not have a direct probabilistic interpretation. ::: :::success 在上面的討論中，全域訓練標準得到一個機率解釋，但是圖上的弧的各別懲罰則沒有。這有充份的理由。舉例來說，如果部份懲罰與不同類別標籤相關聯，那它們將(1)必須相加為1(類別後驗)，或(2)在輸入域([似然性](http://terms.naer.edu.tw/detail/366257/))上集成為1。讓我們首先討論第一種狀況(類別後驗[正規化](http://terms.naer.edu.tw/detail/420654/))。這種懲罰的局部[正規化](http://terms.naer.edu.tw/detail/420654/)或許會消除對局部拒絕所有類別的重要信息\[82\]，舉例來說，當一張圖片不對應於有效字符類別的時候，這是因為某些片段候選可能是錯的。雖然可以在機率性框架中引入一個明確的"垃圾分類"來解決這個問題，但依然存在部份問題，因為很難用機率來表示這個類別，也很難用這種方式訓練(這將需要一個看不見或未標記樣本的密度模型)。各別變數的機率性解釋在Baum-Welsh演算法中與[期望最大化](http://terms.naer.edu.tw/detail/3648649/)過程的結合有著重要作用，不幸的是，這些方法不能應用在判別式訓練標準，而且其中一種方法只能簡化為使用基於梯度的方法。在執行基於梯度學習時，強制機率量的[正規化](http://terms.naer.edu.tw/detail/420654/)是複雜，沒效率，耗時，而且會造成損失函數的[病態](http://terms.naer.edu.tw/detail/3188252/)。因此，根據\[82\]，我們希望可以儘可能的延遲[正規化](http://terms.naer.edu.tw/detail/420654/)(事實上，一直到系統的最終決策階段)。沒有[正規化](http://terms.naer.edu.tw/detail/420654/)，系統中操作的量就沒有直接的機率性解釋。 ::: :::info Let us now discuss the second case (using a generative model of the input). Generative models build the boundary indirectly, by first building an independent density model for each class, and then performing classiffcation decisions on the basis of these models. This is not a discriminative approach in that it does not focus on the ultimate goal of learning, which in this case is to learn the classiffication decision surface. Theoretical arguments \[6\], \[7\] suggest that estimating input densities when the real goal is to obtain a discriminant function for classiffication is a suboptimal strategy. In theory, the problem of estimating densities in high-dimensional spaces is much more ill-posed than finding decision boundaries. Even though the internal variables of the system do not have a direct probabilistic interpretation, the overall system can still be viewed as producing posterior probabilities for the classes. In fact, assuming that a particular label sequence is given as the "desired sequence" to the GTN in figure 21, the exponential of minus $E_{dforw}$ can be interpreted as an estimate of the posterior probability of that label sequence given the input. The sum of those posteriors for all the possible label sequences is 1. Another approach would consists of directly minimizing an approximation of the number of misclassifications \[83\], \[76\]. We prefer to use the discriminative forward loss function because it causes less numerical problems during the optimization. We will see in Section X-C that this is a good way to obtain scores on which to base a rejection strategy. The important point being made here is that one is free to choose any parameterization deemed appropriate for a classification model. The fact that a particular parameterization uses internal variables with no clear probabilistic interpretation does not make the model any less legitimate than models that manipulate normalized quantities. An important advantage of global and discriminative training is that learning focuses on the most important errors, and the system learns to integrate the ambiguities from the segmentation algorithm with the ambiguities of the character recognizer. In Section IX we present experimental results with an on-line handwriting recognition system that confirm the advantages of using global training versus separate training. Experiments in speech recognition with hybrids of neural networks and HMMs also showed marked improvements brought by global training \[77\], \[29\], \[67\], \[84\]. ::: :::success 現在，讓我們討論第二種情況(使用輸入的生成模型)。生成模型間接地建立邊界，首先為每個類別建立一個獨立的密度模型，然後依據這些模型執行分類決策。這並不是判別式方法，它並不關注在學習的最終目標，在這種情況下，最終目標是學習分類[決策面](http://terms.naer.edu.tw/detail/2357266/)。理論證明\[6\]，\[7\]表明，當實際目標是獲得用於分類的判別式函數時，那麼估計輸入密度就是一個次優策略。理論上，在高維度空間中估計密度的問題比尋找決策邊界還不切實際。即使系統的內部變數沒有直接的機率性解釋，整個系統仍然可以視為生成類別的[後驗機率](http://terms.naer.edu.tw/detail/150830/)。事實上，假設將特定的標籤序列作為圖21中的GTN的"所需序列"，那麼負$E_{dforw}$的指數可以解釋為給定輸入的標籤序列的[後驗機率](http://terms.naer.edu.tw/detail/150830/)估計值。對所有可能的標籤序列而言，這些後驗的總和為1。另一種方法將包含直接最小化分類錯誤的近似值\[83\]，\[76\]。我們更喜歡使用判別式[正向](http://terms.naer.edu.tw/detail/2371564/)損失函數，因為它在優化過程中引起的問題更少。我們將在X-C章節中看到，這是一個獲得分數的好方法，以此為拒絕策略的基礎。這邊說的重點是，可以自由的選擇任何適用於分類模型的[參數化](http://terms.naer.edu.tw/detail/254345/)。一個特定參數化使用內部變數並沒有明確機率解釋的這個事實，並沒有造成模型比處理[正規化](http://terms.naer.edu.tw/detail/721767/)量的模型更不合法。 :::