Gradient-Based Learning Applied to Document Recognition_Paper(翻譯)(VII, VIII)

# Gradient-Based Learning Applied to Document Recognition_Paper(翻譯)(VII, VIII) ###### tags: `LeNet-5` `CNN` `論文翻譯` `deeplearning` `Gradient-Based Learning Applied to Document Recognition` >[name=Shaoe.chen] [time=Mon, Jan 6, 2020 1:14 PM] [TOC] ## 說明區塊如下分類，原文區塊為藍底，翻譯區塊為綠底，部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf) * [\*paper hyperlink](http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf) ::: Page 27~34 ## VII. Multiple Object Recognition: Space Displacement Neural Network :::info There is a simple alternative to explicitly segmenting images of character strings using heuristics. The idea is to sweep a recognizer at all possible locations across a normalized image of the entire word or string as shown in Figure 22. With this technique, no segmentation heuristics are required since the system essentially examines all the possible segmentations of the input. However, there are problems with this approach. First, the method is in general quite expensive. The recognizer must be applied at every possible location on the input, or at least at a large enough subset of locations so that misalignments of characters in the field of view of the recognizers are small enough to have no effect on the error rate. Second, when the recognizer is centered on a character to be recognized, the neighbors of the center character will be present in the field of view of the recognizer, possibly touching the center character. Therefore the recognizer must be able to correctly recognize the character in the center of its input field, even if neighboring characters are very close to, or touching the central character. Third, a word or character string cannot be perfectly size normalized. Individual characters within a string may have widely varying sizes and baseline positions. Therefore the recognizer must be very robust to shifts and size variations. ::: :::success 有一個簡單的替代方法來使用[啟發式方法](http://terms.naer.edu.tw/detail/1824705/)顯示的[分段](http://terms.naer.edu.tw/detail/2419996/)字符串影像。這個想法就是，使用一個在整個單詞或字串的[正規化](http://terms.naer.edu.tw/detail/721767/)影像上[掃描](http://terms.naer.edu.tw/detail/2427142/)所有可能位置的識別器，如圖22所示。使用這個技術，不需要[分段](http://terms.naer.edu.tw/detail/2419996/)[啟發式方法](http://terms.naer.edu.tw/detail/1824705/)，因為系統實質上檢查輸入的所有可能[分段](http://terms.naer.edu.tw/detail/2419996/)。然而，這方法還是存在問題。首先，這種方法通常十分昂貴。這個識別器必需應用在輸入的每個可能位置上，或者最少在足夠大的位置子集上，以便識別器的[視野](http://terms.naer.edu.tw/detail/2368726/?index=7)中[錯位](http://terms.naer.edu.tw/detail/854687/)足夠小，而不會影響誤差率。第二，當識別器要辨識的字符為中心的時候，中心字符的鄰域會出現在識別器的[視域](http://terms.naer.edu.tw/detail/2378452/)中，可能接觸到中心字符。因此，識別器必須能夠正確的辨識輸入字段的中心字符，即使相鄰字符非常靠近或接觸到中心字符。第三，一個單詞或字符串不能完全大小[標準化](http://terms.naer.edu.tw/detail/721767/)。字串中的各別字符可能有大小、基線位置不同的可能。因此，識別器必須對[位移](http://terms.naer.edu.tw/detail/2422749/)與大小變化非常具有魯棒性。 ::: :::info ![](https://i.imgur.com/lNAtArX.png) Fig. 22. Explicit segmentation can be avoided by sweeping a recognizer at every possible location in the input field. 圖22. 通過使用一個識別器在輸入字段中的每個可能位置[掃描](http://terms.naer.edu.tw/detail/2427142/)，可以避免顯示地分段。 ::: :::info These three problems are elegantly circumvented if a convolutional network is replicated over the input field. First of all, as shown in section III, convolutional neural networks are very robust to shifts and scale variations of the input image, as well as to noise and extraneous marks in the input. These properties take care of the latter two problems mentioned in the previous paragraph. Second, convolutional networks provide a drastic saving in computational requirement when replicated over large input fields. A replicated convolutional network, also called a Space Displacement Neural Network or SDNN \[27\], is shown in Figure 23. While scanning a recognizer can be prohibitively expensive in general, convolutional networks can be scanned or replicated very efficiently over large, variable-size input fields. Consider one instance of a convolutional net and its alter ego at a nearby location. Because of the convolutional nature of the network, units in the two instances that look at identical locations on the input have identical outputs, therefore their states do not need to be computed twice. Only a thin "slice" of new states that are not shared by the two network instances needs to be recomputed. When all the slices are put together, the result is simply a larger convolutional network whose structure is identical to the original network, except that the feature maps are larger in the horizontal dimension. In other words, replicating a convolutional network can be done simply by increasing the size of the fields over which the convolutions are performed, and by replicating the output layer accordingly. The output layer effectively becomes a convolutional layer. An output whose receptive field is centered on an elementary object will produce the class of this object, while an in-between output may indicate no character or contain rubbish. The outputs can be interpreted as evidences for the presence of object at all possible positions in the input field. ::: :::success 如果在輸入字段上複製卷積網路，那就可以巧妙地規避這三個問題。首先，如第三章節所說，卷積神經網路對輸入影像的[位移](http://terms.naer.edu.tw/detail/2422749/)與縮放變化，以及輸入中的噪點與[無關的](http://terms.naer.edu.tw/detail/3473091/)標記非常具有魯棒性。這些屬性解決了上一段中提到的後兩個問題。第二，當在的大型輸入字段上複製的時候，卷積網路提供一個大大節省計算的需求。圖23說明，一個複製的卷積網路，亦稱為Space Displacement Neural Network或SDNN。一般而言，雖然一個識別器掃描非常昂貴，但是卷積網路可以在大型、可變大小的輸入字段上非常有效率地掃描或複製。考慮一個卷積網路以及在附近區域的[另一個自我](http://terms.naer.edu.tw/detail/713752/)的實例。因為網路具有卷積性質，因此在兩個實例中，在輸入上看相同位置的單元具有相同的輸出，因此不需要對它們的狀態計算兩次。只要重新計算兩個網路實例未共享的新狀態的一小"[片](http://terms.naer.edu.tw/detail/2423941/)"。當所有的[切片]((http://terms.naer.edu.tw/detail/2423941/))都放一起的時候，結果就只是一個更大的卷積網路，其架構與原始網路相同，只是`feature maps`的水平維度較大而以。換句話說，複製一個卷積網路可以簡單的通過增加執行卷積字段的大小並相應的複製輸出層來完成。輸出層有效的成為卷積層。一個[接收域](http://terms.naer.edu.tw/detail/3285175/)以[基本](http://terms.naer.edu.tw/detail/2363712/)物件為中心的輸出將生成此物件的類別，而介於中間的輸出可能表示沒有字符或包含垃圾。其輸出可以解釋為輸入字段中所有可能位置存在物件的證據。 ::: :::info ![](https://i.imgur.com/W9OfpGL.png) Fig. 23. A Space Displacement Neural Network is a convolutional network that has been replicated over a wide input field. 圖23. Space Displacement Neural Network是一個在寬輸入字段上複製的卷積網路 ::: :::info The SDNN architecture seems particularly attractive for recognizing cursive handwriting where no reliable segmen tation heuristic exists. Although the idea of SDNN is quite old, and very attractive by its simplicity, it has not generated wide interest until recently because, as stated above, it puts enormous demands on the recognizer \[26\], \[27\].In speech recognition, where the recognizer is at least one order of magnitude smaller, replicated convolutional networks are easier to implement, for instance in Haffner's Multi-State TDNN model \[78\], \[85\]. ::: :::success SDNN架構對沒有可靠的分段[啟發式方法]((http://terms.naer.edu.tw/detail/1824705/))情況下辨識[草書](http://terms.naer.edu.tw/detail/2354557/)手寫似乎特別地有吸引力。儘管SDNN的概念非常古老，而且其簡單性非常具有吸引力，但一直到最近它才開始引人注意，因為，如上所述，它對識別器有著極大的要求\[26\]，\[27\]。在語音辨識中，其識別器至少小一個量級，複製卷積網路更容易實現，例如在Haffner's Multi-State TDNN模型\[78\]，\[85\]中。 ::: ### A. Interpreting the Output of an SDNN with a GTN :::info The output of an SDNN is a sequence of vectors which encode the likelihood, penalties, or scores of finding character of a particular class label at the corresponding location in the input‧ A post-processor is required to pull out the best possible label sequence from this vector sequence. An example of SDNN output is shown in Figure 23. Very often, individual characters are spotted by several neighboring instances of the recognizer, a consequence of the robustness of the recognizer to horizontal translations. Also quite often, characters are erroneously detected by recognizer instances that see only a piece of a character. For example a recognizer instance that only sees the right third of a "4" might output the label 1. How can we eliminate those extraneous characters from the output sequence and pull-out the best interpretation? This can be done using a new type of Graph Transformer with two input graphs as shown in Figure 24. The sequence of vectors produced by the SDNN is first coded into a linear graph with multiple arcs between pairs of successive nodes. Each arc between a particular pair of nodes contains the label of one of the possible categories, together with the penalty produced by the SDNN for that class label at that location. This graph is called the SDNN Output Graph. The second input graph to the transformer is a grammar transducer, more specifically a finite-state transducer \[86\], that encodes the relationship between input strings of class labels and corresponding output strings of recognized characters. The transducer is a weighted finite-state machine (a graph) where each arc contains a pair of labels and possibly a penalty. Like a finite-state machine, a transducer is in a state and follows an arc to a new state when an observed input symbol matches the first symbol in the symbol pair attached to the arc. At this point the transducer emits the second symbol in the pair together with a penalty that combines the penalty of the input symbol and the penalty of the arc. A transducer therefore transforms a weighted symbol sequence into another weighted symbol sequence. The graph transformer shown in figure 24 performs a composition between the recognition graph and the grammar transducer. This operation takes every possible sequence corresponding to every possible path in the recognition graph and matches them with the paths in the grammar transducer. The composition produces the interpretation graph, which contains a path for each corresponding output label sequence. This composition operation may seem combinatorially intractable, but it turns out there exists an efficient algorithm for it described in more details in Section VIII. ::: :::success SDNN的輸出是一個向量序列，這些向量序列對輸入的相對應位置中找到特定類別標籤的可能性，懲罰或分數進行編碼。需要[後處理機](http://terms.naer.edu.tw/detail/2049335/)從向量序列中找出最佳標籤序列。SDNN的輸出範例，如圖23所示。通常，識別器的幾個相鄰實例會認出[個體](http://terms.naer.edu.tw/detail/441072/)字符，這是識別器對水平轉換的魯棒性的[結果](http://terms.naer.edu.tw/detail/2352038/)。同樣的，字被會被只看到一個字符的識別器實例[錯誤地](https://tw.voicetube.com/definition/erroneously)偵測到。舉例來說，只看到"4"的右邊三分之一的識別器實例也許會輸出標籤為1。我們如何消除來自輸出序列的那些多餘字符並提取出最佳解釋?這可以使用帶有兩個輸入圖的新型圖轉換器來完成，如圖24所示。由SDNN生成的向量序列首先被編碼在[後繼節點](http://terms.naer.edu.tw/detail/2426746/)對之間帶有多個弧的線性圖。特定點節對之間的每一個弧包含一個可能類別的標籤，以及SDNN在該位置為該類別標籤所生成的懲罰。這個圖稱為SDNN輸出圖。轉換器的第二個輸入圖是[語法](http://terms.naer.edu.tw/detail/2373942/)[轉換器](http://terms.naer.edu.tw/detail/1061461/)，更具體的說是[有限狀態轉換器](http://terms.naer.edu.tw/detail/2369686/)\[86\]，它對類別標籤的輸入出串與相對應辨識字符的輸出字串進行編碼。這個[轉換器](http://terms.naer.edu.tw/detail/1061461/)是一個加權[有限狀態機](http://terms.naer.edu.tw/detail/2369679/)(圖)，其每一個弧包含一對標籤，而且可能還包含一個懲罰。與[有限狀態機](http://terms.naer.edu.tw/detail/2369679/)一樣，轉換器處於一個狀態，當觀察到輸入[符號](http://terms.naer.edu.tw/detail/1060655/)與附加到弧的[符號](http://terms.naer.edu.tw/detail/1060655/)對中的第一個[符號](http://terms.naer.edu.tw/detail/1060655/)匹配的時候，[轉換器](http://terms.naer.edu.tw/detail/1061461/)將依循一個弧到新狀態。在這一點上，[轉換器](http://terms.naer.edu.tw/detail/1061461/)發出在該對中的第二個[符號](http://terms.naer.edu.tw/detail/1060655/)以及一個懲罰，該懲罰結合輸入[符號](http://terms.naer.edu.tw/detail/1060655/)的懲罰以及弧的懲罰。因此，[轉換器](http://terms.naer.edu.tw/detail/1061461/)將一個加權符號[符號](http://terms.naer.edu.tw/detail/1060655/)序列轉換為另一個加權[符號](http://terms.naer.edu.tw/detail/1060655/)序列。圖24所示的圖轉換器執行辨識圖與[語法](http://terms.naer.edu.tw/detail/2373942/)轉換器之間的合成。這個合成生成解釋圖，包含每一個相對應輸出標籤序列的路徑。這個合成操作看起來似乎解以處理，但事實上，存在一種有效的演算法，這部份將在VIII章節中更詳細說明。 ::: :::info ![](https://i.imgur.com/f3p5r2L.png) Fig. 24. A Graph Transformer pulls out the best interpretation from the output of the SDNN. 圖24. 圖轉換器從SDNN輸出中取得最佳解釋。 ::: ### B. Experiments with SDNN :::info In a series of experiments, LeNet-5 was trained with the goal of being replicated so as to recognize multiple characters without segmentations. The data was generated from the previously described Modified NIST set as follows. Training images were composed of a central character, flanked by two side characters picked at random in the training set. The separation between the bounding boxes of the characters were chosen at random between -1 and 4 pixels. In other instances, no central character was present, in which case the desired output of the network was the blank space class. In addition, training images were degraded with 10% salt and pepper noise (random pixel inversions). ::: :::success 在一系列的實驗中，LeNet-5的訓練目標是進行複製，以便識別多個字符而不需要[分段](http://terms.naer.edu.tw/detail/2419959/)。資料是由先前所說的Modified NIST所生成，如下所述。訓練影像由中心字符所組成，兩側由訓練集中隨機挑選的兩個側面字符組成。字符[定界框](http://terms.naer.edu.tw/detail/2343363/)的間隔是由-1到4像素之間隨機選擇的。在其它情況下，不存在中心字符，這種情況下，網路的期望輸出是[空白空間](http://terms.naer.edu.tw/detail/2342517/)類別。此外，訓練影像會因為10%的[椒鹽噪聲](https://zh.wikipedia.org/wiki/%E6%A4%92%E7%9B%90%E5%99%AA%E5%A3%B0)(隨機像素[反轉](http://terms.naer.edu.tw/detail/839295/))而退化。 ::: :::info ![](https://i.imgur.com/8Fja0rD.png) Fig. 25. An example of multiple character recognition with SDNN. With SDNN, no explicit segmentation is performed. 圖25. 使用SDNN做多字符辨識的範例。使用SDNN，並不會進行顯示分段。 ::: :::info ![](https://i.imgur.com/squTHJ4.png) Fig. 26. An SDNN applied to a noisy image of digit string. The digits shown in the SDNN output represent the winning class labels, with a lighter grey level for high-penalty answers. 圖26. SDNN應用於數字字串的噪點影像。SDNN輸出中顯示的數字代表獲勝類別的標籤，灰階較低則代表高懲罰的答案。 ::: :::info Figures 25 and 26 show a few examples of successful recognitions of multiple characters by the LeNet-5 SDNN. Standard techniques based on Heuristic Over-Segmentation would fail miserably on many of those examples. As can be seen on these examples, the network exhibits striking invariance and noise resistance properties. While some authors have argued that invariance requires more sophisticated models than feedforward neural networks \[87\], LeNet-5 exhibits these properties to a large extent. ::: :::success 圖25與26說明LeNet-5成功辨識多個字符的部份範例。基於Heuristic Over-Segmentation的標準技術在許多範例中都會失敗的很慘。這些範例可以看的出來，網路具有驚人的[不變性](http://terms.naer.edu.tw/detail/88724/)與抗噪性。儘管部份作者認為不變性需要比前饋神經網路更複雜的模型\[87\]，但LeNet-5很大程度上展現這些特性。 ::: :::info Similarly, it has been suggested that accurate recognition of multiple overlapping objects require explicit mechanisms that would solve the so-called feature binding problem \[87\]. As can be seen on Figures 25 and 26, the network is able to tell the characters apart, even when they are closely intertwined, a task that would be impossible to achieve with the more classical Heuristic Over-Segmentation technique. The SDNN is also able to correctly group disconnected pieces of ink that form characters. Good examples of that are shown in the upper half of figure 26. In the top left example, the 4 and the 0 are more connected to each other than they are connected with themselves, yet the system correctly identifies the 4 and the 0 as separate objects. The top right example is interesting for several reasons. First the system correctly identifies the three individual ones. Second, the left half and right half of disconnected 4 are correctly grouped, even though no geometrical information could decide to associate the left half to the vertical bar on its left or on its right. The right half of the 4 does cause the appearance of an erroneous 1 on the SDNN output, but this one is removed by the character model transducer which prevents characters from appearing on contiguous outputs. ::: :::success 同樣的，已經有人提出對多個重疊物件的準確辨識需要明確的機制，以便解決所謂的特徵綁定的問題\[87\]。可以由圖25、26看出，即使字符緊密纏繞在一起，網路也能夠將字符分開，這是經典的Heuristic Over-Segmentation技術無法實現的任務。SDNN還可以正確的將形成字符的斷開連接的墨跡片段進行分組。圖26的上半部是一個很好的例子。在左上範例中，4與0之間的連接比自己本身的連接還要多，但是系統正確的識別出4與0為單獨的物件。右上的範例很有趣，原因有幾個。首先，系統正確識別出三個各別的1。第二，4的左半部與右半部正確的群組，即使沒有幾何信息能夠決定左半部與左或右邊的[垂直條](http://terms.naer.edu.tw/detail/2435628/)關聯。4的右半部確實在SDNN的輸出上出現錯誤的1，但是這個1透過字符模型[轉換器](http://terms.naer.edu.tw/detail/955587/)刪除，避免字符出現在連續輸出上。 ::: :::info Another important advantage of SDNN is the ease with which they can be implemented on parallel hardware. Specialized analog/digital chips have been designed and used in character recognition, and in image preprocessing applications \[88\]. However the rapid progress of conventional processor technology with reduced-precision vector arithmetic instructions (such as Intel's MMX) make the success of specialized hardware hypothetical at best. Short video clips of the LeNet-5 SDNN can be viewed at [hyperlink](http://www.research.att.com/~yann/ocr). ::: :::success SDNN的另一個重要的優點是可以在[平行](http://terms.naer.edu.tw/detail/419322/)硬體上輕鬆的實現。已經設計出analog/digital專用的[晶片](http://terms.naer.edu.tw/detail/2347534/)，並應用於字符辨識以及影像預處理應用程式\[88\]。然而，傳統處理器技術的快速發展以及低精度的向量算術指令(像是intel的MMX)，專用硬體的成功充其量只是假設。LeNet-5 SDNN的簡短影片可以在[hyperlink](http://www.research.att.com/~yann/ocr)觀賞。 ::: ### C. Global Training of SDNN :::info In the above experiments, the string image were artificially generated from individual character. The advantage is that we know in advance the location and the label of the important character. With real training data, the correct sequence of labels for a string is generally available, but the precise locations of each corresponding character in the input image are unknown. In the experiments described in the previous section, the best interpretation was extracted from the SDNN output using a very simple graph transformer. Global training of an SDNN can be performed by back-propagating gradients through such graph transformers arranged in architectures similar to the ones described in section VI. This is somewhat equivalent to modeling the output of an SDNN with a Hidden Markov Model. Globally trained, variable-size TDNN/HMM hybrids have been used for speech recognition and on-line handwriting recognition \[77\],\[89\],\[90\],\[67\]. Space Displacement Neural Networks have been used in combination with HMMs or other elastic matching methods for handwritten word recognition \[91\],\[92\]. ::: :::success 在上面的實驗中，字串影像是由單獨字符中人工生成的。優點是，我們可以提前知道重要字符的位置與標籤。對於真實的訓練資料，通常可以得到字串的正確標籤序列，但是輸入影像中的每個相應字符的精確位置是未知的。上一章節所述的實驗中，使用非常簡單的圖轉換器從SDNN輸出中提取出最佳解釋。SDNN的全域訓練可以透過如第六章節中所說的架構中排序的圖轉換器來進行反向傳播梯度。這在某種程度上等價於使用[隱馬爾可夫模型](http://terms.naer.edu.tw/detail/6218215/)對SDNN的輸出建模。全域訓練，可變大小的TDNN/HMM混合已經用於語音辨識以及在線手寫辨識\[77\]，\[89\]，\[90\]，\[67\]。Space Displacement Neural Networks已經與HMMs或其它彈性匹配方法結合使用，用於手寫單詞辨識\[91\]，\[92\]。 ::: :::info Figure 27 shows the graph transformer architecture for training an SDNN/HMM hybrid with the Discriminative Forward Criterion. The top part is comparable to the top part of figure 21. On the right side the composition of the recognition graph with the grammar gives the interpretation graph with all the possible legal interpretations. On the left side the composition is performed with a grammar that only contains paths with the desired sequence of labels. This has a somewhat similar function to the path selector used in the previous section. Like in Section VI-D the loss function is the difference between the forward score obtained from the left half and the forward score obtained from the right half. To back-propagate through the composition transformer, we need to keep a record of which arc in the recognition graph originated which arcs in the interpretation graph. The derivative with respect to an arc in the recognition graph is equal to the sum of the derivatives with respect to all the arcs in the interpretation graph that originated from it. Derivative can also be computed for the penalties on the grammar graph, allowing to learn them as well. As in the previous example, a discriminative criterion must be used, because using a non-discriminative criterion could result in a collapse effect if the network's output RBF are adaptive. The above training procedure can be equivalently formulated in term of HMM. Early experiments in zip code recognition \[91\], and more recent experiments in on-line handwriting recognition \[38\] have demonstrated the idea of globally-trained SDNN/HMM hybrids. SDNN is an extremely promising and attractive technique for OCR, but so far it has not yielded better results than Heuristic Over-Segmentation. We hope that these results will improve as more experience is gained with these models. ::: :::success 圖27說明，用於訓練混合判別式正向標準的SDNN/HMM的圖轉換器架構。上部與圖21的上部是相當的。在右側，帶有[語法](http://terms.naer.edu.tw/detail/2373942/)的辨識圖的組合給出帶有所有可能的合法解釋的解釋圖。在左側，使用僅包含具有所需標籤序列的路徑的[語法](http://terms.naer.edu.tw/detail/2373942/)進行合成。這功能與前面章節說使用的路徑選擇器有某些程度上的相似。像是第VI-D章節中，損失函數是由左半部份獲得的正向分數與右半部份獲得的正向分數之間的差值。要透過合成轉換器來做反向傳播，我們需要記錄識別弧中那一個弧起源解釋圖中的那一個弧。識別圖中相對於弧的導數相等於解釋圖中源於它的所有弧的導數之和。導數也可以由[語法](http://terms.naer.edu.tw/detail/2373942/)圖上的懲罰計算而得，以便學習它們。如同前面的範例，必須使用判別式標準，因為如果網路的輸出RBF是[自適應](http://terms.naer.edu.tw/detail/2454885/)的，那麼使用一個無判別式標準會導致崩潰效果。上面的訓練程序可以等價的在HMM中制定。郵政編碼辨識的早期實驗中\[91\]，以及在線手寫辨識的較新實驗\[38\]已經證明全域訓練SDNN/HMM混合的想法。SDNN對OCR極有希望而且具吸引力的技術，但目前為止，它並沒有比Heuristic Over-Segmentation有更好的結果。我們希望隨著這些模型獲得更多的經驗，這些結果將有所改善。 ::: :::info ![](https://i.imgur.com/s6C4pFH.png) Fig. 27. A globally trainable SDNN/HMM hybrid system expressed as a GTN. 圖27. 以GTN表示全域可訓練的SDNN/HMM混合系統 ::: ### D. Object Detection and Spotting with SDNN :::info An interesting application of SDNNs is object detection and spotting. The invariance properties of Convolutional Networks, combined with the efficiency with which they can be replicated over large fields suggest that they can be used for "brute force" object spotting and detection in large images. The main idea is to train a single Convolutional Network to distinguish images of the object of interest from images present in the background. In utilization mode, the network is replicated so as to cover the entire image to be analyzed, thereby forming a two-dimensional Space Displacement Neural Network. The output of the SDNN is a two-dimensional plane in which activated units indicate the presence of the object of interest in the corresponding receptive field. Since the sizes of the object to be detected within the image are unknown, the image can be presented to the network at multiple resolutions, and the results at multiple resolutions combined. The idea has been applied to face location, \[93\] ,address block location on envelopes \[94\], and hand tracking in video \[95\]. To illustrate the method, we will consider the case of face detection in images as described in \[93\]. First, images containing faces at various scales are collected. Those images are filtered through a zer-omean Laplacian filter so as to remove variations in global illumination and low spatial frequency illumination gradients. Then, training samples of faces and non-faces are manually extracted from those images. The face sub-images are then size normalized so that the height of the entire face is approximately 20 pixels while keeping fairly large variations (within a factor of two). The scale of background sub-images are picked at random. A single convolutional network is trained on those samples to classify face sub-images from nonface sub-images. ::: :::success SDNN有一個有趣的應用是物體偵測與發現。卷積網路的[不變性性質](http://terms.naer.edu.tw/detail/2118252/)，再加上它們可以大範圍內複製的效率，這表明在大型影像中它們可以用"[蠻力](http://terms.naer.edu.tw/detail/1217737/)"做物體發現與偵測。主要的想法是訓練一個卷積網路來區分感興趣的物件的影像與背景影像。在[利用](http://terms.naer.edu.tw/detail/488455/)模式下，複製網路來覆蓋要分析的整個影像，從而形成一個二維的Space Displacement Neural Network。SDNN的輸出是一個二維平面，其中啟動的單元指示感興趣的物件在相對應的接收域中存在。由於影像中要檢測的物件大小未知，因此可以將影像以多種解析度呈現給網路，再將多種解析度的結果結合起來。這個想法已經應用於臉部位置 \[93\]、信封上的地址區塊 \[94\]，以及視訊中的手部追蹤\[95\]。為了說明這個方法，我們將考慮\[93\]中說明的影像中的人臉偵測的情況。首先，收集包含各種比例的臉部的影像。這些影像通過平均值為零的拉普拉斯濾波器進行過濾，以消除全域照明以及低空間頻率照明梯度中的變化。然後，手動挑出臉部與非臉部的訓練樣本。然後，將臉部子影像的大小標準化，以使整個臉部的高約20像素，同時保持相同大的變化(在兩倍內)。隨機選取背景子影像的比例。在那些樣本上訓練卷積網路，從而從非臉部子影像中分類出臉部子影像。 ::: :::info When a scene image is to be analyzed, it is first filtered through the Laplacian filter, and sub-sampled at powers- of-two resolutions. The network is replicated over each of multiple resolution images. A simple voting technique is used to combine the results from multiple resolutions. A two-dimensional version of the global training method described in the previous section can be used to alleviate the need to manually locate faces when building the training sample \[93\]. Each possible location is seen as an alternative interpretation, i.e. one of several parallel arcs in a simple graph that only contains a start node and an end node. Other authors have used Neural Networks, or other classifiers such as Support Vector Machines for face detection with great success \[96\], \[97\].Their systems are very similar to the one described above, including the idea of presenting the image to the network at multiple scales. But since those systems do not use Convolutional Networks, they cannot take advantage of the speedup described here, and have to rely on other techniques, such as pre-filtering and real-time tracking, to keep the computational requirement within reasonable limits. In addition, because those classifiers are much less invariant to scale variations than Convolutional Networks, it is necessary to multiply the number of scales at which the images are presented to the classifiers. ::: :::success 當要分析場景影像的時候，首先通過拉普拉斯濾波器進行濾波，然後在兩個解析度的幂下進行子採樣。網路通過多種解析度影像進行複製。一種簡單的投票技術用於組合多種解析度的結果。當建構訓練樣本的時候\[93\]，上一章節中說明的全域訓練方法的二維版本可以用來減少手動定位臉部的需求。每一個可能的位置都被視為替代解釋，即簡單圖中僅包含幾個平行弧中的一個，該圖只包含[開始節點](http://terms.naer.edu.tw/detail/2425451/)與[端節點](http://terms.naer.edu.tw/detail/2115413/)的。其它作者使用神經網路或其它分類器(像是[支援向量機](http://terms.naer.edu.tw/detail/3621763/))成功進行臉部偵測\[96\]，\[97\]。他們的系統與上面說明的系統非常相似，包含多種比例將影像呈現給網路的想法。但是，由於這些系統不使用卷積網路，因此它們無法利用這邊說明的加速優點而不得不依賴其它的技術(像是預濾波與實時追踨)來將計算需求保持在合理範圍內。此外，因為這些分類器在比例變化上的不變性遠小於卷積網路，因此有必要乘上將影像呈現給分類器的比例。 ::: ## VIII. Graph Transformer Networks and Transducers :::info In Section IV, Graph Transformer Networks (GTN) were introduced as a generalization of multi-layer, multi-module networks where the state information is represented as graphs instead of fixed-size vectors. This section re-interprets the GTNs in the framework of Generalized Transduction, and proposes a powerful Graph Composition algorithm. ::: :::success 在第四章節中，介紹圖轉換器網路(GTN)為多層、多模組的網路，其狀態信息以圖表示，而不是固定大小的向量。本章節在[一般型轉導](http://terms.naer.edu.tw/detail/5457299/)的框架中重新解釋GTNs，並提出一種強力的圖合成演算法。 ::: ### A. Previous Work :::info Numerous authors in speech recognition have used Gradient-Based Learning methods that integrate graph-based statistical models (notably HMM) with acoustic recognition modules, mainly Gaussian mixture models, but also neural networks \[98\], \[78\], \[99\], \[67\].Similar ideas have been applied to handwriting recognition (see \[38\] for a review). However, there has been no proposal for a system atic approach to multi-layer graph-based trainable systems. The idea of transforming graphs into other graphs has received considerable interest in computer science, through the concept of weighted finite-state transducers \[86\]. Transducers have been applied to speech recognition \[100\] and language translation \[101\], and proposals have been made for handwriting recognition \[102\]. This line of work has been mainly focused on efficient search algorithms \[103\] and on the algebraic aspects of combining transducers and graphs (called acceptors in this context), but very little effort has been devoted to building globally trainable systems out of transducers. What is proposed in the following sections is a systematic approach to automatic training in graph-manipulating systems. A different approach to graph-based trainable systems, called Input-Output HMM, was proposed in \[104\], \[105\]. ::: :::success 語音辨識的諸多作者已經使用基於梯度的學習方法，該方法結合了基於圖的統計模型(特別是HMM)與[聲學](http://terms.naer.edu.tw/detail/2337143/)辨識模組(主要是高斯混合模型)，同時也使用神經網路\[98\]，\[78\]，\[99\]，\[67\]。類似的想法已經應用於手寫辨識(相關評論，見\[38\])。然而，目前並沒有針對多層基於圖的可訓練系統的系統性提案。透過加權有限狀態[轉換器](http://terms.naer.edu.tw/detail/2432148/)的概念\[86\]，將圖轉換為其它圖的想法已經在計算機科學中引起極大的興趣。[轉換器](http://terms.naer.edu.tw/detail/2432148/)已經應用於語音辨識\[100\]與語言翻譯\[101\]，並已提出關於手寫辨識\[102\]的提案。這方面的工作主要集中在高效搜尋演算法\[103\]，以及結合[轉換器](http://terms.naer.edu.tw/detail/2432148/)與圖的代數方面(此上下文中稱為"[接收器](http://terms.naer.edu.tw/detail/2338394/)")，但是很少人投入於以[轉換器](http://terms.naer.edu.tw/detail/2432148/)建構全域可訓練系統。下面章節提出的是一種在圖處理系統中進行自動訓練的系統化方法。在\[104\]，\[105\]中提出一種基於圖的可訓練系統的不同方法，稱為Input-Output HMM。 ::: ### B. Standard Transduction :::info In the established framework of finite-state transduc ers \[86\], discrete symbols are attached to arcs in the graphs. Acceptor graphs have a single symbol attached to each arc whereas transducer graphs have two symbols (an input symbol and an output symbol). A special null symbol is absorbed by any other symbol (when concatenating symbols to build a symbol sequence). Weighted transducers and acceptors also have a scalar quantity attached to each arc. In this framework, the composition operation takes as input an acceptor graph and a transducer graph and builds an output acceptor graph. Each path in this output graph (with symbol sequence $S_{out}$) corresponds to one path (with symbol sequence $S_{in}$) in the input acceptor graph and one path and a corresponding pair of input/output sequences ($S_{out}$, $S_{in}$) in the transducer graph. The weights on the arcs of the output graph are obtained by adding the weights from the matching arcs in the input acceptor and transducer graphs. In the rest of the paper, we will call this graph composition operation using transducers the (standard) transduction operation. ::: :::success 在有限狀態[轉換器](http://terms.naer.edu.tw/detail/2432148/)的已建立框架中\[86\]，離散符號被附加圖中的弧上。[接收器](http://terms.naer.edu.tw/detail/2338394/)圖在每個弧上都有一個符號，而[轉換器](http://terms.naer.edu.tw/detail/2432148/)圖有兩個符號(輸入符號與輸出符號)。特殊的空符號會被任何其它符號吸收(當連接符號以構建符號序列時)。加權[轉換器](http://terms.naer.edu.tw/detail/2432148/)與[接收器](http://terms.naer.edu.tw/detail/2338394/)也有一個標量附加到每一個弧。在這個框架中，合成的操作將[接收器](http://terms.naer.edu.tw/detail/2338394/)圖與[轉換器](http://terms.naer.edu.tw/detail/2432148/)圖做為輸出，並建立一個輸出[接收器](http://terms.naer.edu.tw/detail/2338394/)圖。這個輸出圖中的每個路徑(帶有符號序列$S_{out}$)對應於輸入[接收器](http://terms.naer.edu.tw/detail/2338394/)圖的一個路徑(帶有符號序列$S_{in}$)與一個路徑，以及對應[轉換器](http://terms.naer.edu.tw/detail/2432148/)圖中的一對輸入/輸出序列($S_{out}$, $S_{in}$)。透過將來自輸入[接收器](http://terms.naer.edu.tw/detail/2338394/)與[轉換器](http://terms.naer.edu.tw/detail/2432148/)圖中的匹配弧的權重相加，可以得到輸出圖弧上的權重。論文的其餘部份，我們將這種使用[轉換器](http://terms.naer.edu.tw/detail/2432148/)的圖合成操作稱為(標準)[轉導](http://terms.naer.edu.tw/detail/2432151/)操作。 ::: :::info A simple example of transduction is shown in Figure 28. In this simple example, the input and output symbols on the transducer arcs are always identical. This type of transducer graph is called a grammar graph. To better understand the transduction operation, imagine two tokens sitting each on the start nodes of the input acceptor graph and the transducer graph. The tokens can freely follow any arc labeled with a null input symbol. A token can follow an arc labeled with a nonnull input symbol if the other token also follows an arc labeled with the same input symbol. We have an acceptable trajectory when both tokens reach the end nodes of their graphs (i.e. the tokens have reached the terminal configuration). This trajectory represents a sequence of input symbols that complies with both the acceptor and the transducer. We can then collect the corresponding sequence of output symbols along the tra jectory of the transducer token. The above procedure produces a tree, but a simple technique described in Section VIII-C can be used to avoid generating multiple copies of certain subgraphs by detecting when a particular output state has already been seen. ::: :::success [轉導](http://terms.naer.edu.tw/detail/2432151/)的一個簡單範例如圖28所示。在這個簡單的範例中，[轉換器](http://terms.naer.edu.tw/detail/2432148/)弧上的輸入與輸出符號總是相同。這種類型的[轉換器](http://terms.naer.edu.tw/detail/2432148/)圖稱為語法圖。為了更好的瞭解[轉導](http://terms.naer.edu.tw/detail/2432151/)操作，想像一下，兩個[符記](http://terms.naer.edu.tw/detail/2431517/)分別位於輸入[接收器](http://terms.naer.edu.tw/detail/2338394/)圖與[轉換器](http://terms.naer.edu.tw/detail/2432148/)圖的[開始節點](http://terms.naer.edu.tw/detail/2425451/)上。[符記](http://terms.naer.edu.tw/detail/2431517/)可以自由的跟隨任何標有空輸入符號的弧。如果另一個[符記](http://terms.naer.edu.tw/detail/2431517/)也跟隨一個標有相同輸入符號的弧，那[符記](http://terms.naer.edu.tw/detail/2431517/)可以跟隨使用非空輸入符號[符記](http://terms.naer.edu.tw/detail/2431517/)的弧。當兩個[符記](http://terms.naer.edu.tw/detail/2431517/)都到達其圖的[端節點](http://terms.naer.edu.tw/detail/2364339/)(即[符記](http://terms.naer.edu.tw/detail/2431517/)已到達終端配置)，那麼我們有一個可接受的軌跡。這個軌跡表示符合[接收器](http://terms.naer.edu.tw/detail/2338394/)與[轉換器](http://terms.naer.edu.tw/detail/2432148/)的輸入符號序列。然後，我們可以沿著轉換器[符記](http://terms.naer.edu.tw/detail/2431517/)的軌跡來收集相對應的輸出符號序列。上面的過程生成一棵樹，但可以使用在VIII-C章節中所說的簡單技術，透過檢測何時已經看到特定的輸出狀態來避免生成某些子圖的多個複本。 ::: :::info ![](https://i.imgur.com/NBxyN0C.png) Fig. 28. Example of composition of the recognition graph with the grammar graph in order to build an interpretation that is consistent with both of them. During the forward propagation (dark arrows), the methods check and fprop are used. Gradients (dashed arrows) are back-propagated with the application of the method bprop. 圖28. 識別圖與語法圖的組成範例，以便構建與兩者一致的解釋。在正向傳播期間(黑箭頭)，使用方法check與fprop。梯度(虛線箭頭)使用方法bprop進行反向傳播。 ::: :::info The transduction operation can be performed very efficiently \[106\], but presents complex book-keeping problems concerning the handling of all combinations of null and non null symbols. If the weights are interpreted as probabilities (normalized appropriately) then an acceptor graph represents a probability distribution over the language defined by the set of label sequences associated to all possible paths (from the start to the end node) in the graph. ::: :::success [轉導](http://terms.naer.edu.tw/detail/2432151/)操作可以非常有效率的執行\[106\]，但存在關於處理空符號與非空符號的所有組合的複雜的[簿記](http://terms.naer.edu.tw/detail/2343012/)問題。如果權重被解釋為機率(適當的標準化)，那麼[接收器](http://terms.naer.edu.tw/detail/2338394/)圖表示由與圖中所有可能路徑(從[開始節點](http://terms.naer.edu.tw/detail/2425451/)到[端節點](http://terms.naer.edu.tw/detail/2364339/))相關聯的標籤序列集合定義的語言上的機率分佈。 ::: :::info An example of application of the transduction operation is the incorporation of linguistic constraints (a lexicon or a grammar) when recognizing words or other character strings. The recognition transformer produces the recognition graph (an acceptor graph) by applying the neural network recognizer to each candidate segment. This acceptor graph is composed with a transducer graph for the grammar. The grammar transducer contains a path for each legal sequence of symbol, possibly augmented with penalties to indicate the relative likelihoods of the possible sequences. The arcs contain identical input and output symbols. Another example of transduction was mentioned in Section V: the path selector used in the heuristic over-segmentation training GTN is implementable by a composition. The transducer graph is linear graph which contains the correct label sequence. The composition of the interpretation graph with this linear graph yields the constrained graph. ::: :::success 轉導操作的應用程式範例，在辨識單詞或其它字符串時，將語言約束([詞彙](http://terms.naer.edu.tw/detail/2385960/)或語法)合併在一起。識別[轉換器](http://terms.naer.edu.tw/detail/2432148/)透過將神經網路識別器應用於每個候選片段來生成識別圖([接收器](http://terms.naer.edu.tw/detail/2338394/)圖)。這個[接收器](http://terms.naer.edu.tw/detail/2338394/)圖由用於語法的[轉換器](http://terms.naer.edu.tw/detail/2432148/)圖組成。語法轉換器包含每個合法符號序列的路徑，可能會增加懲罰以表示可能序列的相對似然性。弧包含相同的輸入與輸出符號。第五章節提到另一個[轉導](http://terms.naer.edu.tw/detail/2432151/)範例：heuristic over-segmentation training GTN中使用的路徑選擇器可以通過組合實現。[轉換器](http://terms.naer.edu.tw/detail/2432148/)圖是線性圖，包含正確標籤序列。解釋圖與該線性圖的組合產生約束圖。 ::: ### C. Generalized Transduction :::info If the data structures associated to each arc took only a finite number of values, composing the input graph and an appropriate transducer would be a sound solution. For our applications however, the data structures attached to the arcs of the graphs may be vectors, images or other high-dimensional objects that are not readily enumerated. We present a new composition operation that solves this problem. Instead of only handling graphs with discrete symbols and penalties on the arcs, we are interested in considering graphs whose arcs may carry complex data structures, in cluding continuous-valued data structures such as vectors and images. Composing such graphs requires additional information: * When examining a pair of arcs (one from each input graph), we need a criterion to decide whether to create corresponding arc(s) and node(s) in the output graph, based on the information attached to the input arcs. We can decide to build an arc, several arcs, or an entire sub-graph with several nodes and arc. * When that criterion is met, we must build the corre sponding arc(s) and node(s) in the output graph and compute the information attached to the newly created arc(s) as a function the the information attached to the input arcs. ::: :::success 如果與每個弧相關聯的資料結構只取有限的值，那麼組合輸入圖與適當的[轉換器](http://terms.naer.edu.tw/detail/2432148/)將是一個合理的解決方案。但是，對於我們的應用程式，附加到圖的弧上的資料結構也許是向量，影像或是其它不易枚舉的高維度物件。我們提出一種新的組合操作來解決這個問題。我們不只處理弧上帶有離散符號與懲罰的圖，還考慮其弧可能包含複雜的資料結構的圖，包含連續數值的資料結構，像是向量與影像。組合這類圖需求其它信息： * 在檢查一對弧(每個輸入圖中的一個)的時候，我們需要一個標準來決定是否根據附加到輸入弧的資料在輸出圖中建立相對應的弧與節點。我們可以決定建立一個弧，多個弧，或一個帶有多個節點與弧的子圖。 * 當滿足標準的時候，我們必須在輸出圖中建立相對應的弧與節點，並計算附加到新建立的弧的信息做為函數附加到輸入弧的信息。 ::: :::info These functions are encapsulated in an object called a Composition Transformer. An instance of Composition Transformer implements three methods: * check(arc1, arc2) compares the data structures pointed to by arcs arc1 (from the first graph) and arc2 (from the second graph) and returns a boolean indicating whether corresponding arc(s) should be created in the output graph. * fprop(ngraph, upnode, downnode, arc1, arc2) is called when check(arc1, arc2) returns true. This method creates new arcs and nodes between nodes upnode and downnode in the output graph ngraph, and computes the information attached to these newly created arcs as a function of the attached information of the input arcs arc1 and arc2. * bprop(ngraph, upnode, downnode, arc1, ar2) is called during training in order to propagate gradient information from the output sub-graph between upnode and downnode into the data structures on the arc1 and arc2, as well as with respect to the parameters that were used in the fprop call with the same arguments. This assumes that the function used by fprop to compute the values attached to its output arcs is rcs is differentiable. ::: :::success 這些功能被封裝在稱為Composition Transformer的物件中。Composition Transformer的實例化實現三種方法： * check(arc1, arc2) 比較弧，`arc1`(來自第一個圖)與`arc2`(來自第二個圖)所指向的資料結構，並回傳一個boolean指示是否應在輸出圖中建立相對應的弧。 * fprop(ngraph, upnode, downnode, arc1, arc2) 當`check(arc1, arc2)`回傳為true的時候呼叫。這個方法在輸出圖`ngraph`中的節點`upnode`與`downnode`之間建立新的弧與節點，並計算附加到這些新建立弧的信息，做為輸入弧`arc1`與`arc2`的附加信息的函數。 * bprop(ngraph, upnode, downnode, arc1, ar2) 訓練期間呼叫，以便將梯度信息從`upnode`與`downnode`之間的輸出子圖傳播到`arc1`與`arc2`上的資料結構中，以及與具有相同參數的`fprop`呼叫中使用的參數。這假設`fprop`用於計算附加到輸出弧的值的函數是可微的。 ::: :::info The check method can be seen as constructing a dynamic architecture of functional dependencies, while the fprop method performs a forward propagation through that architecture to compute the numerical information attached to the arcs. The bprop method performs a backward propagation through the same architecture to compute the partial derivatives of the loss function with respect to the information attached to the arcs. This is illustrated in Figur 28. Figure 29 shows a simplified generalized graph composition algorithm. This simplified algorithm does not handle null transitions, and does not check whether the tokens trajectory is acceptable (i.e. both tokens simultaneously reach the end nodes of their graphs). The management of null transitions is a straightforward modification of the token simulation function. Before enumerating the possible non null joint token transitions, we loop on the possible null transitions of each token, recursively call the token simulation function, and finally call the method fprop. The safest way for identifying acceptable trajectories consists in running a preliminary pass for identifying the token configurations from which we can reach the terminal configuration (i.e. both tokens on the end nodes). This is easily achieved by enumerating the trajectories in the opposite direction. We start on the end nodes and follow the arcs upstream. During the main pass, we only build the nodes that allow the tokens to reach the terminal configuration. ::: :::success 方法`check`可以視為是建構函數依賴項的動態體系架構，而方法`fprop`透過該架構執行正向傳播，以計算附加到弧上的數值信息。方法`bprop`透過相同架構執行反向傳播，以計算相對於附加到弧上的信息的損失函數的偏導數。如圖28所示。圖29說明簡化的generalized graph composition(一般型圖合成?)演算法。這個簡易的演算法並沒有處理`null`的轉換，並且沒有檢查[符記](http://terms.naer.edu.tw/detail/2431517/)軌跡是否可以接受(即兩個[符記](http://terms.naer.edu.tw/detail/2431517/)同時到達其圖的[端節點](http://terms.naer.edu.tw/detail/2364339/))。`null`轉換的管理是對[符記](http://terms.naer.edu.tw/detail/2431517/)模擬函數的直接修改。在枚舉可能的`non null`結合[符記](http://terms.naer.edu.tw/detail/2431517/)轉換之前，我們循環每一個[符記](http://terms.naer.edu.tw/detail/2431517/)可能的`null`轉換，遞迴呼叫符記模擬函數，最後呼叫方法`fprop`。識別可接受軌跡最安全的方法，就是執行一個初步通行認證，用於識別[符記](http://terms.naer.edu.tw/detail/2431517/)配置，從這邊我們可以到達終端配置(即[端節點](http://terms.naer.edu.tw/detail/2364339/)上的兩個[符記](http://terms.naer.edu.tw/detail/2431517/))。透過枚舉相反方向的軌跡可以輕鬆的實現這點。我們從[端節點](http://terms.naer.edu.tw/detail/2364339/)開始，然後沿著弧向上游走。在主要過程中，我們只有建立允許[符記](http://terms.naer.edu.tw/detail/2431517/)到達終端配置的節點。 ::: :::info Graph composition using transducers (i.e. standard transduction) is easily and efficiently implemented as a generalized transduction. The method check simply tests the equality of the input symbols on the two arcs, and the method fprop creates a single arc whose symbol is the output symbol on the transducer's arc. The composition between pairs of graphs is particularly useful for incorporating linguistic constraints in a hand-writing recognizer. Examples of its use are given in the on-line handwriting recognition system described in Section IX) and in the check reading system described in Section X). ::: :::success 使用[轉換器](http://terms.naer.edu.tw/detail/2432148/)(即標準[轉導](http://terms.naer.edu.tw/detail/2432151/))做圖合成是非常容易的而且有效地實現為[一般型轉導](http://terms.naer.edu.tw/detail/5457299/)。方法`check`僅測試兩個弧上輸入符號的相等性，而方法`fprop`建立單個弧，其符號為[轉換器](http://terms.naer.edu.tw/detail/2432148/)弧上的輸出符號。成對圖之間的組合對於在手寫辨識器中合併語路約束特別有用。在IX章節中說明的在線手寫辨識系統以及第X章節中說明的支票讀取系統給出使用範例。 ::: :::info In the rest of the paper, the term Composition Transformer will denote a Graph Transformer based on the generalized transductions of multiple graphs. The concept of generalized transduction is a very general one. In fact, many of the graph transformers described earlier in this paper, such as the segmenter and the recognizer, can be formulated in terms of generalized transduction. In this case the, the generalized transduction does not take two input graphs but a single input graph. The method fprop of the transformer may create several arcs or even a complete subgraph for each arc of the initial graph. In fact the pair (check, fprop) itself can be seen as procedurally defining a transducer. ::: :::success 論文的其餘部份中，術語"Composition Transformer"將表示基於多個圖的[一般型轉導](http://terms.naer.edu.tw/detail/5457299/)的圖轉換器。[一般型轉導](http://terms.naer.edu.tw/detail/5457299/)的概念是一個非常普遍的概念。事實上，論文前面說明的許多圖轉換器，像是分段器與識別器，都可以以[一般型轉導](http://terms.naer.edu.tw/detail/5457299/)來表示。這種情況下，[一般型轉導](http://terms.naer.edu.tw/detail/5457299/)不採用兩個輸入圖，而是一個輸入圖。轉換器的方法`fprop`可以為初始圖的每個弧建立多個弧，甚至是建立一個完整的子圖。事實上，該對(`check`, `fprop`)本身可以視為是在程式上定義一個轉換器。 ::: :::info In addition, It can be shown that the generalized transduction of a single graph is theoretically equivalent to the standard composition of this graph with a particular transducer graph. However, implementing the operation this way may be very inefficient since the transducer can be very complicated. In practice, the graph produced by a generalized transduction is represented procedurally, in order to avoid building the whole output graph (which may be huge when for example the interpretation graph is composed with the grammar graph). We only instantiate the nodes which are visited by the search algorithm during recognition (e.g. Viterbi). This strategy propagates the benefits of pruning algorithms (e.g. Beam Search) in all the Graph Transformer Network. ::: :::success 此外，可以看的出來，單圖的[一般型轉導](http://terms.naer.edu.tw/detail/5457299/)在理論上等價於該圖與特定轉換器圖的標準組合。然而，由於轉換器可能非常複雜，因此以這種方式實作這種操作可能會造成效率不佳。實際上，為了避免建構整個輸出圖(例如，當解釋圖與語法圖組合時，該圖可能非常大)，[一般型轉導](http://terms.naer.edu.tw/detail/5457299/)的生成圖是依程序表示。我們只有實例化辨識期間搜尋演算法訪問的節點(如，Viterbi)。這種策略將[修剪](http://terms.naer.edu.tw/detail/2412213/)演算法(如Beam Search)的好處推廣到所有的Graph Transformer Network中。 ::: :::info ![](https://i.imgur.com/22FE7GK.png) Fig. 29. Pseudo-code for a simplified generalized composition algorithm. For simplifying the presentation, we do not handle null transitions nor implement dead end avoidance. The two main component of the composition appear clearly here: (a) the recursive function simtoken() enumerating the token trajectories, and (b) the associative array map used for remembering which nodes of the composed graph have been visited. 圖29. 簡化的一般性合成演算法的[虛擬碼](http://terms.naer.edu.tw/detail/2412226/)。為了簡化表示，我們不處理`null`轉換，也不實作碰撞避免機制。組成的兩個主要組件在這邊也清晰可見：(a)遞迴函數`simtoken()`枚舉符記軌跡，以及(b)關聯陣列映射，用來記住已訪問組成圖的哪些節點。 ::: ### D. Notes on the Graph Structures :::info Section VI has discussed the idea of global training by back-propagating gradient through simple graph transformers. The bprop method is the basis of the back-propagation algorithm for generic graph transformers. A generalized composition transformer can be seen as dynamically establishing functional relationships between the numerical quantities on the input and output arcs. Once the check function has decided that a relationship should be established, the fprop function implements the numerical relationship. The check function establishes the structure of the ephemeral network inside the composition transformer. Since fprop is assumed to be differentiable, gradients can be back-propagated through that structure. Most parameters affect the scores stored on the arcs of the successive graphs of the system. A few threshold parameters may determine whether an arc appears or not in the graph. Since non existing arcs are equivalent to arcs with very large penalties, we only consider the case of parameters affecting the penalties. ::: :::success 第六章節討論通過簡單的圖轉換器做梯度反向傳播全域訓練的想法。方法`bprop`是一般圖轉換器反向傳播演算法的基礎。一般型合成轉換器可以視為動態建立輸入與輸出弧上的數值量之間函數關聯。一旦函數`check`決定建立關聯，那麼函數`fprop`就實現數字關聯。函數`check`建立合成轉換器內部的臨時網路的結構。由於假設`fprop`是可微分的，因此梯度可以透過該結構反向傳播。多數的參數影響存儲在系統連續圖上的弧的分數。一些閥值參數可以決定弧是否出現在圖中。由於不存在的弧等價於具非常大懲罰的弧，因此我們只考慮影響懲罰的參數的情況。 ::: :::info In the kind of systems we have discussed until now (and the application described in Section X), much of the knowledge about the structure of the graph that is produced by a Graph Transformer is determined by the nature of the Graph Transformer, but it may also depend on the value of the parameters and on the input. It may also be interesting to consider Graph Transformer modules which attempt to learn the structure of the output graph. This might be considered a combinatorial problem and not amenable to Gradient-Based Learning, but a solution to this problem is to generate a large graph that contains the graph candidates as sub-graphs, and then select the appropriate sub-graph. ::: :::success 目前為止我們討論過的系統類型中(以及第十章節討論的應用程式)，關於由圖轉換器生成的圖的結構的多數的知識是由圖轉換器的性質決定的，但它也可以取決於參數的值與輸入。考慮試著學習輸出圖結構的圖轉換器模組可能也是很有趣的。這可能被認為是組合問題，不適合基於梯度的學習，但是解決該問題的方法是生成一個大圖，該大圖包含候選圖做為子圖，然後選擇適當的子圖。 ::: ### E. GTN and Hidden Markov Models :::info GTNs can be seen as a generalization and an extension of HMMs. On the one hand, the probabilistic interpretation can be either kept (with penalties being log-probabilities), pushed to the final decision stage (with the di erence of the constrained forward penalty and the unconstrained forward penalty being interpreted as negative log-probabilities of label sequences), or dropped altogether (the network just represents a decision surface for label sequences in input space). On the other hand, Graph Transformer Networks extend HMMs by allowing to combine in a well-principled framework multiple levels of processing, or multiple models (e.g., Pereira et al have been using the transducer framework for stacking HMMs representing different levels of processing in automatic speech recognition \[86\]). ::: :::success GTNs可以視為HMMs的慨括與擴展。一方面，機率解釋既可以保留(懲罰為對數機率)，也可以推到最終決策階段(約束正向懲罰與無約束正向懲罰的差異被解釋標籤序列的負對數機率)，也可以完全放棄(網路僅表示為輸入空間中標籤序列的一個決策面)。另一方面，Graph Transformer Networks透過允許在一個良好的框架中組合[多層次](http://terms.naer.edu.tw/detail/2396728/)處理或多模型來擴展HMMs(例如，Pereira et al.使用轉換器框架在自動語音辨識中堆疊表示不同處理級別的HMMs) ::: :::info Unfolding a HMM in time yields a graph that is very similar to our interpretation graph (at the final stage of processing of the Graph Transformer Network, before Viterbi recognition). It has nodes $n(t, i)$ associated to each time step $t$ and state $i$ in the model. The penalty $c_i$ for an arc from $n(t - 1, j)$ to $n(t, i)$ then corresponds to the negative log-probability of emitting observed data $o_t$ at position $t$ and going from state $j$ to state $i$ in the time interval $(t - 1, t)$. With this probabilistic interpretation, the forward penalty is the negative logarithm of the likelihood of whole observed data sequence (given the model). ::: :::success 及時展開HMM會生成一個圖，非我們的解釋圖非常相似的圖(在[維特比](http://terms.naer.edu.tw/detail/6280079/)辨識之前，Graph Transformer Network處理的最後階段)。它具有節點$n(t, i)$，關聯到模型中的每一個[時步](http://terms.naer.edu.tw/detail/955441/)$t$與狀態$i$。從$n(t - 1, j)$到$n(t, i)$的弧的懲罰$c_i$對應於在$t$位置[放射](http://terms.naer.edu.tw/detail/2363953/)觀測數據$o_t$以及在時間間隔$(t - 1, t)$中從狀態$j$到狀態$i$的負對數機率。通過這個機率解釋，正向懲罰是整個觀察到的資料序列(根據模型)的似然性的負對數。 ::: :::info In Section VI we mentioned that the collapsing phenomenon can occur when non-discriminative loss functions are used to train neural networks/HMM hybrid systems. With classical HMMs with fixed preprocessing, this problem does not occur because the parameters of the emission and transition probability models are forced to satisfy certain probabilistic constraints: the sum or the integral of the probabilities of a random variable over its possible values must be 1. Therefore, when the probability of certain events is increased, the probability of other events must automatically be decreased. On the other hand, if the probabilistic assumptions in an HMM (or other probabilistic model) are not realistic, discriminative training, discussed in Section VI, can improve performance as this has been clearly shown for speech recognition systems \[48\], \[49\], \[50\], \[107\], \[108\]. ::: :::success 在第六章節中，我們提到當用無區別性損失函數來訓練神經網路/HMM混合系統的時候，崩潰現象可能會發生。對於具有固定預處理的經典HMM，這個問題並不會發生，因為[發射](http://terms.naer.edu.tw/detail/2363953/)與[轉移機率](http://terms.naer.edu.tw/detail/2126604/)模型的參數被強制滿足某些機率約束：隨機變數在其可能值上的機率總合或積分必須為1。因此，當某些事件的機率增加，其它事件的機率就必須自動減少。另一方面，如果HMM(或其它機率模型)機率假設不切實際，在第六章節討論的區別性訓練可以提高效能，因為這已在語音辨識系統清楚可見\[48\]，\[49\]，\[50\]，\[107\]，\[108\]。 ::: :::info The Input-Output HMM model (IOHMM) \[105\], \[109\], is strongly related to graph transformers. Viewed as a probabilistic model, an IOHMM represents the conditional distribution of output sequences given input sequences (of the same or a different length). It is parameterized from an emission probability module and a transition probability module. The emission probability module computes the conditional emission probability of an output variable (given an input value and the value of discrete "state" variable). The transition probability module computes conditional transition probabilities of a change in the value of the "state" variable, given the an input value. Viewed as a graph transformer, it assigns an output graph (representing a probability distribution over the sequences of the output variable) to each path in the input graph. All these output graphs have the same structure, and the penalties on their arcs are simply added in order to obtain the complete output graph. The input values of the emission and transition modules are read off the data structure on the input arcs of the IOHMM Graph Transformer. In practice, the output graph may be very large, and needs not be completely instantiated (i.e., it is pruned: only the low penalty paths are created). ::: :::success 輸出入HMM模型(IOHMM)\[105\], \[109\]與圖轉換器密切相關。從機率模型來看，IOHMM代表給定輸入序列(長度相同或不相同)的輸出序列的條件分佈。它從[發射機率](http://terms.naer.edu.tw/detail/383582/)模組與[轉移機率](http://terms.naer.edu.tw/detail/2126604/)模型進行參數化。[發射機率](http://terms.naer.edu.tw/detail/383582/)模組計算輸出變數的條件[發射機率](http://terms.naer.edu.tw/detail/383582/)(給定輸入值與離散"狀態"變數的值)。[轉移機率](http://terms.naer.edu.tw/detail/2126604/)模組計算給定輸入值之後的"狀態"變數的值化的條件[轉移機率](http://terms.naer.edu.tw/detail/2126604/)。將它視為圖轉換器的話，它將輸出圖(表示輸出變數序列上的機率分佈)分配給輸入圖中的每一個路徑。所有的這些輸出圖擁有相同的結構，只需要加上它們弧上的懲罰就可以得到完整的輸出圖。這個發射模組與轉移模型的輸入值從IOHMM圖轉換器的輸入弧上的資料結構中讀出。實際上，輸出圖可能很大，不需要完全實例化(即修剪：僅建立低懲罰路徑)。 :::