Try   HackMD

Gradient-Based Learning Applied to Document Recognition_Paper(翻譯)(I, II)

tags: LeNet-5 CNN 論文翻譯 deeplearning Gradient-Based Learning Applied to Document Recognition

說明

區塊如下分類,原文區塊為藍底,翻譯區塊為綠底,部份專業用語翻譯參考國家教育研究院

原文

翻譯

任何的翻譯不通暢部份都請留言指導

I. INTERODUCTION

Over the last several years, machine learning techniques, particularly when applied to neural networks, have played an increasingly important role in the design of pattern recognition systems. In fact, it could be argued that the availability of learning techniques has been a crucial factor in the recent success of pattern recognition applica tions such as continuous speech recognition and handwriting recognition.

The main message of this paper is that better pattern recognition systems can be built by relying more on automatic learning, and less on hand-designed heuristics. This is made possible by recent progress in machine learning and computer technology. Using character recognition as a case study, we show that hand-crafted feature extraction can be advantageously replaced by carefully designed learning machines that operate directly on pixel images. Using document understanding as a case study, we show that the traditional way of building recognition systems by manually integrating individually designed modules can be replaced by a unified and well principled design paradigm, called Graph Transformer Networks, that allows training all the modules to optimize a global performance criterion.

過去幾年裡,機器學習技術,尤其是應用於神經網路時,在圖形辨識系統的設計中有著日益重要的角色。事實上,可以這麼說,學習技術的可用性已經是圖形辨識應用程式近來成功的關鍵因素,像是連續語音識別與手寫辨識。

這篇論文的主要訊息是,可以透過依賴更多的自動學習,而不是手動設計的探索來建置更好的圖形辨識系統。機器學習與電腦技術的最新發展讓這事是可能的。以character recognition為例,我們說明了手工製做的特徵提取可以被精心設計的直接操作像素影像的學習機器所代替。以document understanding為例,我們說明,利用手動集成單獨設計的模組來建置辨識系統的傳統方式可以被一個統一而且有原則的設計模式所替代,稱為Graph Transformer Networks,它允許訓練所有的模組來最佳化全域性能標準。

Since the early days of pattern recognition it has been known that the variability and richness of natural data, be it speech, glyphs, or other types of patterns, make it almost impossible to build an accurate recognition system entirely by hand. Consequently, most pattern recognition systems are built using a combination of automatic learning techniques and hand-crafted algorithms. The usual method of recognizing individual patterns consists in dividing the system into two main modules shown in figure 1. The first module, called the feature extractor, transforms the input patterns so that they can be represented by low- dimensional vectors or short strings of symbols that (a) can be easily matched or compared, and (b) are relatively in variant with respect to transformations and distortions of the input patterns that do not change their nature. The feature extractor contains most of the prior knowledge and is rather specific to the task. It is also the focus of most of the design effort, because it is often entirely hand-crafted. The classifier, on the other hand, is often general-purpose and trainable. One of the main problems with this approach is that the recognition accuracy is largely determined by the ability of the designer to come up with an appropriate set of features. This turns out to be a daunting task which, unfortunately, must be redone for each new problem. A large amount of the pattern recognition literature is devoted to describing and comparing and relative merits of different feature sets for particular tasks.

從早期的圖形辨識開始就已經知道自然資料的可變性與豐富性,不論是語音,字符,或是其它類型的圖形,這造成它幾乎不可能通過人工來建置一個精確的辨識系統。因此,大部份的圖形辨識系統的建置都是結合自動學習技術以及手工演算法。辨識單個圖形的常用方式是將系統劃分為兩個模組,如Figure 1.所示。第一個模組稱為特徵提取,轉換輸入的圖形,以便它們可以用低維度的向量或短字符串來表示,(a)可以被簡單的匹配或比較,並且(b)相對於輸入圖形的轉換與變形,這並不會改變它們的本質。特徵提取包含大部份的先驗知識,而且對於任務是更為具體的。這也是大部份設計的重點,因為它通常是透過手工製作的。換句話說,分類器通常是通用而且可訓練的。這種實現方法的主要問題之一在於辨識準確率很大的程度取決於設計人員提出一個適當的特徵集的能力。事實證明,這是一項艱鉅的任務,很不幸的是,每一個新問題都需要重做。大部份的圖形辨識文獻都致力於描述、比較以及特定任務上不同特徵集的相關性。

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Fig. 1. Traditional pattern recognition is performed with two modules: a fixed feature extractor, and a trainable classifier.
Fig. 1. 傳統圖形辨識透過兩個模組執行:一個固定的特徵提取以及一個可訓練的分類器。

Historically, the need for appropriate feature extractors was due to the fact that the learning techniques used by the classifiers were limited to low-dimensional spaces with easily separable classes[1]. A combination of three factors have changed this vision over the last decade. First, the availability of low-cost machines with fast arithmetic units allows to rely more on brute-force "numerical" methods than on algorithmic refinements. Second, the availability of large databases for problems with a large market and wide interest, such as handwriting recognition, has enabled designers to rely more on real data and less on hand-crafted feature extraction to build recognition systems. The third and very important factor is the availability of powerful machine learning techniques that can handle high-dimensional inputs and can generate intricate decision functions when fed with these large data sets. It can be argued that the recent progress in the accuracy of speech and handwriting recognition systems can be attributed in large part to an increased reliance on learning techniques and large training data sets. As evidence to this fact, a large proportion of modern commercial OCR systems use some form of multi layer Neural Network trained with back-propagation.

歷史上,需要適當的特徵提取是基於一個事實,分類器使用的學習技術僅限於在具有容易分離類別的低維度空間中[1]。過去十年中,三個因素共同改變了這一個願景。首先,具有快速運算單元的低成本機器的可用性允許更依賴蠻力的"數值"方法,而不是演算法的改進。第二,大型資料庫對於大型市場和廣泛興趣的問題的可用性,像是手寫辨識,讓設計人員能夠更依賴實際資料,減少手工製做特徵提取來建置辨識系統。第三點,也是非常重要的因素就是強大的機器學習技術的可用性,它可以處理高維度的輸入,並且在給給定大資料集的時候可以生成複雜的決策函數。可以這麼說,語音、手寫辨識系統最近的發展很大部份可以歸因於對學習技術以及大量訓練資料集的日益依賴。做為這事實的證明,現代商業OCR系統很大一部份使用某種形式神經網路,而且是以反向傳播訓練。

In this study, we consider the tasks of handwritten character recognition (Sections I and II) and compare the performance of several learning techniques on a benchmark data set for handwritten digit recognition (Section III). While more automatic learning is beneficial, no learning technique can succeed without a minimal amount of prior knowledge about the task. In the case of multi-layer neural networks, a good way to incorporate knowledge is to tailor its architecture to the task. Convolutional Neural Networks[2] introduced in Section II are an example of specialized neural network architectures which incorporate knowledge about the invariances of 2D shapes by using local connection patterns, and by imposing constraints on the weights. A comparison of several methods for isolated handwritten digit recognition is presented in section III. To go from the recognition of individual characters to the recognition of words and sentences in documents, the idea of combining multiple modules trained to reduce the overall error is introduced in Section IV. Recognizing variable-length objects such as handwritten words using multi-module systems is best done if the modules manipulate directed graphs.

This leads to the concept of trainable Graph Transformer Network (GTN) also introduced in Section IV. Section V describes the now classical method of heuristic over-segmentation for recognizing words or other character strings. Discriminative and non-discriminative gradient-based techniques for training a recognizer at the word level without requiring manual segmentation and labeling are presented in Section VI. Section VII presents the promising Space-Displacement Neural Network approach that eliminates the need for segmentation heuristics by scanning a recognizer at all possible locations on the input. In section VIII, it is shown that trainable Graph Transformer Networks can be formulated as multiple generalized transductions, based on a general graph composition algorithm.

The connections between GTNs and Hidden Markov Models, commonly used in speech recognition is also treated. Section IX describes a globally trained GTN system for recognizing handwriting entered in a pen computer. This problem is known as "online" handwriting recognition, since the machine must produce immediate feedback as the user writes. The core of the system is a Convolutional Neural Network. The results clearly demonstrate the advantages of training a recognizer at the word level, rather than training it on pre-segmented, hand-labeled, isolated characters. Section X describes a complete GTN-based system for reading handwritten and machine-printed bank checks. The core of the system is the Convolutional Neural Network called LeNet-5 described in Section II. This system is in commercial use in the NCR Corporation line of check recognition systems for the banking industry. It is reading millions of checks per month in several banks across the United States.

在這項研究中,我們考慮手寫字符辨識任務(第一、第二章節),並且在用於手寫數字識別的基準資料集上比較了幾種學習技術的效能(第三章節)。雖然更自動化的學習是好的,但是沒有學習技術可以在沒有關於任務的少量先驗知識情況下完成的。在多層神經網路的情況下,整合知識的一個好方法就是依任務調整其架構。在第二章節介紹的卷積神經網路[2]是一個特殊的神經網路架構範例,它透過使用局部連結模式以及在權重上施加約束式的方式來整合關於2D形狀不變性的知識。在第三章節中介紹幾種隔離手寫數字辨識方法的比較。為了從單個字符的辨識到文章裡的詞彙、語句辨識,在第四章節中介紹結合多個訓練模組來減少整體誤差的想法。如果模組操作有向圖,最好使用多模組系統辨識可變長度的物件,像是手寫文字。

這引出第四章節介紹的可訓練的Graph Transformer Network (GTN)的概念。第五章節說明目前用於辨識文字或其它字串符的啟發式過度分割的經典方法。第六節章節介紹基於區別與和非區別的梯度技術,用於訓練詞彙級別的識別器,而不需要手動分割和標記。第七章介紹一個有發展性的Space-Displacement Neural Network方法,通過在輸入端所有可能的位置上掃描辨識器,消除了分割啟發法的需求。第八章中說明了,基於通用的圖形合成演算法,可訓練的Graph Transformer Networks可以被定義為多個通用轉換。

還討論了語音辨識中常用的GTNs與隱馬可夫模型之間的關聯。第九章描述一個全域訓練GTN系統,用於辨識筆輸入電腦的手寫輸入。這問題被稱"線上"手寫辨識,因為機器必須在使用者寫好之後立即產生回饋。系統的核心就是卷積神經網路。結果明確的證明,訓練一個詞彙級別識別器的優點,而不是在預先分割,手工標記,獨立字符上訓練。第十章描述一個基於GTN的完整系統,用於讀取手寫與機器列印的銀行支票。系統的核心是第二章說明的稱為LeNet-5的卷積神經網路。這個系統已經商業化應用於NCR Corporation的銀行業支票識識系統。系統每個月在美國多家銀行中讀取百萬張支票。

A. Learning from Data

There are several approaches to automatic machine learning, but one of the most successful approaches, popularized in recent years by the neural network community, can be called "numerical" or gradient-based learning. The learning machine computes a function

Yp=F(Zp,W) where
Zp
is the
p
-th input pattern, and
W
represents the collection of adjustable parameters in the system. In a pattern recognition setting, the output
Yp
may be interpreted as the recognized class label of pattern
Zp
, or as scores or probabilities associated with each class. A loss function
Ep=D(Dp,F(W,Zp))
, measures the discrepancy between
Dp
, the "correct" or desired output for pattern
Zp
, and the output produced by the system. The average loss function
Etrain(W)
is the average of the errors
Ep
over a set of labeled examples called the training set
{(Z1,D1),(Zp,Dp)}
. In the simplest setting, the learning problem consists finding the value of
W
that minimizes
Etrain(W)
. In practice, the performance of the system on a training set is of little interest. The more relevant measure is the error rate of the system in the field, where it would be used in practice. This performance is estimated by measuring the accuracy on a set of samples disjoint from the training set, called the test set. Much theoretical and experimental work [3],[4],[5]has shown that the gap between the expected error rate on the test set
Etest
and the error rate on the training set
Etrain
decreases with the number of training samples approximately as

EtestEtrain=k(h/p)α(1)

where

P is the number of training samples,
h
is a measure of "effective capacity" or complexity of the machine [6], [7],
α
is a number between 0.5 and 1.0 and
k
is a constant. This gap always decreases when the number of training samples increases. Furthermore, as the capacity
h
increases,
Etrain
decreases. Therefore, when increasing the capacity
h
, there is a trade-off between the decrease of
Etrain
and the increase of the gap, with an optimal value of the capacity
h
that achieves the lowest generalization error
Etest
. Most learning algorithms attempt to minimize
Etrain
as well as some estimate of the gap. A formal version of this is called structural risk minimization [6], [7], and is based on defining a sequence of learning machines of increasing capacity, corresponding to a sequence of subsets of the parameter space such that each subset is a superset of the previous subset. In practical terms, Structural Risk Minimization is implemented by minimizing
Etrain+βH(W)
, where the function
H(W)
is called a regularization function, and
β
is a constant.
H(W)
is chosen such that it takes large values on parameters
W
that belong to high-capacity subsets of the parameter space. Minimizing
H(W)
in effect limits the capacity of the accessible subset of the parameter space, thereby controlling the trade-off between minimizing the training error and minimizing the expected gap between the training error and test error.

有很多種實現自動機器學習的方法,但是最成功的實現之一,近幾年在神經網路社群中最受歡迎的方法,可以稱為"數值的"或是基於梯度的學習。學習機器計算函數

Yp=F(Zp,W),其中
Zp
是第
p
個輸入的圖形,而
W
代表系統中可調整參數的集合。在圖形辨識設定中,輸出
Yp
可以被解釋為圖形
Zp
的辨識類別標籤,或分數,或每一個類別的相關機率。損失函數
Ep=D(Dp,F(W,Zp))
量測
Dp
,也就是圖形
Zp
的正確或期望輸出與系統產生的輸出之間的差異。平均損失函數
Etrain(W)
是一組稱為訓練集
{(Z1,D1),(Zp,Dp)}
的標記樣本上的錯誤
Ep
的平均值。在最簡單的設置中,這個學習問題包含找出
W
的值來最小化
Etrain(W)
。實作上,對於訓練資料集上的系統效能是不感興趣的。更為相關的量測是實作上系統的誤差率。這效能的估測是透過量測一組與訓練資料集不相關的資料集上的準確率,稱為測試資料集。許多理論與實驗[3],[4],[5]已經說明,在測試資料集
Etest
的期望誤差率與訓練資料集
Etrain
的誤差率會隨著訓練樣本的增加而減少,近似於

EtestEtrain=k(h/p)α(1)

其中

P是訓練樣本的數量,
h
是有效容量或機器複雜度的量測([6], [7]),
α
是介於0.5到1.0的數值,
k
是一個常數。資料集之間的誤差率差距會隨著訓練資料集的增加而減少。另外,當容量
h
增加,
Etrain
會減少。然而,當增加容量
h
的時候,在減少
Etrain
與增加差距(資料集之間的誤差率)會有一個折衷,其中容量
h
的最佳值可以達成最低的泛化誤差
Etest
。大多數的學習演算法都嚐試最小化
Etrain
以及一些差距的估測。這種形式的正式說法為結構風險最小化[6],[7],它是基於定義一系列不斷增加容量的學習機器的序列,對應於參數空間的一系列子集,使得每個子集都是前一個子集的超集。實際上,結構風險最小化(RSM)是透過最小化
Etrain+βH(W)
來實作,其中函數
H(W)
稱為正規化函數,
β
為常數。選擇
H(W)
,以便讓它對屬於參數空間的高容量子集的參數
W
選擇大值。最小化
H(W)
有效限制參數空間的可達子集容量,從而控制最小化訓練誤差以及最小化訓練誤差與測試誤差之間的期望差距的平衡。

B. Gradient-Based Learning

The general problem of minimizing a function with respect to a set of parameters is at the root of many issues in computer science. Gradient-Based Learning draws on the fact that it is generally much easier to minimize a reasonably smooth, continuous function than a discrete (combinatorial) function. The loss function can be minimized by estimating the impact of small variations of the parameter values on the loss function. This is measured by the gradient of the loss function with respect to the parameters. Efficient learning algorithms can be devised when the gradient vector can be computed analytically (as opposed to numerically through perturbations). This is the basis of numerous gradient-based learning algorithms with continuous-valued parameters. In the procedures described in this article, the set of parameters

W is a real-valued vector, with respect to which
E(W)
is continuous, as well as differentiable almost everywhere. The simplest minimization procedure in such a setting is the gradient descent algorithm where
W
is iteratively adjusted as follows:

Wk=Wk1ϵE(W)W(2)

In the simplest case,

ϵ is a scalar constant. More sophisticated procedures use variable
ϵ
, or substitute it for a diagonal matrix, or substitute it for an estimate of the inverse Hessian matrix as in Newton or Quasi-Newton methods. The Conjugate Gradient method [8] can also be used. However, Appendix B shows that despite many claims to the contrary in the literature, the usefulness of these second-order methods to large learning machines is very limited.

A popular minimization procedure is the stochastic gradient algorithm, also called the online update. It consists in updating the parameter vector using a noisy, or approximated, version of the average gradient. In the most common instance of it,

W is updated on the basis of a single sample:

Wk=Wk1ϵEpk(W)W(3)

With this procedure the parameter vector fluctuates around an average trajectory, but usually converges considerably faster than regular gradient descent and second-order methods on large training sets with redundant samples (such as those encountered in speech or character recognition). The reasons for this are explained in Appendix B. The properties of such algorithms applied to learning have been studied theoretically since the 1960's [9], [10], [11], but practical successes for non-trivial tasks did not occur until the mid eighties.

就一組參數來最小化一個函數的一般性問題而言,是電腦科學中許多問題的根源。基於梯度的學習描述了這個事實,通常最小化一個合理平滑的連續函數比離散(組合)函數要來的簡單。損失函數可以透過估測參數數值微小變化對損失函數的影響來最小化。這是通過損失函數相對於參數的梯度來測量的。當梯度向量可以被計算分析的時候,就可以設計出有效率的學習演算法(而不是透過擾動進行數值計算)。這就是許多帶有連續數值參數的梯度學習演算法的基礎。這篇文章的描述過程中,參數集合

W是一個實數向量,這意味著
E(W)
是連續的,而且幾乎是可以任意微分。這種情況下,最簡單的最小化過程就是梯度下降演算法,其中
W
的迭代調整如下:

Wk=Wk1ϵE(W)W(2)

在最簡單的情況中,

ϵ是一個純量常數。更複雜的過程使用變數
ϵ
或以對角矩陣替代,或替換為inverse Hessian matrix的估測,像是牛頓法或準牛頓法。共軛梯度法[8]也可以使用。而然,附錄B說明,儘管文獻中有很多相反的說法,但這些二階方法對大型學習機器的實用性非常有限。

一個受歡迎的最小化過程是隨機梯度演算法,也稱為線上更新。它包含使用噪點或近似的平均梯度版本更新參數向量。在最常見的情况下,

W是基於單個樣本更新的:

Wk=Wk1ϵEpk(W)W(3)

透過這個過程,參數向量在一個平均軌距附件波動,但在具有冗餘樣本(例如語音或字符識別中遇到的樣本)的大型訓練集上,其收斂速度通常比正則梯度下降法和二階方法快得多。相關原因在附件B中解釋。自1960年代的以來,已經對這種用於學習的演算法的性質進行了理論研究[9],[10],[11],但直到80年代中期才在普通任務上取得實際成功。

C. Gradient Back-Propagation

Gradient-Based Learning procedures have been used since the late 1950's, but they were mostly limited to linear systems [1]. The surprising usefulness of such simple gradient descent techniques for complex machine learning tasks was not widely realized until the following three events occurred. The first event was the realization that, despite early warnings to the contrary[12], the presence of local minima in the loss function does not seem to be a major problem in practice. This became apparent when it was noticed that local minima did not seem to be a major impediment to the success of early non-linear gradient-based. Learning techniques such as Boltzmann machines [13] , [14]. The second event was the popularization by Rumelhart, Hinton and Williams [15] and others of a simple and efficient procedure, the back-propagation algorithm, to compute the gradient in a nonlinear system composed of several layers of processing. The third event was the demonstration that the back-propagation procedure applied to multi-layer neural networks with sigmoidal units can solve complicated learning tasks. The basic idea of back-propagation is that gradients can be computed efficiently by propagation from the output to the input. This idea was described in the control theory literature of the early sixties [16], but its application to machine learning was not generally realized then. Interestingly, the early derivations of back-propagation in the context of neural network learning did not use gradients, but "virtual targets" for units in intermediate layers [17] , [18], or minimal disturbance arguments [19]. The Lagrange formalism used in the control theory literature provides perhaps the best rigorous method for deriving back-propagation [20] and for deriving generalizations of back-propagation to recurrent networks [21], and networks of heterogeneous modules [22]. A simple derivation for generic multi-layer systems is given in Section I-E.

The fact that local minima do not seem to be a problem for multi-layer neural networks is somewhat of a theoretical mystery. It is conjectured that if the network is oversized for the task (as is usually the case in practice), the presence of "extra dimensions" in parameter space reduces the risk of unattainable regions. Back-propagation is by far the most widely used neural-network learning algorithm, and probably the most widely used learning algorithm of any form.

1950年代晚期就已經開始使用基於梯度的學習過程,但幾乎侷限在線性系統[1]。簡單的梯度下降技術應用於複雜的機器學習任務上的驚人效果一直到下面三件事發生之後才被廣泛的認識。第一件事認識到的是,儘管存在相反的論點[12],但損失函數中存在的局部最小值在實作中似乎不是一個很大問題。當注意到局部最小值似乎不是早期基於非線性梯度的成功的主要障礙時,這一點就變得顯而易見。學習技術,像是Boltzmann machines[13] , [14]。第二件事是由Rumelhart,Hinton與Williams推廣的一種簡單而有效率的過程,即反向傳播演算法,在一個由多層處理組成的非線性系統計算梯度。第三件事是證明反向傳播過程實現在帶有sigmoidal units的多層神經網路可以解決複雜的學習任務。反向傳播的基本概念是可以有效率的計算由輸出傳播到輸入的梯度。這個想法在六十年代的控制理論文獻中已有描述[16],只是當時機器學習的應該對它並沒有普遍的認知。有趣的是,早期的神經網路的反向傳播推導過程並沒有使用梯度,而是使用"虛擬目標"作為中間層的單元[17],[18]或最小干擾參數[19]。在控制理論文獻中使用的拉格朗吉形式主義(Lagrange formalism)或許是對推導反向傳播以及推導反向傳播對於遞迴網路以及異質模組網路的泛化提供最嚴謹的方法,。第1-E章提供一般多層系統的簡單推導。

事實上,對於多層神經網路而言,局部最小值似乎不是問題,這件事理論上還是一謎。可以推測,如果對任務的網路配置過大(實務上通常是這種情況),那在參數空間中存在額外的維度會降低無法到達區域的風險。反向傳播是目前為止使用最廣泛的神經網路學習演算法,也可能是任何形式中使用最廣泛的學習演算法。

D. Learning in Real Handwriting Recognition Systems

Isolated handwritten character recognition has been extensively studied in the literature (see [23],[24] for reviews), and was one of the early successful applications of neural networks [25]. Comparative experiments on recognition of individual handwritten digits are reported in Section III. They show that neural networks trained with Gradient-Based Learning perform better than all other methods tested here on the same data. The best neural networks, called Convolutional Networks, are designed to learn to extract relevant features directly from pixel images (see Section II).

One of the most difficult problems in handwriting recognition, however, is not only to recognize individual characters, but also to separate out characters from their neighbors within the word or sentence, a process known as segmentation. The technique for doing this that has become the "standard" is called Heuristic Over-Segmentation. It consists in generating a large number of potential cuts between characters using heuristic image processing techniques, and subsequently selecting the best combination of cuts based on scores given for each candidate character by the recognizer. In such a model, the accuracy of the system depends upon the quality of the cuts generated by the heuristics, and on the ability of the recognizer to distinguish correctly segmented characters from pieces of characters, multiple characters, or otherwise incorrectly segmented characters. Training a recognizer to perform this task poses a major challenge because of the difficulty in creating a labeled database of incorrectly segmented characters. The simplest solution consists in running the images of character strings through the segmenter, and then manually labeling all the character hypotheses. Unfortunately, not only is this an extremely tedious and costly task, it is also difficult to do the labeling consistently. For example, should the right half of a cut up 4 be labeled as a 1 or as a noncharacter? should the right half of a cut up 8 be labeled as a 3?

The first solution, described in Section V consists in training the system at the level of whole strings of characters, rather than at the character level. The notion of Gradient-Based Learning can be used for this purpose. The system is trained to minimize an overall loss function which measures the probability of an erroneous answer. Section V explores various ways to ensure that the loss function is differentiable, and therefore lends itself to the use of Gradient-Based Learning methods. Section V introduces the use of directed acyclic graphs whose arcs carry numerical information as a way to represent the alternative hypotheses, and introduces the idea of GTN.

The second solution described in Section VII is to eliminate segmentation altogether. The idea is to sweep the recognizer over every possible location on the input image, and to rely on the "character spotting" property of the recognizer, i.e. its ability to correctly recognize a well-centered character in its input field, even in the presence of other characters besides it, while rejecting images containing no centered characters [26], [27]. The sequence of recognizer outputs obtained by sweeping the recognizer over the input is then fed to a Graph Transformer Network that takes linguistic constraints into account and finally extracts the most likely interpretation. This GTN is somewhat similar to Hidden Markov Models (HMM), which makes the approach reminiscent of the classical speech recognition [28], [29]. While this technique would be quite expensive in the general case, the use of Convolutional Neural Networks makes it particularly attractive because it allows significant savings in computational cost.

單獨的手寫字符辨識在文獻中已經被廣泛的研究(請參閱[23],[24]),也是神經網路早期成功的應用之一[25]。第三章報告了對個別手寫數字辨識的比較實驗。他們證明了在相同資料集上以梯度學習訓練的神經網路,其效能優於這邊測試的其它方法。最好的神經網路稱為卷積網路,設計來學習直接從像素照片提取相關特徵(見第二章)。

然後,在手寫辨識中最困難的問題之一,不僅僅要辨識個別的字符,還要在單詞或語句中從相鄰字符中分離出我們的字符,這過程稱為分段。這個已經成為"標準"的技術稱為啟發式過度分割。它包含使用啟發式影像處理技術在字符之間生成大量潛在的切割,然後根據識別器為每個候選字符給出的分數選擇切割的最佳組合。在這樣的模型裡,系統的準確率取決於啟發式生成的切割品質,以及識別器從字符片段、多個字符或不正確分割字符中區分正確分割字符的能力。訓練一個識別器來執行這樣的任務帶來了重大的挑戰,因為困難在於建立一個不正確分段字符的標記資料庫。最簡單的解決方法包含透過分割器執行字符的影像,然後手動標記所有字符假設。不幸的是,這不僅是一個十分乏味而且高成本的任務,也非常難以一致性的執行標記。舉例來說,4的右半部要標為1或是非字符?8的右半部是否標記為3?

第五章描述的第一個解決方案,在整個字串級別上訓練系統,而不是字符級別。基於梯度的學習概念可以用於這個目的。系統被訓練來最小化整體損失函數(用來量測錯誤答案的機率)。第五章探討確保損失函數可微的各種方法,因而適合使用基於梯度學習的方法。第五章介紹有向非循環圖的使用,其帶有數值資訊做了一種假設的替代表示,並且介紹GTN的概念。

第二種解決方案描述於第七章,為完全消除分割。想法上是讓識別器掃過輸入的影像上的每一個可能位置,並且依靠識別器"字符萃取"的特性,例如,即使在輸入欄位早已存在其它字符,它還是有能力正確的辨識居中的字符,同時拒絕不含居中字符的影像[26],[27]。通過將識別器掃過輸入而獲得的識別器輸出序列被饋送到考慮語言限制的Graph Transformer Network, 最後提取出最可能的解釋。GTN有點類似於Hidden Markov Models(HMM_隱馬爾可夫模型),這使得該方法讓人聯想起經典的語音辨識方法 [28],[29]。儘管這個技術在一般情況下會十分昂貴,卷積神經網路的好處讓它特別的有吸引力,因為它可以很顯著的節省計算成本。

E. Globally Trainable Systems

As stated earlier, most practical pattern recognition systems are composed of multiple modules. For example, a document recognition system is composed of a field locator, which extracts regions of interest, a field segmenter, which cuts the input image into images of candidate characters, a recognizer, which classifies and scores each candidate character, and a contextual post-processor, generally based on a stochastic grammar, which selects the best grammatically correct answer from the hypotheses generated by the recognizer. In most cases, the information carried from module to module is best represented as graphs with numerical information attached to the arcs. For example, the output of the recognizer module can be represented as an acyclic graph where each arc contains the label and the score of a candidate character, and where each path represent a alternative interpretation of the input string. Typically, each module is manually optimized, or sometimes trained, outside of its context. For example, the character recognizer would be trained on labeled images of pre-segmented characters. Then the complete system is assembled, and a subset of the parameters of the modules is manually adjusted to maximize the overall performance. This last step is extremely tedious, time-consuming, and almost certainly suboptimal.

A better alternative would be to somehow train the entire system so as to minimize a global error measure such as the probability of character misclassifications at the document level. Ideally, we would want to find a good minimum of this global loss function with respect to all the parameters in the system. If the loss function

E measuring the performance can be made differentiable with respect to the system's tunable parameters
W
, we can find a local minimum of
E
using Gradient-Based Learning. However, at first glance, it appears that the sheer size and complexity of the system would make this intractable.

To ensure that the global loss function

Ep(Zp,W) is differentiable, the overall system is built as a feed-forward network of differentiable modules. The function implemented by each module must be continuous and differentiable almost everywhere with respect to the internal parameters of the module (e.g. the weights of a Neural Net character recognizer in the case of a character recognition module), and with respect to the module's inputs. If this is the case, a simple generalization of the well-known back-propagation procedure can be used to efficiently compute the gradients of the loss function with respect to all the parameters in the system [22]. For example, let us consider a system built as a cascade of modules, each of which implements a function
Xn=Fn(Wn,Xn1)
, where
Xn
is a vector representing the output of the module,
Wn
is the vector of tunable parameters in the module (a subset of W), and
Xn1
is the module's input vector (as well as the previous module's output vector). The input
X0
to the first module is the input pattern
Zp
. If the partial derivative of
Ep
with respect to
Xn
is known, then the partial derivatives of
Ep
with respect to
Wn
and
Xn1
can be computed using the backward recurrence

EpWn=FW(Wn,Xn1)EpXn
EpXn1=FX(Wn,Xn1)EpXn(4)

where

FW(Wn,Xn1) is the Jacobian of
F
with respect to
W
evaluated at the point
(Wn,Xn1)
, and
FW(Wn,Xn1)
is the Jacobian of
F
with respect to
X
. The Jacobian of a vector function is a matrix containing the partial derivatives of all the outputs with respect to all the inputs. The first equation computes some terms of the gradient of
Ep(W)
, while the second equation generates a backward recurrence, as in the well-known back-propagation procedure for neural networks. We can average the gradients over the training patterns to obtain the full gradient. It is interesting to note that in many instances there is no need to explicitly compute the Jacobian matrix. The above formula uses the product of the Jacobian with a vector of partial derivatives, and it is often easier to compute this product directly without computing the Jacobian beforehand. In By analogy with ordinary multi-layer neural networks, all but the last module are called hidden layers because their outputs are not observable from the outside. More complex situations than the simple cascade of modules described above, the partial derivative notation becomes somewhat ambiguous and awkward. A completely
rigorous derivation in more general cases can be done using Lagrange functions [20], [21], [22].

如之前說明的,大多數實用的影像辨識系統都是由很多模組所組成。舉例來說,一個文件辨識系統包含一個欄位定位器來提取感興趣的區域,一個欄位分割器將輸入影像切割為候選字符的影像,一個識別器來為每一個候選字符分類與評分,以及一個上下文後置處理器(通常基於隨機文法),它從識別器生成的假設中選擇文法正確的最佳答案。大多數情況,模組到模型之間的訊息傳遞最好使用圖(graph)來表示,其數值資訊附加於上。舉例來說,識別器模組的輸出可以被表示為非循環圖,每個弧都包含一個候選字符的標記與得分,每一個路徑都代表輸入字串的替代表示。通常,每一個模組都在上下文之外做最佳化與訓練。舉例來說,字符識別器會以預分割字符的標記影像訓練。然後,組裝完整的系統,並手動調整模組參數的子集讓整體性能最大化。最後一個步驟非常冗長乏味、耗時、而且幾乎肯定是局部最佳化

另一個替代方案是用某種方試訓練整個系統以最小化一個全域錯誤量測,像是文件級別字符錯誤分類的機率。理想上,相對於系統中的所有參數,我們希望找到一個好的全域損失函數的最小值。如果量測系統效能的損失函數

E相對於系統的可調參數
W
是可微的,那麼我們就可以使用基於梯度的學習找到
E
的局部最小值。然而,乍看之下,系統的龐大和複雜性似乎會讓這一個問題變得棘手。

要確保全域損失函數

Ep(Zp,W)是可微的,要將整個系統建置為可微模組的前饋網路。相對於模組的內部參數以及輸入(例如,對於字符辨識模組,神經網路字符識別器的權重),每一個模組的函數實作都必需是連續函數並且幾乎是處處可微。如果是這種情況,可以使用一個眾所皆知的反向傳播過程的簡單一般化來有效地的計算損失函數相對於系統中所有參數的梯度[22]。舉例來說,讓我們考慮一個建置為級聯模組的系統,每一個模組都實作一個function-
Xn=Fn(Wn,Xn1)
,其中
Xn
為向量,代表模組的輸出,
Wn
為向量,代表模組中的可調參數(
W
的子集),
Xn1
為模組的輸入向量(為前一個模組的輸出向量)。第一個模組的輸入
X0
為輸入影像
Zp
。如果已知
Ep
相對於
Xn
偏導數,那就可以使用反向遞迴重複計算
Ep
相對於
Wn
Xn1
偏導數

EpWn=FW(Wn,Xn1)EpXn
EpXn1=FX(Wn,Xn1)EpXn(4)

其中

FW(Wn,Xn1)為在點
(Wn,Xn1)
計算的
F
相對於
W
的Jacobian,
FW(Wn,Xn1)
F
相對於
X
的Jacobian。向量函數的Jacobian是一個矩陣,包含所有輸出相對於所有輸入的偏導數。第一個方程式計算
Ep(W)
的梯度的某些項目,而第二個方程式產生一個後向遞迴,正如眾所皆知的神經網路的反向傳播。我們可以對所有訓練圖像的梯度計算平均值,以得到完整的梯度。有趣的是,很多情況下,並不需要顯式的計算Jacobian matrix。上面的公式使用Jacobian與偏導數向量的乘積,這通常更容易直接地計算出這個乘積,而不需要事先計算Jacobain。跟普通的多層神經網路相比,除了最後一個模組之外的皆稱為隱藏層,因為它們的輸出無法從外部觀察。比上面說明的簡單的模組級聯更複雜的情況下,偏導數的符號變的有點模棱兩可與尷尬。更一般的情況下,可以使用Lagrange function做完全嚴格的推導[20],[21],[22]。

Traditional multi-layer neural networks are a special case of the above where the state information

Xn is represented with fixed-sized vectors, and where the modules are alternated layers of matrix multiplications (the weights) and component-wise sigmoid functions (the neurons). However, as stated earlier, the state information in complex recognition system is best represented by graphs with numerical information attached to the arcs. In this case, each module, called a Graph Transformer, takes one or more graphs as input, and produces a graph as output. Networks of such modules are called Graph Transformer Networks (GTN). Sections IV, VI and VIII develop the concept of GTNs, and show that Gradient-Based Learning can be used to train all the parameters in all the modules so as to minimize a global loss function. It may seem paradoxical that gradients can be computed when the state information is represented by essentially discrete objects such as graphs, but that difficulty can be circumvented, as shown later.

傳統的多層神經網路是上述狀況的一個特例,其中狀態資訊

Xn用固定大小的向量來表示,模組為矩陣乘法(權重)與分量sigmoid functions(神經元)的交替層。但是,如之前所說的,複雜的辨識系統中的狀態資訊最好是由帶有數值資訊的圖附加到來表示。這種情況下,每一個稱為Graph Transformer的模組都帶有一個或多個圖做為輸入,並生成一個圖做為輸出。這種模組的網路稱為Graph Transformer Networks (GTN)。第四、六、八章提出GTNs的概念,並說明基於梯度的學習可以被用來訓練所有模組中的所有參數,從而最小化全域損失函數。當狀態資訊由本質上離散的物件(圖)來表示的時候可以計算梯度,這看起來似乎是自相矛盾的,但這種困難是可以被避免的,如後說明。

II Convolutional Neural Networks for Isolated Character Recognition

The ability of multi-layer networks trained with gradient descent to learn complex, high-dimensional, non-linear mappings from large collections of examples makes them obvious candidates for image recognition tasks. In the traditional model of pattern recognition, a hand-designed feature extractor gathers relevant information from the input and eliminates irrelevant variabilities. A trainable classifier then categorizes the resulting feature vectors into classes. In this scheme, standard, fully-connected multi-layer networks can be used as classifiers. A potentially more interesting scheme is to rely on as much as possible on learning in the feature extractor itself. In the case of character recognition, a network could be fed with almost raw inputs (eg size-normalized images). While this can be done with an ordinary fully connected feed-forward network with some success for tasks such as character recognition, there are problems.

Firstly, typical images are large, often with several hundred variables (pixels). A fully-connected first layer with, say one hundred hidden units in the first layer, would already contain several tens of thousands of weights. Such a large number of parameters increases the capacity of the system and therefore requires a larger training set. In addition, the memory requirement to store so many weights may rule out certain hardware implementations. But, the main deficiency of unstructured nets for image or speech applications is that they have no built-in invariance with respect to translations, or local distortions of the inputs. Before being sent to the fixed-size input layer of a neural net, character images, or other 2D or 1D signals, must be approximately size-normalized and centered in the input field. Unfortunately, no such preprocessing can be perfect: handwriting is often normalized at the word level, which can cause size, slant, and position variations for individual characters. This, combined with variability in writing style, will cause variations in the position of distinctive features in input objects. In principle, a fully-connected network of sufficient size could learn to produce outputs that are invariant with respect to such variations. However, learning such a task would probably result in multiple units with similar weight patterns positioned at various locations in the input so as to detect distinctive features wherever they appear on the input. Learning these weight configurations requires a very large number of training instances to cover the space of possible variations. In convolutional networks, described below, shift invariance is automatically obtained by forcing the replication of weight configurations across space.

Secondly, a deficiency of fully-connected architectures is that the topology of the input is entirely ignored. The input variables can be presented in any (fixed) order without affecting the outcome of the training. On the contrary, images (or time-frequency representations of speech) have a strong 2D local structure: variables (or pixels) that are spatially or temporally nearby are highly correlated. Local correlations are the reasons for the well-known advantages of extracting and combining local features before recognizing spatial or temporal objects, because configurations of neighboring variables can be classified into a small number of categories (e.g. edges, corners). Convolutional Networks force the extraction of local features by restricting the receptive fields of hidden units to be local.

以梯度下降訓練的多層網路,從大量範例集中學習複雜、高維度、非線性映射的能力,讓它們明顯成為影像辨識任務的候選者。在傳統的圖形辨識模型中,手動設計特徵提取器從輸入收集相關訊息,並消除不相關的變化。然後,一個可訓練的分類器將所得特徵向量歸類。在這個方案中,標準的、全連接的多層網路可以用來做為分類器。一個可能更有趣的方案是盡可能的依賴特徵提取器本身的學習。在字符辨識的情況下,網路可以餵入幾乎是原始的輸入(例如大小標準化的影像)。儘管這可以使用普通的全連接前饋網路來完成,但對於字符辨識等任務來說,還是存在一些問題。

首先,典型的影像很大,通常具有數百個變數(像素)。一個全連接層(第一層),假設第一層有一百個隱藏單元,這已經包含數萬個權重。這麼大量的參數量增加了系統的容量,因此需要更大的訓練資料集。此外,保存這麼多權重的記憶體需求可能會排除某些硬體實現。但是,用於影像或語音應用的非結構網路的主要缺陷在於,它們在轉換方面沒有內置不變性,或是輸入的局部失真。在發送字符影像、或是其它2D、1D信號到神經網路固定大小的輸入層之前,必須要做大小標準化,並在輸入欄位中居中。不幸的是,沒有這種預處理是完美的:手寫通堂在單詞級別上做標準化,這可能導致單一字符的大小、傾斜角度、位置變化。再加上書寫風格的變化性,將導致輸入物件中獨特特的位置變化。原則上,一個足夠大的全連接網路可以學習生成關於這種變化不變的輸出。然而,學習這類任務可能會導致多個單元在輸入的不同位置帶有相似的權重,以便檢測出現在輸入中的不同特徵。學習這些權重配置需要大量的訓練實例來覆蓋可能的變化空間。在卷積網路中(下面說明),通過強制跨空間複製權重配置來自動獲得移位不變性。

其次,全連接架構的不足在於輸入的拓撲完全被忽略。輸入的變數可以以任何的(固定)順序呈現,而不影響訓練的結果。相反的,影像(或語音的時頻表示)具有很強的2D局部結構:空間或時間上相鄰的變數(或像素)具有高度相關性。局部相關性是眾所皆知在識別空間或時間物件之前提取與組合局部特徵優勢的原因,因為相鄰變數的配置可以被分類為少量類別(像是邊、角)。卷積網路透過限制隱藏單元的感受域(大陸翻為感受野)做為局部來強制局部特徵的提取。

A Convolutional Networks

Convolutional Networks combine three architectural ideas to ensure some degree of shift, scale, and distortion invariance: local receptive fields, shared weights (or weight replication), and spatial or temporal sub-sampling. A typical convolutional network for recognizing characters, dubbed LeNet-5 is shown in figure 2. The input plane receives images of characters that are approximately size-normalized and centered. Each unit in a layer receives inputs from a set of units located in a small neighborhood in the previous layer. The idea of connecting units to local receptive fields on the input goes back to the Perceptron in the early 60s, and was almost simultaneous with Hubel and Wiesel's discovery of locally-sensitive, orientation-selective neurons in the cat's visual system [30]. Local connections have been used many times in neural models of visual learning [31], [32], [18], [33], [34], [2]. With local receptive fields, neurons can extract elementary visual features such as oriented edges, end-points, corners (or similar features in other signals such as speech spectrograms). These features are then combined by the subsequent layers in order to detect higher-order features. As stated earlier, distortions or shifts of the input can cause the position of salient features to vary. In addition, elementary feature detectors that are useful on one part of the image are likely to be useful across the entire image. This knowledge can be applied by forcing a set of units, whose receptive fields are located at different places on the image, to have identical weight vectors [32],[15], [34]. Units in a layer are organized in planes within which all the units share the same set of weights. The set of outputs of the units in such a plane is called a feature map. Units in a feature map are all constrained to perform the same operation on different parts of the image. A complete convolutional layer is composed of several feature maps (with different weight vectors), so that multiple features can be extracted at each location. A concrete example of this is the first layer of LeNet-5 shown in Figure 2. Units in the first hidden layer of LeNet-5 are organized in 6 planes, each of which is a feature map. A unit in a feature map has 25 inputs connected to a 5 by 5 area in the input, called the receptive field of the unit. Each unit has 25 inputs, and therefore 25 trainable coefficients plus a trainable bias. The receptive fields of contiguous units in a feature map are centered on correspondingly contiguous units in the previous layer. Therefore receptive fields of neighboring units overlap. For example, in the first hidden layer of LeNet-5, the receptive fields of horizontally contiguous units overlap by 4 columns and 5 rows. As stated earlier, all the units in a feature map share the same set of 25 weights and the same bias so they detect the same feature at all possible locations on the input. The other feature maps in the layer use different sets of weights and biases, thereby extracting different types of local features. In the case of LeNet-5 at each input location six different types of features are extracted by six units in identical locations in the six feature maps. A sequential implementation of a feature map would scan the input image with a single unit that has a local receptive field, and store the states of this unit at corresponding locations in the feature map. This operation is equivalent to a convolution, followed by an additive bias and squashing function, hence the name convolutional network. The kernel of the convolution is the set of connection weights used by the units in the feature map. An interesting property of convolutional layers is that if the input image is shifted, the feature map output will be shifted by the same amount, but will be left unchanged otherwise. This property is at the basis of the robustness of convolutional networks to shifts and distortions of the input.

卷積網路結合三種架構的思維來確保某種程度的移位、縮放、失真不變性:局部感受域,共享權重(或權重複製),以及空間或時間子採樣。圖二說明一個典型用於辨識字符的卷積網路,稱為LeNet-5。輸入平面接收字符的影像,大致是大小標準化而且居中。一層中的每一個單元從上一層小鄰域中的一組單元接收輸入。將單元連接到輸入端的局部感受域的想法可以追溯到六十年代初的感知器,幾乎跟Hubel與Wiesel在猫的視覺系統中發現局部敏感、定向選擇性神經元同時[30]。局部連接已經在視覺學習的神經模型中多次使用[31],[32],[18],[33],[34],[2]。利用局部感受域,神經元可以提取基本的視覺特徵,像是定向邊緣、端點、角(或其它信號裡的相似特徵,如語音聲譜圖)。然後,這些特徵由後續的層組合,以便檢測更高階的特徵。如之前說明,輸入的失真或移位都會導致顯著特徵的位置發生變化。此外,在影像的一小部份上有用的基本特徵檢測器很可能對整張影像都有用。相關知識應用可以透過強制一組單元,其感受域位於影像上的不同位置,但具有相同的權重向量[32],[15],[34]。一層網路層中的單元組織在平面中,所有單元共享相同的權重集合。在這樣的平面中的單元輸出集合稱為feature map。feature map中的所有單元都受到約束,對影像的不同部位執行相同的操作。一個完整的卷積層由多個feature maps組成(具有不同的權重向量),因此可以在每一個位置提取多個特徵。關於這點的一個具體範例如圖二中LeNet-5第一層說明。LeNet-5的第一層隱藏層的單元組織成六個平面,每一個平面都是一個feature map。feature map中的一個單元具有二十五個輸入,這二十五個輸入連接到輸入中的五乘五區域,稱為單元的接收域。每一個單元具有二十五個輸入,因此二十五個可訓練的係數加上一個可訓練的偏差。feature map中,連續單元的接收域集中在前一層中相應的連續單元上。因此,相鄰單元的接收域重疊。舉例來說,LeNet-5的第一層隱藏層中,水平連續單元的接收域重疊四列與五行。如早前說明,feature map中的所有單元共享二十五個相同權重集合以及相同的偏差,因此它們可以在輸入的所有可能位置偵測相同的特徵。網路層中的其它feature map使用不同的權重與偏差集合,從而提取不同類型的局部特徵。在LeNet-5情況中,每一個輸入位置上,六個單元在六個feature map的相同的位置上提取六個不同特徵類型。feature map的逐次實現將使用具有局部接收域的單個單元掃描輸入影像,並將該單元的狀態存儲在feature map中的相應位置。這個操作相當於卷積,接著是additive bias與擠壓函數(啟動函數),因此稱為卷積網路。卷積的核心就是特徵圖中各單位使用的連接權重集合。卷積層的一個有趣的特性是,如果輸入的影像發生移位,那麼feature map的輸出也會移位相同的量,否則將保持不變。這個特性是卷積網路對於輸入的移位以及失真的魯棒性的基礎。

Once a feature has been detected, its exact location becomes less important. Only its approximate position relative to other features is relevant. For example, once we know that the input image contains the endpoint of a roughly horizontal segment in the upper left area, a corner in the upper right area, and the endpoint of a roughly vertical segment in the lower portion of the image, we can tell the input image is a 7. Not only is the precise position of each of those features irrelevant for identifying the pattern, it is potentially harmful because the positions are likely to vary for different instances of the character. A simple way to reduce the precision with which the position of distinctive features are encoded in a feature map is to reduce the spatial resolution of the feature map. This can be achieved with a so-called sub-sampling layers which performs a local averaging and a sub-sampling, reducing the resolution of the feature map, and reducing the sensitivity of the output to shifts and distortions. The second hidden layer of LeNet-5 is a sub-sampling layer. This layer comprises six feature maps, one for each feature map in the previous layer. The receptive field of each unit is a 2 by 2 area in the previous layer's corresponding feature map. Each unit computes the average of its four inputs, multiplies it by a trainable coefficient, adds a trainable bias, and passes the result through a sigmoid function. Contiguous units have non-overlapping contiguous receptive fields. Consequently, a sub-sampling layer feature map has half the number of rows and columns as the feature maps in the previous layer. The trainable coefficient and bias control the effect of the sigmoid non-linearity. If the coefficient is small, then the unit operates in a quasilinear mode, and the sub-sampling layer merely blurs the input. If the coefficient is large, sub-sampling units can be seen as performing a "noisy OR" or a "noisy AND" function depending on the value of the bias. Successive layers of convolutions and sub-sampling are typically alternated, resulting in a "bi-pyramid": at each layer, the number of feature maps is increased as the spatial resolution is decreased. Each unit in the third hidden layer in figure 2 may have input connections from several feature maps in the previous layer. The convolution/sub-sampling combination, inspired by Hubel and Wiesel's notions of "simple" and "complex" cells, was implemented in Fukushima's Neocognitron [32], though no globally supervised learning procedure such as back-propagation was available then. A large degree of invariance to geometric transformations of the input can be achieved with this progressive reduction of spatial resolution compensated by a progressive increase of the richness of the representation (the number of feature maps).

Since all the weights are learned with back-propagation, convolutional networks can be seen as synthesizing their own feature extractor. The weight sharing technique has the interesting side effect of reducing the number of free parameters, thereby reducing the "capacity" of the machine and reducing the gap between test error and training error [34]. The network in figure 2 contains 340,908 connections, but only 60,000 trainable free parameters because of the weight sharing.

Fixed-size Convolutional Networks have been applied to many applications, among other handwriting recognition [35], [36], machine-printed character recognition [37], online handwriting recognition [38], and face recognition [39]. Fixed-size convolutional networks that share weights along a single temporal dimension are known as Time-Delay Neural Networks (TDNNs). TDNNs have been used in phoneme recognition (without subsampling) [40], [41], spoken word recognition (with sub-sampling) [42], [43], on-line recognition of isolated handwritten characters [44], and signature verification [45].

一旦檢測到特徵,它的確切位置就變的不再重要。只有它相對於其它特徵的大致位置是相關的。舉例來說,一旦我們知道輸入影像包含左上角區域大致水平段的端點,在右上方區域有一個角,以及在影像下半部份中大致垂直段的端點,我們可以說,這個輸入的影像是一個7。這些特徵中的每一個的精確位置不僅與識別圖像無關,而且還可能有害,因為位置可能因字符的不同實例而有所變化。降低featuer map中特徵位置編碼精度的一個簡單方法是降低feature map的空間解析度。這可以透過所謂的分階抽樣(部份稱為子採樣)來達成,分階抽樣層執行局部平均以及分階抽樣,降低feature map的解析度,並降低輸出對移位以及失真的敏感度。LeNet-5的第二層隱藏層就是分階抽樣層。這一層包含六個feature maps,前一層中的每個feature map各對應一個feature map。在上一層的對應feature map中,每個單元的接收域是一個2x2的區域。每個單元計算其四個輸入的平均值,乘上各自可訓練係數,加上可訓練偏差,然後將結果通過sigmoid function傳送。連續單元擁有不重疊的連續接收域。因此,分階抽樣層的feature map的行、列數是前一層feature map的一半。可訓練係數以及偏差控制sigmoid非線性的影響。如果係數過小,那該單元會以準線性模式操作,而分階抽樣層僅讓輸入模糊。如果係數過大,那就根據偏差單元的值來
分階抽樣單元視為執行"noisy OR"或是"noisy AND" function。卷積與分階抽樣的連續層通常是交替的,這導致"雙錐(雙金字塔)":在每一層,隨著空間解析度降低,feature map的數量增多。圖二中,第三個隱藏層的每一個單元都具有來自前一層中的幾個feature maps的輸入連接。受Hubel與Wiesel的"簡單"與"複雜"細胞的概念啟發,卷積/分階抽樣的組合在Fukushima's Neocognitron [32]實現,儘管當時還沒有像是反向傳播等全域監督學習程序。通過逐步增加表示的豐富性(feature maps的數量)來補賞空間解析度的逐漸降低,可以實現輸入的幾何轉換的高度不變性。

由於所有的權重都是透過反向傳播學習,卷積網路可以視為是綜合它們自己的特徵提取器。權值共享技術有一個有趣的副作用,就是減少自由參數的數量,而從減少機器的"容量",並縮減測試誤差與訓練誤差之間的差距[34]。圖二中的網路包含340,908個連接,但是因為權值共享,因此只有60,000個可訓練自由參數。

固定大小的卷積網路已鋞應用在很多應用程式,其中包括手寫辨識 [35],[36],機器列印字符辨識[37],線上手寫辨識[38]以及臉部辨識[39]。沿著單個時間維度共享權重的固定大小卷積網路被稱為延時類神經網路(Time-Delay Neural Networks)(TDNNs)。TDNNs已經應用於音素辨識(沒有分階抽樣)[40],[41],口語詞辨識(有分階抽樣)[42],[43],單獨的手寫字符的線上辨識[44],以及簽名驗證[45]。

Fig. 2. Architecture of LeNet-5, a Convolutional Neural Network, here for digits recognition. Each plane is a feature map i.e. a set of units whose weights are constrained to be identical.

B LeNet-5

This section describes in more detail the architecture of LeNet-5, the Convolutional Neural Network used in the experiments. LeNet-5 comprises 7 layers, not counting the input, all of which contain trainable parameters (weights). The input is a 32x32 pixel image. This is significantly larger than the largest character in the database (at most 20x20 pixels centered in a 28x28 field). The reason is that it is desirable that potential distinctive features such as stroke end-points or corner can appear in the center of the receptive field of the highest-level feature detectors. In LeNet-5 the set of centers of the receptive fields of the last convolutional layer (C3, see below) form a 20x20 area in the center of the 32x32 input. The values of the input pixels are normalized so that the background level (white) corresponds to a value of -0.1 and the foreground (black) corresponds to 1.175. This makes the mean input roughly 0, and the variance roughly 1 which accelerates learning [46].

In the following, convolutional layers are labeled Cx, sub-sampling layers are labeled Sx, and fully-connected layers are labeled Fx, where x is the layer index.

Layer C1 is a convolutional layer with 6 feature. maps Each unit in each feature map is connected to a 5x5 neighborhood in the input. The size of the feature maps is 28x28 which prevents connection from the input from falling off the boundary. C1 contains 156 trainable parameters, and 122,304 connections.

Layer S2 is a sub-sampling layer with 6 feature maps of size 14x14. Each unit in each feature map is connected to a 2x2 neighborhood in the corresponding feature map in C1. The four inputs to a unit in S2 are added, then multiplied by a trainable coeffcient, and added to a trainable bias. The result is passed through a sigmoidal function. The 2x2 receptive fields are non-overlapping, therefore feature maps in S2 have half the number of rows and column as feature maps in C1. Layer S2 has 12 trainable parameters and 5,880 connections.

Layer C3 is a convolutional layer with 16 feature maps. Each unit in each feature map is connected to several 5x5 neighborhoods at identical locations in a subset of S2's feature maps. Table I shows the set of S2 feature maps combined by each C3 feature map. Why not connect every S2 feature map to every C3 feature map? The reason is twofold. First, a non-complete connection scheme keeps the number of connections within reasonable bounds. More importantly, it forces a break of symmetry in the net work. Different feature maps are forced to extract different (hopefully complementary) features because they get different sets of inputs. The rationale behind the connection scheme in table I is the following. The first six C3 feature maps take inputs from every contiguous subsets of three feature maps in S2. The next six take input from every contiguous subset of four. The next three take input from some discontinuous subsets of four. Finally the last one takes input from all S2 feature maps. Layer C3 has 1,516 trainable parameters and 151,600 connections.

Layer S4 is a sub-sampling layer with 16 feature maps of size 5x5. Each unit in each feature map is connected to a 2x2 neighborhood in the corresponding feature map in C3, in a similar way as C1 and S2. Layer S4 has 32 trainable parameters and 2,000 connections.

Layer C5 is a convolutional layer with 120 feature maps. Each unit is connected to a 5x5 neighborhood on all 16 of S4's feature maps. Here, because the size of S4 is also 5x5, the size of C5's feature maps is 1x1: this amounts to a full connection between S4 and C5. C5 is labeled as a convolutional layer, instead of a fully-connected layer, because if LeNet-5 input were made bigger with everything else kept constant, the feature map dimension would be larger than 1x1. This process of dynamically increasing the size of a convolutional network is described in the section Section VII Layer. C5 has 48,120 trainable connections.

Layer F6, contains 84 units (the reason for this number comes from the design of the output layer, explained below) and is fully connected to C5. It has 10,164 trainable parameters.

這一章說明實驗中使用的卷積神經網路,LeNet-5架構更多的細節。LeNet-5包含七層,不計算輸入層,所有層皆包含可訓練參數(權數)。輸入為32x32像素影像。這明顯大於資料庫中最大的字符(以28x28為中心,最多20x20像素)。原因在於,希望潛在的獨特特徵,像是筆劃的端點或是邊角可以出現在最高階特徵偵測器的感受域的中心。在LeNet-5中,最後一層卷積層(C3,見下說明)的感受域的中心集在32x32輸入的中心形成20x20的區域。輸入的像素值被標準化,以便背景等級(白)相對於值-0.1,前景(黑色)相對於值1.175。這讓平均輸入的均值大致為0,方差大致為1,從而加速訓練[46]。

下面開始,卷積層標記為Cx,分階抽樣層標記為Sx,而全連接層標記為Fx,其中x代表層的索引值。

C1層是卷積層,帶有六個feature maps。每一個feature map的每一個單元都連接到輸入中5x5的鄰域。feature maps的大小為28x28,這可以預防來自輸入的連接從邊界脫落。C1包含156個可訓練參數,以及122,304個連接。

S2層是分階抽樣層,帶有六個feature maps,大小為14x14。每一個feature maps的每一個單元都連接到C1中相應feature map中的2x2鄰域。S2中一個單元的四個輸入相加,然後乘上可訓練係數,再加上可訓練偏差。所得結果通過sigmoid function傳遞。2x2的接收域並不是重疊的,因此,S2的feature maps的行、列數是C1的feature maps的一半。S2有12個可訓練參數,以及5,880個連接。

S3層是卷積層,帶有十六個feature mpas。每一個feature maps的每一個單元都連接到S2的feature map的子集中相同位置的幾個5x5的鄰域。表格I說明了由每個C3 feature map組合而成的一組S2 feature maps。為什麼不將每一個S2 feature map連接到每一個S3 feature map?原因有兩個。首先,一個不完整的連接方案將連接數保持在合理範圍內。更重要的是,它強制打破網路的對稱性。不同的feature maps被強制提取不同的特徵(希望是互補的),因為它們得到不同的輸入集。表格I中連接方案背後的原理如下。前六個C3 feature maps從S2中三個feature maps的每個相連子集中得到輸入。接下來六個從四個相連子集中得到輸入。接下來三個從四個不相連子集中得到輸入。最後,最後一個從所有S2 feature maps得到輸入。C3層有1,516個可訓練參數,以及151,600個連接。

S4層是分階抽樣層,帶有十六個feature maps,大小為5x5。每一個feature map的每個單元都跟C1、S2類似的方式,連接到C3中相應的feature map的2x2的鄰域。S4層有32個可訓練參數,以及2,000個連接。

C5層是一個卷積層,帶有120個feature maps。每一個單元都連接到S4層16個feature map上的5x5的鄰域。在這裡,因為S4的大小也是5x5,因此C5的feature map大小是1x1:這相當於S4與C5之間的完全連接。C5被標記為卷積層,而不是全連接層,因為,如果LeNet-5的輸入在其它保持不變的情況下變大,那feature map的尺寸將會大於1x1。動態增加卷積網路大小的過程在第七章Layer中說明。C5層有48,120個可訓練連接。

F6層,包含84個單元(這個數字的原因來自輸出層的設計,後面說明),並且完全連接到C5。它具有10,164個可訓練參數。

TABLE 1. Each column indicates which feature map in S are combined by the units in a particular feature map of C.

As in classical neural networks, units in layers up to F6 compute a dot product between their input vector and their weight vector, to which a bias is added. This weighted sum, denoted

ai for unit
i
, is then passed through a sigmoid squashing function to produce the state of unit
i
, denoted
by
xi
:

xi=f(ai)(5)

The squashing function is a scaled hyperbolic tangent:

f(a)=Atanh(Sa)(6)

where

A is the amplitude of the function and
S
determines its slope at the origin. The function f is odd, with horizontal asymptotes at
+A
and
A
. The constant
A
is chosen to be 1.7159. The rationale for this choice of a squashing function is given in Appendix A.

Finally, the output layer is composed of Euclidean Radial Basis Function units (RBF), one for each class, with 84 inputs each. The outputs of each RBF unit

yi is computed as follows:

yi=j(xjwij)2(7)

In other words, each output RBF unit computes the Euclidean distance between its input vector and its parameter vector. The further away is the input from the parameter vector, the larger is the RBF output. The output of a particular RBF can be interpreted as a penalty term measuring the fit between the input pattern and a model of the class associated with the RBF. In probabilistic terms, the RBF output can be interpreted as the unnormalized negative log-likelihood of a Gaussian distribution in the space of configurations of layer F6. Given an input pattern, the loss function should be designed so as to get the configuration of F6 as close as possible to the parameter vector of the RBF that corresponds to the pattern's desired class. The parameter vectors of these units were chosen by hand and kept fixed (at least initially). The components of those parameters vectors were set to -1 or +1. While they could have been chosen at random with equal probabilities for -1 and +1, or even chosen to form an error correcting code as suggested by [47], they were instead designed to represent a stylized image of the corresponding character class drawn on a 7x12 bitmap (hence the number 84). Such a representation is not particularly useful for recognizing isolated digits, but it is quite useful for recognizing strings of characters taken from the full printable ASCII set. The rationale is that characters that are similar, and therefore confusable, such as uppercase

O, lowercase
O
, and zero, or lowercase
l
, digit
1
, square brackets, and uppercase
I
, will have similar output codes. This is particularly useful if the system is combined with a linguistic post-processor that can correct such confusions. Because the codes for confusable classes are similar, the output of the corresponding RBFs for an ambiguous character will be similar, and the post-processor will be able to pick the appropriate interpretation. Figure 3 gives the output codes for the full ASCII set.

Another reason for using such distributed codes, rather than the more common "1 of N" code (also called place code, or grand-mother cell code) for the outputs is that non-distributed codes tend to behave badly when the number of classes is larger than a few dozens. The reason is that output units in a non-distributed code must be off most of the time. This is quite diffcult to achieve with sigmoid units. Yet another reason is that the classifiers are often used to not only recognize characters, but also to reject non-characters. RBFs with distributed codes are more appropriate for that purpose because unlike sigmoids, they are activated within a well circumscribed region of their input space that non-typical patterns are more likely to fall outside of.

The parameter vectors of the RBFs play the role of target vectors for layer F6. It is worth pointing out that the components of those vectors are +1 or -1, which is well within the range of the sigmoid of F6, and therefore prevents those sigmoids from getting saturated. In fact, +1 and -1 are the points of maximum curvature of the sigmoids. This forces the F6 units to operate in their maximally non-linear range. Saturation of the sigmoids must be avoided because it is known to lead to slow convergence and ill-conditioning of the loss function.

與經典的神經網路一樣,直到F6層中的單元都會計算其輸入向量與它們的權重向量的內積,再加上偏差。這個加權總合表示為單元

i
ai
,然後經過sigmoid squashing function來產生單元
i
的狀態,表示為
xi

xi=f(ai)(5)

這個squashing function是縮放的雙曲正切函數:

f(a)=Atanh(Sa)(6)

其中

A是函數的振幅
S
確定其在原點的斜率。function-f是奇函數,水平漸近線在
+A
and
A
處。常數項
A
設置為1.7159。附件A提供選擇squashing function的說明。

最後,輸出層由Euclidean Radial Basis Function units (RBF)組成,每一個類別一個單元,每個類別有84個輸入。每一個RBF的單元

yi的輸出計算如下:

yi=j(xjwij)2(7)

換句話說,每一個輸出RBF單元都會計算輸入向量與參數向量之間的歐幾里德距離(Euclidean distance)。參數向量的輸入越遠,RBF的輸出就越大。特定RBF的輸出可以解釋為懲罰項(penalty),用來測量輸入圖像與RBF相關類的模型之間的擬合度。以機率術語來說的話,RBF的輸出可以解釋為F6層空間配置中,高斯分佈的非正規負對數似然。給定一個輸入圖像,損失函數應該設計為讓F6的配置盡可能的接近圖像所期望對應的RBF的參數向量。這些單元的參數向量是手動選擇而且保持固定(至少初始的時候是固定的)。這些參數向量的的分量設置為-1或+1。儘管它們可以以-1、+1的等機率隨機選擇,或者甚至按照[47]的建議,選擇形成一個糾錯碼,但它們卻被設計為表示在7x12上繪製的相應字符類的樣式化影像點陣圖(因此數位84)。這樣的表示對於辨識單一數字並不特別有用,但對辨識從完整可列印的ASCII集中提取的字符串非常有用。理由是相似的字符因此容易混淆,像是大寫的

O,小寫的
o
以及
0
,或小寫的
l
,數字
1
,方括號與大寫的
L
,將會有相似的輸出編碼。如果系統能夠結合可以糾正這類混淆的語言的後處理機,那這會特別有用。因為易混淆類別的編碼是相似的,所以,歧義字符的相應RBF的輸出會是相似的,而且後處理機將可以選擇適當的解擇。圖三給出完成ASCII集的輸出編碼。

使用這類分佈式的編碼而不是更常見的"1 of N"編碼(也稱為位置代碼或祖母細胞編碼)做為輸出的另一個原因是,當類別數量大於幾十個的時候,非分佈式編碼往往表現不好。原因是,非分佈式編碼中的輸出單元在多數情況下必須關閉。這用Sigmoid單元是非常難以做到的。另一個原因是,分類器通常不只是辨識字符,還要用於拒絕非字符。具分佈式的RBF編碼更適合用於這個目的,因為不同於Sigmoid,它們在輸入空間的一個良好的限定區域內被啟動,非典型的圖像更可能落在區域之外。

RBF的參數向量在F6層中起著目標向量的作用。值得指出的是,這些向量的分量是+1或-1,它們剛好在F6的Sigmoid範圍內,因此可以預防這些Sigmoids飽合。事實上,+1與-1是sigmoids的最大曲率點。這迫使F6單元在最大非線性範圍內運行。必須避免Sigmoids的飽合,因為已知這會導致收斂緩慢以及損失函數病態

Fig. 3. Initial parameters of the output RBFs for recognizing the full ASCII set.

C. Loss Function

The simplest output loss function that can be used with the above network is the Maximum Likelihood Estimation criterion (MLE), which in our case is equivalent to the Minimum Mean Squared Error (MSE). The criterion for a set of training samples is simply:

E(W)=1Pp=1PyDp(Zp,W)(8)

where

yDp is the output of the
Dp
-th RBF unit, i.e. the one that corresponds to the correct class of input pattern
Zp
. While this cost function is appropriate for most cases, it lacks three important properties. First, if we allow the parameters of the RBF to adapt,
E(W)
has a trivial, but totally unacceptable, solution. In this solution, all the RBF parameter vectors are equal, and the state of F6 is constant and equal to that parameter vector. In this case the network happily ignores the input, and all the RBF outputs are equal to zero. This collapsing phenomenon does not occur if the RBF weights are not allowed to adapt. The second problem is that there is no competition between the classes. Such a competition can be obtained by using a more discriminative training criterion, dubbed the MAP (maximum a posteriori) criterion, similar to Maximum Mutual Information criterion sometimes used to train HMMs [48], [49], [50]. It corresponds to maximizing the posterior probability of the correct class
Dp
(or minimizing the logarithm of the probability of the correct class), given that the input image can come from one of the classes or from a background "rubbish" class label. In terms of penalties, it means that in addition to pushing down the penalty of the correct class like the MSE criterion, this criterion also pulls up the penalties of the incorrect classes:

E(W)=1Pp=1P(yDp(Zp,W)+log(ej+ieyi(Zp,W)))(9)

The negative of the second term plays a "competitive" role. It is necessarily smaller than (or equal to) the first term, therefore this loss function is positive. The constant

j is positive, and prevents the penalties of classes that are already very large from being pushed further up. The posterior probability of this rubbish class label would be the ratio of
ej
and
ej+ieyi(Zp,W)
.This discriminative criterion prevents the previously mentioned "collapsing effect" when the RBF parameters are learned because it keeps the RBF centers apart from each other. In Section VI, we present a generalization of this criterion for systems that learn to classify multiple objects in the input (e.g., characters in words or in documents).

Computing the gradient of the loss function with respect to all the weights in all the layers of the convolutional network is done with back-propagation. The standard algorithm must be slightly modified to take account of the weight sharing. An easy way to implement it is to first compute the partial derivatives of the loss function with respect to each connection, as if the network were a conventional multilayer network without weight sharing. Then the partial derivatives of all the connections that share a same parameter are added to form the derivative with respect to that parameter.

Such a large architecture can be trained very efficiently, but doing so requires the use of a few techniques that are described in the appendix. Section A of the appendix describes details such as the particular sigmoid used, and the weight initialization. Section B and C describe the minimization procedure used, which is a stochastic version of a diagonal approximation to the Levenberg-Marquardt procedure.

可以跟上面說明的網路一起使用的最簡單的損失函數就是最大似然概率標準(MLE),在我們的情況中,它等價於最小均方誤差(MSE)。一組訓練樣本的標準很簡單:

E(W)=1Pp=1PyDp(Zp,W)(8)

其中

yDp是第
Dp
個單元的輸出,即對應於於輸入圖像
Zp
的正確類別。儘管這個成本函數適用於多數情況,但它缺少三個重要的屬性。首先,如果我們允許RBF的參數進行自適應,那
E(W)
有一個明顯解,但完全不能接受。在這個解裡面,所有的RBF參數向量是相等的,並且F6的狀態為常數且相等於該參數向量。這種情況下,網路會很高興的忽略輸入,而且所有的RBF輸出都為零。如果RBF權重不允許進行自適應,那這種崩潰現象不會發生。第二個問題是,類別之間沒有競爭。可以使用更具區別性的訓練準則,稱為MAP(最大後驗機率)準則來獲得此類的競爭。這類似於有時候用於訓練HMMs的最大化相互資訊 [48],[49],[50]\。這對應於最大化正確類別
Dp
後驗機率(或最小化正確類別機率取的對數),因為輸入圖像可以來自其中一個類別或來自背景"垃圾"的類別標記。在懲罰項上面,這意味著除了降低正確類別的懲罰項,像是MSE準則,還拉高了不正確類別的懲罰項:

E(W)=1Pp=1P(yDp(Zp,W)+log(ej+ieyi(Zp,W)))(9)

第二項的負起到了競爭的作用。它必須小於(或等於)第一項,因此該損失函數為正。常數項

j是正數,預防已經很大的類別懲罰項被進一步的推高。該垃圾類別標記的後驗機率
ej
and
ej+ieyi(Zp,W)
的比率。當RBF參數學習的時候,這個鑑別準則預防先前提到的"崩潰效應",因為它讓RBF中心彼此分開。第六章中,我們說明系統準則的概括,學習分類輸入中多個物件(像是,單詞中的字符或文件中的字符)。

通過反向傳播計算損失函數相對於卷積網路所有層中的所有權重的梯度。標準演算法必須被稍微調整,以考慮權重共享。一個簡單的實作方法就是,首先計算損失函數相對於每個連接的偏導數,就像該網路是沒有權重共享的常規多層網路一樣。然後,將共享同一參數的所有連接的偏導數相加,形成該參數的導數。

這樣的大型架構可以非常有效率的訓練,但這麼做需要使用錄附中描述的一些技術。附錄A說明細節,像是使用特定的Sigmoid,以及權重初始化。附錄B、C說明使用的最小化過程,它是對雷文柏格-馬括特過程對角線近似的隨機版本。