Gradient-Based Learning Applied to Document Recognition_Paper(翻譯)(XI)

# Gradient-Based Learning Applied to Document Recognition_Paper(翻譯)(XI) ###### tags: `LeNet-5` `CNN` `論文翻譯` `deeplearning` `Gradient-Based Learning Applied to Document Recognition` >[name=Shaoe.chen] [time=Thu, Feb 6, 2020 12:35 PM] [TOC] ## 說明區塊如下分類，原文區塊為藍底，翻譯區塊為綠底，部份專業用語翻譯參考國家教育研究院 :::info 原文 ::: :::success 翻譯 ::: :::warning 任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf) * [\*paper hyperlink](http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf) ::: Page 37~43 ## XI. Conclusions :::info During the short history of automatic pattern recognition, increasing the role of learning seems to have invariably improved the overall performance of recognition systems. The systems described in this paper are more evidence to this fact. Convolutional Neural Networks have been shown to eliminate the need for handcrafted feature extractors. Graph Transformer Networks have been shown to reduce the need for hand-crafted heuristics, manual labeling, and manual parameter tuning in document recognition systems. As training data becomes plentiful, as computers get faster, as our understanding of learning algorithms improves, recognition systems will rely more and more of learning, and their performance will improve. ::: :::success 在自動影像辨識的短暫歷史中，增加學習的作用似乎總是能提高辨識系統的整體效能。此篇論文所說的系統更是為這一事實提供更多的的證據。卷積神經網路已經被證明可以去除對手工提取特徵的需求。Graph Transformer Networks已經被證明可以減少文件辨識系統中對手工啟發法，手動標記與手動調校參數的需求。隨著訓練資料的豐富，計算機速度的提高，以及對學習演算法的瞭解，辨識系統將更依賴學習，它們的效能也將提高。 ::: :::info Just as the back-propagation algorithm elegantly solved the credit assignment problem in multi-layer neural networks, the gradient-based learning procedure for Graph Transformer Networks introduced in this paper solves the credit assignment problem in systems whose functional architecture dynamically changes with each new input. The learning algorithms presented here are in a sense nothing more than unusual forms of gradient descent in complex, dynamic architectures, with effcient back-propagation algorithms to compute the gradient. The results in this paper help establish the usefulness and relevance of gradient-based minimization methods as a general organizing principle for learning in large systems. ::: :::success 一如反向傳播很好的解決多層神經網路credit assignment的問題一樣，此編論文中所說的基於梯度學習程序的Graph Transformer Networks也解決系統中功能架構隨每個新輸入動態改變的credit assignment問題。這邊提出的學習演算法在某種意義上不過是複雜動態架構中特殊形式的梯度下降，具有高效率的反向傳播演算法來計算梯度。此篇論文的結果有助於建立基於梯度的最小化方法作為大型系統學習的一般組織原理的有用性與相關性。 ::: :::info It was shown that all the steps of a document analysis system can be formulated as graph transformers through which gradients can be back-propagated. Even in the non-trainable parts of the system, the design philosophy in terms of graph transformation provides a clear separation between domain-specific heuristics (e.g. segmentation heuristics) and generic, procedural knowledge (the generalized transduction algorithm) ::: :::success 結果說明，文件分析系統的所有步驟都可以以圖轉換器來表示，通過該圖轉換器可以反向傳播梯度。即使在系統中不可訓練的部份，以圖轉換器為基礎的設計原理也明確區分領域特定啟方法(如segmentation heuristics)與通用，程序知識([一般型轉導](http://terms.naer.edu.tw/detail/5457299/)演算法)。 ::: :::info It is worth pointing out that data generating models (such as HMMs) and the Maximum Likelihood Principle were not called upon to justify most of the architectures and the training criteria described in this paper. Gradient based learning applied to global discriminative loss functions guarantees optimal classification and rejection without the use of "hard to justify" principles that put strong constraints on the system architecture, often at the expense of performances. ::: :::success 值得指出的是，資料生成模型(如HMMs)與最大似然原理沒有被要求用來證明論文中所述的大多數架構與訓練標準。基於梯度學習應用到全域判別式損失函數確保了最佳分類與拒絕，而不需要使用"難以證明"的原則，這些原則在系統架構上放了很大的約束，這往往以犧牲效能做為代價。 ::: :::info More specifically, the methods and architectures presented in this paper offer generic solutions to a large number of problems encountered in pattern recognition systems: 1. Feature extraction is traditionally a fixed transform, generally derived from some expert prior knowledge about the task. This relies on the probably incorrect assumption that the human designer is able to capture all the relevant information in the input. We have shown that the application of Gradient-Based Learning to Convolutional Neural Networks allows to learn appropriate features from examples. The success of this approach was demonstrated in extensive comparative digit recognition experiments on the NIST database. 2. Segmentation and recognition of objects in images cannot be completely decoupled. Instead of taking hard segmentation decisions too early, we have used Heuristic Over-Segmentation to generate and evaluate a large number of hypotheses in parallel, postponing any decision until the overall criterion is minimized. 3. Hand truthing images to obtain segmented characters for training a character recognizer is expensive and does not take into account the way in which a whole document or sequence of characters will be recognized (in particular the fact that some segmentation candidates may be wrong, even though they may look like true characters). Instead we train multi-module systems to optimize a global measure of performance, which does not require time consuming detailed hand-truthing, and yields significantly better recognition performance, because it allows to train these modules to cooperate towards a common goal. 4. Ambiguities inherent in the segmentation, character recognition, and linguistic model should be integrated optimally. Instead of using a sequence of task-dependent heuristics to combine these sources of information, we have proposed a unified framework in which generalized transduction methods are applied to graphs representing a weighted set of hypotheses about the input. The success of this approach was demonstrated with a commercially deployed check reading system that reads millions of business and personal checks per day: the generalized transduction engine resides in only a few hundred lines of code. 5. Traditional recognition systems rely on many hand crafted heuristics to isolate individually recognizable objects. The promising Space Displacement Neural Network approach draws on the robustness and efficiency of Convolutional Neural Networks to avoid explicit segmentation altogether. Simultaneous automatic learning of segmentation and recognition can be achieved with Gradient-Based Learning methods ::: :::success 更具體而言，論文中介紹的方法與架構為影像識別系統中遇到的大量問題提出通用的解決方案： 1. 傳統上，特徵提取是一個固定轉換，通常是由任務相關的專家先驗知識[推導](http://terms.naer.edu.tw/detail/6591151/)而來。這依賴於可能的錯誤假設，即人工設計能夠補捉到輸入中所有相關訊息。我們已經證明，將基於梯度學習應用到卷積神經網路能夠從樣本學到適當的特徵。這種方法的成功在NIST資料庫上廣泛比較數字辨識實驗中證明。 2. 影像中物件的[分段](http://terms.naer.edu.tw/detail/6674993/)與辨識無法完全地[解耦](http://terms.naer.edu.tw/detail/2114307/)。我們採用Heuristic Over-Segmentation平行生成並評估大量的假設，並推遲任何決策，直到總體標準最小化，而不是過早做出硬體[分段](http://terms.naer.edu.tw/detail/6674993/)決策。 3. 為了訓練字符識別器而手工truthing影像來獲得分段字符的成本很高，而且沒有考慮到辨識整體文件或字符序列的方式(特別是某些[分段](http://terms.naer.edu.tw/detail/6674993/)候選可能是錯的，即使它們看起來像真實的字符)。 4. 分段，字符辨識與語言模型中固定的[模糊(歧義)性](http://terms.naer.edu.tw/detail/6624649/)應該得到最佳的整合。相較於使用一系列依賴任務的啟發法來組合這些信息來源，我們提出一個統一的框架，框架中[一般型轉導](http://terms.naer.edu.tw/detail/5457299/)方法應用於表示一組關於輸入的加權假設的圖。商業佈署的支票讀取系統證明了這個方法的成功，系統每天讀取百萬張商業與個人支票：[一般型轉導](http://terms.naer.edu.tw/detail/5457299/)引擎僅[常駐](http://terms.naer.edu.tw/detail/6673245/)在幾百行程式碼內而以。 5. 傳統的辨識系統依賴許多手工製做的啟發法來隔離單獨可辨識的物件。前途看好的Space Displacement Neural Network方法利用了卷積神經網路的魯棒性與高效來完全避免顯式[分段](http://terms.naer.edu.tw/detail/6674993/)。使用基於梯度學習方法，可以同時完成自動[分段](http://terms.naer.edu.tw/detail/6674993/)學習與辨識。 ::: :::info This paper presents a small number of examples of graph transformer modules, but it is clear that the concept can be applied to many situations where the domain knowledge or the state information can be represented by graphs. This is the case in many audio signal recognition tasks, and visual scene analysis applications. Future work will attempt to apply Graph Transformer Networks to such problems, with the hope of allowing more reliance on automatic learning, and less on detailed engineering. ::: :::success 這篇論文提出圖轉換器模組少量的範例，但很明顯的，這概念可以被應用到很多像是以圖來表示領域知識或狀態信息的情況。在許多音頻信號辨識任務與視覺場景分析應用程式就是這種情況。未來的工作會嚐試將Graph Transformer Networks應用到這類問題，希能能夠更依賴自動學習，而減少對細部工程的依賴。 ::: ### A. Preconditions for faster convergence :::info As seen before, the squashing function used in our Convolutional Networks is f(a) = Atanh(Sa). Symmetric functions are believed to yield faster convergence, although the learning can become extremely slow if the weights are too small. The cause of this problem is that in weight space the origin is a fixed point of the learning dynamics, and, although it is a saddle point, it is attractive in almost all directions \[116\]. For our simulations, we use $A=1.7159$ and $S=\dfrac{2}{3}$(see \[20\], \[34\]). With this choice of parameters, the equalities $f(1)=1$ and $f(-1)=-1$ are satisfied. The rationale behind this is that the overall gain of the squashing transformation is around 1 in normal operating conditions, and the interpretation of the state of the network is simplified. Moreover, the absolute value of the second derivative of $f$ is a maximum at +1 and -1, which improves the convergence towards the end of the learning session. This particular choice of parameters is merely a convenience, and does not affect the result. ::: :::success 如先前所看到的，在我們的卷積網路中所用的squashing function為f(a) = Atanh(Sa)。儘管如果權重太小會造成學習變的太慢，但[對稱函數](http://terms.naer.edu.tw/detail/2125937/)被認為會產生更快的收斂。這問題的原因是，在權重空間中，[原點](http://terms.naer.edu.tw/detail/419966/)是學習動力學的[定點](http://terms.naer.edu.tw/detail/2116246/)，儘管它是一個鞍點，但它在幾乎所有方向方都是有吸引力的\[116\]。對於我們的模擬，我們使用$A=1.7159$、$S=\dfrac{2}{3}$(見\[20\]，\[34\])。通過選擇這些參數，可以滿足等式$f(1)=1$與$f(-1)=-1$。這背後的原理是，在正確操作條件下，squashing轉換的總增益約為1，並且簡化對網路狀態的解釋。此外，$f$二階導數的絕對值在+1與-1處是最大的，這提高學習階段結束時的收斂性。 ::: :::warning squashing function在目前皆稱為activation function ::: :::info Before training, the weights are initialized with random values using a uniform distribution between $-2.4/F_i$ and $2.4/F_i$ where $F_i$ is the number of inputs (fan-in) of the unit which the connection belongs to. Since several connections share a weight, this rule could be diffcult to apply, but in our case, all connections sharing a same weight belong to units with identical fan-ins. The reason for dividing by the fan-in is that we would like the initial standard deviation of the weighted sums to be in the same range for each unit, and to fall within the normal operating region of the sigmoid If the initial weights are too small, the gradients are very small and the learning is slow. If they are too large, the sigmoids are saturated and the gradient is also very small. The standard deviation of the weighted sum scales like the square root of the number of inputs when the inputs are independent, and it scales linearly with the number of inputs if the inputs are highly correlated. We chose to assume the second hypothesis since some units receive highly correlated signals. ::: :::success 在訓練之前，權重使用介於$-2.4/F_i$與$2.4/F_i$之間的[均勻分佈](http://terms.naer.edu.tw/detail/955962/)做隨機值的初始化，其中$F_i$是指連接所屬單元的輸入([扇入](http://terms.naer.edu.tw/detail/415982/))數。因為多個連接共享一個權重，因此這個規則難以被應用，但是在我們的案例中，共用相同權重的所有連接都屬於具有相同[扇入](http://terms.naer.edu.tw/detail/415982/)的單元。之所有除以扇入的原因在於，我們希望[加權和](http://terms.naer.edu.tw/detail/3128693/)的初始[標準差](http://terms.naer.edu.tw/detail/1113798/)在每個單元的相同區間內，並且落在Sigmoid的正常操作區間，如果初始權重過小，那梯度會非常小，而且學習會非常緩慢。但如果權重過大，那會造成sigmoid飽合，而且梯度也會非常小。當輸入獨立的時候，[加權和](http://terms.naer.edu.tw/detail/3128693/)的[標準差](http://terms.naer.edu.tw/detail/1113798/)像是輸入數值的平方根一樣的縮放，如果與輸入有高度相關，那它會與輸入線性縮放。由於部份單元接收高度相關的信號，因此我們選擇第二種假設。 ::: ### B. Stochastic Gradient vs Batch Gradient :::info Gradient-Based Learning algorithms can use one of two classes of methods to update the parameters. The first method, dubbed "Batch Gradient", is the classical one: the gradients are accumulated over the entire training set, and the parameters are updated after the exact gradient has been so computed. In the second method, called "Stochastic Gradient", a partial, or noisy, gradient is evaluated on the basis of one single training sample (or a small number of samples), and the parameters are updated using this approximate gradient. The training samples can be selected randomly or according to a properly randomized sequence. In the stochastic version, the gradient estimates are noisy, but the parameters are updated much more often than with the batch version. An empirical result of considerable practical importance is that on tasks with large, redundant data sets, the stochastic version is considerably faster than the batch version, sometimes by orders of magnitude \[117\]. Although the reasons for this are not totally understood theoretically, an intuitive explanation can be found in the following extreme example. Let us take an example where the training database is composed of two copies of the same subset. Then accumulating the gradient over the whole set would cause redundant computations to be performed. On the other hand, running Stochastic Gradient once on this training set would amount to performing two complete learning iterations over the small subset. This idea can be generalized to training sets where there exist no precise repetition of the same pattern but where some redundancy is present. In fact stochastic update must be better when there is redundancy, i.e., when a certain level of generalization is expected. ::: :::success 基於梯度學習演算法可以使用兩類方法之一來更新參數。第一個方法稱為"Batch Gradient(批次梯度)"，是一種經典方法：累積整個訓練集的梯度，並在計算出精確梯度之後更新參數。在稱為"Stochastic Gradient(隨機梯度)"的第二個方法中，在一個訓練樣本基礎上估計部份或噪點的梯度(或少量樣本)，並使用近似梯度更新參數。訓練樣本可以隨機選擇，或根據適當的隨機順序選擇。在隨機版本中，其梯度的估計是有雜訊的，但是參數的更新次數是要比批次處理版本要來的多次。一個具有相當實際意義的經驗結果是，對於具[多餘](http://terms.naer.edu.tw/detail/2123372/)資料集的大型任務，隨機版本會比批次版本要快的多，有時候會多幾個[數量級](http://terms.naer.edu.tw/detail/2121081/)\[117\]。儘管理論上還不能完全瞭解其原因，但可以在下面這個極端的例子看到一個直觀的解釋。讓我們舉一個例子，其訓練資料庫是由同一子集的兩個複本所組成。然後在整個資料集的累積梯度上將導致執行多餘的計算。另一方面，在這訓練集上執行一次Stochastic Gradient相當於在小的子集上執行兩次完成的學習迭代。這個想法可以推廣到訓練集，其中不存在相同影像的精確重覆，但存在一起[冗餘](http://terms.naer.edu.tw/detail/579044/)。事實上，當存在冗餘的時候，即當期待某種程度的泛化時，隨機更新必須更好。 ::: :::info Many authors have claimed that second-order methods should be used in lieu of gradient descent for neural net training. The literature abounds with recommendations \[118\] for classical second-order methods such as the Gauss-Newton or Levenberg-Marquardt algorithms, for Quasi-Newton methods such as the Broyden-Fletcher-Goldfarb-Shanno method (BFGS), Limited-storage BFGS, or for various versions of the Conjugate Gradients (CG) method. Unfortunately, all of the above methods are unsuitable for training large neural networks on large data sets. The Gauss-Newton and Levenberg-Marquardt methods require $O(N^3)$ operations per update, where $N$ is the number of parameters, which makes them impractical for even moderate size networks. Quasi-Newton methods require "only" $O(N^2)$ operations per update, but that still makes them impractical for large networks. Limited-Storage BFGS and Conjugate Gradient require only $O(N)$ operations per update so they would appear appropriate. Unfortunately, their convergence speed relies on an accurate evaluation of successive "conjugate descent directions" which only makes sense in "batch" mode. For large data sets, the speed-up brought by these methods over regular batch gradient descent cannot match the enormous speed up brought by the use of stochastic gradient. Several authors have attempted to use Conjugate Gradient with small batches, or batches of increasing sizes \[119\], \[120\], but those attempts have not yet been demonstrated to surpass a carefully tuned stochastic gradient. Our experiments were performed with a stochastic method that scales the parameter axes so as to minimize the eccentricity of the error surface. ::: :::success 許多作者聲稱，在訓練神經網路的時候，應該使用二階方法取代梯度下降。文獻中經典的二階方法，像是Gauss-Newton([高斯牛頓](http://terms.naer.edu.tw/detail/149913/))或Levenberg-Marquardt([雷文柏格-馬括特](http://terms.naer.edu.tw/detail/67491/))演算法，Quasi-Newton([準牛頓法](http://terms.naer.edu.tw/detail/5269/))像是Broyden-Fletcher-Goldfarb-Shanno(BFGS)，Limited-storage BFGS或各種版本的Conjugate Gradients([共軛梯度](http://terms.naer.edu.tw/detail/2113379/))方法。不幸的是，上面所說的方法都不適合在大型資料集上訓練大型神經網路。Gauss-Newton與Levenberg-Marquardt在每次的梯新都需要$O(N^3)$個操作，其中$N$是參數量，這造成它們即使在中型網路也是不切實際。Quasi-Newton方法每次更新僅需要$O(N^2)$個操作，但對大型網路而言仍然是不切實際。Limited-Storage BFGS與Conjugate Gradient僅需要$O(N)$個操作，因此它們看起來很適合。不幸的是，它們的收斂速度取決於對連續的"conjugate descent directions(共軛下降方向)"的準確評估，而這只在"批次"模式下才有意義。對大型資料集而言，這些方法在常規批次梯度下降上帶來的[加速](http://terms.naer.edu.tw/detail/6680400/)是無法與使用隨機梯度帶來的極大[加速](http://terms.naer.edu.tw/detail/6680400/)相提。許多作者試著將[共軛梯度](http://terms.naer.edu.tw/detail/2113379/)用於小批次或增加大小的批量\[119\]，\[120\]，但這些嚐試至今尚未被證明超過仔細調整的隨機梯度。我們的實驗是以隨機方法進行，該方法依縮放參數軸，以此最小化誤差曲面的[離心率](http://terms.naer.edu.tw/detail/2115229/)。 ::: ### C. Stochastic Diagonal Levenberg-Marquardt :::info Owing to the reasons given in Appendix B, we prefer to update the weights after each presentation of a single pattern in accordance with stochastic update methods. The patterns are presented in a constant random order, and the training set is typically repeated 20 times. ::: :::success 由於附錄B給出的原因，我們傾向於依隨機更新的方法在每次呈現單一影像後更新權重。影像以固定的隨機順序呈現，訓練集通常重覆20次。 ::: :::info Our update algorithm is dubbed the Stochastic Diagonal Levenberg-Marquardt method where an individual learning rate (step size) is computed for each parameter (weight) before each pass through the training set \[20\], \[121\], \[34\]. These learning rates are computed using the diagonal terms of an estimate of the Gauss-Newton approximation to the Hessian (second derivative) matrix. This algorithm is not believed to bring a tremendous increase in learning speed but it converges reliably without requiring extensive adjustments of the learning parameters. It corrects major illconditioning of the loss function that are due to the peculiarities of the network architecture and the training data. The additional cost of using this procedure over standard stochastic gradient descent is negligible. ::: :::success 我們的更新演算法稱為Stochastic Diagonal Levenberg-Marquardt，在每次通過訓練集之前，為每個參數(權重)計算單獨的學習效率(步長)\[20\]，\[121\]，\[34\]。這些學習效率是使用Gauss-Newton近似到Hessian(二階導數)矩陣的對角線項來估計。這演算法不被認為可以在學習效速上帶來驚人的增效，但它可以可靠地收斂而不需要對學習參數做大量調整。它更正了由於網路架構與訓練資料的特殊性而引起的損失函數的主要[病態](http://terms.naer.edu.tw/detail/2117559/)。在標準隨機梯度下降上使用這個程序的額外成本是微不足道的。 ::: :::info At each learning iteration a particular parameter $w_k$ is updated according to the following stochastic update rule ![](https://i.imgur.com/iSsAenM.png) where $E^p$ is the instantaneous loss function for pattern $p$. In Convolutional Neural Networks, because of the weight sharing, the partial derivative $\dfrac{\partial E^p}{\partial w_k}$ is the sum of the partial derivatives with respect to the connections that share the parameter $w_k$: ![](https://i.imgur.com/hGQSMdH.png) where $u_{ij}$ is the connection weight from unit $j$ to unit $i$, $V_k$ is the set of unit index pairs $(i, j)$ such that the connection between $i$ and $j$ share the parameter $w_k$, i.e.: ![](https://i.imgur.com/a6BXg0l.png) ::: :::success 在每次的學習迭代中，根據下面隨機更新規則來更新特定參數$w_k$ ![](https://i.imgur.com/iSsAenM.png) 其中$E^p$是圖像$p$的[瞬時](http://terms.naer.edu.tw/detail/722071/)損失函數。在卷積神經網路中，因為權重共享，其偏導數$\dfrac{\partial E^p}{\partial w_k}$是共享參數$w_k$的連接相關的偏導數之和： ![](https://i.imgur.com/hGQSMdH.png) 其中$u_{ij}$是指從單元$j$到單元$i$的連接權重，$V_k$是單元索引對$(i, j)$的集合，這樣$i$、$j$之間的連接共享參數$w_k$，即： ![](https://i.imgur.com/a6BXg0l.png) ::: :::info As stated previously, the step sizes $\epsilon k$ are not constant but are function of the second derivative of the loss function along the axis $w_l$: ![](https://i.imgur.com/lFEk9DW.png) where $\mu$ is a hand-picked constant and $h_{kk}$ is an estimate of the second derivative of the loss function $E$ with respect to $w_k$. The larger $h_{kk}$, the smaller the weight update. The parameter $\mu$ prevents the step size from becoming too large when the second derivative is small, very much like the "model-trust" methods, and the Levenberg-Marquardt methods in non-linear optimization \[8\]. The exact formula to compute $h_{kk}$ from the second derivatives with respect to the connection weights is: ![](https://i.imgur.com/TC8rPVp.png) ::: :::success 如前所述，步長$\epsilon k$並不是一個[常數](http://terms.naer.edu.tw/detail/3186676/)，而是損失函數沿著軸$w_l$的二階導數的函數： ![](https://i.imgur.com/lFEk9DW.png) 其中$\mu$是手工挑選的常數，而$h_{kk}$是損失函數$E$相對於$w_k$的二階導數估測。$h_{kk}$愈大，則權重更新愈小。參數$\mu$可以預防步長在二階導數過小的時候變的過大，類似於"model-trust"方法，與非線性最佳化中的Levenberg-Marquardt方法\[8\]。從相關連接權重二階導數計算$h_{kk}$的確切公式為： ![](https://i.imgur.com/TC8rPVp.png) ::: :::info However, we make three approximations. The first approximation is to drop the off-diagonal terms of the Hessian with respect to the connection weights in the above equation: ![](https://i.imgur.com/GUv4oPD.png) ::: :::success 然而，我們做了三個近似值。第一個近似值是將Hessian的[非對角線](http://terms.naer.edu.tw/detail/2120883/)項目刪掉相對於上述公式中連接權重： ![](https://i.imgur.com/GUv4oPD.png) ::: :::info Naturally, the terms $\dfrac{\partial^2 E}{\partial u^2_{ij}}$ are the average over the training set of the local second derivatives: ![](https://i.imgur.com/QSBrNG3.png) ::: :::success 很正常的，$\dfrac{\partial^2 E}{\partial u^2_{ij}}$是局部二階導數訓練集的平均值： ![](https://i.imgur.com/QSBrNG3.png) ::: :::info Those local second derivatives with respect to connection weights can be computed from local second derivatives with respect to the total input of the downstream unit: ![](https://i.imgur.com/KvxGbhO.png) where $x_j$ is the state of unit $j$ and $\dfrac{\partial^2 E^p}{\partial a^2_i}$ is the second derivative of the instantaneous loss function with respect to the total input to unit $i$ (denoted $a_i$). Interestingly, there is an efficient algorithm to compute those second derivatives which is very similar to the back-propagation procedure used to compute the first derivatives \[20\], \[121\]: ![](https://i.imgur.com/62uEZtX.png) ::: :::success 這些關於連接權重的局部二階導數可以從關於下游單元的總輸入的局部二階導數計算： ![](https://i.imgur.com/KvxGbhO.png) 其中$x_j$是單元$j$的狀態，而$\dfrac{\partial^2 E^p}{\partial a^2_i}$是關於總輸入到單元$i$的[瞬時](http://terms.naer.edu.tw/detail/722071/)損失函數二階導數(標記為$a_i$)。有趣的是，有一種高效的演算法可以計算這些二階導數，這方法非常類似用於計算一階導數的反向傳播程序 \[20\]，\[121\]： ![](https://i.imgur.com/62uEZtX.png) ::: :::info Unfortunately, using those derivatives leads to well-known problems associated with every Newton-like algorithm: these terms can be negative, and can cause the gradient algorithm to move uphill instead of downhill. Therefore, our second approximation is a well-known trick, called the Gauss-Newton approximation, which guarantees that the second derivative estimates are non-negative. The Gauss-Newton approximation essentially ignores the non-linearity of the estimated function (the Neural Network in our case), but not that of the loss function. The back-propagation equation for Gauss-Newton approximations of the second derivatives is: ![](https://i.imgur.com/uJ0YrYk.png) ::: :::success 不幸的是，使用這些導數會導致每一個Newton-like演算法上眾所皆知的問題，這些項目可以是負數的，並可能導致梯度演算法上坡而不是下坡。因此，我們第二個近似值是一個眾所皆知的技巧，稱為Gauss-Newton近似，它保證二階導數的估計是非負的。Gauss-Newton近似本質上忽略估計函數(本例中為神經網路)的非線性，但並沒有審格損失函數的非線性。二階導數的Gauss-Newton近似的反向傳播方程式為： ![](https://i.imgur.com/uJ0YrYk.png) ::: :::info This is very similar to the formula for back-propagating the first derivatives, except that the sigmoid's derivative and the weight values are squared. The right-hand side is a sum of products of non-negative terms, therefore the left-hand side term is non-negative. ::: :::success 這與一階導數的反向傳播公式非常相似，除了sigmoid的導數與權重值是平方的。右手邊是非負項的乘積之和，因此，左手邊是非負項。 ::: :::info The third approximation we make is that we do not run the average in Equation 24 over the entire training set, but run it on a small subset of the training set instead. In addition the re-estimation does not need to be done often since the second order properties of the error surface change rather slowly. In the experiments described in thispaper, we re-estimate the $h_{kk}$ on 500 patterns before each training pass through the training set. Since the size of the training set is 60,000, the additional cost of re-estimating the $h_{kk}$ is negligible. The estimates are not particularly sensitive to the particular subset of the training set used in the averaging. This seems to suggest that the second-order properties of the error surface are mainly determined by the structure of the network, rather than by the detailed statistics of the samples. This algorithm is particularly useful for shared-weight networks because the weight sharing creates ill-conditionning of the error surface. Because of the sharing, one single parameter in the first few layers can have an enormous influence on the output. Consequently, the second derivative of the error with respect to this parameter may be very large, while it can be quite small for other parameters elsewhere in the network. The above algorithm compensates for that phenomenon. ::: :::success 我們做的第三個近似值為，我們不在整個訓練集上執行公式24中的平均值，而是在訓練集上的一小部份執行。另外，由於誤差曲面的二階屬性變化相當緩慢，因此不需要經常進行重新估算。在本身的實驗說明中，我們在每次通過所有的訓練資料集之前，在500個圖像上重新估算$h_{kk}$。因為訓練集的大小為60,000，因此重新估算$h_{kk}$的額外成本可以忽略不計。估計值對平均中使用的訓練集的特定子集不是特別敏感。這似乎表明，誤差曲面的二階特性主要由網路的結構決定，而不是由樣本的詳細統計決定。這個演算法對共享權重網路特別有用，因為權重共享造成誤差曲面的病態化。因為共享，前幾層中的一個參數可以對輸出造成極大的影響。因此，關於這參數的誤差的二階導數也許非常大，而網路中的其它參數，其導數可能非常小。上面的演算法補償了這種現象。 ::: :::info Unlike most other second-order acceleration methods for back-propagation, the above method works in stochastic mode. It uses a diagonal approximation of the Hessian. Like the classical Levenberg-Marquardt algorithm, it uses a "safety" factor $\mu$ to prevent the step sizes from getting too large if the second derivative estimates are small. Hence the method is called the Stochastic Diagonal Levenberg-Marquardt method. ::: :::success 與其它多數用於反向傳播的二階[加速](http://terms.naer.edu.tw/detail/6597901/)方法不同，上述方法在隨機模式中執行。它使用Hessian的對角線近似值。像是經典的Levenberg-Marquardt演算法，它使用"安全"因子$\mu$來預防步長過大(如果二階導數估計值過小)。因此，這方法稱為Stochastic Diagonal Levenberg-Marquardt。 :::