deep learning - HackMD

# deep learning * A （early stopping）避免overfitting * B (避免避免overfitting) <code>A</code> * C（vanishing gradient problem）用sigmoid function作用 * C ~~（增加learning rate）避免local minima~~ * D（用gradient descent）找到最佳function * C relu function is a special case of maxout function * C softmax layer对于分类问题，如何使更大的输出更好的置信度 * <code>D</code> ~~AAAAAA~~ * B drop out (1-p)% * D ## part 2 * what is the deep learning(physical meaning and mathematical view) Deep learning usually referred to neural network based approach.Each neuron is a very simple function. Cascading the neurons to form a neuralnetwork. Each layer is a simple function in the production line.A neural network is a complex function:f : R(n)to R(M) 深度学习通常是指基于神经网络的方法。每个神经元是一个非常简单的功能。End-to-end training: What each function should do is learned automatically 级联神经元以形成神经网络。在生产线中，每一层都是一个简单的功能。 * 由於 Vanilla 小批量梯度下降法並不能保證良好地收斂，這給我們留下了如下待解決的挑戰：AdaGrad 会在学习的过程中自动调整 learning rate, 对于出现频率低的参数使用较大的 learning rate, 出现频率高的参数使用较小的 learning rate. 因此，这种方法对于训练数据比较稀疏的情况比较适用. AdaGrad 可以提高 SGD 的鲁棒性.ADAGRAD采用了自适应的方式对学习率(learning rate)进行更新。在这个算法中，根据以前所有迭代的梯度变化情况，我们尝试改变学习速率。 https://medium.com/%E9%9B%9E%E9%9B%9E%E8%88%87%E5%85%94%E5%85%94%E7%9A%84%E5%B7%A5%E7%A8%8B%E4%B8%96%E7%95%8C/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92ml-note-sgd-momentum-adagrad-adam-optimizer-f20568c968db ![](https://imgur.com/pRja9hw.png) ### how to reduce the vanishing gradient problem * **Relu:**思想也很简单，如果激活函数的导数为1，那么就不存在梯度消失爆炸的问题了，每层的网络都可以得到相同的更新速度，relu就这样应运而生。先看一下relu的数学表达式： The simplest solution is to use other activation functions, such as ReLU, which doesn’t cause a small derivative. Residual networks are another solution, as they provide residual connections straight to earlier layers. As seen in Image 2, the residual connection directly adds the value at the beginning of the block, x, to the end of the block (F(x)+x). This residual connection doesn’t go through activation functions that “squashes” the derivatives, resulting in a higher overall derivative of the block. * Finally, batch normalization layers can also resolve the issue. As stated before, the problem arises when a large input space is mapped to a small one, causing the derivatives to disappear. In Image 1, this is most clearly seen at when |x| is big. Batch normalization reduces this problem by simply normalizing the input so |x| doesn’t reach the outer edges of the sigmoid function. As seen in Image 3, it normalizes the input so that most of it falls in the green region, where the derivative isn’t too small. ![](https://imgur.com/yLyjgk6.png) ![](https://imgur.com/c7H0R4D.png)