---
disqus: ierosodin
---
# L6-DeepFeedforwardNetworks
> Organization contact [name= [ierosodin](ierosodin@gmail.com)]
###### tags: `deep learning` `學習筆記`
==[Back to Catalog](https://hackmd.io/@ierosodin/Deep_Learning)==
* http://www.deeplearningbook.org/contents/mlp.html
* The goal of a feedforward network is to approximate some function $f∗$.
* There are no feedback connections in which outputs of the model are fed back into itself.
* When feedforward neural networksare extended to include feedback connections -> recurrent neuralnetworks
* Output layer
* The final layer of a feedforward network is called the output layer.
* Hidden layers
* Because the training data does not show the desired output for other layers, they are called hidden layers.
* Width
* The dimensionality of these hidden layers determines the width of the model.
* Depth
* The number of layers determibes the depth of the model.
* Training
* Find network weights w to minimize the error between true training labels $y_i$ and extimated labels $f_w(x_i)$.
* Back-propagation
* Minimization can be done by gradient decent provided $w$ is differentiable. This training method is called back-propagation.
* **Chain rule**
* In vector case
* $\nabla_xz = {\frac{\partial y}{\partial x}}^T\nabla_yz$
* Jacobian matrix
* ${\left[
\begin{array}{c}
\frac{\partial z}{\partial x_1} \\
\frac{\partial z}{\partial x_2} \\
... \\
\frac{\partial z}{\partial x_n} \\
\end{array}
\right]} = {\left[
\begin{array}{cccc}
\frac{\partial y_1}{\partial x_1} & \frac{\partial y_2}{\partial x_1} & ... & \frac{\partial y_m}{\partial x_1} \\
\frac{\partial y_1}{\partial x_2} & \frac{\partial y_2}{\partial x_2} & ... & \frac{\partial y_m}{\partial x_2} \\
... & ... & ... & ... \\
\frac{\partial y_1}{\partial x_n} & \frac{\partial y_2}{\partial x_n} & ... & \frac{\partial y_m}{\partial x_n} \\
\end{array}
\right]} \times {\left[
\begin{array}{c}
\frac{\partial z}{\partial y_1} \\
\frac{\partial z}{\partial y_2} \\
... \\
\frac{\partial z}{\partial y_m} \\
\end{array}
\right]}$
* 
* $g_y = \nabla_yz$
* $g_x = \nabla_xz = J_{y, x}^Tg_y$
* $g_w = \nabla_wz = J_{x, w}^Tg_x$
* $g_v = \nabla_vz = J_{w, v}^Tg_w + J_{y, v}^Tg_y$
* Nonlinear function
* Activation function
* To extend linear models to represent nonlinear function of $x$: $y = w^Tx + b \Rightarrow y = w^T\phi (x) + b$
* To use kernel functions such as radial basis functions (RBFs), e.g.: $\phi (x) = e^{-(a||x - x_i||)^2}$
* Manually engineer $\phi$, that is, features in computer vision, speech recognition, etc.
* To learn $\phi$ from data: $y = f(x; \theta, w) = \phi(x; \theta)^Tw$
* **ReLU: rectified linear unit: $g(z) = max\{0, z\}$**
* **Although ReLU is not differentiable at all point, it is still okay to use for gradient-based learning algorithm.**
* This is in part because neural network training algorithms do not usually arrive at a local minimum of the cost function, but instead merely reduce its value significantly. Because we do not expect training to actually reach a point where the gradient is 0, it is acceptable for the minima of the cost function to correspond to points with undefined gradient.
* When **initializing the parameters** of the affine transformation, it can be a good practice to set all elements of $b$ to **a small positive value**, such as 0.1. Doing so makes it very likely that the rectified linear units will be **initially active for most inputs** in the training set and allow the derivatives to pass through.
* **One drawback to rectified linear units is that they cannot learn via gradient-based methods on examples for which their activation is zero.**
* **leaky ReLU fixes $\alpha_i$ to a small value like 0.01.**
* PReLU treats $\alpha_i$ as a learnable parameter.
* **Maxout units generalize rectified linear units.**
* **Maxout: $g(z)_i = max_{j \in [1, k]}z_{ij} = max\{w_1^Tx + b_1, w_2^Tx + b_2, ..., w_k^Tx + b_k\}$**
* A maxout unit can learn a piece-wise linear, convex function with up to k pieces. Maxout units can thus be seen as learning the activation function itself rather than just the relationship between units.
* 
* If $k = 2$, it becomes an ReLU
* Sigmoid units for Bernoulli output distributions
* Use a logistic sigmoid output combined with maximum likelihood
* $\hat y = \sigma(z) = \frac{1}{1 + e^{-z}}, z = w^Th + b$
* $log\tilde{P}(y) = yz, y \in \{0, 1\}$
* $\tilde{P}(y) = e^{yz}$
* $P(y) = \frac{e^{yz}}{\sum e^{yz}}$
* $P(y) = \sigma((2y - 1)z)$
* $J(\theta) = \zeta((1 - 2y)z), \zeta(x) = log(1 + e^x)$
* **The softplus function does not shrink the gradient**
* 
* Softmax units for Multinoulli output distributions
* $softmax(z) = \frac{e^z}{\sum e^z}$
* Minimize negative log-likelihood: $-logP = -(z_i - log\sum e^z)$
* The first term encourages $z_i$ to be pushed up, while the second term encourages all of $z$ to be pushed down.
* Negative log-likelihood cost function always strongly penalizes the most active incorrect prediction.
* Cost function
* An important aspect of the design of a deep neural network is the choice of the cost function.
* Most modern neural networks are trained using maximum likelihood.
* negative log-likelihood:
* **$J(\theta) = -E_{x, y \sim \hat{p}_{data}} logp_{model}(y | x)$**
* Many output units involve an exp function that can saturate when its argument is very negative.
* log in cost function can undoes the exp functions
* If we define $p_{model}(y | x) = N(j; f(x; \theta), I)$
* Linear output units
* $J(\theta) = \frac{1}{2}E_{x, y \sim \hat{p}_{data}} ||y - f(x; \theta)||^2 + const$
* The network will learn the conditional mean.
* **Maximizing the log-likelihood is equivalent to minimizing the mean squared error.**
* If we define $p_{model}(y | x) = \frac{1}{2\gamma}e^{-\frac{|y - o(x; \theta)|}{\gamma}}$ as Laplace
* $J(\theta) = E_{x, y \sim \hat{p}_{data}} |y - o(x; \theta)|$
* The network will learn the conditional median.
* **View the cost function as being a functional, rather than a function**
* A functional is a mapping from functions to real numbers
* Architecture Design
* How many units it should have and how these units should be connected to each other.
* How to choose the depth and width of each layer.
* Deeper networks are often able to use far fewer units per layer and far fewer parameters, as well as frequently generalizing to the test set, but they also tend tobe harder to optimize.
* The universal approximation theorem states that a feedforward network with one linear output layer and at least one hidden layer with any “squashing” activation function (e.g., sigmoid) can approximate any Borel measurable function, provided that the network is given enough hidden units.
* A feedforward network with a single layer is sufficient to represent any function, but the layer may be infeasibly large and may fail to learn and generalize correctly.
* Using deeper models can reduce the number of units required to represent the desired function and can reduce the generalization error.
* 
* 