## The Neural Network
### The Limits of Traditional Computer Programs
- Traditional computer programs are very good at
- Performing arithmetic really fast (算術快)
- Explicitly following a list of instructions (明確遵循指令)
### The Mechanics of Machine Learning
- ***Human***: usually **learned by example**, **not by formula**
:::info
**Ex.** 你如何認出一隻狗 ?
- 透過測量狗鼻子的形狀或身體的輪廓 ? ❌
- 透過展示多個示例並在我們做出錯誤猜測時得到糾正 ⭕
:::
- Over our lifetime, our model becomes more and more accurate as we assimilate more and more examples
- Deep learning is a subset of a more general field of artificial intelligence called machine learning (深度學習是更廣泛的人工智慧領域(稱為機器學習)的子集)
- Given a model with which the machine can evaluate examples, and a small set of instructions to modify the model when it makes a mistake
---
- Define a model to be a function **h(x, θ)**, where **x** is an example expressed in vector form, and **θ** is a parameter vector that our model uses
- For example, if x were a grayscale image, it can be represented as the figure below

- Assume we want to “Determine how to **predict exam performance** based on the number of **hours of sleep we get** and the number of hours we study the previous day.”
- Collect data as: $x = [x_1, x_2]^T$, $x_1$: # of hours of sleep we got; $x_2$: number of hours we spend studying.
- Our goal might be to learn a model $h(x, \theta)$ with parameter vector $\theta = [\theta_0, \theta_1, \theta_2]^T$ such that:
$$
h(x, \theta) =
\begin{cases}
-1, & \text{if $x^T \cdot \begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix} + \theta_0$ < 0} \\
1, & \text{if $x^T \cdot \begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix} + \theta_0 \ge 0$}
\end{cases}
$$
- We want to learn a parameter vector $\theta$ such that our model **makes the right predictions** by given an input example $x$.
- This model is called a **linear perceptron**, and it’s a model that’s been used since the 1950s.

- By selecting $θ = [-24, 3, 4]^T$, this machine learning model make the correct prediction on every data input
$$
h(x, \theta) =
\begin{cases}
-1 & \text{if } 3x_1 + 4x_2 - 24 < 0 \\
1 & \text{if } 3x_1 + 4x_2 - 24 \ge 0
\end{cases}
$$
- In **most cases**, there are many possible choices for $\theta$ that are **optimal**.
- Questions remains:
- How do we even come up with an **optimal value** for the parameter vector $\theta$ in the first place?(如何得到 vector $\theta$ 的最佳值?)
- It’s quite clear that this particular model is **quite limited** in the relationships it can learn.(可以學習的範圍很有限)
- For example, the distribution of data below can not be described well by a linear perceptron.

### The Neuron
- Using natural neuron network structure to build machine models that solve problems in an analogous way
(使用自然神經元網路結構建立機器模型,以類似的方式解決問題)

:::danger
- Formulate the **input** as a vector $x = [x_1, x_2, ..., x_n]$ and **weight** of the neuron as $w = [w_1, w_2, ..., w_n]$
- The **output** of the neuron as $y = f(x*w + b)$, where b is the bias (偏差) term
:::
### Study or Sleep ?
$$
h(x, \theta) =
\begin{cases}
-1, & \text{if } 3x_1 + 4x_2 - 24 < 0 \\
1, & \text{if } 3x_1 + 4x_2 - 24 \ge 0
\end{cases}
$$
- Previously, we constructed a linear perceptron classifier that divided the Cartesian coordinate plane into two halves

- This is an optimal choice for $\theta$ because it correctly classified every sample in the dataset
- Consider using neuron to model the same question with
$$
f(z) =
\begin{cases}
-1, & \text{if z < 0} \\
1, & \text{if z $\ge$ 0}
\end{cases}
$$
### Feed-Forward Neural Networks (FFN)
- Although single neurons are more powerful than linear perceptron, they’re not nearly expressive enough to solve complicated learning problems
(儘管單一神經元比線性感知器更強大,但它們的表達能力不足以解決複雜的學習問題)
- 人腦中的神經元是分層組織的。事實上,人類的大腦皮質是由六層組成的。
- This is a simplified version of these layers:

---
- Borrowing from these concepts, we can construct an ***artificial neural network***.
- The **bottom layer** of the network pulls in the **input** data
- The **top layer** of neurons computes the final **answer**
- The **middle layer(s)** of neurons are called the ***hidden layers***
- We let $W^{(k)}_{i,j}$ be the **weight** of the connection between the $i^{th}$ neuron in the $k^{th}$ layer with the $j^{th}$ neuron in the $k + 1^{st}$ layer
- These weights constitute our **parameter vector** $\theta$.
- Our ability to solve problems with neural networks depends on **finding the optimal values** to plug into $\theta$
(神經網路解決問題的能力,取決於網路的複雜度、演算法和找到的 $\theta$ 的最佳值)
---
- 可以發現的是,在前面的範例中,連接僅從較低層遍歷到較高層
- **同一層的神經元之間沒有連接,也不存在從較高層到較低層傳輸資料的連接**
- $\Rightarrow$ 因此這種網路稱為 ***feed-forward network***
- 在例子中,每一層都有相同數量的神經元,但這既不是必要的,也不建議這樣做。通常,隱藏層的神經元數量少於輸入層,以迫使網路學習原始輸入的壓縮表示
- 不需要每個神經元的輸出都連接到下一層中所有神經元的輸入
- The input and outputs are vectorized representations.
- $y = f(W^Tx + b),\ x = [x_1\ x_2\ ...\ x_n],\ y = [y_1\ y_2\ ...\ y_m]$
- $W$: a weight matrix of size $n \times m$, and b is a bias vector of size m
### Linear Neurons and Their Limitations
- Most neuron types are defined by the function $f$ they apply to their logit (**logistic unit**) $z$
- [如何理解深度学习源码里经常出现的logits? - 知乎](https://www.zhihu.com/question/60751553)
- Let’s first consider layers of neurons that use a linear function: $f(z) = az + b$
- For example,
- 試圖估計快餐店一頓飯費用的神經元將是一個線性神經元,其中 a = 1 且 b = 0。換句話說,使用 $f(z) = z$ 且權重等於每個物品的價格
- 但它們遇到了嚴重的限制。事實上,可以證明**任何僅由線性神經元組成的 FFN 都可以表示為沒有隱藏層的網路**。這是有問題的,因為我們知道神奇的事情都發生在隱藏層中

### Sigmoid, Tanh, and ReLU Neurons
- 實際使用的神經元主要有三種類型,它們在計算中引入了非線性計算
- ***Sigmoid neuron*** uses the function: $f(z)=\cfrac{1}{1+e^{-z}}$
- **When the logit is very small**, the **output** of a logistic neuron is very **close to 0**
- **When the logit is very large**, the **output** of the logistic neuron is **close to 1**
- In between, the neuron assumes an **S-shape**

---
- ***Tanh neurons*** use a similar kind of S-shaped nonlinearity, but instead of ranging from 0 to 1, the output of tanh neurons **range from -1 to 1**.
- They use $f(z) = tanh(z)$

---
- A different kind of nonlinearity is used by the ***Restricted Linear Unit (ReLU) neuron*** (限制線性單元神經元)
- It uses the function $f(z) = max(0, z)$, resulting in a characteristic hockey-stick-shaped (曲棍球棒狀) response
- The ReLU has recently become the neuron of choice for many tasks (especially in computer vision)

### Softmax Output Layers
- Output vector to be a **probability distribution**(機率分布) over a set of mutually exclusive labels
(輸出向量是一組互斥標籤上的機率分佈)
- 例如,用於識別 MNIST 資料集中的手寫數字的神經網路。每個標籤(0 到 9)都是互斥的,但我們不太可能以 100% 的可信度識別數字
- 使用機率分佈可以更好地了解預測的可信度
- As a result, the desired output vector is of the form, where $\sum^9_{\mathrm{i=0}}p_i=1,[p_0\ p_1\ p_2\ ...\ p_9]$
- This can be achieved by using a special output layer called a ***softmax layer***.
- Softmax 層中神經元的輸出**取決於該層中所有其他神經元的輸出**
- Letting $z_i$ be the logit of the $i^{th}$ softmax neuron, we can achieve this normalization by setting its output to
$$
y_i=\cfrac{e^{z_i}}{\sum\limits_{j} e^{z_j}}
$$
<!-- TODO 公式可能不準確 -->
## Training Feed-Forward Neural Networks
### The Fast-Food Problem
- 我們到底要如何計算參數向量應該是什麼?
- 這是透過通常稱為"**訓練**"的過程來完成的
- 透過向神經網路展示大量的訓練範例並**迭代修改權重**來最小化我們在訓練範例上犯的錯誤
- 經過足夠的範例後,我們預計神經網路將非常有效地解決其經過訓練的任務
---

$y^{(i)}=w_1x_1^{(i)}+w_2x_2^{(i)}+w_3x_3^{(i)}$
- As in the previous page, we want to be able to predict how much a meal is going to cost us, but the items don’t have price tags
- The only thing the cashier will tell us is the total price of the meal
- We want to train a single linear neuron to solve this problem. How do we do it?
- Easy way
- Be intelligent about picking training cases.
- For example, we could buy only a single serving of burgers, then the other day, only a single serving of fries, and then a single serving of soda
- But the issues with this approach is that in real situations, it rarely ever gets close to the solution.
- For example, there’s no clear analog of this strategy in image recognition.
---
- Another way that works well in general.
- Let’s consider the way to **minimize the square error** over all of the training examples that we encounter.
- More formally, if we know that $t^{(i)}$ is the true answer for the $i^{th}$ training example and $y^{(i)}$ is the value computed by the neural network, we want to **minimize** the value of the ***error function*** $E$:
$$
E=\frac{1}{2}\sum_i\Big(t^{(i)}-y^{(i)}\Big)^2
$$
- The squared error is zero when out model makes a perfectly correct prediction on every training example. Moreover, **the closer $E$ is to 0, the better out model is**.
- As a result, our goal will be to select parameter vector θ such that E is as close to 0 as possible.
### Question
- Why we need the error function?
- Why can’t we treat the problem as a system of equations and solve them?
### Gradient Descent 梯度下降
- Assume the **linear neuron** only has two inputs $(w1, w2)$.
- The we can imagine a **three-dimensional space** where the horizontal dimensions correspond to the weights $w1$ and $w2$ , and the vertical dimension corresponds to the value of the error function $E$.
(我們可以想像一個三維空間,其中水平維度對應於權重 $w1$ 和 $w2$ ,垂直維度對應於誤差函數 $E$ 的值。)
- If we consider the errors we make over all possible weights, we get a surface in this three-dimensional space, in particular, a quadratic bowl as the figure.
(如果我們考慮所有可能的權重所產生的誤差,我們就會在這個三維空間中得到一個表面)

---
- We can also visualize this surface as a set of elliptical contours, where the minimum error is at the center of the ellipses.
(我們也可以將該表面視覺化為一組橢圓輪廓,其中最小誤差位於橢圓的中心。)
- The closer the contours are to each other, the steeper the slope.
(等高線彼此越接近,坡度就越陡。)
- The direction of the steepest descent is always perpendicular to the contours.
(最陡下降的方向始終垂直於輪廓。)
- This direction is expressed as a vector knows as the **gradient**.
- We can develop a high-level strategy for how to find the values of the weights that minimizes the error function.
(我們可以製定一個高階策略來找到使誤差函數最小化的權重值。)

---
- Suppose we randomly initialize the weights of our network so we find ourselves somewhere on the horizontal plane.
(假設我們隨機初始化網路的權重,我們會發現自己位於水平面上的某個位置。)
- By **evaluating the gradient** at our current position, we can find the direction of steepest descent, and we can take a step in that direction.
(透過評估當前位置的梯度,我們可以找到最速下降的方向,然後我們可以朝該方向邁進。)
- Then we’ll find ourselves at a new position that’s closer to the minimum than we were before.
- And then repeat the evaluation and steps taking.
- Eventually, this strategy will get us to the point of minimum error.
- This algorithm is known as ***gradient descent***.
### Learning Rates
- A quick note on hyperparameters.
- In addition to the weight parameters, learning algorithm also require a couple of **additional parameters** to carry out the training process.
- One of these so-called **hyperparameters** is the learning rate.
- At each step, **we need to determine how far we want to walk before recalculating our new direction**.
- **The closer we are to the minimum, the shorter we want to step forward**.(離最小值越近,每一步越短)
- However, if the error surface is rather mellow, training can potentially take a large amount of time.
(然而,如果誤差表面相當柔和,訓練可能會花費大量時間。)
- As a result, we often multiply the gradient by a factor $\epsilon$ (epsilon), the **learning rate**.

### The Delta Rule
- In order to calculate how to change each weight, we evaluate the gradient, which is essentially the **partial derivative**(偏微分) **of the error function** with respect to each of the weights.(梯度本質上是誤差函數的偏微分)
We know
$$
\begin{aligned}
E &= \frac{1}{2}\sum_{i}\Big(t^{(i)}-y^{(i)}\Big)^{2} \\
y^{(i)} &= w_1x_1^{(i)}+w_2x_2^{(i)}+w_3x_3^{(i)}
\end{aligned}
$$
Then
$$
\begin{aligned}
\Delta w_{k}&=-\epsilon\frac{\partial E}{\partial w_{k}} \\
&=-\epsilon\frac{\partial}{\partial w_{k}}\biggl(\frac{1}{2}\sum_{i}\biggl(t^{(i)}-y^{(i)}\biggr)^{2}\biggr) \\
&=\sum_{i}\epsilon\Big(t^{(i)}-y^{(i)}\Big)\frac{\partial y_{i}}{\partial w_{k}} \\
&=\sum_{i}\epsilon x_{k}^{(i)}\Big(t^{(i)}-y^{(i)}\Big)
\end{aligned}
$$
- Applying this method of changing the weights at every iteration, we are finally able to utilize gradient descent.
### Gradient Descent with Sigmoidal Neurons
- **Logistic neurons** compute their output value from their inputs:
$$
\begin{aligned}
z &= \sum_k w_k x_k \\
y &= \frac{1}{1+e^{-z}}
\end{aligned}
$$
- The neuron computes the **weighted sum of its inputs**, the logit $z$. It then feeds its logit into the input function to compute $y$, its final output.
- We want to compute the gradient of the error function with respect to the weights.
---
- Starting by taking the **derivative of the logit** with respect to the inputs and the weights:
$$
$$\begin{aligned}
\frac{\partial z}{\partial w_k} &= x_k \\
\frac{\partial z}{\partial x_k} &= w_k
\end{aligned}$$
$$
- The derivative of the output with respect to the logit is quite simple
$$
\begin{aligned}
\frac{dy}{dz} = \frac{e^{-z}}{\left(1+e^{-z}\right)^{2}} &= \frac{1}{1+e^{-z}}\frac{e^{-z}}{1+e^{-z}} \\
&= \frac{1}{1+e^{-z}}\Bigg(1-\frac{1}{1+e^{-z}}\Bigg) \\
&= y(1-y)
\end{aligned}
$$
- Using chain rule to get the derivative of the output with respect to each weight
$$
\frac{\partial y}{\partial w_{k}} = \frac{dy}{dz}\frac{\partial z}{\partial w_{k}} = x_{k}y(1-y)
$$
- Putting all of this together, we can now compute the derivative of the error function with respect to each weight:
$$
\frac{\partial E}{\partial w_{k}} = \sum_{i}\frac{\partial E}{\partial y^{(i)}}\frac{\partial y^{(i)}}{\partial w_{k}} = -\sum_{i}x_{k}^{(i)}y^{(i)}\Big(1-y^{(i)}\Big)\Big(t^{(i)}-y^{(i)}\Big)
$$
- Thus, the final rule for modifying the weights becomes:
$$
\Delta w_{k}=\sum_{i}\epsilon x_{k}^{(i)}y^{(i)}\Big(1-y^{(i)}\Big)\Big(t^{(i)}-y^{(i)}\Big)
$$
- The new modification rule is just like the delta rule, except with **extra multiplicative terms** included to account for the logistic component of the **sigmoidal neuron**
### The Backpropagation Algorithm
- The backpropagation algorithm is pioneered by David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams in 1986.
- We don’t know what the hidden units ought to be doing, but what we can do is **compute how fast the error changes** as we change the weight of an individual connection.
(我們不知道隱藏單元應該做什麼,但我們可以做的是計算當我們改變單一連接的權重時誤差變化的速度。)
- **Each hidden unit can affect many output units**. Thus we’ll have to combine many separate effects on the error in an informative way.
- Once we have the error derivatives for one layer of hidden units, we’ll use them to compute the error derivatives for the activities of the layer below.
(一旦我們獲得了一層隱藏單元的誤差導數,我們將使用它們來計算下一層活動的誤差導數。)
- And once we find the error derivatives for the activities of the hidden units, it’s easy to get the error derivatives for the weights leading into a hidden unit.
(一旦我們找到隱藏單元活動的誤差導數,就很容易獲得引入隱藏單元的權重的誤差導數。)
---
- The subscript we use will refer to the layer of the neuron.
- The symbol **y** will refer to the **activity of a neuron**, and the symbol **z** will refer to the **logit of the neuron**.
- The error function derivatives at the output layer:
$$E=\frac{1}{2}\sum_{j\in output}\Big(t_j-y_j\Big)^2\implies\frac{\partial E}{\partial y_j}=-\Big(t_j-y_j\Big)$$
- Let’s presume we have the error derivatives for layer $j$.
- Now, we aim to calculate the error derivatives for the layer below it, layer $i$.
- To do so, we must accumulate information about how the output of a neuron in layer $i$ affects the logits of every neuron in layer $j$.

---
- This can be done as follows, using that the partial derivative of the logit with respect to the incoming output data from the layer beneath is merely the weight of the connection $w_{ij}$.
$$\frac{\partial E}{\partial y_{i}}=\sum_{j}\frac{\partial E}{\partial z_{j}}\frac{dz_j}{dy_{i}}=\sum_{j}w_{ij}\frac{\partial E}{\partial z_{j}}$$
- Furthermore, we observe the following
$$\frac{\partial E}{\partial z_{j}}=\frac{\partial E}{\partial y_{j}}\frac{dy_{j}}{dz_{j}}=y_{j}\Big(1-y_{j}\Big)\frac{\partial E}{\partial y_{j}}$$
- Combining these two together, we can express the error derivatives of layer $i$ in term of the error derivatives of layer $j$.
$$\frac{\partial E}{\partial y_{i}}=\sum_{j}w_{ij}y_{j}\Big(1-y_{j}\Big)\frac{\partial E}{\partial y_{j}}$$
---
- We can then determine how the error changes with respect to the weights. This gives us how to modify the weights after each training example:
$$\frac{\partial E}{\partial w_{ij}}=\frac{\partial z_{j}}{\partial w_{ij}}\frac{\partial E}{\partial z_{j}}=y_{i}y_{j}\Big(1-y_{j}\Big)\frac{\partial E}{\partial y_{j}}
$$
- Finally, to complete the algorithm, we merely sum up the partial derivatives over all the training examples in our dataset.
- This gives us the following modification formula:
$$\Delta w_{ij}=-\sum_{k\in dataset}\epsilon y_{i}^{(k)}y_{j}^{(k)}\Big(1-y_{j}^{(k)}\Big)\frac{\partial E^{(k)}}{\partial y_{j}^{(k)}}
$$
### Stochastic and Minibatch Gradient Descent
- In backpropagation algorithm, we’ve been using a version of gradient descent known as ***batch gradient descent***.
- Using entire dataset to compute the error surface and then follow the gradient to take the path of steepest descent.
- For a simple quadratic error surface, it works quite well. But consider the case below. The error surface, however, has a **flat region**, and if we get unlucky, we might find ourselves **getting stuck**.

---
- To solve the above problem, another approach called **Stochastic Gradient Descent (SGD, 隨機梯度下降)** may be used.
- At each iteration, the error surface is estimated only with respect to a single sample, which improves the ability to navigate flat regions.

- Major pitfall (陷阱) of SGD
- looking at the error incurred one example at a time may not be a good enough approximation of the error surface.
(一次查看一個範例所產生的誤差可能不足以近似誤差表面)
- Meaning
- it could make gradient descent take a significant amount of time.
(它可能會使梯度下降花費大量時間)
- One way to combat this problem is using ***mini-batch gradient descent***.
- At every iteration, we compute the error surface with respect to some subset of the total dataset (instead of a single sample).
(在每次迭代中,我們都會計算總資料集的某個子集(而不是單一樣本)的誤差面)
- This subset is called **minibatch**. (the size of minibatch is another hyperparameter)
- In the context of backpropagation, the weight update step becomes
$$
\Delta w_{ij}=-\sum_{k\in minibatch}\epsilon y_{i}^{(k)}y_{j}^{(k)}\Big(1-y_{j}^{(k)}\Big)\frac{\partial E^{(k)}}{\partial y_{j}^{(k)}}
$$
### Test Sets, Validation Sets, and Overfitting
- Given a bunch of data points on a flat plane and find a curve that best describe this dataset.
- Two models might be trained: a linear model and a degree 12 polynomial.
- Which curve should we trust? What about adding more data to the dataset?

---
- The linear model is not only better subjectively but also quantitatively.
- This leads to an interesting point about training and evaluating machine learning models.
- By building a very complex model, it is **easy to perfectly fit our training dataset**. But when we evaluate such a complex model **on new data, it performs very poorly**.
- In other words, the model does not generalize well.
- This phenomenon is called ***overfitting***.
---
- A neural network with two inputs, a softmax output of size two, and a hidden layer with 3, 6, or 20 neurons.

- It’s apparent that as the number of connections in the network increases, so does the propensity to overfit to the data.
- Similar phenomenon of overfitting can be seen when the neural networks becomes deeper.
- The neural network have one, two, or four hidden layers of three neurons each.

---
- This leads to three major observations:
1. The machine learning engineer is always working with a direct **trade-off between overfitting and model complexity**. If the model isn’t complex enough, it may not be powerful enough to capture all the useful information necessary to solve a problem.
2. **It is very misleading to evaluate a model using the data we used to train it**. By **splitting up the data** into a ***training set*** and a ***test set***, it enables us to make a fair evaluation to the model by directly measuring how well it generalizes on new data it has not yet seen.
3. It’s quite likely that while we’re training our data, there’s a point in time where instead of learning useful features, we start overfitting to the training set. To avoid that, **we want to be able to stop the training process as soon as we start overfitting**.
---
- To do this, the training process can be divided into ***epochs***.
- **An epoch is a single iteration over the entire training set**. If we have a training set of size $d$ and we are doing mini-batch gradient descent with bath size $b$, than an epoch would be equivalent to $d/b$ model updates.
- **At the end of each epoch, we use *validation set* to measure how well the model is generalizing**. If the accuracy on the training set continues to increase while the accuracy on the validation set stays the same (or decreases), it’s a good sign that it’s time to stop training because we’re overfitting.

### The Workflow of Building Model
- Define the problem
- Build a neural network architecture
- Collect a significant amount of data
- Shuffle and divide this data up into three sets
- **Training**, **validation**, and **test** sets.

### Preventing Overfitting in Deep Neural Networks
- One of the method is called ***regularization***.
- It modifies the objective function that we minimize by adding additional terms that penalize large weights.
- Object function => $Error + λf(θ)$, where $f(θ)$ grows larger as the components of $θ$ grow larger, and $λ$ is the regularization strength (another hyperparameter).
- The value we choose for $λ$ determines how much we want to protect against overfitting. A $λ=0$ implies that we do not take any measures against the possibility of overfitting.
- The most common type of regularization in machine learning is L2 regularization. It can be implemented by augmenting the error function with the squared magnitude of all weights in the neural network.
- For every weight $w$ in the neural network, add $1/2 λw^2$ to the error function. It heavily penalizing peaky weight vectors and preferring diffuse weight vectors.
---
- Using the L2 regularization ultimately means that every weight is decayed linearly to zero. Because of this phenomenon, L2 regularization is also commonly referred to as weight decay.
- The visualized effect can be found below with regularization strengths of 0.01, 0.1 and 1.

---
- Another common type of regularization is L1 regularization.
- Add the term $λ|w|$ for every weight $w$ in the neural network.
- It has the intriguing property that it **leads the weight vector to become sparse** during optimization (i.e., very close to exactly zero).
- In other word, neurons with L1 regularization end up using only a small subset of their most important inputs and become quite resistant to noise in the inputs.
- In comparison, weight vectors from **L2 regularization** are **usually diffuse, small numbers**.
- **L1 regularization is very useful when you want to understand exactly which features are contributing to a decision**.
- If this feature analysis is not important, using L2 regularization is preferred since it empirically perform better.
---
- ***Max norm constraints*** have a similar goal of attempting to restrict θ from becoming too large, but they do this more directly.
- Max norm constraints enforce an absolute upper bound on the magnitude of the incoming weight vector for every neuron and use projected gradient descent to enforce the constraint.
- In other words, any time a gradient descent step moves the incoming weight vector such that $||w||_2 > c$, we project the vector back onto the ball with radius c. Typical values of c are 3 and 4.
---
- ***Dropout*** is a very different kind of method for preventing overfitting that has become one of the most favored methods of preventing overfitting in deep neural networks.
- While training, dropout is implemented by only keeping a neuron active with some probability p (a hyperparameter), or setting it to zero otherwise.
- Intuitively, this forces the network to be accurate even in the absence of certain information. It prevents the network from becoming too dependent on any one of neurons.
- It prevents overfitting by providing a way of approximately combining exponentially many different neural network architecture efficiently.
### Dropout
- Dropout is easy to understand, but there are some important intricacies to consider.
- First, we’d like the output of neurons during test time to be equivalent to their expected outputs at training time.
- This could be fixed by scaling the output at test time.
- But it’s always preferable to use inverted dropout, where the scaling occurs at training time instead of at test time.

- https://datasciocean.tech/deep-learning-core-concept/understand-dropout-in-deep-learning/