# 11 & 12 Applications
## Question 1
### Answer
False
### Explanation
In linear regression, the "best fit model" minimizes the mean squared error (MSE), but it does not necessarily make the error zero, especially in real-world datasets where noise or imperfect relationships exist between features and the target.
$$
C(\widehat{\beta}) = \frac{1}{n}\sum_{i = 1}^n (y_i - \widehat{y}_i)^2
$$
Even if we find the optimal $\hat\beta$ (model parameters), the predictions $\hat{y_i}$ won't perfectly match the actual values $y_i$ unless:
- the regression line perfectly passes through all the data point (otherwise called overfitting)
- there's no noise or variance in the data.
For example,
Let’s say we have the following points:
| x | y |
| ---- | ---- |
| 1 | 2 |
| 2 | 4.1 |
| 3 | 6 |
Considering the best fit may be close to $y = 2x$, but because of the slight variation at $x=2$ (where $y = 4.1$ instead of 4), the MSE won't be exactly zero.

## Question 2
### Answer
False
### Explanation
Gradient descent does not always converge to the global minimum, especially when the cost function is non-convex. Many models, particularly in deep learning, have complex loss surfaces with multiple local minima and saddle points.
Gradient descent can get stuck in:
- A local minimum (not the best possible solution)
- A saddle point, where the gradient is zero but it's neither a minimum nor maximum
It only guarantees reaching the global minimum when the cost function is convex, like in simple linear regression.
## Question 3
### Answer
True
### Explanation
In a quadratic cost function, the gradient is proportional to the distance from the minimum.
A quadratic function is in shape of a parabola. As you can see in the below image the slope, angle the tangent makes with x axis, is reducing the tangent reaches closer to global minimum. Which means the update steps in gradient descent also become smaller.

## Question 4
### Answer
True
### Explanation
The learning rate $\eta$ controls the step size in gradient descent. If it's too small, learning is slow. But if it's too large, the updates might overshoot the minimum, causing the algorithm to oscillate or even diverge.
Update rule in gradient descent:
$$
w_{new} = w_{old} \: - \eta \times \Delta w_{old}
$$
When $\eta$ is too large:
- Instead of approaching the minimum, the algorithm jumps back and forth across it.
- This leads to a zig-zag pattern in the cost curve or complete instability.

## Question 5
### Answer
False
### Explanation
In Stochastic Gradient Descent (SGD), we do not use the entire dataset to compute the gradient. Instead, we update the parameter $\beta$ using the gradient from a single randomly selected data point at a time.
This makes SGD:
- Faster (especially with large datasets)
- Noisier, since updates are based on individual samples
- Able to escape local minima more easily due to noise in the updates
In contrast:
- Batch Gradient Descent computes the gradient using the entire dataset.
- Mini-Batch Gradient Descent uses a small batch of samples.
## Question 6
### Answer
True
### Explanation
In vanilla (batch) gradient descent, the gradient is computed using all samples in the dataset. This results in longer and more complex expressions involving all data points.
In Stochastic Mini-Batch Gradient Descent (SMBGD), the gradient is computed using only a small batch of the dataset (e.g., 32 or 64 samples), rather than the full dataset.
This means the gradient calculation at each step is simpler and involves fewer terms.
## Question 7
Answer
True
Explanation
The learning rate $\eta$ controls how big the steps are when updating parameters during gradient descent.
- If $\eta$ is too large, the algorithm might overshoot the minimum or diverge.
- If $\eta$ is very small, the updates are tiny, and although stable, it takes much longer to converge.
This means decreasing the learning rate results in slower convergence, requiring more iterations (epochs) to reach the optimal solution.

## Question 8
Answer
False
Explanation
Stochastic Mini-Batch Gradient Descent (SMBGD) only requires the cost function to be differentiable for the mini-batches it operates on — not the full dataset at once.
In practice:
- The algorithm computes gradients based on small random subsets of the data.
- It updates model parameters using the approximate gradient from those subsets.
- Therefore, it's not necessary for the entire cost function (across all data) to be differentiable everywhere — just that the function is differentiable at the sampled points.
This makes SMBGD more flexible and often more robust to non-smooth regions than full-batch gradient descent.