Week 11 and 12

# Week 11 and 12 ## Question 1 For any linear regression model, the "best fit model" will have a cost (MSE) value of zero. ### Answer: False - The Mean Squared Error (MSE) measures the average squared difference between the observed values ($y_i$) and the predicted values ($\hat{y}_i$) from the model: $$ \text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2. $$ - While the goal of linear regression is to minimize MSE, achieving an MSE of zero is practically impossible in real-world scenarios. - **Noise in Data**: Real-world data inherently contains noise (e.g., measurement errors, unobserved variables). Even the "best" linear model cannot perfectly account for random fluctuations. - *Example*: In the Boston suburb dataset, home values depend on factors beyond room count (e.g., location). - **Model Limitations**: Linear regression assumes a linear relationship between features and the target. If the true relationship is nonlinear, the model will have residual error. - *Example*: Exponential growth in home prices vs. room count. - **Feature Incompleteness**: If the model lacks relevant features, it cannot fully explain the variation in the target. - Example: Predicting home values solely based on room count ignores critical features like crime rate or proximity to parks, leading to higher MSE. - Hence the statement is false. --- ## Question 2 For any model, the gradient descent algorithm converges to the global minimum of the cost function. ### Answer: False - Gradient descent iteratively updates parameters (e.g., $β$ in linear regression) by moving in the direction of the steepest descent of the cost function. However, convergence to the global minimum is not guaranteed for all models. $$ \beta_{\text{new}} = \beta_{\text{old}} - \eta \nabla C(\beta). $$ - Convergence fails when, 1. **Non-Convex Cost**: If the cost function has multiple local minima (common in complex models like neural networks), gradient descent may converge to a local minimum instead of the global one. - for example: The MSE for linear regression is convex (bowl-shaped), guaranteeing a single global minimum. But for models with nonlinearities (e.g., polynomial regression), the cost surface may have multiple minima. 2. **Poor $\eta$**: - Too large → The algorithm may overshoot the minimum and diverge - Too small → Progress becomes excessively slow, potentially stalling in a suboptimal region. 3. **Saddle Points**: In high-dimensional spaces, gradients can vanish near saddle points (flat regions), halting progress. - Gradient descent excels for convex problems (like linear regression) but may fail for non-convex or high-dimensional cost surfaces. Hence the statement is false --- ## Question 3 For the quadratic cost function in the notebook, the gradients have a higher magnitudes further away from its global minimum. Therefore, updates in gradient descent become less drastic closer to the global minimum. ### Answer: True - MSE gradient: $$ \nabla C(\beta) = -2X^T(y - X\beta). $$ - **Far from minimum**: The gradient $∇C(β)$ is steeper (larger magnitude) farther from the minimum and flatter (smaller magnitude) near it. - **Near minimum**: Small residuals → flat gradient → tiny steps. - The notebook’s contour plots show this clearly—gradient arrows shrink near the center (global minimum). - Quadratic cost functions naturally enforce larger updates when far from the optimum and finer adjustments near convergence. Hence the staement is true. --- ## Question 4 Given some cost function, it is possible for the learning rate (𝜂) to be too large. That is, if this value is too large, we can only "bounce around" past the minimum of the cost function, never making any progress. ### Answer: True The learning rate $η$ scales the gradient step $$ \beta_{\text{new}} = \beta_{\text{old}} - \eta \nabla C(\beta). $$ 1. **Overshooting** - Steps exceed the minimum → Oscillations (e.g., zig-zag in parameter space). - *Example*: If $(\eta = 1.0)$ (vs. notebook's $(\eta = 0.02)$, updates diverge. 2. **Divergence** - For MSE, $(\eta \geq 2/\lambda_{\text{max}})$ (max eigenvalue of $(X^TX)$ causes instability. - Taylor expansion shows cost may *increase* if $(\eta)$ violates Lipschitz continuity. 3. **Trade-off** - Small $(\eta$): Slow convergence (e.g., notebook's "diminishing returns" phase). - $(\eta)$ must balance speed and precision—too large prevents convergence, too small wastes compute. Hence the statement is true. --- ## Question 5 "In Stochastic Gradient Descent, we make updates to β using the gradients calculated from the entire dataset, at each epoch." ### Answer: False - **Batch Gradient Descent**: - Computes gradient using ALL training data: $$\nabla C(\beta) = \frac{1}{n}\sum_{i=1}^n \nabla C_i(\beta)$$ - for Example: Notebook's "Basic Gradient Descent" uses full dataset for 15K epochs - **Stochastic Gradient Descent (SGD)**: - Uses **one random sample** (or mini-batch) per update: $$\nabla C(\beta) \approx \nabla C_j(\beta) \quad \text{(for randomly chosen } j\text{)}$$ - for Example: Notebook's "Stochastic Mini-Batch GD" shows noisy cost curves from batch-wise updates - **Speed**: SGD avoids expensive full-dataset computations. - **Noise**: Single-sample gradients are "noisy" but help escape local minima. --- ## Question 6 In Stochastic Mini-Batch Gradient Descent, the gradient calculation will always involve fewer polynomial terms than with vanilla Gradient Descent. ### Answer: True 1. **Vanilla GD**: For MSE with $n$ samples: $$\nabla C(\beta) = -\frac{2}{n}X_{\text{full}}^T(y - X_{\text{full}}\beta)$$ - Requires $O(n)$ operations per update (sum over all $n$ samples). 2. **Mini-Batch GD**: With batch size $m \ll n$: $$\nabla C(\beta) \approx -\frac{2}{m}X_{\text{batch}}^T(y_{\text{batch}} - X_{\text{batch}}\beta)$$ - Only $O(m)$ operations per update. - In the notebook, mini-batches of size $m=4$ (see "n_batches=fixed(4)") reduce computation vs. full batch ($n=506$ for Boston data). - Fewer terms → Less memory/I/O → Better hardware utilization. - Hence the statement is true --- ## Question 7 "Decreasing the learning rate (η) will typically increase the amount of time it takes for the algorithm to reach optimization." ### Answer: True **Role of $\eta$**: Controls step size in parameter updates: $$\beta_{\text{new}} = \beta_{\text{old}} - \eta \nabla C(\beta)$$ | Learning Rate | Pros | Cons | |--------------|------|------| | **Large $\eta$** | Fast initial progress | Risk of overshooting/divergence | | **Small $\eta$** | Stable convergence | Slower progress (more epochs needed) | - With $\eta=0.02$, convergence takes 15K epochs. - If $\eta=0.002$, convergence might require 150K epochs (10× slower). - For convex functions, convergence requires $\eta < 2/L$ (where $L$ is Lipschitz constant). Smaller $\eta$ guarantees convergence but increases time. --- ## Question 8 "Stochastic Mini-Batch Gradient Descent requires the cost function of the full data set to be differentiable at every point." ### Answer: False - **Full Differentiability**: Needed for theoretical guarantees in *Batch GD* (global gradients must exist). - **Mini-Batch SGD**: Only requires differentiability *within batches*: - Even if cost is non-differentiable at some points (e.g., ReLU activation), batches may avoid them. - the Notebook uses MSE (differentiable everywhere), but SGD would work even with piecewise costs. - Real-world data often has outliers/irregularities. - Mini-batches "smooth" the cost landscape, making optimization feasible. - A cost function non-differentiable at 50% of points could still be optimized via SGD if batches avoid those points. - hence the statement is false. ---