# CS 3120 Machine Learning HW1
Devon DeJohn, Spring 2020
## Imports
I used the `plotly` backend for my plotting:
```python
import numpy as np
import random as rand
import plotly.graph_objects as go
from plotly.subplots import make_subplots
```
## Loss function 'landscape'
I 'normalized' the error matrix $Z$ so as to scale the loss 'landscape' down to more reasonable values for the purposes of plotting:
```python
for i, b in enumerate(bb):
for j, w in enumerate(ww):
for n, (x, y) in enumerate(zip(x_train, y_train)):
Z[(i,j)] += (w*x + b - y)**2
# end
# end
# end
# 'normalize' Z by scaling its components by the maximum value found in Z
Z = Z/Z.max()
```
I also decided to combine this plot with the gradient descent path plot from exercise 6:

## Gradient descent details (exercises 4, 5, and 6)
I used separate 'learning' rates for `w` and `b`. Here is the call signature for my gradient descent method:
```python
def grad(X, Y, w_lr=1e-3, b_lr=5e-2, tol=1e-5):
...
```
$X$ and $Y$ are vectors containing the training data. `w_lr` is the learning rate I used for the gradient descent on the `w` parameter, which defaults to $10^{-3},$ and `b_lr` controls the learning rate on the `b` parameter, and defaults to $5\times 10^{-2}.$
I specify a tolerance parameter for this method as well, which defaults to $10^{-5}.$
I also implemented a very basic 'adaptive' rate control for the two different 'learning' rates, which decreases the learning rate as the total error approaches zero. I'm not convinced this helps much with precision and if I had the time, I would've liked to explore a more robust method for adaptive rate control.
One issue I've found is that the initial learning rate is *highly* sensitive to the amount of training data provided. The default learning rates as I've set them will introduce major instability in `grad()` when the training data is increased from the ten points asked for in exercise 1. This is because the default values I've chosen are too 'fast', causing the steps taken to be too large, and introducing wild oscillations and oftentimes floating point overflow.
Obviously with 'slower' initial rates, this wouldn't be an issue, but I wanted to push the iterative gradient descent to the limit in an effort to see how quickly I could get it to converge.
The initial values for `b` and `w` are randomized with $b\in (1,99],$ and $w\in (-9, 9].$ Some initial guesses result in convergence in fewer than 30 iterations.
Each iteration checks if the error from `w` is within the specified tolerance. If so, we append the current value of `w` to the `w_path`, and if not, we modify `w` according to
$$
w = w - \alpha\frac{\partial J}{\partial w}
$$
The same logic applies to the `b` parameter.
The method will exit when *both* `b` and `w` error terms are within the tolerance, *or* when `max_iter` is exceeded.
## Model performance
I created a separate set of testing data on an entirely separate interval than the one used for the training data. I then compare the results of the testing data evaluated using the regression model and the original function $f(x) = 2x + 50.$ I also plot the absolute error $abs(f(x) - reg(x))$ as it changes over the interval:
