# <center><i class="fa fa-edit"></i> Machine Learning: Reducing Loss </center>
###### tags: `Internship`
:::info
**Goal:**
- [x] Meeting with Ren
- [x] Reducing Loss
**Resources:**
[Framing](https://developers.google.com/machine-learning/crash-course/framing/video-lecture)
[Descending Into ML](https://developers.google.com/machine-learning/crash-course/descending-into-ml/video-lecture)
[Machine Learning](https://hackmd.io/@Derni/HJQkjlnIP)
:::
### Meeting Notes
- Not really possible to trace data with BMI or height --> measure every semester is the limit
- Considering t-tests, ANOVAs but need to discuss with Professor Ray
- Can try to navigate dds for LTSM predictions
- Check with Prof. Ray for other tasks
### Reducing Loss
- Hyperparameters: configurations to tune how model is trained
- Derivative of (y - y’)^2 with respect to weights and biases: shows how loss changes
- Simple to compute and convex
- **Gradient descent**: take small gradient steps in direction to minimize loss


- Weight initialization:
- Convex: weights can start anywhere
- Bowl-shaped: only 1 minimum
- Non-convex: selection of initial value is v
- Egg crate: more than 1 minimum
- Unnecessary to find gradient over entire data set at each step
- Can just find gradient on small data samples
- Each step gets a new random sample
- Stochastic Gradient Descent: one example at a time
- Mini-Batch Gradient Descent: batches of 10-1000
- Loss and gradients averaged over batch
- Convergence: continue trying new values for b and w1 and loss is minimized and stops changing

- Gradient Descent
* Type of “compute parameter updates”
* Assume convex problem with one minimum
* Does NOT compute loss for every w1 over entire data set
* Randomly pick starting value for w1
* Calculate gradient of loss curve as start point
* Go in direction of negative gradient (largest decrease as possible)
* Learning rate: step size
* Too small takes too long to converge
* Too large may miss the minimum
* Ideal (Goldilocks) learning rate depends on how flat the loss function is
* Flatter = try larger learning rate
* Ideal learning rate:
* 1-D: inverse of second derivative
* 2-D or more: inverse of Hessian matrix
* Batch: number of examples to calculate gradient in single iteration
* If batch = entire data set, too inefficient with too much redundancy
* If batch = 1, very noisy gradients
* Stochastic gradient descent (SDG) uses just batch size of 1)
* Mini-batch SDG: 10-1000. Efficient and reduces noise