# <center><i class="fa fa-edit"></i> Machine Learning: Reducing Loss </center> ###### tags: `Internship` :::info **Goal:** - [x] Meeting with Ren - [x] Reducing Loss **Resources:** [Framing](https://developers.google.com/machine-learning/crash-course/framing/video-lecture) [Descending Into ML](https://developers.google.com/machine-learning/crash-course/descending-into-ml/video-lecture) [Machine Learning](https://hackmd.io/@Derni/HJQkjlnIP) ::: ### Meeting Notes - Not really possible to trace data with BMI or height --> measure every semester is the limit - Considering t-tests, ANOVAs but need to discuss with Professor Ray - Can try to navigate dds for LTSM predictions - Check with Prof. Ray for other tasks ### Reducing Loss - Hyperparameters: configurations to tune how model is trained - Derivative of (y - y’)^2 with respect to weights and biases: shows how loss changes - Simple to compute and convex - **Gradient descent**: take small gradient steps in direction to minimize loss ![](https://i.imgur.com/zmUMYCl.png) ![](https://i.imgur.com/C4YHQ55.png) - Weight initialization: - Convex: weights can start anywhere - Bowl-shaped: only 1 minimum - Non-convex: selection of initial value is v - Egg crate: more than 1 minimum - Unnecessary to find gradient over entire data set at each step - Can just find gradient on small data samples - Each step gets a new random sample - Stochastic Gradient Descent: one example at a time - Mini-Batch Gradient Descent: batches of 10-1000 - Loss and gradients averaged over batch - Convergence: continue trying new values for b and w1 and loss is minimized and stops changing ![](https://i.imgur.com/OcQKvPo.png) - Gradient Descent * Type of “compute parameter updates” * Assume convex problem with one minimum * Does NOT compute loss for every w1 over entire data set * Randomly pick starting value for w1 * Calculate gradient of loss curve as start point * Go in direction of negative gradient (largest decrease as possible) * Learning rate: step size * Too small takes too long to converge * Too large may miss the minimum * Ideal (Goldilocks) learning rate depends on how flat the loss function is * Flatter = try larger learning rate * Ideal learning rate: * 1-D: inverse of second derivative * 2-D or more: inverse of Hessian matrix * Batch: number of examples to calculate gradient in single iteration * If batch = entire data set, too inefficient with too much redundancy * If batch = 1, very noisy gradients * Stochastic gradient descent (SDG) uses just batch size of 1) * Mini-batch SDG: 10-1000. Efficient and reduces noise