---
title: Linear Neural Networks
tags: ml
---
### Linear Neural Networks
#### Linear regression
Assumptions
- The relationship between the independent variables x and the dependent variable y is linear
- Any noise is well-behaved (following a Gaussian distribution)
$\hat{y}$ = $w_1x_1$ + $w_dx_d$ + b
- The w’s are the weights and the b is the bias
#### Loss Function
- Quantifies the distance between the real and predicted value of the target
- The loss will usually be a non-negative number where smaller values are better and perfect predictions have a loss of 0
- Squared error is the most popular loss function in regression problems

- Average the losses on the training set to measure the quality of a model on the entire dataset of n examples

#### Analytic Solution
- Subsume bias b into the parameter w by appending a column to the design matrix consisting of all ones
- Minimize ||$y - X_w$||^2^ where y is observation, x is vector of all features, and w is weight
- Take derivative with respect to w and set it equal to zero
#### Minibatch Stochastic Gradient Descent
- Gradient descent
- Iteratively reducing the error by updating the parameters in the direction that incrementally lowers the loss function
- Requires taking derivative of loss function, which is an average of the losses computed on every single example in the dataset
- Can be super slow
- Minibatch stochastic gradient descent
- Sample a random minibatch of examples every time an update needs to be computer

- Algorithm
- Initialize the values of the model parameters, typically at random
- Iteratively sample random minibatches from the data, updating the parameters in the direction of the negative gradient
- For quadratic losses and affine transformations, we can write it as

- |𝜷| is the number of examples in each minibatch (batch size)
- n is the learning rate
- Hyperparameters
- Parameters that are tunable but not updated in the training loop
- ex: batch size, learning rate
- Hyperparameter tuning
- Process by which hyperparameters are chosen
- Typically requires that we adjust them based on the results of the training loop as assessed on a separate validation set
#### Normal distribution and squared loss