This is an accumulation of all topics learned in CS178 at UC Irvine (The Intro machine learning course) taught by Alex Berg.
Machine learning is the process of learning from lots of data to make predictions about future, unseen data.
It's important to compute loss becuase the model can accumulate errors in its predictions. The most common are:
undrefitting
: not enough training data for model to understand featuresoverfitting
: the model understands the training data too well and cannot generalize to new dataThere are also several ways to compute loss depending on the problem:
MSE
is used to calculate the mean difference between the predicted and expected value and penalizes large error margins. (Regression)Negative Log-Likelihood (NLL)
determines what to include based on the confidence level of its prediction and penalizes low confidence. (Logistic Regression)Cross Entropy Loss
calculates the confidence levels (how uncertain the model is)Hinge Loss
calculates the margin between classes and tries to push them as far as possible (SVM)MSE is used to compute loss by summing the squared error of all predicted values compared to actual values. It is most commonly used to compute
Formula:
Gradient Descent (GD) is a convex optimization function that aims to minimize error (loss function) and find the best model that represents the data. It computes a single "global optima" (meaning, the one absolute best model).
It does this iteratively by minimizing the cost function (ex: MSE) at multiple steps, beginning with a random value.
Here is the general form for the chain rule that is applied at each step fo the GD process. represesnts weight adjustment while represents bias adjustment. These derivatives represent the size of the step function at each iteration:
There are two main types of GD:
Stochastic Gradient Descent (SGD)
optimizes the prediction for a single datapoint at a time and iterates over the batchBatch Gradient Descent
calculates the GD based on the entire datasetVariance
is when an overly complex model learns the data too well and generalizes irrelevant patterns and noise
Bias
is the tendancy of an overly simplified model to favor certain outcomes and it more likely to make errors.
General ways to reduce bias/variance
VC Dimension measures a model's complexity and how powerful it is by measuring it's performance on different ditributions of data (called "shattering"). If it can accurately separate different distributions of data, then we call this a generalizable model
shattering
example below with different data points. Usually, a higher dimensionality model is more generalizable.
Thus, the ouput of VC Dimension is how many points its able to shatter
PCA is a way to reduce the number of features and only select the most important ones relevant to the output. This reduces complexity of the model.
Now we will go into the main kinds of models and explain each loss in more depth for each one.
There are two main ways to split a dataset and validate predictions:
Hold out
is when you split data into two discrete groups: training and testing. Usually 80-20 split.Cross validation
is when you split the dataset into different folds and iterate over it to assign different folds to "testing". See this section.These models compute the probability distribution of an input datapoint across possible output labels.
Bayesian Classification is based on Bayes Theorm
and classifies objects based on prosterior and independent probabilities. You most commonly use this for classifiying things with probablistic outcomes like spam filtering and classifying words:
Select this when:
There are both simple (Naive Bayes) and complex (Bayesian networks, gaussian processes).
A specific kind of Bayesian classifier is called the Naïve Bayes Classifier
which is the simplified version of the classifier. This model assumes that all features are conditionally independent of each other.
The Gaussian Bayes Classifer
is a subset of naive bayes that classifies continuous data that follows the gaussian curve
This is a type of learning algorithm that does not require training to predict data. Instead, it compares new data points with the training data.
Nearest-Centroid takes the mean value of all datapoints in a class and assigns it as the centroid. it then compares new datapoints to these centroids to determine which class label it belongs to
Properties
K-nearest neighbor is a model that takes a datapoint and compares it to K number of neighbors that are closest in proximity to determine the final class label.
It's a form of supervised learning
because all data used as neighbors are from the training data where x is a feature value and y is the actual output.
It can be used for either regression or classification tasks
The process is as follows:
Calculate Euclidian distance between point ( p ) and all other points ()
If there are more then one points:
Training MSE of 1-NN is always 0
because the nearest neighbor of any training point is always itself ()
Test MSE of 1-NN is volatile because you are comparing test data with training data.
Here is an computational example of calculating loss:
Cross validation is the process of splitting a dataset into training and test data (usually a 80/20 split). This is a way to prevent the common errors to ensure the model is generalizable to new data and prevents overfitting during the training process.
There are different ways to approach this, each with their own MSE values:
leave-one-out
removes one datapoint and uses it as the test data and recomputes the MSE with the rest of the data. It then predicts the removed datapoint with the new model.k-fold cross validation
split dataset into k folds and perform training on all folds except k-1.These models are used when the relationship between datapoints is linear and structured.
Unlike Linear Regression, Logistic Regression is a classification problem that requires that predicts the probability of wether a point is in a class given a set of possible class labels.
Properties:
sigmoid activation funciton
predicts 0 or 1 if a datapoint passes a thresholdUse cases:
Logistic Regression works best when data is linearly separable (can achieve 0 loss), but when this is not the case,
SVMs is an optimization algorithm that finds the best hyperplane that separates data by maximizing the margin.
Properties:
convex optimization
where there is only one optimal solution (best hyperplane)Kernels
allows SVM's to operate on higher-dimension data by transforming nonlinear data into a linear representation. However, it's computationally expensive and is best used on small datasets
This is used for complex and multi-dimensional data
Note: technically this is a linear model but we include this under Neural Netowrks since it's more relevant to this.
A perceptron is a simplified binary classification model that computes the output by applying weights of inputs and putting it through an activation function.
MIT explains Neural Networks as:
a means of doing machine learning, in which a computer learns to perform some task by analyzing training examples. Usually, the examples have been hand-labeled in advance.
Essentialy, NN's find patterns in large amounts of data instead of the features being given to them. It can then use these patterns to generalize across new data.
The Universal Approximation Theorem
states that a NN with enough hidden nodes and layers can model any continuous function (it has a high ceiling capability)
These are the steps of a NN. Usually, you begin the training process by feeding the inputs of lots of data and testing that the output reflects that of the datatest.
During Feedforward
learning:
During Backpropogation
, the model learns from the mistakes made at the output by moving backwards in the NN and updating the weights of each node that contributed to the error. It uses SGD to compute error.
Activation functions are used to apply nonlinearity on data which allows NN models to learn more complex patterns compared to linear models.
Some of the most popular activation functions include:
Sigmoid
is a continuous function that selects an activation value between 0 and 1
Relu
selects 0 for negative values and itself for positive values.
softmax
gives the probability distribution of an input over all multiple class labels that sums to 1.
This is an issue with the loss function during backpropagation. This happens with activation functions squash each subsequent loss function until the loss function of the inputs layer become extremely small.
This causes the model to learn really slowly at the beginning. To fix this, use models like ReLu which can place a threshold on activation to prevent very small values.
recall that gradient represents steepness of the loss curve
Reinforcement learning is a subset of NN that updates it's actions iteratively based on new information and reward functions.
The Discount Factor (𝛾)
shows how much the agent values future vs immediate rewards
The State-Action Function
returns the expected return for an action at state. The goal is to select the optimal state-action function that optimizes reward, which directly influecnes the policy function:
The Markov Model
is a simplified RL model that only computes the rewards between state transitions.
Decisions trees are rule-based models that categorize data into discrete groups based on binary decisions at each node.
Properties:
computing decision boundary
The decision boundary is made of vertical and horizontal lines based on the depth of a tree. In the example below, there are 4 lines which reflect the height of the tree (4)
How does the model determine the best questions to ask at each node?
It uses entropy
which is a measure of how pured and homogenous a dataset is. The more homogenous the data becomes (i.e. all objects are classified under the same class label) the lower the entropy. 1
is the worst entropy value since it means that the data is divided into two classes with a 50/50 probability between each.
Here is the entropy formula:
The goal of decision tree policy is to minimize the entropy. The information gain
(IG) is the reduction in entropy. We use this to determine the best attributes to classify the training data on at each step.
Here is the IG Formula given feature :
The most efficient trees:
Ensemble models combine multiple models to form one better model.
There are two main types of ensemble learning:
Boosting
uses the loss output and weighted dataset of a preceding model and builds a new modelBagging
divides the training data into different subsets and trains different submodels on them before combining them into a new one.
boostrapped sampling
meaning that subsets of data are chosen at random and allow the same datapoint to be sampled multiple times – reducing overfittingProperties:
Clustering Algorithms is a unsupervised learning algorithm that groups data based on distance-based similarity. It assumes datasets are unlabeled and there is no ground truth.
Properties:
O(n^2)
time complexity (exponential scaling with increase of data) due to comparisons made with other data pointsThere are two main types of clustering:
Single Linkage
groups points one by one based on the next shortest distance to a point.Complete Linkage
groups clusters of points based on the smallest MAXIMUM distance between points. This keeps the data compact.K-Means is a specific type of clustering algorithm that works by the following:
Algorithm below:
Here are the type of problems that each ML model attempts to solve. These are the main types of models we've covered:
Linear Models
: Linear Regression, Linear Classification,