CS 178: Machine Learning Final Review

This is an accumulation of all topics learned in CS178 at UC Irvine (The Intro machine learning course) taught by Alex Berg.

What is machine learning?
Topic 1: Basics

What is machine learning?

Machine learning is the process of learning from lots of data to make predictions about future, unseen data.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

0. Some basics

Loss

It's important to compute loss becuase the model can accumulate errors in its predictions. The most common are:

undrefitting: not enough training data for model to understand features
overfitting: the model understands the training data too well and cannot generalize to new data

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

There are also several ways to compute loss depending on the problem:

MSE is used to calculate the mean difference between the predicted and expected value and penalizes large error margins. (Regression)
Negative Log-Likelihood (NLL) determines what to include based on the confidence level of its prediction and penalizes low confidence. (Logistic Regression)
Cross Entropy Loss calculates the confidence levels (how uncertain the model is)
Hinge Loss calculates the margin between classes and tries to push them as far as possible (SVM)

Minimum Squared Error

MSE is used to compute loss by summing the squared error of all predicted values compared to actual values. It is most commonly used to compute

Formula:

M S E = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}

Gradient Descent

Gradient Descent (GD) is a convex optimization function that aims to minimize error (loss function) and find the best model that represents the data. It computes a single "global optima" (meaning, the one absolute best model).

It does this iteratively by minimizing the cost function (ex: MSE) at multiple steps, beginning with a random value.

Here is the general form for the chain rule that is applied at each step fo the GD process.

w

represesnts weight adjustment while

b

represents bias adjustment. These derivatives represent the size of the step function at each iteration:

w = w - α \cdot \frac{\partial J (w, b)}{\partial w}, b = b - α \cdot \frac{\partial J (w, b)}{\partial b}

There are two main types of GD:

Stochastic Gradient Descent (SGD) optimizes the prediction for a single datapoint at a time and iterates over the batch
Batch Gradient Descent calculates the GD based on the entire dataset
Batch is more computationally expensive since
SGD converages faster and is ideal for smaller datasets

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Variance-Bias Tradeoff

Variance is when an overly complex model learns the data too well and generalizes irrelevant patterns and noise

leads to overfitting because it learns the training data too well and cannot generalize
example: a decision tree that learns every aspect of your home's interior but only needed to classify it as a house or not.

Bias is the tendancy of an overly simplified model to favor certain outcomes and it more likely to make errors.

leads to underfitting becuase it cannot capture all features of the data
example: trying to classify patterns in admissions to a university (many features) with a linear regression model (simplified)

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

General ways to reduce bias/variance

increasing the amount of data to train on usually reduces bias by refining the decision boundary
hand-picking feautres reduces variance by learning more broad patterns

VC Dimensions

VC Dimension measures a model's complexity and how powerful it is by measuring it's performance on different ditributions of data (called "shattering"). If it can accurately separate different distributions of data, then we call this a generalizable model

shattering example below with different data points. Usually, a higher dimensionality model is more generalizable.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Thus, the ouput of VC Dimension is how many points its able to shatter

ex: in the example above, VC=3 because the classifier can always separate any three points but not four

Principle Component Analysis (PCA)

PCA is a way to reduce the number of features and only select the most important ones relevant to the output. This reduces complexity of the model.

Now we will go into the main kinds of models and explain each loss in more depth for each one.

Training-Test data

There are two main ways to split a dataset and validate predictions:

Hold out is when you split data into two discrete groups: training and testing. Usually 80-20 split.
Cross validation is when you split the dataset into different folds and iterate over it to assign different folds to "testing". See this section.

1. Probablistic Models

These models compute the probability distribution of an input datapoint across possible output labels.

Bayesian Classifiers

Bayesian Classification is based on Bayes Theorm and classifies objects based on prosterior and independent probabilities. You most commonly use this for classifiying things with probablistic outcomes like spam filtering and classifying words:

P (A ∣ B) = \frac{P (B ∣ A) P (A)}{P (B)}

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Select this when:

There are a lot of features
Not a lot of values in the dataset
The goal is to output a simple categorical answer (0/1)

There are both simple (Naive Bayes) and complex (Bayesian networks, gaussian processes).

A specific kind of Bayesian classifier is called the Naïve Bayes Classifier which is the simplified version of the classifier. This model assumes that all features are conditionally independent of each other.

The Gaussian Bayes Classifer is a subset of naive bayes that classifies continuous data that follows the gaussian curve

2. Instance-Based Learning

This is a type of learning algorithm that does not require training to predict data. Instead, it compares new data points with the training data.

Nearest-Centroid Classifier

Nearest-Centroid takes the mean value of all datapoints in a class and assigns it as the centroid. it then compares new datapoints to these centroids to determine which class label it belongs to

Properties

High bias since it doesnt account for local variations
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →

K-Nearest-Neighbor Classifier

K-nearest neighbor is a model that takes a datapoint and compares it to K number of neighbors that are closest in proximity to determine the final class label.

It's a form of supervised learning because all data used as neighbors are from the training data

(x_{1}, y)

where x is a feature value and y is the actual output.

It can be used for either regression or classification tasks

The process is as follows:

Calculate Euclidian distance between point ( p ) and all other points (
$q_{i}$ )

$d (p, q) = \sqrt{(q_{1} - p_{1})^{2} + (q_{2} - p_{2})^{2} + \dots + (q_{n} - p_{n})^{2}}$
If there are more then one points:
- for classification tasks: take majority vote
- for regression tasks: take the average

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Training MSE of 1-NN is always 0 because the nearest neighbor of any training point is always itself (

{\hat{y}}_{i} = y_{i}

)
Test MSE of 1-NN is volatile because you are comparing test data with training data.

Here is an computational example of calculating loss:

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Cross Validation

Cross validation is the process of splitting a dataset into training and test data (usually a 80/20 split). This is a way to prevent the common errors to ensure the model is generalizable to new data and prevents overfitting during the training process.

There are different ways to approach this, each with their own MSE values:

leave-one-out removes one datapoint and uses it as the test data and recomputes the MSE with the rest of the data. It then predicts the removed datapoint with the new model.
k-fold cross validation split dataset into k folds and perform training on all folds except k-1.

3. Linear Models

These models are used when the relationship between datapoints is linear and structured.

Linear Regression

Linear Classification

Logistic Regression

Unlike Linear Regression, Logistic Regression is a classification problem that requires that predicts the probability of wether a point is in a class given a set of possible class labels.

Properties:

sigmoid activation funciton predicts 0 or 1 if a datapoint passes a threshold
Assumes that all features are independent of each other
Supervised learning with a given set of class labels

Use cases:

probability of spam
probability of a specific disease

Negative Log-Loss

Logistic Regression works best when data is linearly separable (can achieve 0 loss), but when this is not the case,

Cross-Entropy Loss

Support Vector Machines

SVMs is an optimization algorithm that finds the best hyperplane that separates data by maximizing the margin.

Properties:

Uses convex optimization where there is only one optimal solution (best hyperplane)

Kernels allows SVM's to operate on higher-dimension data by transforming nonlinear data into a linear representation. However, it's computationally expensive and is best used on small datasets

4. Deep Learning

This is used for complex and multi-dimensional data

Perceptrons

Note: technically this is a linear model but we include this under Neural Netowrks since it's more relevant to this.

A perceptron is a simplified binary classification model that computes the output by applying weights of inputs and putting it through an activation function.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Neural Networks

MIT explains Neural Networks as:

a means of doing machine learning, in which a computer learns to perform some task by analyzing training examples. Usually, the examples have been hand-labeled in advance.

Essentialy, NN's find patterns in large amounts of data instead of the features being given to them. It can then use these patterns to generalize across new data.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

The Universal Approximation Theorem states that a NN with enough hidden nodes and layers can model any continuous function (it has a high ceiling capability)

These are the steps of a NN. Usually, you begin the training process by feeding the inputs of lots of data and testing that the output reflects that of the datatest.

During Feedforward learning:

Input data points
Compute weighted sum of inputs
Apply an Activation function
Repeat for all hidden layers
compute the output layer weighted sum
Apply the output activation function
Output consists of: final prediction and error

During Backpropogation, the model learns from the mistakes made at the output by moving backwards in the NN and updating the weights of each node that contributed to the error. It uses SGD to compute error.

Activation Functions

Activation functions are used to apply nonlinearity on data which allows NN models to learn more complex patterns compared to linear models.

Some of the most popular activation functions include:

Sigmoid is a continuous function that selects an activation value between 0 and 1

σ (x) = \frac{1}{1 + e^{- x}}

Relu selects 0 for negative values and itself for positive values.

f (x) = max (0, x)

softmax gives the probability distribution of an input over all multiple class labels that sums to 1.

σ (z_{i}) = \frac{e^{z_{i}}}{\sum_{j} e^{z_{j}}}

Vanishing Gradient Problem

This is an issue with the loss function during backpropagation. This happens with activation functions squash each subsequent loss function until the loss function of the inputs layer become extremely small.

This causes the model to learn really slowly at the beginning. To fix this, use models like ReLu which can place a threshold on activation to prevent very small values.

recall that gradient represents steepness of the loss curve

Reinforcement Learning

Reinforcement learning is a subset of NN that updates it's actions iteratively based on new information and reward functions.

The Discount Factor (𝛾) shows how much the agent values future vs immediate rewards

High 𝛾 (near 1): long term planning
Low 𝛾 (near 0): greedy agent

V (s) = E [R_{0} + γ R_{1} + γ^{2} R_{2} + γ^{3} R_{3} + . . .]

The State-Action Function

Q (s, a)

returns the expected return for an

a

action at

s

state. The goal is to select the optimal state-action function that optimizes reward, which directly influecnes the policy function:

π^{*} (s) = \arg max_{a} Q^{*} (s, a)

The Markov Model is a simplified RL model that only computes the rewards between state transitions.

6. Decision Trees

Decisions trees are rule-based models that categorize data into discrete groups based on binary decisions at each node.

Properties:

High variance because of how it subdivides data, it could lead to deep trees that learns extreme details of a dataset

computing decision boundary
The decision boundary is made of vertical and horizontal lines based on the depth of a tree. In the example below, there are 4 lines which reflect the height of the tree (4)

Constructing Decision Trees

How does the model determine the best questions to ask at each node?

It uses entropy which is a measure of how pured and homogenous a dataset is. The more homogenous the data becomes (i.e. all objects are classified under the same class label) the lower the entropy. 1 is the worst entropy value since it means that the data is divided into two classes with a 50/50 probability between each.

Here is the entropy formula:

H (X) = - \sum p (x) \log p (x)

The goal of decision tree policy is to minimize the entropy. The information gain (IG) is the reduction in entropy. We use this to determine the best attributes to classify the training data on at each step.

Here is the IG Formula given feature

A_{i}

I G (D, A_{i}) = H (D) - H (D ∣ A_{i})

The most efficient trees:

Most high IG features at the top of the tree
Short in height (efficient attribute selection)

Ensemble Learning

Ensemble models combine multiple models to form one better model.

In classification tasks: takes the majority vote
In regression tasks: takes the averages

There are two main types of ensemble learning:

Boosting uses the loss output and weighted dataset of a preceding model and builds a new model
Bagging divides the training data into different subsets and trains different submodels on them before combining them into a new one.
- Bagging uses boostrapped sampling meaning that subsets of data are chosen at random and allow the same datapoint to be sampled multiple times – reducing overfitting

Properties:

Boosting increases overfitting because it learns from a single dataset and constantly improves a model (specialist)
Bagging reduces overfitting by averaging the output of multiple models and subsets of data (generalist)

7. Hierarchical Clustering

Clustering Algorithms is a unsupervised learning algorithm that groups data based on distance-based similarity. It assumes datasets are unlabeled and there is no ground truth.

Properties:

O(n^2) time complexity (exponential scaling with increase of data) due to comparisons made with other data points

There are two main types of clustering:

Single Linkage groups points one by one based on the next shortest distance to a point.
Complete Linkage groups clusters of points based on the smallest MAXIMUM distance between points. This keeps the data compact.
Note that each data point is initialized as its own cluster in the beginning

# algorithm pseudocode
# init all points as its own cluster
clusters = [c1, c2, ...]

distances = compute_distances(data)

while len(clusters) > 1:
    # create links between nodes with shortest dist
    If (single_link):
        c1, c2 = find_min_dist(distances, type="single")
    # create links between nodes with largest dist
    If (complete_link):
        c1, c2 = find_max_dist(distances, type="complete")

    # merge based on distance
    clusters[c1].merge_with(clusters[c2])

return clusters

K Means Clustering

K-Means is a specific type of clustering algorithm that works by the following:

selects random centroid points
clusters points based on proximity to the centroid
calculates new centroid when all points are clustered
repeat for k num of iterations

Algorithm below:

def kmeans(data, k, max_iterations=100):
    """
    data: numpy array of shape (n_samples, n_features)
    k: number of clusters
    max_iterations: int, maximum number of iterations
    """
    n_samples, n_features = data.shape
    # randomly choose k centroids
    centroid_indices = random.sample(range(n_samples), k)
    centroids = data[centroid_indices]
    labels = np.zeros(n_samples, dtype=int)
    
    # Loop through all points
    for iteration in range(max_iterations):
        old_centroids = centroids.copy()
        
        # Assign each point to the nearest centroid
        for i in range(n_samples):
            distances = np.linalg.norm(data[i] - centroids, axis=1)
            labels[i] = np.argmin(distances)
        
        # Update centroids by taking the mean of its cluster points
        for j in range(k):
            cluster_points = data[labels == j]
            if len(cluster_points) > 0:
                centroids[j] = np.mean(cluster_points, axis=0)
        # Check for convergence
        if np.all(old_centroids == centroids):
            break
            
    return centroids, labels

Applications of ML in the real world

Here are the type of problems that each ML model attempts to solve. These are the main types of models we've covered:

Linear Models: Linear Regression, Linear Classification,

CS 178: Machine Learning Final Review

Table of contents

What is machine learning?

0. Some basics

Loss

Minimum Squared Error

Gradient Descent

Variance-Bias Tradeoff

VC Dimensions

Principle Component Analysis (PCA)

Training-Test data

1. Probablistic Models

Bayesian Classifiers

2. Instance-Based Learning

Nearest-Centroid Classifier

K-Nearest-Neighbor Classifier

Cross Validation

3. Linear Models

Linear Regression

Linear Classification

Logistic Regression

Negative Log-Loss

Cross-Entropy Loss

Support Vector Machines

4. Deep Learning

Perceptrons

Neural Networks

Activation Functions

Vanishing Gradient Problem

Reinforcement Learning

6. Decision Trees

Constructing Decision Trees

Ensemble Learning

7. Hierarchical Clustering

K Means Clustering

Applications of ML in the real world

Additional recommended reads

CS 178: Machine Learning Final Review

Table of contents

What is machine learning?

0. Some basics

Loss

Minimum Squared Error

Gradient Descent

Variance-Bias Tradeoff

VC Dimensions

Principle Component Analysis (PCA)

Training-Test data

1. Probablistic Models

Bayesian Classifiers

2. Instance-Based Learning

Nearest-Centroid Classifier

K-Nearest-Neighbor Classifier

Cross Validation

3. Linear Models

Linear Regression

Linear Classification

Logistic Regression

Negative Log-Loss

Cross-Entropy Loss

Support Vector Machines

4. Deep Learning

Perceptrons

Neural Networks

Activation Functions

Vanishing Gradient Problem

Reinforcement Learning

6. Decision Trees

Constructing Decision Trees

Ensemble Learning

7. Hierarchical Clustering

K Means Clustering

Applications of ML in the real world

Additional recommended reads

Read more

CS 121: Information Retrieval

CS 171: Introduction to AI