Try   HackMD

CS 178: Machine Learning Final Review

This is an accumulation of all topics learned in CS178 at UC Irvine (The Intro machine learning course) taught by Alex Berg.

Table of contents

  1. What is machine learning?
  2. Topic 1: Basics

What is machine learning?

Machine learning is the process of learning from lots of data to make predictions about future, unseen data.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

0. Some basics

Loss

It's important to compute loss becuase the model can accumulate errors in its predictions. The most common are:

  • undrefitting: not enough training data for model to understand features
  • overfitting: the model understands the training data too well and cannot generalize to new data

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

There are also several ways to compute loss depending on the problem:

  • MSE is used to calculate the mean difference between the predicted and expected value and penalizes large error margins. (Regression)
  • Negative Log-Likelihood (NLL) determines what to include based on the confidence level of its prediction and penalizes low confidence. (Logistic Regression)
  • Cross Entropy Loss calculates the confidence levels (how uncertain the model is)
  • Hinge Loss calculates the margin between classes and tries to push them as far as possible (SVM)

Minimum Squared Error

MSE is used to compute loss by summing the squared error of all predicted values compared to actual values. It is most commonly used to compute

Formula:

MSE=1ni=1n(yiy^i)2

Gradient Descent

Gradient Descent (GD) is a convex optimization function that aims to minimize error (loss function) and find the best model that represents the data. It computes a single "global optima" (meaning, the one absolute best model).

It does this iteratively by minimizing the cost function (ex: MSE) at multiple steps, beginning with a random value.

Here is the general form for the chain rule that is applied at each step fo the GD process.

w represesnts weight adjustment while
b
represents bias adjustment. These derivatives represent the size of the step function at each iteration:
w=wαJ(w,b)w,b=bαJ(w,b)b

There are two main types of GD:

  • Stochastic Gradient Descent (SGD) optimizes the prediction for a single datapoint at a time and iterates over the batch
  • Batch Gradient Descent calculates the GD based on the entire dataset
  • Batch is more computationally expensive since
  • SGD converages faster and is ideal for smaller datasets

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Variance-Bias Tradeoff

Variance is when an overly complex model learns the data too well and generalizes irrelevant patterns and noise

  • leads to overfitting because it learns the training data too well and cannot generalize
  • example: a decision tree that learns every aspect of your home's interior but only needed to classify it as a house or not.

Bias is the tendancy of an overly simplified model to favor certain outcomes and it more likely to make errors.

  • leads to underfitting becuase it cannot capture all features of the data
  • example: trying to classify patterns in admissions to a university (many features) with a linear regression model (simplified)

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

General ways to reduce bias/variance

  • increasing the amount of data to train on usually reduces bias by refining the decision boundary
  • hand-picking feautres reduces variance by learning more broad patterns

VC Dimensions

VC Dimension measures a model's complexity and how powerful it is by measuring it's performance on different ditributions of data (called "shattering"). If it can accurately separate different distributions of data, then we call this a generalizable model

shattering example below with different data points. Usually, a higher dimensionality model is more generalizable.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Thus, the ouput of VC Dimension is how many points its able to shatter

  • ex: in the example above, VC=3 because the classifier can always separate any three points but not four

Principle Component Analysis (PCA)

PCA is a way to reduce the number of features and only select the most important ones relevant to the output. This reduces complexity of the model.

Now we will go into the main kinds of models and explain each loss in more depth for each one.

Training-Test data

There are two main ways to split a dataset and validate predictions:

  • Hold out is when you split data into two discrete groups: training and testing. Usually 80-20 split.
  • Cross validation is when you split the dataset into different folds and iterate over it to assign different folds to "testing". See this section.

1. Probablistic Models

These models compute the probability distribution of an input datapoint across possible output labels.

Bayesian Classifiers

Bayesian Classification is based on Bayes Theorm and classifies objects based on prosterior and independent probabilities. You most commonly use this for classifiying things with probablistic outcomes like spam filtering and classifying words:

P(AB)=P(BA)P(A)P(B)

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Select this when:

  • There are a lot of features
  • Not a lot of values in the dataset
  • The goal is to output a simple categorical answer (0/1)

There are both simple (Naive Bayes) and complex (Bayesian networks, gaussian processes).

A specific kind of Bayesian classifier is called the Naïve Bayes Classifier which is the simplified version of the classifier. This model assumes that all features are conditionally independent of each other.

The Gaussian Bayes Classifer is a subset of naive bayes that classifies continuous data that follows the gaussian curve

2. Instance-Based Learning

This is a type of learning algorithm that does not require training to predict data. Instead, it compares new data points with the training data.

Nearest-Centroid Classifier

Nearest-Centroid takes the mean value of all datapoints in a class and assigns it as the centroid. it then compares new datapoints to these centroids to determine which class label it belongs to

Properties

  • High bias since it doesnt account for local variations
    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →

K-Nearest-Neighbor Classifier

K-nearest neighbor is a model that takes a datapoint and compares it to K number of neighbors that are closest in proximity to determine the final class label.

It's a form of supervised learning because all data used as neighbors are from the training data

(x1,y) where x is a feature value and y is the actual output.

It can be used for either regression or classification tasks

The process is as follows:

  1. Calculate Euclidian distance between point ( p ) and all other points (

    qi)

    d(p,q)=(q1p1)2+(q2p2)2++(qnpn)2

  2. If there are more then one points:

    • for classification tasks: take majority vote
    • for regression tasks: take the average

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Training MSE of 1-NN is always 0 because the nearest neighbor of any training point is always itself (

y^i=yi)
Test MSE of 1-NN is volatile because you are comparing test data with training data.

Here is an computational example of calculating loss:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Cross Validation

Cross validation is the process of splitting a dataset into training and test data (usually a 80/20 split). This is a way to prevent the common errors to ensure the model is generalizable to new data and prevents overfitting during the training process.

There are different ways to approach this, each with their own MSE values:

  • leave-one-out removes one datapoint and uses it as the test data and recomputes the MSE with the rest of the data. It then predicts the removed datapoint with the new model.
  • k-fold cross validation split dataset into k folds and perform training on all folds except k-1.
    10-K-Fold-Cross-Validation

3. Linear Models

These models are used when the relationship between datapoints is linear and structured.

Linear Regression

Linear Classification

Logistic Regression

Unlike Linear Regression, Logistic Regression is a classification problem that requires that predicts the probability of wether a point is in a class given a set of possible class labels.

regression

Properties:

  • sigmoid activation funciton predicts 0 or 1 if a datapoint passes a threshold
  • Assumes that all features are independent of each other
  • Supervised learning with a given set of class labels

Use cases:

  • probability of spam
  • probability of a specific disease

Negative Log-Loss

Logistic Regression works best when data is linearly separable (can achieve 0 loss), but when this is not the case,

Cross-Entropy Loss

Support Vector Machines

SVMs is an optimization algorithm that finds the best hyperplane that separates data by maximizing the margin.

Properties:

  • Uses convex optimization where there is only one optimal solution (best hyperplane)
    svm

Kernels allows SVM's to operate on higher-dimension data by transforming nonlinear data into a linear representation. However, it's computationally expensive and is best used on small datasets

4. Deep Learning

This is used for complex and multi-dimensional data

Perceptrons

Note: technically this is a linear model but we include this under Neural Netowrks since it's more relevant to this.

A perceptron is a simplified binary classification model that computes the output by applying weights of inputs and putting it through an activation function.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Neural Networks

MIT explains Neural Networks as:

a means of doing machine learning, in which a computer learns to perform some task by analyzing training examples. Usually, the examples have been hand-labeled in advance.

Essentialy, NN's find patterns in large amounts of data instead of the features being given to them. It can then use these patterns to generalize across new data.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

The Universal Approximation Theorem states that a NN with enough hidden nodes and layers can model any continuous function (it has a high ceiling capability)

These are the steps of a NN. Usually, you begin the training process by feeding the inputs of lots of data and testing that the output reflects that of the datatest.

During Feedforward learning:

  1. Input data points
  2. Compute weighted sum of inputs
  3. Apply an Activation function
  4. Repeat for all hidden layers
  5. compute the output layer weighted sum
  6. Apply the output activation function
  7. Output consists of: final prediction and error

During Backpropogation, the model learns from the mistakes made at the output by moving backwards in the NN and updating the weights of each node that contributed to the error. It uses SGD to compute error.

backprop

Activation Functions

Activation functions are used to apply nonlinearity on data which allows NN models to learn more complex patterns compared to linear models.

Some of the most popular activation functions include:

Sigmoid is a continuous function that selects an activation value between 0 and 1

σ(x)=11+ex
Screenshot 2025-03-17 at 2.11.47 PM

Relu selects 0 for negative values and itself for positive values.

f(x)=max(0,x)
Screenshot 2025-03-17 at 1.58.19 PM

softmax gives the probability distribution of an input over all multiple class labels that sums to 1.

σ(zi)=ezijezj
Screenshot 2025-03-17 at 2.13.52 PM

Vanishing Gradient Problem

This is an issue with the loss function during backpropagation. This happens with activation functions squash each subsequent loss function until the loss function of the inputs layer become extremely small.

This causes the model to learn really slowly at the beginning. To fix this, use models like ReLu which can place a threshold on activation to prevent very small values.

recall that gradient represents steepness of the loss curve

Vanishing_Gradient_Problem_38a6c5625a

Reinforcement Learning

Reinforcement learning is a subset of NN that updates it's actions iteratively based on new information and reward functions.

The Discount Factor (𝛾) shows how much the agent values future vs immediate rewards

  • High 𝛾 (near 1): long term planning
  • Low 𝛾 (near 0): greedy agent

V(s)=E[R0+γR1+γ2R2+γ3R3+...]

The State-Action Function

Q(s,a) returns the expected return for an
a
action at
s
state. The goal is to select the optimal state-action function that optimizes reward, which directly influecnes the policy function:
π(s)=argmaxaQ(s,a)

The Markov Model is a simplified RL model that only computes the rewards between state transitions.

6. Decision Trees

Decisions trees are rule-based models that categorize data into discrete groups based on binary decisions at each node.

Properties:

  • High variance because of how it subdivides data, it could lead to deep trees that learns extreme details of a dataset

entropy

computing decision boundary
The decision boundary is made of vertical and horizontal lines based on the depth of a tree. In the example below, there are 4 lines which reflect the height of the tree (4)

Screenshot 2025-03-17 at 10.20.58 PM

Constructing Decision Trees

How does the model determine the best questions to ask at each node?

It uses entropy which is a measure of how pured and homogenous a dataset is. The more homogenous the data becomes (i.e. all objects are classified under the same class label) the lower the entropy. 1 is the worst entropy value since it means that the data is divided into two classes with a 50/50 probability between each.

Here is the entropy formula:

H(X)=p(x)logp(x)

The goal of decision tree policy is to minimize the entropy. The information gain (IG) is the reduction in entropy. We use this to determine the best attributes to classify the training data on at each step.

Here is the IG Formula given feature

Ai:
IG(D,Ai)=H(D)H(DAi)

The most efficient trees:

  • Most high IG features at the top of the tree
  • Short in height (efficient attribute selection)

Ensemble Learning

Ensemble models combine multiple models to form one better model.

  • In classification tasks: takes the majority vote
  • In regression tasks: takes the averages

There are two main types of ensemble learning:

  • Boosting uses the loss output and weighted dataset of a preceding model and builds a new model
    ensemble-learning-boosting
  • Bagging divides the training data into different subsets and trains different submodels on them before combining them into a new one.
    • Bagging uses boostrapped sampling meaning that subsets of data are chosen at random and allow the same datapoint to be sampled multiple times reducing overfitting
      Screenshot 2025-03-17 at 2.20.43 PM

Properties:

  • Boosting increases overfitting because it learns from a single dataset and constantly improves a model (specialist)
  • Bagging reduces overfitting by averaging the output of multiple models and subsets of data (generalist)

7. Hierarchical Clustering

Clustering Algorithms is a unsupervised learning algorithm that groups data based on distance-based similarity. It assumes datasets are unlabeled and there is no ground truth.

Properties:

  • O(n^2) time complexity (exponential scaling with increase of data) due to comparisons made with other data points

There are two main types of clustering:

  • Single Linkage groups points one by one based on the next shortest distance to a point.
  • Complete Linkage groups clusters of points based on the smallest MAXIMUM distance between points. This keeps the data compact.
  • Note that each data point is initialized as its own cluster in the beginning
# algorithm pseudocode
# init all points as its own cluster
clusters = [c1, c2, ...]

distances = compute_distances(data)

while len(clusters) > 1:
    # create links between nodes with shortest dist
    If (single_link):
        c1, c2 = find_min_dist(distances, type="single")
    # create links between nodes with largest dist
    If (complete_link):
        c1, c2 = find_max_dist(distances, type="complete")

    # merge based on distance
    clusters[c1].merge_with(clusters[c2])

return clusters

Screenshot 2025-03-16 at 5.21.39 PM

K Means Clustering

K-Means is a specific type of clustering algorithm that works by the following:

  1. selects random centroid points
  2. clusters points based on proximity to the centroid
  3. calculates new centroid when all points are clustered
  4. repeat for k num of iterations

Algorithm below:

def kmeans(data, k, max_iterations=100):
    """
    data: numpy array of shape (n_samples, n_features)
    k: number of clusters
    max_iterations: int, maximum number of iterations
    """
    n_samples, n_features = data.shape
    # randomly choose k centroids
    centroid_indices = random.sample(range(n_samples), k)
    centroids = data[centroid_indices]
    labels = np.zeros(n_samples, dtype=int)
    
    # Loop through all points
    for iteration in range(max_iterations):
        old_centroids = centroids.copy()
        
        # Assign each point to the nearest centroid
        for i in range(n_samples):
            distances = np.linalg.norm(data[i] - centroids, axis=1)
            labels[i] = np.argmin(distances)
        
        # Update centroids by taking the mean of its cluster points
        for j in range(k):
            cluster_points = data[labels == j]
            if len(cluster_points) > 0:
                centroids[j] = np.mean(cluster_points, axis=0)
        # Check for convergence
        if np.all(old_centroids == centroids):
            break
            
    return centroids, labels

Applications of ML in the real world

Here are the type of problems that each ML model attempts to solve. These are the main types of models we've covered:

  • Linear Models: Linear Regression, Linear Classification,