# Machine Learning
## Foundational Concepts
This note lays the groundwork for understanding the core principles and challenges in machine learning.
### What is Machine Learning?
Machine learning is a subfield of artificial intelligence (AI) that focuses on enabling computers to learn from data without being explicitly programmed for every specific task. Instead of relying on hard-coded rules, machine learning algorithms use data to identify patterns, make predictions, and improve their performance over time.
### Goals of Machine Learning
* **Learning from data:** Extract meaningful patterns and insights from data.
* **Generalization:** Apply learned knowledge to new, unseen data and make accurate predictions or decisions.
### The Core Problem
The fundamental challenge in machine learning is to find a reliable rule or pattern within a dataset that allows us to accurately predict outcomes for new data. This involves:
* **Defining the data:** Understanding the type and structure of the data (e.g., images, text, sensor readings).
* **Defining the task:** Specifying the prediction or decision-making goal (e.g., classifying images, translating languages, predicting stock prices).
### The Importance of "Simple" Rules
A key principle in machine learning is the preference for simplicity. When faced with multiple rules or patterns that explain the training data, we often favor the simplest one. This is because simpler rules are less likely to overfit the data (memorize noise or irrelevant details) and tend to generalize better to new data. This principle is related to Occam's Razor, which states that the simplest explanation is often the best.
### Types of Learning
Machine learning encompasses various learning paradigms, each with its own characteristics and applications:
* **Supervised Learning:** The algorithm learns from labeled examples, where each data point includes both the input features and the correct output or label.
* **Classification:** Predicting a categorical label (e.g., spam/not spam, cat/dog/bird).
* **Regression:** Predicting a continuous value (e.g., house prices, stock prices).
* **Unsupervised Learning:** The algorithm learns from unlabeled data, where only the input features are available. The goal is to discover underlying structure or patterns in the data.
* **Clustering:** Grouping similar data points together (e.g., customer segmentation).
* **Dimensionality Reduction:** Reducing the number of input features while preserving important information.
* **Online Learning:** The algorithm learns sequentially, processing data one example at a time. This is suitable for dynamic environments where data arrives continuously.
* **Batch Learning:** The algorithm learns from the entire dataset at once. This is more common when the dataset is static and available upfront.
### Key Challenges in Machine Learning
* **Overfitting:** Occurs when a model learns the training data too well, capturing noise and irrelevant details. This leads to poor generalization to new data. Overfitting is often caused by excessive model complexity or limited training data.
* **Generalization:** The ability of a model to perform well on unseen data is crucial. Achieving good generalization requires balancing model complexity with the amount of training data and using techniques like regularization.
* **Bias-Variance Trade-off:** This trade-off relates to the balance between model complexity and generalization.
* **Bias:** Error due to oversimplification of the model. High bias leads to underfitting.
* **Variance:** Error due to excessive sensitivity to the training data. High variance leads to overfitting. Finding the optimal balance between bias and variance is essential for achieving good generalization.
This foundational section provides a clear overview of machine learning, its goals, and the key challenges that need to be addressed when designing and training effective machine learning models.
## Linear Models and Algorithms
This section explores fundamental linear models and algorithms used in machine learning for classification tasks.
### The Perceptron Algorithm
The Perceptron algorithm is a binary linear classifier that aims to find a linear separator (a hyperplane) to divide data points into two categories. In essence, it learns a linear function to perform classification.
#### Task: Linear Separation
Given a set of data points $\mathbf{x}_i \in \mathbb{R}^d$ (where $d$ is the number of features) and their corresponding labels $l_i \in \{-1, 1\}$, the Perceptron seeks to find a weight vector $\mathbf{w} \in \mathbb{R}^d$ and a bias term $t$ such that:
* $\mathbf{w} \cdot \mathbf{x}_i > t$ for each $\mathbf{x}_i$ labeled $+1$
* $\mathbf{w} \cdot \mathbf{x}_i < t$ for each $\mathbf{x}_i$ labeled $-1$
The equation $\mathbf{w} \cdot \mathbf{x} + t = 0$ defines the separating hyperplane.

#### Algorithm: Iterative Weight Updates
The Perceptron algorithm employs an iterative process to learn the optimal weights:
:::success
<font color=blue>Perceptron Algorithm</font>
1. **Initialization:** Initialize the weight vector $\mathbf{w}$ to $0$ (or small random values) and the bias $b$ to $0$.
2. **Iteration:**
* For each data point $(\mathbf{x}_i, l_i)$:
* Predict the label: $\hat{l}_i = \text{sign}(z_i)$
* If $\hat{l}_i \neq l_i$ (misclassified):
* Update the weights: $\mathbf{w} \leftarrow \mathbf{w} + \mathbf{x}_i l_i$
* Update the bias: $t \leftarrow t + l_i$
3. **Repeat** step 2 until all points are correctly classified or a maximum number of iterations is reached.
:::
#### Convergence and Margin
* **Convergence:** If the data is linearly separable with a margin $\gamma$ (meaning there exists a hyperplane that separates the data with a minimum distance of $\gamma$ from any data point), the Perceptron algorithm is guaranteed to converge in a finite number of steps. The maximum number of updates is bounded by $R^2/\gamma^2$, where $R$ is the maximum norm of any data point.
<font color=red>Bonus proof</font>
* **Margin:** Mathematically, the margin $\gamma$ is defined as: $$\gamma = \min_i \frac{|\mathbf{w} \cdot \mathbf{x}_i + t|}{||\mathbf{w}||}$$
#### Example: Spam Classification
In spam classification, each email can be represented by a feature vector $\mathbf{x}_i$ where each element corresponds to the frequency of a specific word or phrase. The Perceptron learns a weight vector w that assigns importance to each word. For example, words like "free," "viagra," or "lottery" might have large positive weights (indicating spam), while words like "report," "meeting," or "project" might have negative weights. The bias term b acts as a threshold. The Perceptron classifies an email as spam if $\mathbf{w} \cdot \mathbf{x}_i + t > 0$.
### Handling Non-linearly Separable Data
Linear models struggle when the decision boundary between classes is non-linear.
#### Feature Transformations
To address this, we can apply feature transformations, mapping the original data points $\mathbf{x}_i$ to a higher-dimensional space using a function $\phi(\mathbf{x}_i)$. This can make the data linearly separable in the new space.

#### The Kernel Trick
The Kernel Trick avoids the explicit computation of $\phi(\mathbf{x}_i)$, which can be computationally expensive. Instead, it uses a kernel function $K(\mathbf{x}_i, \mathbf{x}_j)$ that directly computes the dot product of the transformed data points:
$$
K(\mathbf{x}_i, \mathbf{x}_i) = \phi(\mathbf{x}_i) \cdot \phi(\mathbf{x}_j)
$$
**Lemma** A matrix $K$ is a kernel matrix, i.e., there is a function $\phi$ such that $k_{ij} = \phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)$, if and only if $K$ is positive semi-definite.
<font color=red>Bonis proof</font>
#### Kernel Functions
* **Polynomial Kernel:** $k(\mathbf{x}, \mathbf{y}) = (1 + \mathbf{x}^T \mathbf{y})^k$
* This kernel implicitly maps data points to a higher-dimensional space where features are polynomial combinations of the original features.
* **Gaussian Kernel:** $k(\mathbf{x}, \mathbf{y}) = e^{-c |\mathbf{x} - \mathbf{y}|^2}$
* This kernel implicitly maps data points to an infinite-dimensional space. It measures the similarity between points based on their Euclidean distance.
By using kernel functions, we can apply linear models to non-linear data effectively.
## Measuring Model Complexity and Generalization
This section delves into how we can quantify the complexity of machine learning models and how this complexity relates to their ability to generalize to new, unseen data.
### VC-Dimension
The Vapnik-Chervonenkis (VC) dimension is a fundamental concept in statistical learning theory that measures the expressive power or **capacity** of a hypothesis class (a set of possible models).
#### Definition
* **Shattering:** A hypothesis class $\mathcal{H}$ is said to shatter a set of points $S$ if for every possible labeling of the points in $S$, there exists a hypothesis $h \in \mathcal{H}$ that perfectly classifies the points according to that labeling.
* **VC-Dimension:** The VC-dimension of $\mathcal{H}$, denoted $\text{VCdim}(\mathcal{H})$, is the size of the largest set of points that can be shattered by $\mathcal{H}$.
#### Examples of VC-Dimensions
* **Linear Classifiers (hyperplanes) in $\mathbb{R}^d$:** $\text{VCdim}(\mathcal{H})=d+1$
* **Axis-aligned rectangles in $\mathbb{R}^2$:** $\text{VCdim}(\mathcal{H}) = 4$

* **Intervals on the real line:** $\text{VCdim}(\mathcal{H}) = 2$
* **Convex polygons in $\mathbb{R}^2$:** $\text{VCdim}(\mathcal{H}) = \infty$
<font color=red>Bonus explaination</font>
#### Connection to Sample Complexity and Generalization
The VC-dimension is closely linked to the number of training examples needed for a model to generalize well.
* **Higher VC-dimension:** Indicates a more complex model that can represent a wider range of functions. This requires more training data to avoid overfitting and achieve good generalization.
* **Lower VC-dimension:** Indicates a simpler model with less expressive power. It may require fewer training examples to generalize well but might underfit complex data.
#### Sauer's Lemma
Sauer's Lemma provides a bound on the number of different ways a hypothesis class can label a set of points. It states that for a hypothesis class $\mathcal{H}$ with VC-dimension $d$, the maximum number of distinct labelings it can produce on a set of $n$ points is bounded by:
$$
\sum_{i=0}^d \binom{n}{i} \leq n^d+1
$$
<font color=red>Bonus proof</font>
##### Implications
* The growth function (number of possible labelings) is either exponential (for infinite VC-dimension) or polynomial (for finite VC-dimension).
* This lemma connects the VC-dimension to the complexity of the hypothesis class and its ability to represent different patterns in the data.
### Uniform Convergence and Sample Bounds
Uniform convergence is a key concept in statistical learning theory that describes the scenario where the empirical error (error on the training data) of a hypothesis converges to its true error (error on the underlying data distribution) uniformly across all hypotheses in the hypothesis class.
#### Key Theorems and Corollaries
Several theorems and corollaries provide bounds on the generalization error of a learning algorithm based on the VC-dimension and the number of training examples. Here are some important ones:
* **Sample Bound Theorem:** Provides a lower bound on the number of training examples needed to ensure that, with high probability, any hypothesis with a large true error will also have a non-zero empirical error.
* **Growth Function Uniform Convergence Theorem:** Gives a sufficient condition for uniform convergence based on the growth function of the hypothesis class.
* **Corollary:** Relates the sample complexity directly to the VC-dimension, showing that the number of training examples needed to achieve a certain generalization error grows linearly with the VC-dimension.
<font color=red>Bonus proof</font>
##### Interpretation
These results provide theoretical guarantees that, with enough training data, the empirical error of a hypothesis will be a good estimate of its true error, allowing us to confidently generalize from the training data to unseen data. The VC-dimension plays a crucial role in these bounds, highlighting the importance of controlling model complexity for good generalization.
## Deep Learning
Deep learning has revolutionized machine learning by leveraging the power of artificial neural networks to learn complex patterns and representations from vast amounts of data. This section explores the fundamental building blocks and architectures of deep neural networks, along with some advanced concepts.
### Introduction to Neural Networks
Inspired by the structure and function of the human brain, artificial neural networks are composed of interconnected nodes, or neurons, organized in layers.
#### Basic Building Blocks
* **Neurons:** Each neuron receives input from other neurons (or from the input data), performs a weighted sum of the inputs, adds a bias term, and applies an **activation function** to produce an output.
* Mathematically: $$\text{output} = \text{activation function}(w_1 \mathbf{x}_1 + w_2 \mathbf{x}_2 + \dots + w_n \mathbf{x}_n)$$
* **Activation Functions:** Introduce non-linearity into the network, enabling it to learn complex relationships. Common activation functions include:

* **Sigmoid:** Maps any input value to a value between 0 and 1.
* **ReLU (Rectified Linear Unit):** Outputs the input directly if it is positive, otherwise, it outputs 0.
* **Tanh (Hyperbolic Tangent):** Maps any input value to a value between -1 and 1.
* **Layers:** Neurons are organized into layers:
* **Input Layer:** Receives the raw input data.
* **Hidden Layers:** Perform intermediate computations and learn representations.
* **Output Layer:** Produces the final output or prediction.
#### Common Architectures
* **Fully Connected Layers:** Every neuron in one layer is connected to every neuron in the next layer. This architecture is suitable for tasks where global relationships between features are important.

* **Convolutional Layers:** Specialized for processing grid-like data, such as images. They use filters that slide across the input, detecting local patterns.

* **Pooling Layers:** Reduce the dimensionality of the data by summarizing the outputs of neighboring neurons. This helps make the network more robust to small variations in the input.

#### How Neural Networks Learn
Deep neural networks learn through a process called **supervised learning**, where they are trained on labeled examples. The learning process involves:
1. **Forward Pass:** Input data is fed through the network, and the output is calculated.
2. **Loss Function:** The difference between the network's prediction and the true label is measured using a loss function (e.g., mean squared error for regression, cross-entropy for classification).
3. **Backpropagation:** The error is propagated back through the network, and the weights and biases of each neuron are adjusted to minimize the loss. This is typically done using an optimization algorithm like stochastic gradient descent.
### Advanced Deep Learning Concepts
#### Softmax
* Softmax is an activation function used in the output layer of a neural network for multi-class classification problems.
* It takes a vector of raw scores (logits) and transforms them into a probability distribution over the different classes. This allows the network to assign probabilities to each possible class, making it easier to interpret the output and make predictions.
#### Generative Adversarial Networks (GANs)
* GANs consist of two neural networks: a **generator** and a **discriminator**.
* The generator creates synthetic data (e.g., images, text), while the discriminator tries to distinguish real data from the synthetic data produced by the generator.
* These two networks are trained in an adversarial manner, where the generator tries to fool the discriminator, and the discriminator tries to correctly identify real and fake data.
* This process leads to the generator learning to produce increasingly realistic data, which has applications in image generation, style transfer, and more.

## Optimization Techniques
Optimization is the engine that drives machine learning. It's the process of finding the best parameters for a model to minimize the error on the training data. This section explores key optimization techniques, delving into their mathematical foundations.
### Gradient Descent
Gradient descent is a first-order iterative optimization algorithm used to find the minimum of a function. In machine learning, this function is typically the loss function that measures the error of the model.
#### Concept
Imagine a function $J(\pmb{\theta})$ that represents the loss landscape, where $\pmb{\theta}$ represents the model parameters. The goal is to find the values of θ that minimize $J(\pmb{\theta})$. Gradient descent achieves this by iteratively updating the parameters in the direction of the steepest descent, which is the negative gradient of the loss function.
#### Update Rule
The update rule for gradient descent is:
$$
\pmb{\theta}_{t+1} = \pmb{\theta}_t - \eta \nabla_{\theta} J(\pmb{\theta}_t)
$$
where:
* $\pmb{\theta}_t$ is the vector of parameters at iteration $t$.
* $\eta > 0$ is the learning rate, a scalar that controls the step size.
* $\nabla_{\theta} J(\pmb{\theta}_t)$ is the gradient of the loss function $J$ with respect to the parameters at iteration $t$. This is a vector where each component is the partial derivative of $J$ with respect to a specific parameter:
$$
\nabla_{\theta} J(\pmb{\theta}_t) = \left(\frac{\partial J(\pmb{\theta}_t)}{\partial \theta_1}, \frac{\partial J(\pmb{\theta}_t)}{\partial \theta_2}, \dots, \frac{\partial J(\pmb{\theta}_t)}{\partial \theta_n} \right)
$$
#### Challenges
* **Learning Rate Selection:**
* A small $\eta$ can lead to slow convergence.
* A large $\eta$ can cause oscillations and prevent convergence.
* Adaptive learning rate methods (e.g., Adam, RMSprop) adjust $\eta$ during training to improve convergence.
* **Local Minima:**
* For non-convex functions, gradient descent can converge to a local minimum instead of the global minimum.
* Techniques like momentum or simulated annealing can help escape local minima.
* **Saddle Points:**
* Saddle points have a zero gradient but are not minima.
* Second-order methods (e.g., Newton's method) that use the Hessian matrix (second derivatives) can help navigate saddle points.
#### Convex Functions
A function $f$ is convex if for any two points $\mathbf{x}$ and $\mathbf{y}$ and any $\lambda \in [0,1]$:
$$
f(\lambda \mathbf{x} + (1−\lambda) \mathbf{y}) \leq \lambda f(\mathbf{x}) + (1−\lambda) f(\mathbf{y})
$$
**Theorem** If $f$ is a convex, differentiable function defined on a closed bounded convex domain, then:
* Any first-order local minimum is also a global minimum.
* Infinitesimal gradient descent is guaranteed to reach the global minimum.
**Proof.**
**Part 1: Local to Global Minimum**
Assume $\mathbf{x}$ is a local minimum but not global. Then there's a $\mathbf{y}$ with $f($\mathbf{y}$)<f($\mathbf{x}$)$. By convexity, for $0<\epsilon<1$:
$$
f(\mathbf{x} + \epsilon(\mathbf{y} - \mathbf{x})) \leq (1 - \epsilon) f(\mathbf{x}) + \epsilon f(\mathbf{y}) < f(\mathbf{x})
$$
This implies the directional derivative $\nabla f(\mathbf{x}) \cdot (\mathbf{y}-\mathbf{x}) \leq 0$, contradicting $\mathbf{x}$ being a local minimum.
**Part 2: Gradient Descent Convergence**
Gradient descent updates: $\mathbf{x}_{new} = \mathbf{x}_{old} - \eta \nabla f(\mathbf{x}_{old})$. By Taylor approximation:
$$
f(\mathbf{x}_{new}) \approx f(\mathbf{x}_{old}) - \eta || \nabla f(\mathbf{x}_{old})||^2
$$
$\Box$
### Stochastic Gradient Descent (SGD)
SGD is a variant of gradient descent that uses a mini-batch of training examples to estimate the gradient in each iteration.
#### Mathematical Formulation
Instead of computing the gradient using the entire dataset:
$$
\nabla_{\theta} J(\pmb{\theta}_t) \approx \frac1m \sum_{i=1}^m \nabla_{\theta} J_i(\pmb{\theta}_t; \mathbf{x}^{(i)}, \mathbf{y}^{(i)})
$$
(where $m$ is the total number of training examples, and $J_i$ is the loss for the $i$-th example), SGD uses a mini-batch of size $b$:
$$
\nabla_{\theta} J(\pmb{\theta}_t) \approx \frac1b \sum_{i=1}^b \nabla_{\theta} J_i(\pmb{\theta}_t; \mathbf{x}^{(i)}, \mathbf{y}^{(i)})
$$
##### Advantages
* **Computational Efficiency:** Each iteration is faster as it processes a smaller batch of data.
* **Escape Local Minima:** The noise from mini-batch sampling can help escape shallow local minima.
##### Trade-offs
* **Noisy Updates:** The gradient estimate is less accurate, leading to oscillations during training.
* **Hyperparameter Tuning:** Requires careful selection of learning rate and mini-batch size.
##### Role of Mini-batch Size
* **Small Batches ($b = 1$):** This is called online learning. It provides maximum exploration but has the noisiest updates.
* **Large Batches ($b = m$):** This is equivalent to batch gradient descent. It provides accurate gradient estimates but is computationally expensive.
* **Intermediate Batches:** A balance between accuracy and efficiency. Typical values range from 32 to 512.
### Regularization
Regularization modifies the loss function to prevent overfitting.
#### Mathematical Formulation
The regularized loss function is:
$$
J_{reg}(\pmb{\theta}) = J(\pmb{\theta}) + \lambda R(\pmb{\theta})
$$
where:
* $J(\pmb{\theta})$ is the original loss function.
* $\lambda \geq 0$ is the regularization parameter that controls the strength of regularization.
* $R(\pmb{\theta})$ is the regularization term.
##### Examples
* **L1 Regularization:** $R(\pmb{\theta}) = ||\pmb{\theta}||_1 = \sum_{i=1}^n |\theta_i|$
* **L2 Regularization (Weight Decay):** $R(\pmb{\theta}) = ||\pmb{\theta}||_2 = \sum_{i=1}^n \theta^2_i$
##### Impact
* **L1 Regularization:** Encourages sparsity (many weights become zero).
* **L2 Regularization:** Encourages smaller weights, making the model less sensitive to individual features.
##### Benefits
* **Improved Generalization:** Reduces overfitting and improves performance on unseen data.
* **Increased Interpretability:** Can simplify the model by reducing the number of features or making the weights smaller.
## Online Learning
Online learning offers a dynamic approach where the algorithm learns sequentially, processing data one instance at a time. This section delves deeper into the mathematical foundations of online learning algorithms and their adaptation to batch settings.
### Introduction to Online Learning
In online learning, the algorithm interacts with a sequence of data instances, $(x_1, y_1), (x_2, y_2), \dots, (x_t,y_t), \dots$, where $x_t$ is the input instance at time $t$ and $y_t$ is its corresponding label. For each instance:
1. **Prediction:** The algorithm makes a prediction $\hat{y}_t$ based on its current hypothesis $h_t$.
2. **Feedback:** The true label $y_t$ is revealed, and the algorithm suffers a loss $\ell(y_t, \hat{y}_t)$ based on the discrepancy between its prediction and the true label.
3. **Update:** The algorithm updates its hypothesis from $h_t$ to $h_{t+1}$ based on the feedback.
#### Mathematical Formulation
The goal in online learning is to minimize the cumulative loss over time:
$$
\min \sum_{t=1}^T \ell(y_t, \hat{y}_t) = \min \ell(y_t, h_t(x_t))
$$
where $T$ is the total number of time steps.
#### Challenges
* **Non-IID Data:** The data sequence may not be i.i.d., meaning the instances can be dependent or drawn from different distributions. This requires algorithms that can adapt to changing data characteristics.
* **Adaptivity:** Online learners need to adapt to new information and evolving patterns in the data stream.
* **Mistake Bound:** A key theoretical concept is the mistake bound, which provides an upper bound on the number of mistakes an algorithm can make on a sequence of instances.
### Online Learning Algorithms
#### The Halving Algorithm
* Maintains a version space $V_t$ of all hypotheses consistent with the observed data up to time $t$.
* Predicts according to the majority vote of hypotheses in $V_t$: $$\hat{y}_t = \text{majority}\{h(x_t) | h \in V_t\}$$
* If a mistake occurs, the version space is halved: $$V_{t+1} = \{h \in V_t| h_(x_t) = y_t \}$$
* **Mistake Bound:** The Halving Algorithm makes at most $\log_2 |\mathcal{H}|$ mistakes, where $|\mathcal{H}|$ is the size of the initial hypothesis class.
<font color=red>Bonus proof</font>
#### Online Perceptron
* Maintains a weight vector $\mathbf{w}_t$.
* Predicts: $\text{sign}(\mathbf{w}_t \cdot \mathbf{x}_t)$
* Updates the weight vector if a mistake occurs: $$\mathbf{w}_{t+1} = \mathbf{w}_t + y_t \mathbf{x}_t$$
* **Mistake Bound:** If the data is linearly separable with margin $\gamma$, the Online Perceptron makes at most $R^2/\gamma^2$ mistakes, where $R$ is the maximum norm of any instance.
<font color=red>Bonus proof</font>
#### Combining Sleeping Experts Algorithm
* Maintains a weight $w_{i, t}$ for each expert $i$ at time $t$.
* Predicts using a weighted majority vote of the active experts (those that provide a prediction).
* Updates weights using exponential updates based on the expert's performance: $w_{i, t+1} = w_{i, t}(1+\epsilon)^{r_{i, t}}$ where $r_{i, t}$ is a reward function that depends on the accuracy of the expert's prediction.
* **Regret Bound:** This algorithm aims to minimize regret, which is the difference between the cumulative loss of the algorithm and the cumulative loss of the best expert in hindsight. It achieves a regret bound of $O(\sqrt{T \log n})$, where $n$ is the number of experts.
<font color=red>Bonus proof</font>
### From Online to Batch Learning
Online learning algorithms can be adapted to batch settings using techniques like:
### Random Stopping
* Let $M$ be the mistake bound of the online algorithm.
* Draw a sample $S$ of size $m = \frac{M}{\epsilon}$ from the data distribution.
* Run the online algorithm on a random permutation of $S$.
* Stop at a random iteration $t$ and output $h_t$.
* **Expected Error:** The expected error of the output hypothesis is at most $\epsilon$.
<font color=red>Bonus explanation & proof</font>
### Controlled Testing
* Divide the data into a training set and a test set.
* Run the online algorithm on the training set.
* After each update, evaluate the current hypothesis on the test set.
* Stop when the hypothesis achieves a desired error rate on the test set.
* This method provides a probabilistic guarantee on the error of the final hypothesis.
<font color=red>Bonus explanation & proof</font>
## Ensemble Methods
Ensemble methods combine multiple individual models (often referred to as "base learners" or "weak learners") to create a more powerful and robust predictive model. This section focuses on boosting, a prominent ensemble method that has achieved remarkable success in various machine learning applications.
### Boosting
Boosting is a technique that sequentially trains a series of weak learners, where each subsequent learner focuses on correcting the mistakes of the previous learners. The final prediction is made by combining the predictions of all the weak learners, typically through a weighted majority vote.
#### Concept: Combining Weak Learners
A weak learner is a model that performs slightly better than random guessing. Boosting leverages the idea that even weak learners, when combined strategically, can create a strong learner with high accuracy.
#### Boosting Algorithm (AdaBoost as an example)
:::success
<font color=blue>AdaBoost</font>
1. **Initialization:** Assign equal weights to all training examples.
2. **Iteration:** For $t = 1$ to $T$ (number of iterations):
* Train a weak learner $h_t$ on the weighted training data.
* Calculate the error rate $\epsilon_t$ of $h_t$ on the weighted data.
* Compute the weight $\alpha_t$ of the weak learner: $\alpha_t = \frac12 \ln \frac{1 - \epsilon_t}{\epsilon_t}$
* Update the weights of the training examples:
* Increase the weights of misclassified examples.
* Decrease the weights of correctly classified examples.
3. **Output:** Combine the weak learners into a final strong learner: $H(x) = \text{sign}\left(\sum_{t=1}^T \alpha_t h_t(x) \right)$
:::
#### Theoretical Guarantees
* **Training Error Reduction:** Boosting algorithms are proven to reduce the training error exponentially fast with the number of iterations.
<font color=red>Bonus proof</font>
* **Generalization:** Boosting can improve generalization performance by increasing the margin of the training data. A larger margin implies better separation between classes and increased robustness to noise.
#### Connections to VC-Dimension
* **VC-Dimension of the Ensemble:** The VC-dimension of the ensemble (the combined model) is related to the VC-dimension of the weak learners and the number of iterations.
* **Sample Complexity:** Boosting can potentially reduce the sample complexity (the number of training examples needed) compared to using a single complex model. This is because it combines multiple simple models, each with lower VC-dimension.
#### Key Advantages of Boosting
* **High Accuracy:** Boosting can achieve very high accuracy by combining weak learners.
* **Versatility:** It can be used with various types of weak learners (e.g., decision trees, linear models).
* **Robustness:** Less prone to overfitting compared to single complex models, especially when the base learners are simple.
## Advanced Topics and Future Directions
Machine learning is a rapidly evolving field with ongoing research exploring new fro****ntiers and pushing the boundaries of what's possible. This section provides a glimpse into some advanced topics and promising future directions in machine learning.
### Semi-Supervised Learning
Semi-supervised learning addresses the challenge of learning from datasets where only a small portion of the data is labeled, while the majority remains unlabeled. This is a common scenario in many real-world applications where labeling data can be expensive or time-consuming.
#### Key Ideas
* **Leveraging Unlabeled Data:** Semi-supervised learning algorithms aim to utilize the information present in unlabeled data to improve the learning process, even without explicit labels.
* **Leveraging Unlabeled Data:** Semi-supervised learning algorithms aim to utilize the information present in unlabeled data to improve the learning process, even without explicit labels.
* **Techniques:** Common techniques include self-training, co-training, and graph-based methods.
#### Benefits
* **Improved Performance:** By leveraging unlabeled data, semi-supervised learning can lead to significant improvements in accuracy, especially when labeled data is scarce.
* **Reduced Labeling Cost:** Reduces the need for large amounts of labeled data, making machine learning more applicable in scenarios where labeling is expensive.
### Active Learning
Active learning shifts the role of the learning algorithm from a passive receiver of labeled data to an active querier. The algorithm actively selects the most informative instances to be labeled by an oracle (e.g., a human annotator), aiming to achieve a desired level of accuracy with minimal labeling effort.
#### Key Ideas
* **Informative Instance Selection:** Active learning strategies employ various criteria to identify the most informative instances, such as uncertainty sampling (querying instances the model is most uncertain about), expected model change (querying instances that would most significantly impact the model), or query-by-committee (querying instances where a committee of models disagree).
* **Labeling Efficiency:** The goal is to achieve high accuracy with a significantly smaller number of labeled instances compared to traditional passive learning.
#### Benefits
* **Reduced Labeling Cost:** Minimizes the cost and effort associated with data labeling.
* **Improved Efficiency:** Focuses on the most informative instances, leading to faster learning and better performance with limited labeled data.
### Multi-Task Learning
Multi-task learning aims to learn multiple related tasks simultaneously, rather than treating each task in isolation. This approach leverages the shared information and underlying relationships between tasks to improve learning efficiency and generalization performance across all tasks.
#### Key Ideas
* **Shared Representations:** Multi-task learning models typically learn shared representations or features that are relevant to multiple tasks.
* **Transfer Learning:** Knowledge learned from one task can be transferred to improve performance on other related tasks.
* **Regularization:** Multi-task learning can act as a form of regularization, preventing overfitting to a single task by encouraging the model to learn more general representations.
#### Benefits
* **Improved Generalization:** Learning across multiple tasks can lead to more robust and generalizable models.
* **Increased Efficiency:** Sharing information across tasks can improve learning speed and reduce the amount of data needed for each individual task.
* **Better Performance:** Multi-task learning can lead to improved performance on all tasks, especially when tasks are closely related.
#### Examples
* In natural language processing, multi-task learning can be used to jointly learn tasks like part-of-speech tagging, named entity recognition, and sentiment analysis.
* In computer vision, it can be used to simultaneously learn object detection, image segmentation, and depth estimation.
## Reference
* Chapter 5 of Blum A, Hopcroft J, Kannan R. Foundations of Data Science. Cambridge University Press; 2020.
### Further Reading
#### Support Vector Machine (SVM)
* James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor, J. (2023). An Introduction to Statistical Learning with Applications in Python. Springer. (See Chapter 9)
* YouTube Playlist: [link 1](https://youtu.be/Op0OyOuDjcQ?si=8biztGgf2mqmDQCC) [link2](https://youtu.be/pjvnCEfAswc?si=Sk8bzutDt2ObGqRA) [link3](https://youtu.be/02icdqOJsH4?si=RVc4x4GnpR9IWWIl) [link 4](https://youtu.be/m5d7-URGnVY?si=H9s8bpfOTrwjCAEo) [link 5](https://youtu.be/94wKFGS4Zm4?si=u18p9-uDMHG2RZxI)
#### Kernel Methods
* Schölkopf, B., & Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press. https://doi.org/10.7551/mitpress/4175.001.0001
* Hofmann, T., Schölkopf, B., & Smola, A. J. (2008). Kernel methods in machine learning. The Annals of Statistics, 36(3), 1171-1220. https://doi.org/10.1214/009053607000000677
* Rudi, A., Mairal, J., Arbel, M., & Vert, J.-P. (2023). Machine learning with kernel methods [Course](https://mva-kernel-methods.github.io/course-2023-2024/)
* Gretton, A. (2023). Reproducing kernel Hilbert spaces in Machine Learning [Course](https://www.gatsby.ucl.ac.uk/~gretton/coursefiles/rkhscourse.html)
#### Machine Learning Theory
* Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge University Press.
* Abu-Mostafa, Y. S., Magdon-Ismail, M., & Lin, H.-T. (2012). Learning from data. AMLBook.
#### Deep Learning
* LeCun, Y., Bengio, Y. & Hinton, G. (2015). Deep learning. Nature, 521, 436–444. https://doi.org/10.1038/nature14539
* Prince, S. J. D. (2023). Understanding deep learning. The MIT Press. http://udlbook.com
* Geiger, A. (2024). Deep learning [Course](https://uni-tuebingen.de/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/fachbereiche/informatik/lehrstuehle/autonomous-vision/lectures/deep-learning/)
#### Optimization for Machine Learning
* Hazan, E. (2019). Lecture notes: Optimization for machine learning. arXiv preprint arXiv:1909.03550.
* Garrigos, G., & Gower, R. M. (2023). Handbook of convergence theorems for (stochastic) gradient methods. arXiv preprint arXiv:2301.11235.
#### Convex Optimization
* Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge University Press. https://web.stanford.edu/~boyd/cvxbook/
* Convex Optimization Short Course by S. Boyd, S. Diamond, J. Park, A. Agrawal, and J. Zhang. [link](https://web.stanford.edu/~boyd/papers/cvx_short_course.html)
#### Online Learning
* Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, learning, and games. Cambridge University Press.
* Bubeck, S. (2011). Introduction to online optimization [Lecture notes, draft](http://sbubeck.com/BubeckLectureNotes.pdf)
* Hazan, E. (2016). Introduction to online convex optimization. MIT Press. [link](https://sites.google.com/view/intro-oco/)
#### Multi-Armed Bandit
* Lattimore, T., & Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press.
#### Boosting
* Schapire, R. E., & Freund, Y. (2012). Boosting: Foundations and algorithms. The MIT Press. https://doi.org/10.7551/mitpress/8291.001.0001
* Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119-139.
* Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on International Conference on Machine Learning (ICML'96) (pp. 148-156). Morgan Kaufmann Publishers Inc.