---
tags: COTAI Training, Coding, Software Engineering, TodayILearned
title: Notes on Training at CoTAI
---
# Notes on Training at CoTAI
## ML & DL
## MC & MathAIR
(Math foundations for AI & Robotics)
**6/10/2023: Linear Algebra in Python notes**
1. Basic Linear Algebra
:::spoiler
- Vector
:::spoiler
- What?
- A series of number ordered vertically (column-form vector) or horizontally (row-form vector).
- The number of elements in a vector is called the 'length' or 'dimension' of a vector.
- Each element in a vector represents the coordinate of its associated dimension.
:::
- Similarity
- What?
- A method to measure the similarity of objects (vectors).
- There're 2 basic methods to calculate the similarity:
- Dot product:
- Formula: $p = a.b = a_1b_1 + a_2b_2 + ... + a_nb_n$
- How?
- The larger value $p$ is, the more similar of both vector $a$ and $b$ are (both point to the same direction!).
- If $p = 0$, $a$ and $b$ are orthogonal.
- Cons:
- The value of $p$ is various (from -$\infty$ to $\infty$).
- Cosine similarity:
- Formula: $\text{cosine}(a, b) = {a.b \over |a|.|b|} = {a_1b_1 + a_2b_2 + ... + a_nb_n \over \sqrt(a_1^2 + a_2^2 + ... + a_n^2) . \sqrt(b_1^2 + b_2^2 + ... + b_n^2)}$.
- How?
- The closer to $1$ $\text{cosine}(a, b)$ value is, the more similar they are.
- If $\text{cosine}(a, b)$ is 0, $a$ and $b$ are orthogonal.
- Matrix
- What?
- A collection of vectors
- Transposing matrix:
- Rotating the matrix around the main diagonal.
- Matrix multiplication:
- $A(m, n) \times B(n, k) = C(m, k)$
- Note: The number of columns of matrix $A$ must be equal to the number of rows of matrix $B$.
:::
2. Numpy
- What?
- A robust library supporting scientific calculation, working with dimensional-based data and much more.
- How?
- Install, import and use it as a normal module.
- It supports many methods to efficiently work with vectors and matrices:
- ```transpose```
- Formula: ```np.transpose(a)```
- ```dot product```
- Formula: ```np.dot(a, b)``` or ```a.dot(b)``` or ```a @ b```
- ```norm```
- Formula: ```np.linalg.norm(a)```
- ```Element-wise operators```
- Performing operations on individual element of vectors, matrices, tensors, and/or scalars.
- Read [more](https://numpy.org/devdocs/index.html) (*highly recommended*)
**Common Numpy Functions & Matplotlib**
1. Numpy
- Some common Numpy functions to work with ndarray data type:
- mean, min, max, concatenate, **argmin, argmax, where, filter**
- Note: Work with column, ```axis=0```. Work with row, ```axis=1```
- Reference [here](https://numpy.org/doc/stable/reference/routines.sort.html)
2. Matplotlib
- Some common methods to visualize data: ```plot```, ```scatter```, ```bar```, ```pie```, etc
- Learn [more](https://matplotlib.org/stable/users/index)
**Pandas and Data Analysis**
1. Pandas
- Provides DataFrame data type to work with structured and labeled data.
- Accessing rows and columns using ```loc``` or ```iloc```
- Some common built-in methods: ```filter```, ```map```, ```apply```
2. Data Analysis
- Cleaning, structuring and handling missing values.
- Get some statistics from data:
- Notice on the box plot with some quantiles.
- Visualizing data using various plots: ```pie```, ```box```, ```histogram```, ```bar```
- Aggregate data
- Simplifies the complex data sets and extract key insights
**09/10/2023: KNN & K-Means notes**
1. KNN
- What?
- A supervised learning algorithm to classify new data points based on K-nearest neighbors.
- How?
- Main ideas:
- Calculating the distance from a new data point to all data points in the dataset.
- Pick the K most nearest with their labels.
- Choose the dominant one.
- How to choose K neighbors?
- Perhaps, trial and error, but it should be an odd number!
- Reference [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
2. K-Means
- What?
- An unsupervised learning algorithm to cluster data points into K groups.
- How?
- Main ideas:
- Step 1: Ramdomize a collection of ```centers```
- Step 2: Calculate the distance of each data point from the data set to every data point in the ```centers```
- Step 3: Create a label array
- Step 4: Re-calculate new centers
- Step 5: Check if the new centers are the same as the previous one, stop. Otherwise, loop from Step 2.
- How to choose K?
- Elbow method
- Silhouette method (more intuitive)
- Reference [here](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
**10/10/2023: Derivatives, gradient descent and plotly**
1. Derivatives
- What?
- Determine the 'trend' of a function. Being increased if its derivative value is positive and vice versa.
- Extremum
- What?
- Refers to 'Local min/max' and 'Global min/max'
- Gradient vector
- What?
- A collection of derivatives of function $f$ by each individual variable.
- Reveal the changes of function by each of its dependent variables -> Determine the 'fastest'
2. Gradient descent
- What?
- A technique to find the *approximate* extreme value of a function by following the ~~counter-trend~~ the Gradient vector:
- ascent $\equiv$ *same* direction of Gradient vector $\equiv$ (local) *peak*-finding.
- descent $\equiv$ *opposite* direction of Gradient vector $\equiv$ (local) *valley*-finding.
- An improved version of traditional Gradient descent: Gradient Descent with Momentum.
3. Plotly
- What?
- An open-source library to explore and visualize data
**13/10/2023: Linear Regression**
1. What?
- A model to predict real values from a given dataset.
- Try to find a parameters set to represent the relation of data points as a function.
- 2 types: Single vs. Multivariate Linear Regression
2. How?
- Optimize the loss function: Mean Absolute Error (MAE) or Mean Squared Error (MSE),...
3. Why?
- Suitable for simple problems, especially in prediction and forcasting.
- The model seems simple enough to explain the relationship among features.
4. Notes
- To save a model: use ```pickle``` module
- Read more on [Linear Regression](https://en.wikipedia.org/wiki/Linear_regression)
- Linear Regression in [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
- Linear Regression in [TensorFlow](https://www.tensorflow.org/tutorials/keras/regression)
**13/10/2023: Basic probability & Probability distribution**
1. Probability distribution
- What?
- Describing the rule of random variables as they are expected to follow a distribution.
- Probability distribution is represented by a function called 'Probability Mass Function ```f(x)```' showing the probability to get the value ```x```.
- Bernoulli distribution
- What?
- A discrete distribution to describe a binary random variable.
- The value is either ```1``` or ```0```.
- Probability Mass Function: $f_p(k) = p^k(1 - p)^{1 - k}$.
- When?
- The random variable is expected to receive one of two values (```success``` or ```failure```)
- Bernoulli distribution in ```numpy```: [```np.random.binomial```](https://numpy.org/doc/stable/reference/random/generated/numpy.random.binomial.html)
- Categorical distribution
- What?
- A general scenario of Bernoulli when the random variable can receive more than 2 values.
- A set of ```k``` parameter can be represented as a vector: $p = (p_1, p_2, p_3,..., p_k)$, in which $p_i >= 0$ and $\sum_{i}p_i = 1$.
- Probability Mass Function: $f(k) = p(X = k) = p_k$
- When?
- The random variable is not in a binary form, i.e. it can receive $k$ values ($k > 2$).
- Categorical distribution in ```numpy```: [```numpy.random.multinomial```](https://numpy.org/doc/stable/reference/random/generated/numpy.random.multinomial.html).
- Uniform distribution
- What?
- A probability distribution where all results share the same chance to occur.
- There're 2 types: discrete and continuous.
- When?
- All outcomes within a range are equally likely.
- Uniform distribution in ```numpy```: [```np.random.uniform```](https://numpy.org/doc/stable/reference/random/generated/numpy.random.uniform.html).
- Normal/Gauss distribution
- What?
- Representation of many natural phenomena and events in real life (ex: IQ, heights, weights, etc).
- Probability Mass Function: $f(x,\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$
- When?
- Data is continuous and symmetrically distributed around a mean with fite variance.
- Normal distribution in ```numpy```: [```np.random.normal```](https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html).
- Notes:
- Should set ```random.seed``` when working with probability distribution and random functions to reproduce the result in the future!
- Learn more about other distribution: [```np.random```](https://numpy.org/doc/stable/reference/random/index.html).
**16/10/2023: Logistic Regression**
1. What?
- An algorithm to classify problems with 2 classes (labels).
- LogisticRegression = Sigmoid(LinearRegression)
- Using Binary Cross-Entropy as the loss function: $e_i = -y_i\ln\hat{y_i} - (1 - y_i)\ln(1 - \hat{y_i})$.
- Activation functions:
- Sigmoid (more popular nowadays): $\sigma(x) = \frac{1}{1 + e^{-x}}$.
- Tanh: $\text{tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$.
- LogisticRegression with [```sklearn```](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).
2. Why?
- Efficient in linearity dataset with 2 classes.
- Interpretability with set of coefficients and bias.
- Provide probabilistic outputs, which can be valuable in various applications.
3. Notes
- Measure accuracy: ```accuracy```, ```precision```, ```recall```, or ```F1```
- The trade-off between ```precision``` and ```recall```:
- $\text{precision} = 1 => \text{False Positive} = 0$
- $\text{recall} = 1 => \text{False Negative} = 0$
- ~Balanced => go with F1
- Remember True Positive, False Negative, False Positive, True Negative:
- The later (Positive or Negative): What the model predicts.
- The front (True or False): How is the prediction compared to the true labels.
- E.x: True Negative -> Model says the sample is 'Negative' and it's the same as the true label.
- Confution matrix: Represent TP, FN, FP, and TN as a matrix.
- Thresholding: A technique to adjust TP, FN, FP, and TN such that they are suitable for the needs of $precision$ or $recall$ measure.
**17/10/2023: Multiclass Linear Classifier**
1. What?
- An algorithm to classify data into $k$ classes ($k \geq 3$).
- MulticlassLinearClassifier = Softmax(LinearRegression).
- Softmax function: Convert an arbitrary real-value vector into the one with real positive values that sum to 1 while preserve the magnitute of the input.
- Formula: $\sigma(z)_i = \frac{e^{z_i}}{\sum_{j = 1}^{k}e^{z_j}}$ for $i = \overline{1..k}, z=[z_1, z_2,..., z_k]$.
- Cross-Entropy: Reflect the difference between 2 probability distributions $p$ and $q$. The smaller the value, the similar $p$ and $q$ are.
- Formula: $L_{CE}(p, q) = -\sum_{i}p_i\ln q_i$.
- Note: As the values of $p$ and $q$ are always in $(0, 1)$, then $\ln(q)$ is always negative. Therefore, $L_{CE}(p, q)$ is always a positive value.
- One-hot encoding: An operation to transform a $true$ label $y = c$ with $c \in { \{0,...,k - 1\}}$ into a $k$-dimensional vector of $0$ and $1$, where $1$ is in the position of $c$.
- Categorical Cross-Entropy Loss: Is the sum of Cross-Entropy at each individual sample.
- Formula: $E(\overline{W})=\sum_{i=1}^{n}L_{CE}(y_i,\hat{y_i})$
2. Why?
- Suitable for tasks where there are more than 2 classes with linear (or semi-linear) distribution.
- Simplicity and interpretability: Easy to understand and interpret parameters set that drives the model's decision.
- Can serve as a baseline model for other complex algorithms.
3. How?
- Read more on [Multiclass Classification with TensorFlow](https://www.tensorflow.org/tutorials/keras/classification).
4. Notes:
- Data normalization: Transform the values of features to a smaller space, normally $(-1, 1)$ or $(0, 1)$.
- In the formula $y = Wx + b$, $x$ should be a vector, otherwise, the shape will be incompatible to build a model. Therefore, if it is given as a matrix $(m, n)$, we have to convert it into a vector before passing it to the Dense layer. With the support from TensorFlow, we can add a Flatten layer before the Dense layer, so it will help flat the $(m, n)$ matrix into a vector, which is what we want.
- Learn more about [Gradio](https://www.gradio.app/) to quickly interact with machine learning models.
**19/10/2023: Pytorch Basics notes**
1. What?
- A powerful library for Deep Learning, getting more and more popular compared to TensorFlow.
- The core syntax has many similarities to ```numpy``` (index accessing, slicing, filtering, etc) => These basic operations always return a ```tensor``` => To get the actual values, invoking ```.item()``` method.
2. Why?
- It comes with the built-in data structure ```tensor``` which makes it easier to work with vectors, matrices and tensors.
- More intuitive and convenient in many scenarios (compared to TensorFlow).
- It may be considered a bit harder-to-write (compared to TensorFlow), but it's a good reason to learn ha?!
3. How?
- Import ```torch``` as a module.
- Learn to use its APIs: https://pytorch.org/
4. Notes
- To design a model in ```PyTorch```, it's a must to write a class that is inherited from ```torch.nn.Model```.
- ```model.train()```: Set the model into training mode (as opposed to using it for inference or evaluation --```model.eval()```, etc).
- ```optimizer.zero_grad()```: Reset the gradients of the parameters in a neural network model => Ensure the gradients from the previous batch or iteration don't affect the current batch's parameter updates.
- ```loss.backward()```: Compute the gradients at each iteration.
- ```optimizer.step()```: Update model parameters at each iteration.
- ```with torch.no_grad():```: Disabling Gradient computation => Significantly reduce memory usage and computation time. Learn more [here](https://pytorch.org/docs/stable/generated/torch.no_grad.html#torch.no_grad).
- ```with torch.inference_mode():```: Enabling inference_mode. Learn more [here](https://pytorch.org/docs/stable/generated/torch.inference_mode.html).
**[MathAIR'19.01: Nền Tảng Toán Cho Trí Tuệ Nhân Tạo](https://www.youtube.com/playlist?list=PLeFDKx7ZCnBZlZCiXnbs2vberhDgSUGge)**
**26/10/2023: MathAIR-brief - Overview**
1. Intelligence
- What?
- The ability to learn and apply knowledge to various areas in life: casual reasoning, planning, creativity, intuition, imagination, commonsense, etc
- General modeling: Knowledge + Skills -> *functions* -> Computer programs
- AI = inputs ~> functions ~> outputs
2. Machine learning
- What?
- Computers automatically learn by experience.
- General model: Given task (T), experience (E), performance measure \(P\), algorithm (A), and function space (F). Find $\hat{f} \in F$ with the highest measure of generalization as possible.
- Learning process $\equiv$ searching in the function space/programs to find the optimal $\hat{f}$
**26/10/2023: MathAIR-brief - Function space**
1. Functions and parameters
- What?
- In vector space, the linear function $f(x) = ax + b, x \in \mathbb{R}$ is generally called an affine function, which is observed by shifting the function $f(x) = ax$ by a distance of $b$.
- Quadratic function: $f(x) = ax^2 + bx + c, a \neq 0$, and $a, b, c \in \mathbb{R}$ are called parameter $\theta$.
- Graphing
- In 2D
- $\{(x, y = f(x))\}$ -> The output $f(x)$ is depending on the input variable $x$
- $\{(x, \theta) | f(x, \theta) = const\}$ -> The output $f(x, \theta)$ is always a $const$, which is independent from the input variable $x$ and parameter $\theta$ => This is called $level \; sets/curves$ or $contour \; lines/map$.
- Non-linear functions
- Logistic sigmoid: $\sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^z}{e^z + 1}$
- Tanh: $f(z) = tanh(z) = 2\sigma(2z) - 1 = \frac{e^z - e^{-z}}{e^z + e^{-z}} = \frac{e^{2z} - 1}{e^{2z} + 1} = \frac{1 - e^{-2z}}{1 + e^{-2z}}$
- General form: $f(x) = \sigma(ax + b)$, $f(x) = tanh(ax + b)$ -> Pass $x$ to the linear function $f(x) = ax + b$, then grab the output and pass it to the $sigmoid$ or $tanh$ to scale the final output to a certain range $(0, 1)$ for $sigmoid$ or $(-1, 1)$ for $tanh$.
- Function parameters & function space
- A function is described through operations on variables $x$ and parameters $\theta$:
- $f(x; \theta)$ or $f_\theta(x)$ for $f_\theta: X \to Y, x \in X, \theta \in \Theta$
- The parameter $\theta$ is treated as a kind of high-level input -> Choose the parameter $\theta$ first, then the variable $x$.
- When $\theta$ changing, it creates a function space $\mathcal{F}_\Theta := {f(x; \theta) : \theta \in \Theta}$
- Find the optimal function in the function space: **representation** & **searching**
- Representation: Represent a function that reflects the structure of input data and also make it easier to search for the optimal function.
- Searching: Parameter space (e.g. $\theta_1$, $\theta_2$) creates a function space -> Scan in the parameter space to find an optimal value of the function.
**26/10/2023: MathAIR-brief - Representation: Vector and vector space**
1. Vector
- What?
- Vector $\equiv$ an arrow $\longmapsto$ (with root & top, direction and magnitude).
- Vector $\equiv$ a point $\odot$ in a vector space.
- Vector space $\equiv$ 4-tuple $\langle V, \mathbb{R}, +, * \rangle$.
- Definition: If a set $V$ with two operations satisfies:
1) addition of 2 elements: $u + v \in V, \forall u, v \in V \; (translation)$
2) multiplied by a scalar: $aV \in V, \forall v \in V, a \in \mathbb{R}$
-> $V$ is a vector space, and $v \in V$ is a vector.
- That's being said, if we shift (add $+$), or scale (multiply $*$, scalar multiply) vectors in $V$, their results are still in $V$.
- Examples
- Describing a person (height, weight, age, etc)
- Coordinate space:
- $V = \mathbb{R}^n = \mathbb{R} \times ... \times \mathbb{R}: Cartesian \; (direct) \; product$
- $V \ni v = (v_1, ..., v_n) =: [v_i], v_i \in \mathbb{R}, \forall i \in \{ 1,...,n \}$
- Addition of 2 vectors:
- $v_i = u_i + w_i$
- $v = u + w \in V$
- Scalar multiplication:
- $v_i = \alpha u_i$
- $v = \alpha u \in V$ (*element-wise* operation)
- Describing video, image data
- Tensor space (multidimensional arrays)
- 0-d tensor (scalars): $\mathbb{R} \ni v$
- 1-d tensor (vectors): $\mathbb{R}^m \ni v = [v_i]$
- 2-d tensor (matrices): $\mathbb{R}^{m\times n} \ni v = [v_{ij}]$
- 3-d tensor: $V = \mathbb{R}^{m \times n \times p} \ni v = [v_{ijk}]$, etc
With $v_{ijk} \in \mathbb{R}, i \in \{1,...,m\}, j \in \{1,...,n\}, k \in \{1,...,p\}$
- Addition of 2 vectors:
- $v_{ijk} = u_{ijk} + w_{ijk}$
- $v = u + w \in V$
- Scalar multiplication:
- $v_{ijk} = \alpha u_{ijk}$
- $v = \alpha u \in V$ (*element-wise* operation)
- Representing some function spaces
- $P_n(\mathbb{R})$: Polynomial function space of $z \in \mathbb{R}$ with the degree $\le n$
- $f(z) = a_0 + a_1z + a_2z^2 + ... + a_nz^n = \sum_{i = 0}^{n}a_iz^i \in P_n(\mathbb{R})$
- $g(z) = b_0 + b_1z + b_2z^2 + ... + b_nz^n = \sum_{i = 0}^{n}b_iz^i \in P_n(\mathbb{R})$
- Addition of 2 vectors: $h = f + g; h(z) = \sum_{i = 0}^{n}(a_i + b_i)z^i \in P_n(\mathbb{R})$
- Scalar multiplication: $v = \alpha f; v(z) = \sum_{i = 0}^{n}\alpha a_iz^i \in P_n(\mathbb{R})$
- Note: $\{ 1, z, z^2,..., z^n \}$ are also vectors in $P_n(\mathbb{R})$
- ${e_i} = \{z^i\}_{i = 0}^n$ is called a basis of $P_n(\mathbb{R})$
- $\forall v \in V = P_n(\mathbb{R}) : v = a_0e_0 + ... + a_ne_n = \sum_{i = 0}^{n}a_ie_i$
- $v$ = linear combination of basis vectors
- Finding optimal basis functions is at heart of ML!
- Basis $\Rightarrow$ coordinate space
- Directions
- Landmarks
- Features
- Words
- Prototypes, patterns, templates, ...
- Regularities, abstractions, ...
Choose and arrange a basis (ordered basis) $\mathcal{E} = (e_1, ..., e_n) \xrightarrow[]{\forall v \in V} decomposition [v]_\epsilon = coordinates (a_1, ..., a_n) \in \mathbb{R}^n$
$\fbox{1}$ $a_i \approx$ the degree of similarity/difference (e.g., frequency) of $v$ and $e_i$
**26/10/2023: MathAIR-brief - Basis (cont.)**
1. Purpose
- Transforming an abstract vector space into a coordinate $\mathbb{R}^n$ for easier manipulating.
- Playing as a direction or landmark role, it helps building functions to help us search optimal functions in the function space.
- Notes
- $\fbox{2}$ $V \xrightarrow[]{\mathcal{E}} \mathbb{R}^n$: all vector spaces can be transformed to $\mathbb{R}^n$.
- Dimension $n$: The ***least*** number of vectors to represent $\forall v \in V$.
2. Linear transformation $\Leftrightarrow$ Matrix
- What?
- Function $f: V \to W$ transforms from vector space $V$ to $W$.
- Choose basis $\mathcal{B}$ for $V \xrightarrow[space]{coordinate} [v]_\mathcal{B} \in \mathbb{R}^n, \forall v \in V$.
- Choose basis $\mathcal{D}$ for $W \xrightarrow[space]{coordinate} [w]_\mathcal{D} \in \mathbb{R}^m, \forall w \in W$.
- Function $w = f(v) \xrightarrow[space]{coordinate}[w]_\mathcal{D} = f_\mathcal{D}^\mathcal{B}[v]_\mathcal{B}$ -> $f_\mathcal{D}^\mathcal{B}$ is a matrix $\mathbb{R}^{m \times n}$.
- $f: V \to W$ is linear if $f(au + bv) = af(u) + bf(v)$ (i.e. Add and scale the input -> the output is also added and scaled)
- $f(au + bv) \neq af(u) + bf(v)$: non-linear function.
- Example: Sigmoid function: $\sigma(z) = \frac{1}{1 + e^{-x}} = \frac{e^z}{e^z + 1}$ for $z \in \mathbb{R}^n$
- Provable: in coordinate spaces with the corresponding $\mathbb{R}^n, \mathbb{R}^m$,
$\fbox{3}$ linear $f \Leftrightarrow \mathcal{f}_\mathcal{D}^\mathcal{B} = matrix M_f \in \mathbb{R}^{m \times n}: [w]_\mathcal{D} = M_f[v]_\mathcal{B}$.
$\fbox{2} + \fbox{3}$ results in the growth of matrix operations in $\mathbb{R}^n$
3. System of linear equations
- What?
$$
\begin{cases}
\begin{align*}
a_{11}x_1 + a_{12}x_2 + ... + a_{1n}x_n = b_1 \\
a_{21}x_1 + a_{22}x_2 + ... + a_{2n}x_n = b_2 \\
&\vdots\\
a_{m1}x_1 + a_{m2}x_2 + ... + a_{mn}x_n = b_m
\end{align*}
\end{cases}
\Leftrightarrow Ax = b
$$
**27/10/2023: MathAIR-brief - Cont.**
1. Basis and coordinates
- What?
- Provable in the inner product space $\langle V, \mathbb{R}, +, *, \cdot \rangle$:
$\fbox{4}$ Representation theorem: $\forall$ linear $f: V \to \mathbb{R}$, $\exists$ unique $v_f \in V$ s.t. $f(v) = \langle v_f, v \rangle, \forall v \in V$.
- Each basis vector (e.g., image) $\stackrel{\langle \cdot, \cdot \rangle}{=}$ one linear function.
$\fbox{5}$ In the orthonormal basis $\mathcal{U}$:
- Inner product $\langle v, u \rangle \in V \Rightarrow$ dot product $[v]_\mathcal{U} \cdot [u]_\mathcal{U}$ in $\mathbb{R}^n$.
$\Rightarrow \forall$ abstract vector space $V$ $\stackrel{\mathcal{U}}{\equiv}$ Euclidean space $\mathbb{R}^n$.
- Coordinates $[v]_\mathcal{U}$ = inner products $\langle v, u_i \rangle, i = 1,...,n$.
- Most of ML models are built by 2 methods:
1. Select from the input space a set of vectors as the basis:
- just right: linearly dependent basis vectors. The set $\{ e_1, ..., e_n \} = \mathcal{E} \subset V$ is linear dependent if $0_v = \sum{a_ie_i}$ holds iff $a_i = 0, \forall i = 1...n$.
- Overcomplete: dictionary, codebook, bag-of-features.
- Undercomplete: bottleneck/dimensionality reduction.
- Examples:
- Sparse coding for images
- Representation $v = \sum{\alpha_i\phi_i}$ with $n \gg m$ = the number of coordinates $\alpha_i \neq 0$
- E.g., $[\alpha_1,...,\alpha_64] = [0,...,0.8,...,0.3,...,0.5,...0]$
- Sparse coding for acoustics
- Sparse coding for text documents
- "One-hot" encoding: one word is a basis vector.
- Vector length = vocabulary size (huge, ~10k).
- Orthogonal: Doesn't reflect word similarity/distance.
- Encoding for text documents
- "Bag-of-words" encoding: e.g., frequencies, ignoring context.
- TF-IDF Statistic: Multiplication of 1) TF score (term frequency) of each word in the document, and 2) IDF score (inverse document frequency --word rarity, i.e. feature) in all documents.
- Featurized representations for text documents
- Word embedding: vector length (~ 300) $\ll$ vocabulary size.
- Basis vectors capture semantic meanings.
- t-SNE visualization
$\Rightarrow$ Parameterized basis vectors (functions) $\Rightarrow$ function space
Coordinates as latent embedding, latent state variables
$\Rightarrow$ Computing coordinates = "feature extraction"
- To increase the expressiveness of a function space: (2) + (3)
- (2) affine + nonlinear mappings of coordinate vector x
$x'\leftarrow \sigma(Ax + b)$
2. Affine & nonlinear coordinate mappings
- What?
- $\forall$ linear $T: V \to W$ consists of 3 basic transforming operations: rotate (or reflect), scale (some coordinates set to 0) & rotate back (not neccessarily same as the initial rotate).
- Linear mapping in coordinate space: $y = Ax$
$\fbox{6}$ Singular Value Decomposition (SVD): $\forall A \in \mathbb{R}^{m \times n}$
$A_{m \times n} = U_{m \times r}S_{r \times r}V_{n \times r}^T, r \leq min(m, n)$
$Ax$ = rotate $x$ + scale (some coordinates to 0) + rotate
$U, V$: matrices of orthonormal vectors, $S = diag(\sigma_1,...,\sigma_r)$ of positive scalars
- Special case: symmetric $A \in \mathbb{R}^{n \times n}$
$\fbox{7}$ Symmetric Eigenvalue/Spectral Decomposition
$A_{n \times n} = U_{n \times n}\Lambda_{n \times n}U_{n \times n}^T$ -> (Rotate + Scale + Rotate) -> Note: In this case, both the 'rotate' operations are the same
$U$: matrix of orthonormal vectors $\{u_i \}_{i=1}^n$, $\Lambda = diag(\lambda_1,...,\lambda_n)$ of real numbers (they can be *negative* or *positive*)
$Ax$ = "scale x vertically on directions of $u_i$ by an amount of $\lambda_i$"
Eigenvalue $\lambda_i$ associates with eigenvector $u_i$, $Au_i = \lambda u_i$
$y \leftarrow y + b$: translation (by $b$)
Translation $\notin$ linear transformations
3. Compositional & hierachical basis: increased abstractions
- ML as "pattern recognition/template matching"
- Final representation amenable to linear models
**27/10/2023: MathAIR-brief - Search in the function space**
Machine learning problem $\equiv$ 5-tuple $\langle T, E, F, P, A \rangle$
- Target function $f(x; \theta)$
- Performance: objective/cost/loss function $P(f(x; \theta), E)$
- In ML: $\theta$ becomes variables and data $E$ becomes parameters
$\Rightarrow$ optimize $P(\theta; E)$: ML as "curve-fitting"
- Minimize the loss function $L(\theta; D): \Theta \to \mathbb{R}$ with 2D variables $\theta = w = (w_1, w_2)$ i.e., 2 parameters in the target function $f(x; \theta)$
1. Local search for optimum
- E.g., gradient-based (steepest direction)
- At each point $M_0(w_1^0, w_2^0)$ in the parameter space $\Theta = \mathcal{W}$, find the direction so that the function $L(w; D)$ changes with the fastest pace.
2. Global search for optimum
- E.g., blackbox (gradient-free), evolution strategies, etc
Gradient as steepest direction & extremums
- Given a function $f: V \to W$ between 2 vector spaces.
- Considering a fixed direction in the input space $v \in V$.
- We want to describe rate of change (if any) of $f(x)$ according to the direction of $v$ at an arbitrary point $x$, symboled as $\partial_vf(x) \equiv \frac{\partial f(x)}{\partial v}$:
- $\forall x \in V$, $\partial_vf(x) := \lim_{t\to0} \frac{f(x + tv) - f(x)}{t} \in W, t \in \mathbb{R}$
$\rightarrow$ This is the directional derivative of $f(x)$ at $x$ by the given direction $v$.
Provable: $\partial_vf: V \to W$ is a linear function by $v$
## Python
**3/10/2023: Streamlit notes**
- What?
- An open-source Python library for creating web apps for machine learning and data science.
- Read [more](https://docs.streamlit.io).
- Why?
- Before people can make use of machine learning models, we need a place for them to interact with?!
- It's pretty flexible to include HTML code inside the framework Python file.
- Faster than Dash or Flask as it's suitable for creating data-focused web apps and dashboards quickly without extensive web development knowledge!
- How?
- Install it using pip
- Import it as a module
- Use layouts and containers (eg. sidebar, columns, tabs, container, etc) to structure the page
- Create elements (eg. text, data, chart, media, etc) and widgets (eg. input, etc), then attach them to the layouts
- Run [Streamlit in Colab](https://discuss.streamlit.io/t/how-to-launch-streamlit-app-from-google-colab-notebook/42399).
- Some use cases for Streamlit
- Data dashboards & visualization: Excellent for building data dashboards that allow users to interact with and explore data visually.
- Machine learning prototypes: Streamlit allows users to create web apps to showcase model results, input parameters, and predictions.
- EDA: Users can build apps that load data, perform basic statistical analyses and generate visualizations to explore data and gain more insights.
- ...
<!-- +++++++++ DRAFT +++++++++ -->
$L := MSE = \frac{1}{n}\sum_{i = 1}^{n}(\hat{y_i} - y_i)^2$
$= \frac{1}{n}\sum_{i = 1}^{n}(\bar{w}\bar{x_i} - y_i)^2$
$= \frac{1}{n}\sum_{i = 1}^{n}(\bar{w}^2\bar{x_i}^2 - 2\bar{w}\bar{x_i}y_i + y_i^2)$
$$\nabla_\bar{w}L(\bar{w}) = \frac{1}{n}\sum_{i = 1}^{n}2\bar{x_i}^2\bar{w} - 2\bar{x_i}y_i $$
$$= \frac{2}{n}\sum_{i = 1}^{n}\bar{x_i}(\bar{w}\bar{x_i} - y_i)$$
$$= \frac{2}{n}\sum_{i = 1}^{n}\bar{x_i}(\hat{y} - y_i)$$
<!-- Logistic Regression -->
$L := BCE = \sum_{i}(e_i)$
$= \sum_{i}(-y_iln\hat{y_i} - (1 - y_i)ln(1 - \hat{y_i}))$
$= \sum_{i}(-y_iln(\bar{w}\bar{x_i}) - (1 - y_i)ln(1 - \bar{w}\bar{x_i}))$
$$
\nabla_\bar{w}L(\bar{w}) = \sum_{i}(-y_i\frac{(\bar{w}\bar{x_i})'}{\bar{w}\bar{x_i}} - (1 - y_i)\frac{(1 - \bar{w}\bar{x_i})'}{1 - \bar{w}\bar{x_i}})
$$
$$
= \sum_{i}(-y_i\frac{1}{\bar{w}} - (1 - y_i)\frac{-\bar{x_i}}{1 - \bar{w}\bar{x_i}})
$$
$$
= \sum_{i}(\frac{-y_i}{\bar{w}} + \frac{\bar{x_i} - \bar{x_i}y_i}{1 - \bar{w}\bar{x_i}})
$$
$$
= \sum_{i}(\frac{\bar{w}\bar{x_i}y_i - y_i}{\bar{w} - \bar{w}^2\bar{x_i}} + \frac{\bar{w}\bar{x_i} - \bar{w}\bar{x_i}y_i}{\bar{w} - \bar{w}^2\bar{x_i}})
$$
$$
= \sum_{i}(\frac{y_i(\hat{y_i} - 1) + \hat{y_i}(1 - y_i)}{\bar{w}(1 - \hat{y_i})})
$$