Notes on Training at CoTAI

--- tags: COTAI Training, Coding, Software Engineering, TodayILearned title: Notes on Training at CoTAI --- # Notes on Training at CoTAI ## ML & DL ## MC & MathAIR (Math foundations for AI & Robotics) **6/10/2023: Linear Algebra in Python notes** 1. Basic Linear Algebra :::spoiler - Vector :::spoiler - What? - A series of number ordered vertically (column-form vector) or horizontally (row-form vector). - The number of elements in a vector is called the 'length' or 'dimension' of a vector. - Each element in a vector represents the coordinate of its associated dimension. ::: - Similarity - What? - A method to measure the similarity of objects (vectors). - There're 2 basic methods to calculate the similarity: - Dot product: - Formula: $p = a.b = a_1b_1 + a_2b_2 + ... + a_nb_n$ - How? - The larger value $p$ is, the more similar of both vector $a$ and $b$ are (both point to the same direction!). - If $p = 0$, $a$ and $b$ are orthogonal. - Cons: - The value of $p$ is various (from -$\infty$ to $\infty$). - Cosine similarity: - Formula: $\text{cosine}(a, b) = {a.b \over |a|.|b|} = {a_1b_1 + a_2b_2 + ... + a_nb_n \over \sqrt(a_1^2 + a_2^2 + ... + a_n^2) . \sqrt(b_1^2 + b_2^2 + ... + b_n^2)}$. - How? - The closer to $1$ $\text{cosine}(a, b)$ value is, the more similar they are. - If $\text{cosine}(a, b)$ is 0, $a$ and $b$ are orthogonal. - Matrix - What? - A collection of vectors - Transposing matrix: - Rotating the matrix around the main diagonal. - Matrix multiplication: - $A(m, n) \times B(n, k) = C(m, k)$ - Note: The number of columns of matrix $A$ must be equal to the number of rows of matrix $B$. ::: 2. Numpy - What? - A robust library supporting scientific calculation, working with dimensional-based data and much more. - How? - Install, import and use it as a normal module. - It supports many methods to efficiently work with vectors and matrices: - ```transpose``` - Formula: ```np.transpose(a)``` - ```dot product``` - Formula: ```np.dot(a, b)``` or ```a.dot(b)``` or ```a @ b``` - ```norm``` - Formula: ```np.linalg.norm(a)``` - ```Element-wise operators``` - Performing operations on individual element of vectors, matrices, tensors, and/or scalars. - Read [more](https://numpy.org/devdocs/index.html) (*highly recommended*) **Common Numpy Functions & Matplotlib** 1. Numpy - Some common Numpy functions to work with ndarray data type: - mean, min, max, concatenate, **argmin, argmax, where, filter** - Note: Work with column, ```axis=0```. Work with row, ```axis=1``` - Reference [here](https://numpy.org/doc/stable/reference/routines.sort.html) 2. Matplotlib - Some common methods to visualize data: ```plot```, ```scatter```, ```bar```, ```pie```, etc - Learn [more](https://matplotlib.org/stable/users/index) **Pandas and Data Analysis** 1. Pandas - Provides DataFrame data type to work with structured and labeled data. - Accessing rows and columns using ```loc``` or ```iloc``` - Some common built-in methods: ```filter```, ```map```, ```apply``` 2. Data Analysis - Cleaning, structuring and handling missing values. - Get some statistics from data: - Notice on the box plot with some quantiles. - Visualizing data using various plots: ```pie```, ```box```, ```histogram```, ```bar``` - Aggregate data - Simplifies the complex data sets and extract key insights **09/10/2023: KNN & K-Means notes** 1. KNN - What? - A supervised learning algorithm to classify new data points based on K-nearest neighbors. - How? - Main ideas: - Calculating the distance from a new data point to all data points in the dataset. - Pick the K most nearest with their labels. - Choose the dominant one. - How to choose K neighbors? - Perhaps, trial and error, but it should be an odd number! - Reference [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) 2. K-Means - What? - An unsupervised learning algorithm to cluster data points into K groups. - How? - Main ideas: - Step 1: Ramdomize a collection of ```centers``` - Step 2: Calculate the distance of each data point from the data set to every data point in the ```centers``` - Step 3: Create a label array - Step 4: Re-calculate new centers - Step 5: Check if the new centers are the same as the previous one, stop. Otherwise, loop from Step 2. - How to choose K? - Elbow method - Silhouette method (more intuitive) - Reference [here](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) **10/10/2023: Derivatives, gradient descent and plotly** 1. Derivatives - What? - Determine the 'trend' of a function. Being increased if its derivative value is positive and vice versa. - Extremum - What? - Refers to 'Local min/max' and 'Global min/max' - Gradient vector - What? - A collection of derivatives of function $f$ by each individual variable. - Reveal the changes of function by each of its dependent variables -> Determine the 'fastest' 2. Gradient descent - What? - A technique to find the *approximate* extreme value of a function by following the ~~counter-trend~~ the Gradient vector: - ascent $\equiv$ *same* direction of Gradient vector $\equiv$ (local) *peak*-finding. - descent $\equiv$ *opposite* direction of Gradient vector $\equiv$ (local) *valley*-finding. - An improved version of traditional Gradient descent: Gradient Descent with Momentum. 3. Plotly - What? - An open-source library to explore and visualize data **13/10/2023: Linear Regression** 1. What? - A model to predict real values from a given dataset. - Try to find a parameters set to represent the relation of data points as a function. - 2 types: Single vs. Multivariate Linear Regression 2. How? - Optimize the loss function: Mean Absolute Error (MAE) or Mean Squared Error (MSE),... 3. Why? - Suitable for simple problems, especially in prediction and forcasting. - The model seems simple enough to explain the relationship among features. 4. Notes - To save a model: use ```pickle``` module - Read more on [Linear Regression](https://en.wikipedia.org/wiki/Linear_regression) - Linear Regression in [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) - Linear Regression in [TensorFlow](https://www.tensorflow.org/tutorials/keras/regression) **13/10/2023: Basic probability & Probability distribution** 1. Probability distribution - What? - Describing the rule of random variables as they are expected to follow a distribution. - Probability distribution is represented by a function called 'Probability Mass Function ```f(x)```' showing the probability to get the value ```x```. - Bernoulli distribution - What? - A discrete distribution to describe a binary random variable. - The value is either ```1``` or ```0```. - Probability Mass Function: $f_p(k) = p^k(1 - p)^{1 - k}$. - When? - The random variable is expected to receive one of two values (```success``` or ```failure```) - Bernoulli distribution in ```numpy```: [```np.random.binomial```](https://numpy.org/doc/stable/reference/random/generated/numpy.random.binomial.html) - Categorical distribution - What? - A general scenario of Bernoulli when the random variable can receive more than 2 values. - A set of ```k``` parameter can be represented as a vector: $p = (p_1, p_2, p_3,..., p_k)$, in which $p_i >= 0$ and $\sum_{i}p_i = 1$. - Probability Mass Function: $f(k) = p(X = k) = p_k$ - When? - The random variable is not in a binary form, i.e. it can receive $k$ values ($k > 2$). - Categorical distribution in ```numpy```: [```numpy.random.multinomial```](https://numpy.org/doc/stable/reference/random/generated/numpy.random.multinomial.html). - Uniform distribution - What? - A probability distribution where all results share the same chance to occur. - There're 2 types: discrete and continuous. - When? - All outcomes within a range are equally likely. - Uniform distribution in ```numpy```: [```np.random.uniform```](https://numpy.org/doc/stable/reference/random/generated/numpy.random.uniform.html). - Normal/Gauss distribution - What? - Representation of many natural phenomena and events in real life (ex: IQ, heights, weights, etc). - Probability Mass Function: $f(x,\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$ - When? - Data is continuous and symmetrically distributed around a mean with fite variance. - Normal distribution in ```numpy```: [```np.random.normal```](https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html). - Notes: - Should set ```random.seed``` when working with probability distribution and random functions to reproduce the result in the future! - Learn more about other distribution: [```np.random```](https://numpy.org/doc/stable/reference/random/index.html). **16/10/2023: Logistic Regression** 1. What? - An algorithm to classify problems with 2 classes (labels). - LogisticRegression = Sigmoid(LinearRegression) - Using Binary Cross-Entropy as the loss function: $e_i = -y_i\ln\hat{y_i} - (1 - y_i)\ln(1 - \hat{y_i})$. - Activation functions: - Sigmoid (more popular nowadays): $\sigma(x) = \frac{1}{1 + e^{-x}}$. - Tanh: $\text{tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$. - LogisticRegression with [```sklearn```](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). 2. Why? - Efficient in linearity dataset with 2 classes. - Interpretability with set of coefficients and bias. - Provide probabilistic outputs, which can be valuable in various applications. 3. Notes - Measure accuracy: ```accuracy```, ```precision```, ```recall```, or ```F1``` - The trade-off between ```precision``` and ```recall```: - $\text{precision} = 1 => \text{False Positive} = 0$ - $\text{recall} = 1 => \text{False Negative} = 0$ - ~Balanced => go with F1 - Remember True Positive, False Negative, False Positive, True Negative: - The later (Positive or Negative): What the model predicts. - The front (True or False): How is the prediction compared to the true labels. - E.x: True Negative -> Model says the sample is 'Negative' and it's the same as the true label. - Confution matrix: Represent TP, FN, FP, and TN as a matrix. - Thresholding: A technique to adjust TP, FN, FP, and TN such that they are suitable for the needs of $precision$ or $recall$ measure. **17/10/2023: Multiclass Linear Classifier** 1. What? - An algorithm to classify data into $k$ classes ($k \geq 3$). - MulticlassLinearClassifier = Softmax(LinearRegression). - Softmax function: Convert an arbitrary real-value vector into the one with real positive values that sum to 1 while preserve the magnitute of the input. - Formula: $\sigma(z)_i = \frac{e^{z_i}}{\sum_{j = 1}^{k}e^{z_j}}$ for $i = \overline{1..k}, z=[z_1, z_2,..., z_k]$. - Cross-Entropy: Reflect the difference between 2 probability distributions $p$ and $q$. The smaller the value, the similar $p$ and $q$ are. - Formula: $L_{CE}(p, q) = -\sum_{i}p_i\ln q_i$. - Note: As the values of $p$ and $q$ are always in $(0, 1)$, then $\ln(q)$ is always negative. Therefore, $L_{CE}(p, q)$ is always a positive value. - One-hot encoding: An operation to transform a $true$ label $y = c$ with $c \in { \{0,...,k - 1\}}$ into a $k$-dimensional vector of $0$ and $1$, where $1$ is in the position of $c$. - Categorical Cross-Entropy Loss: Is the sum of Cross-Entropy at each individual sample. - Formula: $E(\overline{W})=\sum_{i=1}^{n}L_{CE}(y_i,\hat{y_i})$ 2. Why? - Suitable for tasks where there are more than 2 classes with linear (or semi-linear) distribution. - Simplicity and interpretability: Easy to understand and interpret parameters set that drives the model's decision. - Can serve as a baseline model for other complex algorithms. 3. How? - Read more on [Multiclass Classification with TensorFlow](https://www.tensorflow.org/tutorials/keras/classification). 4. Notes: - Data normalization: Transform the values of features to a smaller space, normally $(-1, 1)$ or $(0, 1)$. - In the formula $y = Wx + b$, $x$ should be a vector, otherwise, the shape will be incompatible to build a model. Therefore, if it is given as a matrix $(m, n)$, we have to convert it into a vector before passing it to the Dense layer. With the support from TensorFlow, we can add a Flatten layer before the Dense layer, so it will help flat the $(m, n)$ matrix into a vector, which is what we want. - Learn more about [Gradio](https://www.gradio.app/) to quickly interact with machine learning models. **19/10/2023: Pytorch Basics notes** 1. What? - A powerful library for Deep Learning, getting more and more popular compared to TensorFlow. - The core syntax has many similarities to ```numpy``` (index accessing, slicing, filtering, etc) => These basic operations always return a ```tensor``` => To get the actual values, invoking ```.item()``` method. 2. Why? - It comes with the built-in data structure ```tensor``` which makes it easier to work with vectors, matrices and tensors. - More intuitive and convenient in many scenarios (compared to TensorFlow). - It may be considered a bit harder-to-write (compared to TensorFlow), but it's a good reason to learn ha?! 3. How? - Import ```torch``` as a module. - Learn to use its APIs: https://pytorch.org/ 4. Notes - To design a model in ```PyTorch```, it's a must to write a class that is inherited from ```torch.nn.Model```. - ```model.train()```: Set the model into training mode (as opposed to using it for inference or evaluation --```model.eval()```, etc). - ```optimizer.zero_grad()```: Reset the gradients of the parameters in a neural network model => Ensure the gradients from the previous batch or iteration don't affect the current batch's parameter updates. - ```loss.backward()```: Compute the gradients at each iteration. - ```optimizer.step()```: Update model parameters at each iteration. - ```with torch.no_grad():```: Disabling Gradient computation => Significantly reduce memory usage and computation time. Learn more [here](https://pytorch.org/docs/stable/generated/torch.no_grad.html#torch.no_grad). - ```with torch.inference_mode():```: Enabling inference_mode. Learn more [here](https://pytorch.org/docs/stable/generated/torch.inference_mode.html). **[MathAIR'19.01: Nền Tảng Toán Cho Trí Tuệ Nhân Tạo](https://www.youtube.com/playlist?list=PLeFDKx7ZCnBZlZCiXnbs2vberhDgSUGge)** **26/10/2023: MathAIR-brief - Overview** 1. Intelligence - What? - The ability to learn and apply knowledge to various areas in life: casual reasoning, planning, creativity, intuition, imagination, commonsense, etc - General modeling: Knowledge + Skills -> *functions* -> Computer programs - AI = inputs ~> functions ~> outputs 2. Machine learning - What? - Computers automatically learn by experience. - General model: Given task (T), experience (E), performance measure $P$, algorithm (A), and function space (F). Find $\hat{f} \in F$ with the highest measure of generalization as possible. - Learning process $\equiv$ searching in the function space/programs to find the optimal $\hat{f}$ **26/10/2023: MathAIR-brief - Function space** 1. Functions and parameters - What? - In vector space, the linear function $f(x) = ax + b, x \in \mathbb{R}$ is generally called an affine function, which is observed by shifting the function $f(x) = ax$ by a distance of $b$. - Quadratic function: $f(x) = ax^2 + bx + c, a \neq 0$, and $a, b, c \in \mathbb{R}$ are called parameter $\theta$. - Graphing - In 2D - $\{(x, y = f(x))\}$ -> The output $f(x)$ is depending on the input variable $x$ - $\{(x, \theta) | f(x, \theta) = const\}$ -> The output $f(x, \theta)$ is always a $const$, which is independent from the input variable $x$ and parameter $\theta$ => This is called $level \; sets/curves$ or $contour \; lines/map$. - Non-linear functions - Logistic sigmoid: $\sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^z}{e^z + 1}$ - Tanh: $f(z) = tanh(z) = 2\sigma(2z) - 1 = \frac{e^z - e^{-z}}{e^z + e^{-z}} = \frac{e^{2z} - 1}{e^{2z} + 1} = \frac{1 - e^{-2z}}{1 + e^{-2z}}$ - General form: $f(x) = \sigma(ax + b)$, $f(x) = tanh(ax + b)$ -> Pass $x$ to the linear function $f(x) = ax + b$, then grab the output and pass it to the $sigmoid$ or $tanh$ to scale the final output to a certain range $(0, 1)$ for $sigmoid$ or $(-1, 1)$ for $tanh$. - Function parameters & function space - A function is described through operations on variables $x$ and parameters $\theta$: - $f(x; \theta)$ or $f_\theta(x)$ for $f_\theta: X \to Y, x \in X, \theta \in \Theta$ - The parameter $\theta$ is treated as a kind of high-level input -> Choose the parameter $\theta$ first, then the variable $x$. - When $\theta$ changing, it creates a function space $\mathcal{F}_\Theta := {f(x; \theta) : \theta \in \Theta}$ - Find the optimal function in the function space: **representation** & **searching** - Representation: Represent a function that reflects the structure of input data and also make it easier to search for the optimal function. - Searching: Parameter space (e.g. $\theta_1$, $\theta_2$) creates a function space -> Scan in the parameter space to find an optimal value of the function. **26/10/2023: MathAIR-brief - Representation: Vector and vector space** 1. Vector - What? - Vector $\equiv$ an arrow $\longmapsto$ (with root & top, direction and magnitude). - Vector $\equiv$ a point $\odot$ in a vector space. - Vector space $\equiv$ 4-tuple $\langle V, \mathbb{R}, +, * \rangle$. - Definition: If a set $V$ with two operations satisfies: 1) addition of 2 elements: $u + v \in V, \forall u, v \in V \; (translation)$ 2) multiplied by a scalar: $aV \in V, \forall v \in V, a \in \mathbb{R}$ -> $V$ is a vector space, and $v \in V$ is a vector. - That's being said, if we shift (add $+$), or scale (multiply $*$, scalar multiply) vectors in $V$, their results are still in $V$. - Examples - Describing a person (height, weight, age, etc) - Coordinate space: - $V = \mathbb{R}^n = \mathbb{R} \times ... \times \mathbb{R}: Cartesian \; (direct) \; product$ - $V \ni v = (v_1, ..., v_n) =: [v_i], v_i \in \mathbb{R}, \forall i \in \{ 1,...,n \}$ - Addition of 2 vectors: - $v_i = u_i + w_i$ - $v = u + w \in V$ - Scalar multiplication: - $v_i = \alpha u_i$ - $v = \alpha u \in V$ (*element-wise* operation) - Describing video, image data - Tensor space (multidimensional arrays) - 0-d tensor (scalars): $\mathbb{R} \ni v$ - 1-d tensor (vectors): $\mathbb{R}^m \ni v = [v_i]$ - 2-d tensor (matrices): $\mathbb{R}^{m\times n} \ni v = [v_{ij}]$ - 3-d tensor: $V = \mathbb{R}^{m \times n \times p} \ni v = [v_{ijk}]$, etc With $v_{ijk} \in \mathbb{R}, i \in \{1,...,m\}, j \in \{1,...,n\}, k \in \{1,...,p\}$ - Addition of 2 vectors: - $v_{ijk} = u_{ijk} + w_{ijk}$ - $v = u + w \in V$ - Scalar multiplication: - $v_{ijk} = \alpha u_{ijk}$ - $v = \alpha u \in V$ (*element-wise* operation) - Representing some function spaces - $P_n(\mathbb{R})$: Polynomial function space of $z \in \mathbb{R}$ with the degree $\le n$ - $f(z) = a_0 + a_1z + a_2z^2 + ... + a_nz^n = \sum_{i = 0}^{n}a_iz^i \in P_n(\mathbb{R})$ - $g(z) = b_0 + b_1z + b_2z^2 + ... + b_nz^n = \sum_{i = 0}^{n}b_iz^i \in P_n(\mathbb{R})$ - Addition of 2 vectors: $h = f + g; h(z) = \sum_{i = 0}^{n}(a_i + b_i)z^i \in P_n(\mathbb{R})$ - Scalar multiplication: $v = \alpha f; v(z) = \sum_{i = 0}^{n}\alpha a_iz^i \in P_n(\mathbb{R})$ - Note: $\{ 1, z, z^2,..., z^n \}$ are also vectors in $P_n(\mathbb{R})$ - ${e_i} = \{z^i\}_{i = 0}^n$ is called a basis of $P_n(\mathbb{R})$ - $\forall v \in V = P_n(\mathbb{R}) : v = a_0e_0 + ... + a_ne_n = \sum_{i = 0}^{n}a_ie_i$ - $v$ = linear combination of basis vectors - Finding optimal basis functions is at heart of ML! - Basis $\Rightarrow$ coordinate space - Directions - Landmarks - Features - Words - Prototypes, patterns, templates, ... - Regularities, abstractions, ... Choose and arrange a basis (ordered basis) $\mathcal{E} = (e_1, ..., e_n) \xrightarrow[]{\forall v \in V} decomposition [v]_\epsilon = coordinates (a_1, ..., a_n) \in \mathbb{R}^n$ $\fbox{1}$ $a_i \approx$ the degree of similarity/difference (e.g., frequency) of $v$ and $e_i$ **26/10/2023: MathAIR-brief - Basis (cont.)** 1. Purpose - Transforming an abstract vector space into a coordinate $\mathbb{R}^n$ for easier manipulating. - Playing as a direction or landmark role, it helps building functions to help us search optimal functions in the function space. - Notes - $\fbox{2}$ $V \xrightarrow[]{\mathcal{E}} \mathbb{R}^n$: all vector spaces can be transformed to $\mathbb{R}^n$. - Dimension $n$: The ***least*** number of vectors to represent $\forall v \in V$. 2. Linear transformation $\Leftrightarrow$ Matrix - What? - Function $f: V \to W$ transforms from vector space $V$ to $W$. - Choose basis $\mathcal{B}$ for $V \xrightarrow[space]{coordinate} [v]_\mathcal{B} \in \mathbb{R}^n, \forall v \in V$. - Choose basis $\mathcal{D}$ for $W \xrightarrow[space]{coordinate} [w]_\mathcal{D} \in \mathbb{R}^m, \forall w \in W$. - Function $w = f(v) \xrightarrow[space]{coordinate}[w]_\mathcal{D} = f_\mathcal{D}^\mathcal{B}[v]_\mathcal{B}$ -> $f_\mathcal{D}^\mathcal{B}$ is a matrix $\mathbb{R}^{m \times n}$. - $f: V \to W$ is linear if $f(au + bv) = af(u) + bf(v)$ (i.e. Add and scale the input -> the output is also added and scaled) - $f(au + bv) \neq af(u) + bf(v)$: non-linear function. - Example: Sigmoid function: $\sigma(z) = \frac{1}{1 + e^{-x}} = \frac{e^z}{e^z + 1}$ for $z \in \mathbb{R}^n$ - Provable: in coordinate spaces with the corresponding $\mathbb{R}^n, \mathbb{R}^m$, $\fbox{3}$ linear $f \Leftrightarrow \mathcal{f}_\mathcal{D}^\mathcal{B} = matrix M_f \in \mathbb{R}^{m \times n}: [w]_\mathcal{D} = M_f[v]_\mathcal{B}$. $\fbox{2} + \fbox{3}$ results in the growth of matrix operations in $\mathbb{R}^n$ 3. System of linear equations - What? $$ \begin{cases} \begin{align*} a_{11}x_1 + a_{12}x_2 + ... + a_{1n}x_n = b_1 \\ a_{21}x_1 + a_{22}x_2 + ... + a_{2n}x_n = b_2 \\ &\vdots\\ a_{m1}x_1 + a_{m2}x_2 + ... + a_{mn}x_n = b_m \end{align*} \end{cases} \Leftrightarrow Ax = b $$ **27/10/2023: MathAIR-brief - Cont.** 1. Basis and coordinates - What? - Provable in the inner product space $\langle V, \mathbb{R}, +, *, \cdot \rangle$: $\fbox{4}$ Representation theorem: $\forall$ linear $f: V \to \mathbb{R}$, $\exists$ unique $v_f \in V$ s.t. $f(v) = \langle v_f, v \rangle, \forall v \in V$. - Each basis vector (e.g., image) $\stackrel{\langle \cdot, \cdot \rangle}{=}$ one linear function. $\fbox{5}$ In the orthonormal basis $\mathcal{U}$: - Inner product $\langle v, u \rangle \in V \Rightarrow$ dot product $[v]_\mathcal{U} \cdot [u]_\mathcal{U}$ in $\mathbb{R}^n$. $\Rightarrow \forall$ abstract vector space $V$ $\stackrel{\mathcal{U}}{\equiv}$ Euclidean space $\mathbb{R}^n$. - Coordinates $[v]_\mathcal{U}$ = inner products $\langle v, u_i \rangle, i = 1,...,n$. - Most of ML models are built by 2 methods: 1. Select from the input space a set of vectors as the basis: - just right: linearly dependent basis vectors. The set $\{ e_1, ..., e_n \} = \mathcal{E} \subset V$ is linear dependent if $0_v = \sum{a_ie_i}$ holds iff $a_i = 0, \forall i = 1...n$. - Overcomplete: dictionary, codebook, bag-of-features. - Undercomplete: bottleneck/dimensionality reduction. - Examples: - Sparse coding for images - Representation $v = \sum{\alpha_i\phi_i}$ with $n \gg m$ = the number of coordinates $\alpha_i \neq 0$ - E.g., $[\alpha_1,...,\alpha_64] = [0,...,0.8,...,0.3,...,0.5,...0]$ - Sparse coding for acoustics - Sparse coding for text documents - "One-hot" encoding: one word is a basis vector. - Vector length = vocabulary size (huge, ~10k). - Orthogonal: Doesn't reflect word similarity/distance. - Encoding for text documents - "Bag-of-words" encoding: e.g., frequencies, ignoring context. - TF-IDF Statistic: Multiplication of 1) TF score (term frequency) of each word in the document, and 2) IDF score (inverse document frequency --word rarity, i.e. feature) in all documents. - Featurized representations for text documents - Word embedding: vector length (~ 300) $\ll$ vocabulary size. - Basis vectors capture semantic meanings. - t-SNE visualization $\Rightarrow$ Parameterized basis vectors (functions) $\Rightarrow$ function space Coordinates as latent embedding, latent state variables $\Rightarrow$ Computing coordinates = "feature extraction" - To increase the expressiveness of a function space: (2) + (3) - (2) affine + nonlinear mappings of coordinate vector x $x'\leftarrow \sigma(Ax + b)$ 2. Affine & nonlinear coordinate mappings - What? - $\forall$ linear $T: V \to W$ consists of 3 basic transforming operations: rotate (or reflect), scale (some coordinates set to 0) & rotate back (not neccessarily same as the initial rotate). - Linear mapping in coordinate space: $y = Ax$ $\fbox{6}$ Singular Value Decomposition (SVD): $\forall A \in \mathbb{R}^{m \times n}$ $A_{m \times n} = U_{m \times r}S_{r \times r}V_{n \times r}^T, r \leq min(m, n)$ $Ax$ = rotate $x$ + scale (some coordinates to 0) + rotate $U, V$: matrices of orthonormal vectors, $S = diag(\sigma_1,...,\sigma_r)$ of positive scalars - Special case: symmetric $A \in \mathbb{R}^{n \times n}$ $\fbox{7}$ Symmetric Eigenvalue/Spectral Decomposition $A_{n \times n} = U_{n \times n}\Lambda_{n \times n}U_{n \times n}^T$ -> (Rotate + Scale + Rotate) -> Note: In this case, both the 'rotate' operations are the same $U$: matrix of orthonormal vectors $\{u_i \}_{i=1}^n$, $\Lambda = diag(\lambda_1,...,\lambda_n)$ of real numbers (they can be *negative* or *positive*) $Ax$ = "scale x vertically on directions of $u_i$ by an amount of $\lambda_i$" Eigenvalue $\lambda_i$ associates with eigenvector $u_i$, $Au_i = \lambda u_i$ $y \leftarrow y + b$: translation (by $b$) Translation $\notin$ linear transformations 3. Compositional & hierachical basis: increased abstractions - ML as "pattern recognition/template matching" - Final representation amenable to linear models **27/10/2023: MathAIR-brief - Search in the function space** Machine learning problem $\equiv$ 5-tuple $\langle T, E, F, P, A \rangle$ - Target function $f(x; \theta)$ - Performance: objective/cost/loss function $P(f(x; \theta), E)$ - In ML: $\theta$ becomes variables and data $E$ becomes parameters $\Rightarrow$ optimize $P(\theta; E)$: ML as "curve-fitting" - Minimize the loss function $L(\theta; D): \Theta \to \mathbb{R}$ with 2D variables $\theta = w = (w_1, w_2)$ i.e., 2 parameters in the target function $f(x; \theta)$ 1. Local search for optimum - E.g., gradient-based (steepest direction) - At each point $M_0(w_1^0, w_2^0)$ in the parameter space $\Theta = \mathcal{W}$, find the direction so that the function $L(w; D)$ changes with the fastest pace. 2. Global search for optimum - E.g., blackbox (gradient-free), evolution strategies, etc Gradient as steepest direction & extremums - Given a function $f: V \to W$ between 2 vector spaces. - Considering a fixed direction in the input space $v \in V$. - We want to describe rate of change (if any) of $f(x)$ according to the direction of $v$ at an arbitrary point $x$, symboled as $\partial_vf(x) \equiv \frac{\partial f(x)}{\partial v}$: - $\forall x \in V$, $\partial_vf(x) := \lim_{t\to0} \frac{f(x + tv) - f(x)}{t} \in W, t \in \mathbb{R}$ $\rightarrow$ This is the directional derivative of $f(x)$ at $x$ by the given direction $v$. Provable: $\partial_vf: V \to W$ is a linear function by $v$ ## Python **3/10/2023: Streamlit notes** - What? - An open-source Python library for creating web apps for machine learning and data science. - Read [more](https://docs.streamlit.io). - Why? - Before people can make use of machine learning models, we need a place for them to interact with?! - It's pretty flexible to include HTML code inside the framework Python file. - Faster than Dash or Flask as it's suitable for creating data-focused web apps and dashboards quickly without extensive web development knowledge! - How? - Install it using pip - Import it as a module - Use layouts and containers (eg. sidebar, columns, tabs, container, etc) to structure the page - Create elements (eg. text, data, chart, media, etc) and widgets (eg. input, etc), then attach them to the layouts - Run [Streamlit in Colab](https://discuss.streamlit.io/t/how-to-launch-streamlit-app-from-google-colab-notebook/42399). - Some use cases for Streamlit - Data dashboards & visualization: Excellent for building data dashboards that allow users to interact with and explore data visually. - Machine learning prototypes: Streamlit allows users to create web apps to showcase model results, input parameters, and predictions. - EDA: Users can build apps that load data, perform basic statistical analyses and generate visualizations to explore data and gain more insights. - ...  $L := MSE = \frac{1}{n}\sum_{i = 1}^{n}(\hat{y_i} - y_i)^2$ $= \frac{1}{n}\sum_{i = 1}^{n}(\bar{w}\bar{x_i} - y_i)^2$ $= \frac{1}{n}\sum_{i = 1}^{n}(\bar{w}^2\bar{x_i}^2 - 2\bar{w}\bar{x_i}y_i + y_i^2)$ $$\nabla_\bar{w}L(\bar{w}) = \frac{1}{n}\sum_{i = 1}^{n}2\bar{x_i}^2\bar{w} - 2\bar{x_i}y_i $$ $$= \frac{2}{n}\sum_{i = 1}^{n}\bar{x_i}(\bar{w}\bar{x_i} - y_i)$$ $$= \frac{2}{n}\sum_{i = 1}^{n}\bar{x_i}(\hat{y} - y_i)$$  $L := BCE = \sum_{i}(e_i)$ $= \sum_{i}(-y_iln\hat{y_i} - (1 - y_i)ln(1 - \hat{y_i}))$ $= \sum_{i}(-y_iln(\bar{w}\bar{x_i}) - (1 - y_i)ln(1 - \bar{w}\bar{x_i}))$ $$ \nabla_\bar{w}L(\bar{w}) = \sum_{i}(-y_i\frac{(\bar{w}\bar{x_i})'}{\bar{w}\bar{x_i}} - (1 - y_i)\frac{(1 - \bar{w}\bar{x_i})'}{1 - \bar{w}\bar{x_i}}) $$ $$ = \sum_{i}(-y_i\frac{1}{\bar{w}} - (1 - y_i)\frac{-\bar{x_i}}{1 - \bar{w}\bar{x_i}}) $$ $$ = \sum_{i}(\frac{-y_i}{\bar{w}} + \frac{\bar{x_i} - \bar{x_i}y_i}{1 - \bar{w}\bar{x_i}}) $$ $$ = \sum_{i}(\frac{\bar{w}\bar{x_i}y_i - y_i}{\bar{w} - \bar{w}^2\bar{x_i}} + \frac{\bar{w}\bar{x_i} - \bar{w}\bar{x_i}y_i}{\bar{w} - \bar{w}^2\bar{x_i}}) $$ $$ = \sum_{i}(\frac{y_i(\hat{y_i} - 1) + \hat{y_i}(1 - y_i)}{\bar{w}(1 - \hat{y_i})}) $$