FINAL EXAM MACHINE LEARNING REVISION

# FINAL EXAM MACHINE LEARNING REVISION ![](https://i.imgur.com/wCFju2u.jpg) # ML4AI Bullet Points & Flashcards :::danger :fire: Students should copy this page and create their own page with flashcards ::: ### Session 1 :::info Representations: Feature Extraction & Embedding Coordinates. ::: - Good embedding space $Z\Rightarrow$ accurate predictions: $x\overset{B}{\to}z\overset{P}{\to}\hat{y}$. - Basis/templates/filters $B$. - Feature engineering: $B_i = \phi_i(x)$ "hand-crafted". - Sparse coding. - PCA: principal components. - Convolutional kernels. - Word embeddings. - Clustering: k-means algorithm & objective function - Elbow method - Silhouette method - Classification: kNN *Example flashcards*: 1. What does representation in ML mean? :::spoiler Your answer here ::: 1. Relation of feature extraction & embedding coordinates? :::spoiler Your answer here ::: ### Session 2 :::info Predictions: Introduction to linear models $s(Wz+b)$ ::: - AI as prediction machines: $x\overset{f}{\to}\hat{y}$. - Feature extraction & concept embedding space: $x\overset{B}{\to}z\overset{P}{\to}\hat{y}$. - Types of predictions: regression & classification (supervised & unsupervised). - Linear regression = linear combination = weighted sum. - 4 popular similarity measures: dot product, convolution, distance, angle. - Linear classification = argmax of some similarity score. - Confidence/uncertainty of classifications: softmax probability. - Linear predictors general form: $\hat{y}=s(Wz+b)$. ### Session 3 :::info Linear predictors for decision making: RecSys and Bandits. ::: - CF (item-based & user-based) key ideas: items (or users) as features $\to$ compute item-item (or user-user) similarity matrix (cosine, dot product). ![](https://i.imgur.com/2kB0rKh.jpg) 1/ Mean user ratings là rating trung bình của từng user (chỉ tính những item user đã đánh giá) 2/ Normalize ultility matrix được tính bằng cách lấy rating của mỗi user cho từng item trừ cho rating trung bình của user đó; những item chưa được đánh giá thì gán giá trị=0 3/User similarity matrix được tính bằng cách lấy cosine similarity 2 normalized ratings vector của 2 user tương ứng 4/ Tính y^ để điền vào các ô trống bằng cách được ghi trong mục e) - Fill in empty cells of utility matrix with average ratings $\to$ subtract to normalize $\to$ use as coordinates to compute similarity matrix $\to$ select $k$-NN already rated and compute normalized weighted average rating. - Matrix factorization after CF: $D_{\text{full}} = USV^\top = WT$ or by "Regularized Alternating Least Squares". Extract hidden features for both users & items. ![](https://i.imgur.com/DFIg0LI.png) - Content-based RecSys: extract features of each item using external methods then learn a predictor for each user. ![](https://i.imgur.com/PseEFdZ.png) ### Session 4 :::info Predictions: Introduction to nonlinear models & MLP $s(W'z'+b')$ with $z'=\gamma(Wz+b)$ ::: - Regression surface & decision boundary - Manifold hypothesis. - Separability & linearly/nonlinearly separable. - Concept embedding space transformation: $x\overset{B}{\to}z\overset{T}{\to}z'\overset{P}{\to}\hat{y}$. - Transformations $T$: Linear $L(~)$ vs. Nonlinear $\phi(~)$ - Nonlinear predictors: $\hat{y}=s(W'\phi(z)+b')$. - Feature engineering: $\phi(z)$ "hand-crafted". - MLP: $\phi(z) = \gamma(Wz+b)$ "multi-layer". - Kernel machines: $\phi(z_i)\cdot\phi(z_j) = \kappa(z_i,z_j)$ "kernel trick". - Locally linear: dividing $Z$ space into subspaces. ### Session 5 :::info Search: Gradient-based optimization & training, over/underfitting, regularization, and generalization. ::: - Machine learning pipeline - Maximizing performance measure = minimizing loss - Parameter space $\to$ loss surface - Velocity & gradient - Gradient & steepest direction - Gradient descent & back-propagation (MLP) - Realistic loss surface - challenging issues ### Session 6 :::info Evaluations: Common metrics and loss functions. ::: - Evaluation procedure: train--dev (validation)--test sets & [stratified](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) split - Evaluation metrics for regression: MSE, MAE, MAPE, etc. - Evaluation metrics for classification: mistakes/accuracy, confusion matrix. - Evaluation metrics for class-imbalanced datasets - Precision, recall & f1 scores - ROC & AUC - Losses: convex & nonconvex - Large-margin classifier, hinge loss, and kernel SVM ### Session 7 :::info TEFPA unified framework. ::: $\text{Input } X \xrightarrow[B_{\beta}]{\text{Features}}\text{ Embedding Coordinates }Z \xrightarrow[P_{\theta}]{\text{Predictor}}\text{ Prediction }\hat{Y}$ **ML training pipeline:** - The experience $\mathcal{E}$: Give computer a set of $N$ examples $\mathcal{D}=\{ (x^t,y^t) \}_{t=1}^N$. The task $\mathcal{T}$: get predictions $\{\hat y^t\}_{t=1}^N$ with $\hat y^t = f(x^t)$ from a given function space $f\in\mathcal{F}$. - How to know $f$ is a good predictor? $\to$ performance measure (evaluation metric, loss) $\mathcal{P}$ - $\mathcal{P}$ gives a **real number** saying how "good" $f$ is on examples $\mathcal{D}: \mathcal{P}(f,\mathcal{D})\in\mathbb{R}$ - For regression, one option is [MSE](https://en.wikipedia.org/wiki/Mean_squared_error): $\mathcal{P}(f,\mathcal{D}) = \frac{1}{N}\sum_{t=1}^N (\hat{y}^t - y^t)^2$ - A "good" function or model: must **generalize** well on unseen data $\to$ split labeled data into train--dev (validation)--test sets. - Optimization algorithms $\mathcal{A}$ (e.g., gradient descent) can give us "good" $f_\theta$. - That is, a good set of params $\theta^i = (W^i,b^i)$ at $i$-th iteration, with $\theta^0$ initialized randomly. This is called "**fitting** a model $f_\theta$ to the dataset $\mathcal{D}$". - Performance measure $\equiv$ evaluation metrics $\to$ smooth loss / cost / objective / error functions to ease the optimization. - Hence the unified TEFPA framework: task $\mathcal{T}$, experience $\mathcal{E}$, function space $\mathcal{F}$, performance $\mathcal{P}$, and algorithm $\mathcal{A}$ to search. ### Session 8 :::info Convolutional Neural Networks (CNNs) ::: * **"Function of intelligence" model?** * **Examples of array data?** In computer science, an array data structure, or simply an array, is a data structure consisting of a collection of elements (values or variables), each identified by at least one array index or key. An array is stored such that the position of each element can be computed from its index tuple by a mathematical formula.[1][2][3] The simplest type of data structure is a linear array, also called one-dimensional array. Example: Images, Speech, Text,... * **What does feature extraction mean?** * **What does template matching mean?** * Matching & similarity: weighted/linear combination * What is generalized dot product? Dot product of matrices & data “tensors”? * Spatial matching: sliding * Matching + sliding = convolution (1D, 2D, 3D) * **CNN operations: Padding, Stride, Pooling** https://medium.com/analytics-vidhya/convolution-padding-stride-and-pooling-in-cnn-13dc1f3ada26 Padding: tạo thêm các pixel làm vùng đệm bên ngoài image Stride is the number of pixels shifts over the input matrix (độ lớn mỗi bước trượt của filter trên image) Pooling's function is to progressively reduce the spatial size of the representation to reduce the network complexity and computational cost. Two common types: Max pooling & Average pooling * **CNN effective receptive field** https://ahmdtaha.medium.com/understanding-the-effective-receptive-field-in-deep-convolutional-neural-networks-b2642297927e In CNNs, the receptive field grows incrementally one layer after another. ![](https://i.imgur.com/cOO2Rn1.png) * DCNN abstract features * PCA vs. CNN? * Generalized equation for ANNs (including CNNs)? ### Session 9 :::info Recurrent Neural Networks (RNNs). ::: * Sequential predictions: Temporal inputs & outputs * Vanilla RNN equation? Why [`weight-sharing`](https://stats.stackexchange.com/questions/221513/why-are-the-weights-of-rnn-lstm-networks-shared-across-time) in time? * [`BPTT`](https://en.wikipedia.org/wiki/Backpropagation_through_time) training & [`vanishing gradient`](https://en.wikipedia.org/wiki/Vanishing_gradient_problem) problem * [`Bi-directional`](https://en.wikipedia.org/wiki/Bidirectional_recurrent_neural_networks) & hierarchical RNNs * Introduction to LSTM & GRU ### Session 10 :::info Sequential decision making: Perception $\rightleftharpoons$ Action Loop & MDP Planning ::: - Planning = Sequential decision making - What is a plan? What is an optimal plan? - Markov Decision Process (MDP) formulation $\mathcal{M} = (S, A, R, T, \gamma)$ - "World model" $(T,R): s\overset{a}{\to}s'\sim r$ - Trajectory, episode, return, value, optimal policy - Solving discrete MDPs: Bellman optimality equation & Value-Iteration MDP $(S,A,T,R,\gamma) \to \text{policy } \pi(s) \text{ and value } q^\pi(s,a)$ ### Session 11 :::info Sequential decision making: Reinforcement Q-Learning ::: - MDP Planning vs. Learning - Learning in MDPs: Interacting $\rightleftharpoons$ Learning - Q-learning: Approximating Bellman optimality equation - Q-learning pseudocode