Lecture Notes ML4AI 2021

--- tags: COTAI LHP --- # Lecture Notes ML4AI 2021 ## Session 1 -- Linear Predictors * Dữ liệu đầu vào x, chiết xuất đặc trưng z, dự đoán y. * 4 loại data: discrete data, continuous data, nominal data, ordinal data. * 2 kiểu mô hình dự đoán: hồi quy(regression) và phân loại (classification). * Input data -> Regression -> a continuous quantity * Input data -> Classification -> 1 or more discrete labels * Hàm tuyến tính: đồ thị dạng phẳng: y = ax + b * Hàm hồi quy tuyến tính: đồ thị dạng phẳng: y = a.x_1 + b.x_2 + c * Phân loại dữ liệu bằng cách tính khoảng cách giữa các điểm dữ liệu (KNN). * So khớp bằng: tích chấm, khoảng cách thẳng Euclidean, hướng và góc. * LinearPredictors: $\hat{y}\text{ = s(wz+b)}$ --- ## Session 2 -- Feature Extraction * Images: data "tensors" R^mxnxk^ * Color images: rows x columns x channels * Hand crafted (engineered) features * **Sparse coding** * Face decomposition & synthesis: simplest model * Coordinate vectors for faces: PCA & Eigenface * Inner product and convolutional operator * Inner product of matrices & data “tensors” * Word embeddings: word2vec * Item2Vec: embeddings of things --- ## Session 3 -- Transformations and Nonlinear Predictors * Concept embedding space transformation: $x\overset{B}{\to}z\overset{T}{\to}z'\overset{P}{\to}\hat{y}$ * $\text{Chiết xuất đặc trưng z}\to\text{biến z thành z'}\to\text{dự đoán } \hat{y}$ * Regression curves & surfaces: * Tìm những đường cong, mặt cong cho dự đoán chính xác hơn đối với những đồ thị có các giá trị phân bố ko theo dạng thẳng hoặc phẳng. * Linearly separable & decision boundary: * Có những điểm mà ta ko bt thuộc class nào, vì chúng có niềm tin bằng nhau. * Nonlinearly separable & decision boundary: * ko thể phân loại bằng lát cắt mà phải dùng các mặt phẳng cong để phân loại. * Cần biến đối ko gian dữ liệu, vd như biến 2d thành 3d rồi sau đó ta có thể dùng mặt phẳng để phân loại. * Manifold Hypothesis (vùng mỏng, lớp, đa tạp): * Từ thông tin đc chứa trên mặt cong uốn lượn, ta trải mặt đó thành 2d * Transformation có 2 loại: linear và nonlinear. * Linear transformations: z(phẳng)-->z'(phẳng) (biểu diễn ma trận và nhân ma trận) * Xoay, co giãn, lật * Nonlinear transformations: z(phẳng)-->z'(ko phẳng) * Nonlinear predictors: --- ## Session 4 * **RecSys: Hệ khuyến nghị** * Là hệ thống cốt lõi trong nhiều nền tảng kinh doanh trực tuyến, nền tảng người dùng... * Đóng góp cho thành công của nhiều công ty công nghệ. * RecSys tốt: * Generalization counts (Khái quát hóa) * Khái quát hóa sản phẩm * khái quát hóa người dùng * Personalization counts * Có sự chuyên biệt, khác biệt cho từng người dùng. * Long-tail issues * Problem formulation * **Similarity** * Look for Embeddings of users & items * Đề xuất sản phẩm cho người dùng có sở thích giống nhau. * Item-based CF: dựa vào độ giống nhau của các sản phẩm và ng dùng để đưa ra khuyến nghị. * Step 1: Hình thành ma trận * Step 2: tính cosine similarity cho từng cặp item * Step 3: tính rating cho item chưa bt bằng cách tính trung bình cộng của các chỉ số tương đồng giữa item đó và các item đã bt. * User-based CF * Lấy user làm đặc trưng. * Để dự đoán rating cho sản phẩm chưa sử dụng của 1 user, ta so độ giống nhau giữa user đó với những user khác $\to$ chọn 2 user giống nhất rồi thực hiện tính toán. --- # ML4AI Midterm Questions (written exam, open book, 1 hour) Students **copy Markdown code** of these questions to their Lecture notes and write down solutions in there. 1. [**5** Point] Given a set of inputs $(X^1,\dots,X^N)$, we use PCA to represent them as $X^t = X_0 + z_1U_1+\dots+z_nU_n$ with $X_0$ the mean input and $U_i$ the orthonormal principal components. - [**2** Points] Write down equation for $X_0$, and equations for properties of $U_i,U_j$: unit length & mutual orthogonal. **Solution**: - [**1** Point] We need to reduce the dimensions of $X^t$ to visualize them on 2D. What is the embedding vector ${\bf z}^t$ of $X^t$ if we use only 2 first principal components to represent it? What is the last feature of ${\bf z}^t$ in this case? **Solution**: - [**1** Point] What are the main differences between representations by PCA and by sparse coding? **Solution**: * Sparse coding: overcomplete set of basis vectors. * PCA: orthonormal basis vectors. - [**1** Point] If we cluster the dataset into 3 groups with centroids $({\bf m}_1, {\bf m}_2, {\bf m}_3),$ what is the label embedding coordinates of $X^t$ if it belongs to cluster 2? **Solution**: 2. [**1** Point] If we use each song as a feature to represent the users, what is the embedding coordinates ${\bf z}_A$ of user A in the dataset below? **Solution**: ![](https://i.imgur.com/PseEFdZ.png) 3. [**3** Point] From the general form of linear predictors: $\hat{y}=\mathsf{s}(Wz+b)$ with $\mathsf{s}(\cdot)$ a transfer function for desired output interpretation. - [**1** Point] What is $W$ for - 1 dimentional linear regression? **Solution**: W là 1 con số. - sofmax regression with 3 classes? **Solution**: W là 3 con số. - [**1** Point] What is function $\mathsf{s}(\cdot)$ for - 1 dimentional linear regression? **Solution**: y = s(Wz + b) - SVM binary classification? **Solution**: - [**1** Point] Why logistic regression (for binary classification) has only 1 probability output while there are 2 classes? **Solution**: Bởi vì loại này tính ra xác suất và có 1 con số threshold nên nó trả ra 1 khả năng (0 hoặc 1). 4. [**2** Points] Evaluation procedure - [**1** Point] Explain the main use of the train--dev (validation)--test sets. **Solution**: nó như một bài kiểm tra đối với bộ train, để tăng tính chính xác. - [**1** Point] What are the main similarity and differences between linear SVM and logistic regression? **Solution**: * Similarity: Đều dùng để phân cắt. * Differences: SVM(đg thẳng); logistic regression(đg cong). 5. [**2** Points] There are **1100 items** and **one million users**. We need to build a content-based RecSys by extracting **120 features** ${\bf z}_i$ describing each item $i$ then learn a classifier ${\bf \theta}_j$ for each user $j$ to predict **ratings from 1 to 5 stars** of each user for each item. ![](https://i.imgur.com/PseEFdZ.png) - [**1** Point] How many classes do we need? **Solution**: 5. - [**1** Point] What is the size of $W$ if we use softmax regression $\hat{y}=s(Wz+b)$ for to classify ratings? **Solution**: 5. 6. [**2** Points] Nonlinear predictors have general form $\hat{y}=s(W'\phi(z)+b')$. For Multilayer Perceptrons (MLP) in particular: $\phi(z) = \gamma(Wz+b)$ recursively, each called a "hidden layer". - [**1** Point] Give explicit equation of an MLP with 2 hidden layers. **Solution**: - [**1** Point] What are the parameters of the fully-connected layer in your equation? **Solution**: 7. [**2** Points] Kernel machines use "kernel trick" $\phi(z_i)\cdot\phi(z_j) = \kappa(z_i,z_j)$. - [**1** Point] Explain why kernel trick is useful. **Solution**: - [**1** Point] Explain how we can use kernel trick in feature-space prediction $\hat{y}=s(W^\phi\phi(z)+b)$ to turn a linear predictor into a nonlinear one. **Solution**: --- # Sesssion 7 * Data input $\to$ find logist function $\to \hat{y}$ --- # Session 9 ## RNNS * Mô hình dùng cho dữ liệu có tính chuỗi (liên tục theo thời gian) như: * Âm thanh, giọng nói * Tỷ giá usd, vàng, v.v * Robot cần thực hiện một chuỗi hành động... * ... * Di chuyển trong không gian: * $z{\to}z'=z+delta(z)$ (z di chuyển đến z' ta cộng delta_z) * Vanilla RNN: * Các cấu trúc phổ biến của mô hình RNNS: * $ Vanilla RNN {\to} $ * Lấy presentation đầu ra làm input để ghi nhớ ... --- # Session 10 ## Clustering * **Ứng dụng:** Dùng để phân chia các nhóm dựa trên các đặc trưng khác nhau (vd: phân chia nhóm khách hàng trong business) * **Gom cụm:** * Tương tự như phương pháp Softmax Regression: tính khoảng cách. * **Gom K cụm:** * Phân thành K cụm * Lấy ra centroid của từng cluster * K là khoảng cách đến centroid của mỗi cluster. * Clustering có đặc điểm của Linear model. --- # Session 11 * Đặc trưng không nằm ngoài environment mà được AI phân tích và chiết xuất. * Do đó Z(trong agent(AI)) có thể bằng O(điểm ở environment) hoặc không. ## MDP Planning * Lên kế hoạch để giải quyết một vấn đề logic, nhanh gọn. * VD: xe tự hành, ball balance robot,... * Kế hoạch là một chuỗi hành động để đạt được kết quả nhanh nhất, tinh gọn nhất. * Để lên kế hoạch cần: * Mục đích * Xác định từng hành động trong chuỗi hành động (mỗi hành động khác nhau có trạng thái và mức độ khác nhau). * Simplest formulation * an MDP = (S,A,T,R,gamma)-5 thành phần planning. * S: state * A: action * T: transform * R: reward * ygama: ? * reward càng lớn thì xác suất thực hiện theo hành động đó càng lớn. ___ # Session 12 Final Exam **Questions** 1. [**8** Points] The unified TEFPA framework to build a ML model: task $\mathcal{T}$, experience $\mathcal{E}$, function space $\mathcal{F}$, performance $\mathcal{P}$, and algorithm $\mathcal{A}$ to search. What are the elements in TEFPA framework... - 1.1 [**2** Point, 3.1] to build a face recognition model using a DCNN? * task T - 1.2 [**2** Point, 3.1] to build a RecSys? (using one of the models you have learned: item-based, user-based, content-based, MF) * function F - 1.3 [**2** Point, 3.1] to build a customer segmentation model using k-means? * algorithm A - 1.4 [**2** Point, 3.1] to build a sentiment analysis model (good, bad, neutral comments) using RNN+Softmax Classifier? * performance P 2. [**6** Points] Convolutional filters (or kernels) - 2.1 [**1** Point, 1.1, 3.2] How do we extend them from dot product? * - 2.2 [**1** Point, 1.1, 3.2, 3.4] Why do we call their outputs "feature maps"? * - 2.3 [**1** Point, 3.2] Explain padding: how to do & main purpose * - 2.4 [**1** Point, 3.2] Explain pooling: how to do & main purpose * - 2.5 [**1** Point, 3.2] Explain stride: how to do & main purpose * - 2.6 [**1** Point, 3.2, 3.4] Explain their **effective** receptive field: why do they produce highly absstract features? * 3. [**6** Points] Recurrent neural networks (RNNs) can be used for sequential modeling. - 3.1 [**1** Point, 3.2] What does sequential data mean? * Sequential data là các loại dữ liệu như văn bản, âm thanh, ... - 3.2 [**1** Point, 1.1, 3.2, 3.4] Explain each element in this basic equation of RNNs $h_t = \mathsf{\gamma}(Ah_{t-1}+Wz_t)$ * - 3.3 [**2** Point, 1.3, 2.1, 3.2] WWhat does back-propagation-through-time mean, why do we need it instead of using plain back-prop, and how does it work for training RNNs? - 3.4 [**1** Point, 1.3, 3.2] Explain vanishing gradient problem for simple RNNs. - 3.5 [**1** Point, 3.1, 3.3] If we want to classify the sentiment of each user comment (good, bad, neutral) at the end of each sequence using RNN+Softmax classifier: explain briefly the model architecture. 4. [**6** Points] Planning in Markov Decision Process (MDP) $(S,A,T,R,\gamma)$. - 4.1 [**1** Point, 3.1, 3.2] Explain 5 elements in MDP model (equation of each element if available). * S: state trạng thái hiện tại của agent * A: action hành động của agent * T: transform biến đổi trạng thái * R: reward mà agent nhận được * gamma: - 4.2 [**1** Point, 3.2] Following a policy $\pi(s)$ to generate a trajectory of 10 time steps $(s_t,a_t,s_{t+1},r_{t+1})$. Compute the return. Equation of $a_t$? * - 4.3 [**1** Point, 1.2, 3.2] Repeat for 10 days: from $s_0 = \text{HOME}$ take action $a_0 = \text{GET_BUS}$ with cost $r_1 = 6000 \text{VNĐ}$ then following policy $\pi(s)$ to generate $K=10$ trajectories, each with total cost $G_k$. Compute the average cost of taking bus then following $\pi$: $Q^\pi(\text{HOME, GET_BUS})$. - 4.4 [**1** Point, 1.1, 1.3, 2.1, 3.2] How do we compute an optimal plan (i.e., optimal policy $\pi^*$) of a known MDP $(S,A,T,R,\gamma)$? - 4.5 [**1** Point, 3.2] Why do we say that the action value function $Q^\pi(s,a)$ gives predictions into very far future? - 4.6 [**1** Point, 1.2, 3.2] What is the meaning of action value function when we set $\gamma = 1$? $\gamma = 0$? 5. [**7** Points] Unified ML models $\text{Input } X \xrightarrow[B_{\beta}]{\text{Features}}\text{ Embedding Coordinates }Z \xrightarrow[P_{\theta}]{\text{Predictor}}\text{ Predictions }\hat{Y} \xrightarrow[{\pi^*}]{\text{Policy}}\text{ Action }A$ - 5.1 [**2** Points] List all *taught* algorithms for feature extraction and their main ideas. * Kernel PCA * Nonlinear Regression - 5.2 [**2** Points] List all *taught* algorithms for making predictions and their main ideas. - 5.3 [**2** Points] What are the main *general* differences between linear predictors? And in your opinion why do we need different algorithms? * Linear: phân loại bằng đường thẳng và mặt phẳng. * Nonlinear: phân loại bằng đường phi tuyến tính và mặt lồi lõm. - 5.4 [**1** Points] For MDPs, what are the predictions $\hat{Y}$ used to make decisions $A$? * $\hat{Y}$: xác suất để nhận được reward cao hơn. 6. [**2** Points] RecSys ![](https://i.imgur.com/PseEFdZ.png) We build item embeddings ${\bf z}_i \in \mathbb{R}^2$ as in table, and use **softmax regression** to predict ratings 1 to 5. Choose a specific user $X\in \{A,\dots,F\}$, what is the training set for learning $\theta_X$? What are the parameters $\theta_X$ to be learned (with their shapes)? 7. [**6** Points] MDP Planning for playing Chess. Let rewards = 1 for winning, -1 for losing, and 0 for a draw or unfinished game, and no discount. - 7.1 [**2** Points] What is the range of value of the optimal action-value function $Q^*(s,a)$, and how to derive probability of win/loss from it? - 7.2 [**2** Points] If we use all the games already played in history to compute $Q^*(s,a)$, explain the method? - 7.3 [**2** Points] Because there are so many state and action pairs $(s,a)$, we need to use *learning* to approximate and generalize for all $(s,a)$ pairs. If we use MLP to learn $Q^*(s,a)$, what is the dataset and possible network structure?