--- tags: COTAI LHP --- # Lecture Notes ML4AI 2021 ## SESSION 3: NONLINEAR PREDICTORS ###### **_NOTE_** OnlineLaTex (luyện input equations,etc.) ### LESSIONS REVIEW ### Linear predictors: + $X\rightarrow Y$ (regression or classification) + **Equation đặc trưng**: ${\hat y} = s(W*z+b)$ * bài toán regression thì **s** là **Linear**: * $s(x) = x \Leftrightarrow s(Wz+b) = Wz+b$ * bài toán classification thì **s** là **Softmax, sigmoid,etc.** với output là niềm tin. ### Regression curve (non-linear) * Linear predictor may not be accurate (low accuracy) * Regression surfaces are curves #### Decision Boundary: Là vùng mà ở đó không xác định được nó thuộc classification nào #### Non-linearly separable use curves to separate classes. * when input is "noisy" (sparse), linear separating is highly inaccurate, or, non-separable. #### Non-linear Transformation transforms the data to a Z space that can be sliced using linear separate ### Manifold Hypothesis (vùng mỏng, lớp, đa tạp) "high dimensional data are points in a meaningful low dimensional manifold with high dimensional noise added" #### Multimodal embeddings (đa phương thức) maximally compatible w.r.t. multimodal training data. #### Entangled embedding manifolds (xoắn, rối) * use transformations to "stretch", thus, disentangle the data ### Transformations: stretch out Z space - We want to transform objects from ambient space into coordinate space. - We want to use simple “cuts” to separate concepts in embedding space $Z$ - Rotate, stretch, bend, curl to disentangle the embedding manifolds. - Input: $Z$ ; Output: $Z'$ - transform $Z$ using matrix $A$ to get $Z'$ space with $Z' = AZ$ sau đó bóp méo với matrix $gamma$ ### Matrix multiplication $C_{m\times p} = A_{m\times n}B_{n\times p}$ ### Transformations #### Linear transformations = rotate/flip + scale + rotate/flip #### Nonlinear Transformations Activation functions: ### Nonlinear predictors - Feature engineering (hand-crafted): - MLP ("multi-layer") or neuro-network: $Z \rightarrow Z' \rightarrow Z" \rightarrow Z"'$ (deep-learning) #### Interactive demo: TensoFlow Playground [Find out here](https://news.ycombinator.com/item?id=11486058) #### Visualize in 2D: nonlinear decision boundary http://cs231n.github.io/neural-networks-case-study/ https://machinelearningcoban.com/2017/02/24/mlp/ #### Kernel trick in kernel machines ### Decision Tree & Locally linear models Chia nhỏ không gian input thành nhiều không gian nhỏ với mỗi không gian là linear ## SESSION 4: RecSys: Recommender Systems - widely use as core engine for user platforms (amazon,spotify,etc) ((recsys success story insert)) - can learn to serve humans: Virtual assistant, cobot ### Generalization counts - Observe items or users and recommend other items to the users - **vital for a good RecSys** ### Personalization counts - going beyond common statistics of the mass - have a unique differentiated stats between users - **vital for a good RecSys** ### Long-tail issues (insert image here) ### Problem formulation - $Input: users' rating \rightarrow Output: prediction(rating, stars,etc.)$ - Two common methods: **User-based filtering** and **Item-based filtering** #### Input Data ![](https://i.imgur.com/H0uSsqe.png) #### Taxonomy ![](https://i.imgur.com/I0z56IK.png) ### Collaborative filtering ![](https://i.imgur.com/RWQFs9r.png) - Require: - $\overrightarrow{Z}$ : embeddings - similarity ##### Item-based CF ![](https://i.imgur.com/1CzwP3O.png) - Key idea: users as features → compute item-item similarity matrix. - Advantage: items don’t change much as users. ##### User-based CF ![](https://i.imgur.com/oO48Tcy.png) - Key idea: items as features → compute user-user similarity matrix. ###### Note: Transposing the utility matrix in user-based CF before computing basically gives item-based CF result (more efficient since $n^o$ of users > $n^o$ of items). ##### Matrix factorization (model-based) CF ![](https://i.imgur.com/TxUcROW.png) - Key idea: user-item weighted-sum/similarity score ≈ rating score - SVD: singular vector decomposition. - Matrix factorization solved by Regularized Alternating Least Squares - Solved by truncated SVD ***after*** CF ### Personalized RecSys ![](https://i.imgur.com/xdoht2n.png) #### Content-based filtering ![](https://i.imgur.com/0IFka6z.png) - Key idea: $\theta_k⋅z_i+b_k= user_k–item_i$, weighted sum score ≈ rating (could be formulated as a regression or a classification problem) #### Context-aware & KB personalized RecSys! [](https://i.imgur.com/7tlsUUH.png) ### Hybrid RecSys ![](https://i.imgur.com/AmRAync.png) ## SESSION 6 MIDTERM EXAM 1. [**5** Point] Given a set of inputs $(X^1,\dots,X^N)$, we use PCA to represent them as $X^t = X_0 + z_1U_1+\dots+z_nU_n$ with $X_0$ the mean input and $U_i$ the orthonormal principal components. - [**2** Points] Write down equation for $X_0$, and equations for properties of $U_i,U_j$: unit length & mutual orthogonal. **Solution**: - mean: $\frac{1}{n}\sum_{i=1}^{n}X_i$ - unit length: $d(A,B)=\sqrt{(x_i-x_j)^2 +(y_i-y_j)^2 +...+ (z_i-z_j)^2}$ - mutual orthogonal: $x_i x_j + y_i y_j + ... + z_i z_j = 0$ - [**1** Point] We need to reduce the dimensions of $X^t$ to visualize them on 2D. What is the embedding vector ${\bf z}^t$ of $X^t$ if we use only 2 first principal components to represent it? What is the last feature of ${\bf z}^t$ in this case? **Solution**: - [**1** Point] What are the main differences between representations by PCA and by sparse coding? **Solution**: PCA reduce the number of significant signal component while sparse coding does not. - [**1** Point] If we cluster the dataset into 3 groups with centroids $({\bf m}_1, {\bf m}_2, {\bf m}_3),$ what is the label embedding coordinates of $X^t$ if it belongs to cluster 2? **Solution**: 2. [**1** Point] If we use each song as a feature to represent the users, what is the embedding coordinates ${\bf z}_A$ of user A in the dataset below? **Solution**:$z_A = [5,5,0,1,1]$ ![](https://i.imgur.com/PseEFdZ.png) 3. [**3** Point] From the general form of linear predictors: $\hat{y}=\mathsf{s}(Wz+b)$ with $\mathsf{s}(\cdot)$ a transfer function for desired output interpretation. - [**1** Point] What is $W$ for - 1 dimentional linear regression? **Solution**: - sofmax regression with 3 classes? **Solution**: - [**1** Point] What is function $\mathsf{s}(\cdot)$ for - 1 dimentional linear regression? **Solution**: nothing - SVM binary classification? **Solution**: sigmoid - [**1** Point] Why logistic regression (for binary classification) has only 1 probability output while there are 2 classes? **Solution**: 4. [**2** Points] Evaluation procedure - [**1** Point] Explain the main use of the train--dev (validation)--test sets. **Solution**: - [**1** Point] What are the main similarity and differences between linear SVM and logistic regression? **Solution**: + differences: SVM are deterministic and Logistics regression are probabilistic. + similarity: can be used for regressions. 5. [**2** Points] There are **1100 items** and **one million users**. We need to build a content-based RecSys by extracting **120 features** ${\bf z}_i$ describing each item $i$ then learn a classifier ${\bf \theta}_j$ for each user $j$ to predict **ratings from 1 to 5 stars** of each user for each item. ![](https://i.imgur.com/PseEFdZ.png) - [**1** Point] How many classes do we need? **Solution**: *120 classes* - [**1** Point] What is the size of $W$ if we use softmax regression $\hat{y}=s(Wz+b)$ for to classify ratings? **Solution**: ** 6. [**2** Points] Nonlinear predictors have general form $\hat{y}=s(W'\phi(z)+b')$. For Multilayer Perceptrons (MLP) in particular: $\phi(z) = \gamma(Wz+b)$ recursively, each called a "hidden layer". - [**1** Point] Give explicit equation of an MLP with 2 hidden layers. **Solution**: + $\hat y=s(\gamma(W'_1(W'_{2}z+b_{2})+b_1))$ - [**1** Point] What are the parameters of the fully-connected layer in your equation? **Solution**: - s: activation functions - W': weight - b: bias - $\gamma$ - 7. [**2** Points] Kernel machines use "kernel trick" $\phi(z_i)\cdot\phi(z_j) = \kappa(z_i,z_j)$. - [**1** Point] Explain why kernel trick is useful. **Solution**: - [**1** Point] Explain how we can use kernel trick in feature-space prediction $\hat{y}=s(W^\phi\phi(z)+b)$ to turn a linear predictor into a nonlinear one. **Solution**: ------------------------- ## SESSION 7: Search: Optimization & training work horse of machine learning based on derivative. ##### Notes: features $\neq$ embedding vector Z ### Main goal of today's session: - to find the best $F$ algorithms - further understand TEFPA model: - T: Task - E: Experience - F: Function space - P: Performance - A: Algorithm to search - focus on **searching Algorithm A** #### Maximizing performance measure = minimizing loss - Search/optimization: moving in parameter $\theta = (\theta_1,\theta_2)$ to find a "good" function $f_{\theta}$ ##### Parameter space to lost surface ![](https://i.imgur.com/6zJ13M8.gif) - we need to find optimized $\theta^{*} = argmin J(\theta;X;y)$ - following the steepest direction (like water) $\rightarrow$ gradient. ![](https://i.imgur.com/dy631wF.gif) ###### with every set of $Weight$ and $bias$ ##### Velocity and gradient Velocity $v_i = \lim_{\Delta t\to0}\frac{\Delta p_i}{\Delta t}$ - direction of velocity is the direction where the rate of change reaches apex. - we need to find velocity vector's direction to find where $\theta$ steepest direction. - velocity vector have 2 component of $(\theta_1,\theta_2)$ Gradient vector: $\nabla_\theta J(\theta;X,{\bf y}) = \Big(\frac{\partial}{\partial\theta_1}J_\theta, \dots, \frac{\partial}{\partial\theta_k}J_\theta, \dots, \frac{\partial}{\partial\theta_n}J_\theta\Big)$ with $\frac{\partial}{\partial\theta_k}J_\theta = \lim_{\Delta \theta_k\to 0}\frac{J(\theta_1,\dots,\theta_k,\dots,\theta_n) - J(\theta_1,\dots,\theta_k + \Delta \theta_k,\dots,\theta_n)}{\Delta \theta_k}$ - for minimizing loss, phai di nguoc chieu vector gradient tang nhanh nhat - formula of minimized loss: $\theta \leftarrow \theta - \eta \nabla_\theta J$ with step-size 0 < \eta <= 1 called "learning rate" - Ex: $\nabla_{f_\theta} = (0.5;-0.2)$ ##### back-propagation(MLP) - forward to find loss then backward to update better number. ##### Realistic loss surface ![](https://i.imgur.com/s0qpEYz.png) ![](https://i.imgur.com/vUXkcBP.png) - local optimum - saddle point - flat area - narrow valleys Other issues: ![](https://i.imgur.com/PReTUfc.png) - big learning rate: loss might increase and decrease at the same time - small learing rate: slow run time - $Batch \rightarrow mini-batch \rightarrow SGD$ - momentum: ![](https://i.imgur.com/VTzSqFC.png) - some optimizers: SDG, SGD with momentum, AdaGrad, RMSprop, Adam(most common) ##### Heuristics - loss in validation set often increases when overfit → **early stopping**. - **Dropout** forces a neural network to learn more robust features. ---------------------- ## SESSION 8: Convolutional neural networks uses for processing images, videos or audio ### Revision: - Dot product for matching & similarity ### Spatial matching: sliding - Use dot product to recognize pattern by template matching. - Features maps ### Matching + sliding: ##### 1D convolution: horizontal **OR** vertical slide only: ![](https://i.imgur.com/m190bDk.png) ##### 2D convolution: horizontal **AND** vertical slide only: ![](https://i.imgur.com/megrY91.png) ### CNN operations #### Padding - keep or increase shape size - reduce egde information loss #### Stride - How filter move - data shrinking speed #### Pooling - reduce shape of input data by half. Ex $4 \times 4 \rightarrow 2\times 2$ - Max pooling and average pooling ![](https://i.imgur.com/nAhKuNZ.png) - implement as kernel with stride = 2 - filter's variables always remains the same. - Max pooling: sharp features - Avg pooling: smooth ###### Convolution Bias: - convolution bias = output bias + convolution kernel (filter) size ### CNN output shape - **Formula**: $n_{out} = \Big[ \frac{n_{in} +2p - k}{s} \Big]+ 1$ - **Parameter**: - $n_{in}$: number of input features - $n_{out}$:number of output features - k: convolution kernel size - p: convolution padding size - s: convolution stride size ### DCNN: Deep CNN - DCNN for MNIST: ![](https://i.imgur.com/pwf5Hiy.png) - DCNN for face: ![](https://i.imgur.com/zOh3guP.png) - Some popular DCNNs: ![](https://i.imgur.com/G9niXDW.png) ### CNN effective receptive field ![](https://i.imgur.com/06MGqqR.png)![](https://i.imgur.com/HNPqVSi.png) - A single variable in layer $n^o$ 3 represents layer 1 ------------------- ## SESSION 9: Recurrent neural networks (RNNs) - Sequential Data - Vanilla RNN - Vanishing Gradient Problem - Introduction to LSTM & GRU ###### RNNs predict a series of predictions as output - Output of a image classification does not affect the output of another image. - RNNs' output series of outputs (one output affects others): - Time series. - Voice/speech recognition. #### Moving in concept embedding spaces ![](https://i.imgur.com/r4TpVDJ.png) ![](https://i.imgur.com/16oPtyQ.png) [More examples](http://visual.cs.ucl.ac.uk/pubs/interpTransform/) - When moving in concept embedding scenes, values of $Z$ change. ### 'Moving' mathematical meaning - $z \rightarrow z' = z + \Delta z$ - $z \rightarrow z' = \sigma(Uz + W\Delta{z} +b)$ - $h \rightarrow h' = \sigma(Uh + Wx +b)$ - $h_{t-1} \rightarrow h_t = \sigma(Uh_{t-1}+Wx_t + b)$ - $h_{t-1}$: khai niem cu - $h_t$: khai niem moi ### Vanilla RNNs - Feed-forward networks ![](https://i.imgur.com/GqmQ27a.png) - To transform feed-forward networks to RNNs: Leveraging any information we’ve already extracted from the sequence and maintaining them by the “hidden states”. - Vanilla RNNs: ![](https://i.imgur.com/NWATsmC.png) - output becomes new input. - Math: ![](https://i.imgur.com/supf0bQ.png) - RNNs: weight sharing in time. CNNs: weight sharing in space. - Backpropagation through time: backpropagate errors to earlier time-steps during training. ![](https://i.imgur.com/ym8fB1n.png) #### Training Vanilla RNNs Gradient-based method: backpropagation through time (BPTT) - Treat the unfolded network as a big feed-forward network. - The whole input sequence is given to the FFNN. - The weight updates are computed for each copy in the unfolded network using the usual back-propagation method. - All the updates are then summed (or averaged) and then applied to the RNN (shared) weights. #### Common structures ![](https://i.imgur.com/nABET1D.png) #### Vanishing of Exploding Gradient - Recurrent layer’s weights are shared across time-steps. - Issue: - When multiplying, gradient values can be reduced(vanishing) or increase(exploding). - 'short-term' memory:![](https://i.imgur.com/One4kEw.png) #### Other RNNs: - Vanilla RNNs are simple but do not work very well. - Some variants of vanilla RNNs can improve their drawbacks: LSTMs & GRUs (more on later sessions). ![](https://i.imgur.com/Z1qYRMJ.png) ## SESSION 10: K-means Clustering: Unsupervised Learning ### In Today Lesson: - Unsupervised Learning & Clustering - K-means algorithm & objective function - Elbow method - Silhouette method ### Unsupervised Learning: Clustering #### Unsupervised Learning: dua vao input X de ra Y^ nhung khong can output label vi du Y #### Clustering & Classification: ![](https://i.imgur.com/yqFteIY.png) - classification classifies 1 input X to 1 output Y (labels or class) - Clustering classifies inputs X to 1 output Y (segment of Xs to a Y class) #### Applications: - Customer segmentation:![](https://i.imgur.com/ZVmecMQ.png) - Example: using customers' shopping frequency or spending to classify customers to different classes to provide each customer with better services and goods ### K-means Clustering - centroid = mean of values in a cluster - inputs are classified to the nearest Cluster using Euclidean distance. - implementing method: ![](https://i.imgur.com/MJa1wis.png) - example:![](https://i.imgur.com/shEKAg1.png) #### Objective of K-means Clustering - In K-means Clustering, performance depends on how 'stick together' of each point in a cluster (intra-cluster) and how far each cluster are (inter-cluster). - To optimize K-means means minimizing mean sum of intra-cluster distances: - Formula: $J = #### How to choose good ##### Elbow method: ##### Silhouette method: - Formula: $S A_i = \frac{b_i - a_i}{\max(a_i,b_i)}$ with $a_i$ ## SESSION 11: MDP Planning ### WHAT IS PLANNING ? ##### Planning is at the core of intelligence - 5P: proper planning prevents poor performance. Fail to plan = plan to fail. - From input X extract Z, predict $hat Y$, from $hat Y$ bring out action A. - For example: - playing chess: input is current chess board, extract Z(strategies), predict next move with highest win chances. - Tower of Hanoi: find the best algorithm for solving problem, achieve goal. ![](https://i.imgur.com/33sIYOh.png) - From observation O, through extracting features creates Z. Sometimes observation O is the Z. - **Good planning**:![](https://i.imgur.com/YpQ7Avt.png) - **Bad planning**: ![](https://i.imgur.com/1HwrvG7.png) ### Planning - Planning is the highest level of processing data. - A plan is a series of actions to as efficiently as possible to achieve goal(s) - A plan = sequence of actions to efficiently achieve a goal → sequential decision making. - Examples: Chess playing, Auto Driving, Assistant Robots,etc. #### Elements to formulate a planning problem - Actions: depends on environment - Performance evaluations: the client satisfaction,etc. - State of agent(environment): Observation O and Z. - Ex: - ![](https://i.imgur.com/HRoM1rw.png) - The planning model is called transition model: at a state $O_1$, use action $A_1$, and switch to $A_n$ when state changes to $O_n$ #### Simplest formulation: - MDP(Markov decision progress) = (S,A,T,R,γ) with: - S: state - A: Action - T: Transition - R: Reward - $\gamma$: Discount Factor(reward nhan dc bay gio se lon hon reward neu wait lau hon), lay duoc reward som - ![](https://i.imgur.com/1W7yFmp.png) - At state $S_t$, the agent decide which Action to take in order to maximize Reward R. #### Optimal plan - Policy $\pi(s)$: what $A$ to do at each & every situation $S$ - Action value $q^{\pi}(s,a)$: calculate sum of rewards with S and A. Average sum of possibility - *Find q to find $\pi$* - We can find $\pi$ directly using policy search(will be taught later) #### Trajectory, return, value function, and optimal policy the highest reward makes the best plan - (slide 16 session 11. remember to write down!!!!!) #### Bellman equations - Formula: $q^∗ (s,a)=R(s,a)+γ\sum_{s′} P(s′|s,a) \max_{a′}q^∗ (s′,a′)$ #### Reinforcement learning will be taught in the next class --------------------- # ML4AI Session 12 Final Exam (for Students) <br> ![](https://i.imgur.com/eDOwrsO.png) <br> CoTAI AI Training Program for Highschool Students © [CoTAI](https://gem.cot.ai) 2021 --- Open book. Choose to answer as many questions as possible, as concisely as possible. Ask instructors if you have difficulty in understanding the questions. **Questions** 1. [**8** Points] The unified TEFPA framework to build a ML model: task $\mathcal{T}$, experience $\mathcal{E}$, function space $\mathcal{F}$, performance $\mathcal{P}$, and algorithm $\mathcal{A}$ to search. What are the elements in TEFPA framework... - 1.1 [**2** Point, 3.1] to build a face recognition model using a DCNN?: - Answer: - Task: Input: Images or camera footages - Output: Face recognitions (names,...) - Experience: Face images - Function Space: - Performance: - Algorithm: - 1.2 [**2** Point, 3.1] to build a RecSys? (using one of the models you have learned: item-based, user-based, content-based, MF) - Answer: - Task: Input: user's id. Output: Recommendation - Experience: users' interactions,items' Features or user features or both. - Function Space: users' similarity and items similarity - Performance: user's interactions - Algorithm: - 1.3 [**2** Point, 3.1] to build a customer segmentation model using k-means?: - Answer: - Task: Input: customers. Output: customer segmentation - Experience: customers' traits (hobby,afforded goods,etc) - Function Space: classify customers to different segments. - Performance: customers' actions - Algorithm: distances between intracluster and intercluster. - 1.4 [**2** Point, 3.1] to build a sentiment analysis model (good, bad, neutral comments) using RNN+Softmax Classifier?: - Answer: - Task: Input: Comment - Output: analysis (good, bad, neutral) - Experience: Previous comments - Function Space: MSE - Performance: the comment's label - Algorithm: 2. [**6** Points] Convolutional filters (or kernels) - 2.1 [**1** Point, 1.1, 3.2] How do we extend them from dot product?: - Answer: Backpropagation. - 2.2 [**1** Point, 1.1, 3.2, 3.4] Why do we call their outputs "feature maps"?: - Answer: Because each number in output represents features of input. - 2.3 [**1** Point, 3.2] Explain padding: how to do & main purpose: - Answer: - Main purpose: keep or increase shape size and reduce egde information loss - How to do: add pads around data - 2.4 [**1** Point, 3.2] Explain pooling: how to do & main purpose: - Answer: - main purpose: reduces input shape size - Avg Pool (mean of values inside kernel) or Max Pool (max of values inside kernel) - 2.5 [**1** Point, 3.2] Explain stride: how to do & main purpose: - Answer: - Main purpose: decide how the kernel will move accross the data. - 2.6 [**1** Point, 3.2, 3.4] Explain their **effective** receptive field: why do they produce highly absstract features?: - Answer: A number in output layer(s) represents a feature which is represented by a lot of numbers in input data. 3. [**6** Points] Recurrent neural networks (RNNs) can be used for sequential modeling. - 3.1 [**1** Point, 3.2] What does sequential data mean?: - Answer: a series of data in time. - 3.2 [**1** Point, 1.1, 3.2, 3.4] Explain each element in this basic equation of RNNs $h_t = \mathsf{\gamma}(Ah_{t-1}+Wz_t)$: - Answer: - $h_t$: new concept - $\gamma$: activate functions - A: - $h_{t-1}$:old concept - W: weight - 3.3 [**2** Point, 1.3, 2.1, 3.2] WWhat does back-propagation-through-time mean, why do we need it instead of using plain back-prop, and how does it work for training RNNs? - 3.4 [**1** Point, 1.3, 3.2] Explain vanishing gradient problem for simple RNNs. - 3.5 [**1** Point, 3.1, 3.3] If we want to classify the sentiment of each user comment (good, bad, neutral) at the end of each sequence using RNN+Softmax classifier: explain briefly the model architecture. 4. [**6** Points] Planning in Markov Decision Process (MDP) $(S,A,T,R,\gamma)$. - 4.1 [**1** Point, 3.1, 3.2] Explain 5 elements in MDP model (equation of each element if available). - 4.2 [**1** Point, 3.2] Following a policy $\pi(s)$ to generate a trajectory of 10 time steps $(s_t,a_t,s_{t+1},r_{t+1})$. Compute the return. Equation of $a_t$? - 4.3 [**1** Point, 1.2, 3.2] Repeat for 10 days: from $s_0 = \text{HOME}$ take action $a_0 = \text{GET_BUS}$ with cost $r_1 = 6000 \text{VNĐ}$ then following policy $\pi(s)$ to generate $K=10$ trajectories, each with total cost $G_k$. Compute the average cost of taking bus then following $\pi$: $Q^\pi(\text{HOME, GET_BUS})$. - 4.4 [**1** Point, 1.1, 1.3, 2.1, 3.2] How do we compute an optimal plan (i.e., optimal policy $\pi^*$) of a known MDP $(S,A,T,R,\gamma)$? - 4.5 [**1** Point, 3.2] Why do we say that the action value function $Q^\pi(s,a)$ gives predictions into very far future? - 4.6 [**1** Point, 1.2, 3.2] What is the meaning of action value function when we set $\gamma = 1$? $\gamma = 0$? 5. [**7** Points] Unified ML models $\text{Input } X \xrightarrow[B_{\beta}]{\text{Features}}\text{ Embedding Coordinates }Z \xrightarrow[P_{\theta}]{\text{Predictor}}\text{ Predictions }\hat{Y} \xrightarrow[{\pi^*}]{\text{Policy}}\text{ Action }A$ - 5.1 [**2** Points] List all *taught* algorithms for feature extraction and their main ideas. - Answer: PCA, CNN. - 5.2 [**2** Points] List all *taught* algorithms for making predictions and their main ideas. - Answer: Linear Regression, Logistic Regression, SVM, KMeans, kNN. - 5.3 [**2** Points] What are the main *general* differences between linear predictors? And in your opinion why do we need different algorithms?: - Each linear predictor use different algorithm - 5.4 [**1** Points] For MDPs, what are the predictions $\hat{Y}$ used to make decisions $A$? - Answer: Action value $Q$ 6. [**2** Points] RecSys ![](https://i.imgur.com/PseEFdZ.png) We build item embeddings ${\bf z}_i \in \mathbb{R}^2$ as in table, and use **softmax regression** to predict ratings 1 to 5. Choose a specific user $X\in \{A,\dots,F\}$, what is the training set for learning $\theta_X$? What are the parameters $\theta_X$ to be learned (with their shapes)? - Answer: - For user $\theta_A$: training 7. [**6** Points] MDP Planning for playing Chess. Let rewards = 1 for winning, -1 for losing, and 0 for a draw or unfinished game, and no discount. - 7.1 [**2** Points] What is the range of value of the optimal action-value function $Q^*(s,a)$, and how to derive probability of win/loss from it? - 7.2 [**2** Points] If we use all the games already played in history to compute $Q^*(s,a)$, explain the method? - 7.3 [**2** Points] Because there are so many state and action pairs $(s,a)$, we need to use *learning* to approximate and generalize for all $(s,a)$ pairs. If we use MLP to learn $Q^*(s,a)$, what is the dataset and possible network structure?