[Trần Duy Thanh] COTAI EXAM QUESTIONS 21/5/2020

--- tags: COTAI MathAIR --- # [Trần Duy Thanh] COTAI EXAM QUESTIONS 21/5/2020 ## ABSTRACT VECTOR SPACES & LINEAR ALGEBRA ## General ### G.1 Rosenblatt's PERCEPTRON 1957 > ![](https://i.imgur.com/WVUp2DK.png) > > $x$$\to\varphi_i(x)$ $\to \alpha_i\varphi_i(x)$$\to \sum_{i=1}^n\alpha_i\varphi_i(x)$$\to \Psi(x) = \Omega\Big(\sum_{i=1}^n\alpha_i\varphi_i(x)\Big)$ > > Decomposition $\to$ aggregation  $\to$ prediction > > - What does decomposition (phân rã, phân tách) step do that we call it "decomposition"? Decomposition step tries to "cut" and observe several parts or sides of input data and measures how much of each part/side contained in the input data > - What does aggregation (gom lại) step do that we call it "aggregation"? Aggregation tries to use linear combination as the description of the input data based on the parts/sides decomposed before ### G.2 Cats and dogs features. > - Choose 3 **most discriminative features** between cats and dogs and position them on 3D axes. Give **label** to each axis. ![](https://i.imgur.com/viOGtWA.jpg) ### G.3 Manifold hypothesis in ML > - Explain it concisely and most intuitively. - point-care embedding - local matrix ### G.4 Linear combination > - **G4.1** Give all variants of the basic linear combination that researchers have developed over time, in a systematic manner (i.e., increasing of complexity). > - **G4.2** What is a factorization? Is it a linear combination of factors? Give 1 way to turn factorization into linear combination. Lấy log > - **G4.3** Explain your intuitive understanding when we say "Linear combination as decomposition & synthesis in ML." > - **G4.4** What is the good property of convex combination that we exploit in ML and in probability theory? How do we exploit it (i.e., how to guarantee it's a convex combination) in attention model? ### G.5 Vector spaces > ![](https://i.imgur.com/JIg2kfx.png =x500) > - **G5.1** In the figure above, what is $\text{span}(w)$? The span of $\mathbf{w}$ is the line $\mathbf{L}_0$ > - **G5.2** Construct the equation of $L_2$ step-by-step. Construct $\mathbf{L}_0$. $\mathbf{L}_0$ is the set of all scales of $\mathbf{w}$ $\mathbf{L}_0 = span(\mathbf{w}) = \{\mathbf{u}=t\mathbf{w}| t\in \mathbb{R}\}$ Moving each vector $\mathbf{u}$ of $\mathbf{L}_0$ by $\mathbf{v}$ which is equivalent to moving the whole line $\mathbf{L}_0$ by $\mathbf{v}$. Hence $\mathbf{L}_2 = \{\mathbf{z}=\mathbf{v}+t\mathbf{w}| t\in \mathbb{R}\}$ ### G.6 Embeddings > - **G6.1** What does embedding mean? Why is it so important a central concept in ML? > - **G6.2** The general (abstract) flow of processing steps from observations to embeddings to concepts and predictions/decisions. Explain the main goals of each step. ### G.7 Inner product > - **G7.1** What can we define for abstract vectors using inner product, so that when we change the definition of inner product such definitions also change? - Định nghĩa được độ lớn vector - Định nghĩa được góc giữa 2 vector ==> vuông góc - Định nghĩa được các hàm khoảng cách > - **G7.2** Explain why we can use flattening step when comparing similarity of 2 arrays of data. Element wise + sum === tích vô hướng của 2 vector ===> về mặt toán nó như nhau > - **G7.3** Why orthonormal basis is so useful? List all nice properties we can get. Chỉ việc tính inner product là ra toạ độ cmnr > - **G7.4** What is the relationship between orthogonal projection and distance? Why do you think we have such relationship, i.e., least square instead of some other distance measure? Khoảng cách ngắn nhất từ điểm x đến subspace cho trước chính là phương của phép chiếu x lên subspace đó ## Linear transformations > - **L.1** Draw the big picture of linear transformation $V^{(n)}\overset{T}{\to}W^{(m)}$ with abstract and coordinate spaces, as complete as possible. ![](https://i.imgur.com/O6zuYV7.png) > - **L.2** Write equations and explain everything in this figure. ![](https://i.imgur.com/ba6Amft.png =x250) The linear transformation T tries to map each element of V to an element of W. - The set of elements of V that T maps to $O_v$ of W is called the kernel (or null space) of T, denoted ker(T) - The set of elements of W that T maps to is called the image of T, denoted im(T) - dim(im(T)) is the dimension of im(T), it measures how much information of V is preserved after the linear transformation T - dim(ker(T)) is the dimension of ker(T), it measures how much information of V is lost after the linear transformation T - dim(V) is the dimension of V, can be seen as the total information of V before the linear transformation T $$ dim(V) = dim(ker(T)) + dim(im(T)) $$ > - **L.3** Explain intuitively the 4 views of matrix multiplication (not just equations). > - **L.4** Describe all about this general [4 fundamental subspaces](https://web.mit.edu/18.06/www/Essays/newpaper_ver3.pdf) of $A_{m\times n}$ >![](https://i.imgur.com/3pFuOI7.png =x250) - $A$ is the linear transformation that maps each element of $\mathbb{R}^n$ to $\mathbb{R}^m$ - $A^\top$ is the inverse linear transformation of $A$ that maps each element of $\mathbb{R}^m$ back to $\mathbb{R}^n$ - The result of $A\mathbf{x}$ can be seen as the linear combination of all column vectors in $A$. So the set of vectors $\mathbf{b}=A\mathbf{x}$ is called the column space of $A$, denoted $C(A) = im(A)$ - The result of $A^\top\mathbf{b}$ can be seen as the linear combination of all column vectors in $A^\top$, which is exactly the set of row vectors of $A$. So the set of vectors $\mathbf{x}=A^\top\mathbf{b}$ is called the row space of $A$, denoted $R(A) = im(A^\top)$ - $N(A)$, $N(A^\top)$ is as mentioned in exercise L2 > - **L.5** Describe all about this subspace projection of *overcomplete* linear system. >![](https://i.imgur.com/o4lapza.png =x250) ## PCA > - **P.1** Explain intuitively, e.g. by text and figures, what PCA procedure does step-by-step. > - **P.2** What are the properties of the basis vectors of PCA? > - **P.3** Why do we say PCA is used for dimensionality reduction? How? PCA gives us a list of information dimension (features) that ordered by the important level. Keep all the top and remove a set of information dimension at the bottom of above list is equivalent of reducing dimensionality. Remember, dimensionality reduction always drops a part of you data information, which is measured by reconstruction error. > - **P.4** What is the objective function of PCA? > - **P.5** Prove that the first component of PCA must be the mean vector. > - **P.6** Using SVD prove that PCA gives the best approximation of the dataset given fixed budget of $k$ basis. ## [SVD](https://en.wikipedia.org/wiki/Singular_value_decomposition): $\forall A_{m\times n} = U_{m\times m}\Sigma_{m\times n}V^{\top}_{n\times n}$ > - **S.1** Describe geometric intuitions of SVD - for general matrix - for square matrix - for symmetric matrix - for positive definite matrix > - **S.2** Using SVD to derive inverse transform from $C(A)$ back to $R(A)$ in this figure > ![](https://i.imgur.com/3pFuOI7.png =x650) ## Matrix rank - $A_{m\times n}$ has $\mathsf{rank}(A) = r = \min(m,n)$. - $\mathsf{rank}(A^\top A) = \mathsf{rank}(AA^\top) = r$. - **M.1** Prove this using Nullspace. - Full rank matrices $\Leftrightarrow$ invertible. - $r=n$ $\Leftrightarrow A$ has full column rank. $\Leftrightarrow A$ is injective (or "one-to-one"): **M.2** **prove this**! $\Rightarrow A$ is a tall (slim) matrix: $n\leq m$. $\Leftrightarrow A^\top A$ is full rank. $\Leftrightarrow A$ has left inverse $A_{\text{left}}^{-1} = (A^\top A)^{-1}A^\top$. - $r=m$ $\Leftrightarrow A$ has full row rank. $\Leftrightarrow A$ is surjective (or "onto"): **M.3** **prove this**! $\Rightarrow A$ is a wide (fat) matrix: $m\leq n$. $\Leftrightarrow AA^\top$ is full rank. $\Leftrightarrow A$ has right inverse $A_{\text{right}}^{-1} = A^\top(AA^\top)^{-1}$. - **M.4** Why do we call left & right inverse? - **M.5** Plot the 4 fundamental subspaces for each case - $r=n$ - $r=m$ - **M.6** In case of **rank deficiency** (not full rank), use inverse transformations by SVD above to find the solutions of [linear least square regression](https://hackmd.io/@COTAI/LinearRegression). - **M.7** Prove that for the case $m<n$ the solution $\boldsymbol \beta^*$ has smallest norm: $\|\boldsymbol \beta_{\text{least-square}}\| \geq \|\boldsymbol \beta^*\|$ ### 3h, free-style questions so answer as much you know as possible. Enjoy! :heart_eyes_cat: