# HW2 Handwritten Assignment (Revised by 2022.10.21) ## Problem 1 (1%) Consider a generative classification model of $K$ classes $C_1,...,C_K$ described by prior probabilities $p(C_k) = \pi_k$ and conditional probabilities $p(\mathbf{x}|C_k)$, where $\mathbf{x}$ is the input feature vector. Suppose we are given a training data set $((\mathbf{x}_i, \mathbf{t}_i))_{i=1}^N$ where $i = 1,...,N$, and $\mathbf{t}_i = \left( t_i^1,...,t_i^K \right) \in \{0,1\}^K$ is the 1-of-K encoding given by $$t_i^k = \left\{\begin{array}{cl}1 & \mbox{, if the $i$'th pattern is from class $C_k$}\\ 0 & \mbox{, otherwise}\end{array}\right.$$ Assume the data points are drawn independently from this model. Show that the maximum-likelihood solution for the prior probablities is given by $$\pi_k =\frac{N_k}{N}$$ where $N_k$ is the number of data points belonging to class $C_k$. --- ## Problem 2 (1%) a.(0.3%) Given $\mathbf{w}\in \mathbb{R}^m$, $A\in\mathbb{R}^{m\times m}$, and $\mathbf{w}$ does not dependent on $A$. Show that $$ \frac{\partial\ \mathbf{w}^TA\mathbf{w}}{\partial\ \mathbf{w}} = A^T\mathbf{w} + A\mathbf{w} $$ In particular, if $A$ is an symmetric matrix, then $$ \frac{\partial\ \mathbf{w}^TA\mathbf{w}}{\partial \mathbf{w}} = 2A\mathbf{w} $$ b.(0.3%) Given $A, B\in \mathbb{R}^{m\times m}$. Show that $$ \frac{\partial\ tr(AB)}{\partial a_{ij}} = b_{ji} $$ where $$ A=\left[\begin{array}{cccc} a_{11} & a_{12} & \cdots & a_{1 m} \\ a_{21} & a_{22} & \cdots & a_{2 m} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m 1} & a_{m 2} & \cdots & a_{m m} \end{array}\right] $$ and $$ B=\left[\begin{array}{cccc} b_{11} & b_{12} & \cdots & b_{1 m} \\ b_{21} & b_{22} & \cdots & b_{2 m} \\ \vdots & \vdots & \ddots & \vdots \\ b_{m 1} & b_{m 2} & \cdots & b_{m nm} \end{array}\right] $$ c.(0.4%) Show that $$\frac{\partial \log(\det(\boldsymbol{\Sigma}))}{\partial \sigma_{ij}} = \mathbf{e}_j^{\color{red}{T}} \boldsymbol{\Sigma}^{-1}\space \mathbf{e}_i,$$ where $\boldsymbol{\Sigma} = \left[\begin{array}{cccc} \sigma_{11} & \sigma_{12} & \cdots & \sigma_{1 m} \\ \sigma_{21} & \sigma_{22} & \cdots & \sigma_{2 m} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{m 1} & \sigma_{m 2} & \cdots & \sigma_{m m} \end{array}\right] \in \mathbb{R}^{m\times m}$ is a (non-singular) covariance matrix and $\mathbf{e}_j$ is the unit vector along the j-th axis (ex: $\mathbf{e}_3=[0,0,1,0,...,0]^T$). - Hint: ![](https://i.imgur.com/gq5ylwQ.jpg) --- ## Problem 3 (1%) Consider the classification model in **problem 1** as well as the result in **problem 2**, and suppose that the conditional probability density functions are given by Gaussian distributions with a shared convariance matrix, namely $$p(\mathbf{x}|C_k)= \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}).$$ Show that the maximum likelihood solution for the mean of the Gaussian distribution for class $C_k$ is given by $$\boldsymbol{\mu}_k = \frac{1}{N_k}\sum_{i=1}^N t_i^k \mathbf{x}_i.$$ In other words, $\boldsymbol{\mu}_k$ is the mean of those feature vectors belonging to class $C_k$. Similarly, show that the maximum likelihood solution for the shared covariance matrix is given by $$\boldsymbol{\Sigma} = \sum_{k=1}^K \frac{N_k}{N}\boldsymbol{\Sigma}_k,$$ where $\boldsymbol{\Sigma}_k$ is the covariance matrix of the data points associated with class $C_k$ $$\boldsymbol{\Sigma}_k = \frac{1}{N_k}\sum_{i=1}^N t_i^k(\mathbf{x}_i-\boldsymbol{\mu}_k)(\mathbf{x}_i-\boldsymbol{\mu}_k)^T.$$ In other words, $\boldsymbol{\Sigma}$ is a weighted average of the covariance matrics from each class weighted by their corresponding prior probabilities. - Hint: Given $A, B, C\in\mathbb{R}^{m\times m}$, the following properties are hold: - The trace is invariant under cyclic permutations of matrix products: $tr(ABC) = tr(CAB) = tr(BCA)$ - If $x^T A x$ is a scaler, then $x^T A x = tr(x^T A x)$ --- ## Problem 4 (Convergence of K-means Clustering) (1%) In the K-means clustering algorithm, we are given a set of $n$ points $x_i \in \mathbb{R}^d, i \in\{1, \ldots, n\}$ and we want to find the centers of $k$ clusters $\boldsymbol{\mu}=\left(\mu_1, \ldots, \mu_k\right)$ by minimizing the average distance from the points to the closest cluster center. In general, $n\geq k$. Define function $\mathcal{C} :\{1, \cdots, n\}\rightarrow \{1, 2,\cdots, k\}$ assigns one of $k$ clusters to each point in the data set such that $\mathcal{C}(i) = q$ if the $i$-th data point $x_i$ is assigned to the $q$-th cluster where $i\in\{1, 2, \cdots, n\}$ and $q\in\{1, 2, \cdots, k\}$ Formally, we want to minimize the following loss function $$ L(\mathcal{C}, \boldsymbol{\mu})=\sum_{i=1}^n\left\|x_i-\mu_{\mathcal{C}(i)}\right\|_2^2 = \sum_{q=1}^{k}\sum_{i:\mathcal{C}(i)=q}\left\|x_i-\mu_{q}\right\|_2^2 $$ The K-means algorithm: Initialize cluster center $\mu_j, j = 1, 2, \cdots, k$ ($k$ random $x_n$ from data set) Repeat: 1. Fix $\boldsymbol{\mu}$, update $\mathcal{C}(i)$ for each $i$ that minimizes $L$. Formally, consider a data point $x_i$, and let $\mathcal{C}(i)$ be the assignment from the previous iteration and $C^*(i)$ be the new assignment obtained as: $C^*(i)= \arg\min_{j=1, \cdots, k} \Vert x_i - \mu_j\Vert^2_2$ 2. Fix $\mathcal{C}$, update the centers $\mu_j$ which satisfies $$\left|\left\{i: \mathcal{C}(i)=j\right\}\right|\mu_j= \sum_{i: \mathcal{C}(i)=j} x_i, $$for each $j$, where $\left|\left\{i: \mathcal{C}(i)=j\right\}\right|$ is the number of elements of set $\left\{i: \mathcal{C}(i)=j\right\}$.(i.e. Set the cluster centres to be the means of the points in each cluster.) The algorithm stops when no change in loss function occurs during the assignment step. Suppose that the algorithm proceeds from iteration $t$ to $t+1$. a. Consider the points $z_1, z_2, \cdots, z_m$, where $m\geq 1$ . and for $i\in\{1, 2, \cdots, m\}, z_i\in \mathbb{R}^d$. Let $\bar{z} = \frac{1}{m}\sum_{i=1}^m z_i$ be the mean of these points, and let $z\in \mathbb{R}^d$ be an arbitrary point in the same ($d$-dimensional) space. Then $$ \sum_{i=1}^m \Vert z_i - z\Vert^2_2 \geq \sum_{i=1}^m \Vert z_i - \bar{z}\Vert^2_2 $$ b. Show that $L(\mathcal{C}^{t+1}, \boldsymbol{\mu}^{t}) \leq L(\mathcal{C}^{t}, \boldsymbol{\mu}^{t})$ i.e. The first step in K-means clustering c. Show that $L(\mathcal{C}^{t+1}, \boldsymbol{\mu}^{t+1})\leq L(\mathcal{C}^{t+1}, \boldsymbol{\mu}^{t})$ i.e. The second step in K-means clustering - Hint: Use the result of (a) d. Use the result in (b) and $(c)$ to show that the loss of $k$-means clustering algorithm is monotonic decreasing. - Hint: Show that the sequence $\{l_t\}$, where $l_t = L(\mathcal{C}^{t}, \boldsymbol{\mu}^{t})$, which is monotone decreasing ($l_{t+1}\leq l_t, \forall t$) and bounded below ($l_t\geq 0$). Then, we use monotone convergence theorem for sequences, $\{l_t\}$ converges. e. Show that the $k$-means clustering algorithm converges in finitely many steps.