# HW2 Answer ## Problem 1 Denoted $\pi = (\pi_1, \cdots, \pi_K)$ The probability of one data point $\mathbf{x}_n$ is \begin{equation*} p(\mathbf{x}_n, \mathbf{t}_n) = p(\mathbf{x}_n\vert \mathbf{t}_n)p(\mathbf{t}_n) = \prod_{k = 1}^{K}(p(\mathbf{x}_n\vert C_k)\pi_k)^{t_{n}^k} \end{equation*} So the likelihood function is given by \begin{equation*} p(\mathbf{x}_n, \mathbf{t}_n\vert \pi_k) = \prod_{n = 1}^{N}\prod_{k=1}^{K} (p(\mathbf{x}_n\vert C_k)\pi_k)^{t_{n}^k} \end{equation*} and taking the logarithm, we get \begin{equation*} \log p(\mathbf{x}_n, \mathbf{t}_n\vert \pi_k) = \sum_{n = 1}^{N}\sum_{k=1}^{K} t_{n}^k(\log p(\mathbf{x}_n\vert C_k)+ \log \pi_k) \end{equation*} Note that $\log$ is denoted by natural log. Now, we can formalize our maximize likelihood problem as an optimization problem: \begin{equation*} \begin{aligned} & \underset{\pi}{\text{max}} & & \log p(\mathbf{x}_n, \mathbf{t}_n\vert \pi)\\ & \text{subject to} & & \sum_{k = 1}^{K}\pi_k = 1 \end{aligned} \end{equation*} By introducing a Lagrange multiplier $\lambda$ and maximizing \begin{align*} \mathcal{L}(\pi, \lambda) &= \log p(\mathbf{x}_n, \mathbf{t}_n\vert \pi) + \lambda(\sum_{k = 1}^{K}\pi_k - 1) \end{align*} Taking the derivative with respect to $\pi_k$ and setting it to 0, we have \begin{align*} \frac{\partial}{\partial\pi_k}\mathcal{L}(\pi, \lambda) &= \frac{\partial}{\partial\pi_k} \{\sum_{n = 1}^{N}\sum_{k=1}^{K} t_{n}^k(\log p(\mathbf{x}_n\vert C_k)+ \log \pi_k) + \lambda(\sum_{k = 1}^{K}\pi_k - 1)\}\\ &= \frac{1}{\pi_k}\sum_{n=1}^N t_{n}^k + \lambda = 0\\ &\Rightarrow \pi_k = -\frac{1}{\lambda}\sum_{n=1}^N t_{n}^k = -\frac{N_k}{\lambda} \end{align*} where $N_k$ us the number of data points whose label is class $k$. Taking the derivative with respect to $\lambda$, we have \begin{align*} &\frac{\partial}{\partial \lambda}\mathcal{L}(\pi, \lambda) = \sum_{k = 1}^{K}\pi_k - 1 = 0\\ &\Rightarrow \sum_{k = 1}^{K}\pi_k = 1 \end{align*} Combining two equations, we get \begin{equation*} \sum_{k = 1}^K \pi_k = \sum_{k = 1}^K -\frac{N_k}{\lambda} = -\frac{N}{\lambda} \Rightarrow \lambda = -N \end{equation*} Finally, we can put it back into our equation to solve $\pi_k$'s. Thus, we have \begin{equation*} \pi_k = \frac{N_k}{N} \end{equation*} ## Problem 2 1. Conseider $f(\mathbf{w}) = \mathbf{w}^T A \mathbf{w}$, and $f(\mathbf{w}+h) = (\mathbf{w} + h)^T A (\mathbf{w} + h)$ Then, $$ \begin{aligned} f(\mathbf{w}+h) - f(\mathbf{w}) &= \mathbf{w}^T A\mathbf{w} + h^T A \mathbf{w} + \mathbf{w}^TA h + h^T A h - \mathbf{w}^T A \mathbf{w}\\ &= h^T A \mathbf{w} + \mathbf{w}^TA h + h^T A h\\ &= h^T(A^T \mathbf{w} + A\mathbf{w}) + h^T A h\\ &= (A^T \mathbf{w} + A\mathbf{w})\cdot h + h^T A h \end{aligned} $$ By definition, $\frac{\partial\ \mathbf{w}^TA\mathbf{w}}{\partial\ \mathbf{w}} = A^T\mathbf{w} + A\mathbf{w}$. In particular, $A$ is a symmetric matrix i.e. $A = A^T$, then $\frac{\partial\ \mathbf{w}^TA\mathbf{w}}{\partial\ \mathbf{w}} = 2A\mathbf{w}$ 2. Define $C = AB$. Note that $c_{ij} := \sum_{k=1}^m a_{ik}b_{kj}$, where $c_{ij}$ is the $i$-th row and $j$-th columns of matrix $C$. i.e. the dot product of the $i$-th row of $A$ and the $j$-th column of $B$. Then, $$ tr(AB) = tr(C) = \sum_{l=1}^m c_{ll} = \sum_{l=1}^m\sum_{k=1}^m a_{lk}b_{kl} $$ Hence, $$ \begin{aligned} \frac{\partial\ tr(AB)}{\partial a_{ij}} &= \frac{\partial\ \sum_{l=1}^m\sum_{k=1}^m a_{lk}b_{kl}}{\partial a_{ij}} =b_{ji} \end{aligned} $$ 3. Follow the hint ## Problem 3 By assumption, we know \begin{align*} & p\left(x_n \vert C_{k}\right)=\frac{1}{(2 \pi)^{D / 2}|\boldsymbol{\Sigma}|^{1 / 2}} \exp \left(-\frac{1}{2}\left(x_n-\boldsymbol{\mu}_{k}\right)^{T} \boldsymbol{\Sigma}^{-1}\left(x_n-\boldsymbol{\mu}_{k}\right)\right)\\ & p(x_n, t_n) = p(x_n\vert t_n)p(t_n) = \prod_{k = 1}^{K}(p(x_n\vert C_k)\pi_k)^{t_{n}^k} \end{align*} where $D$ is dimension of $x$.\ The log-likelihood function is given by \begin{align*} &\log p(x_n, t_n\vert \pi_k, \boldsymbol{\mu_k}, \boldsymbol{\Sigma}) = \sum_{n = 1}^{N}\sum_{k=1}^{K} t_{n}^k(\log p(x_n\vert C_k)+\log \pi_k)\\ &= \sum_{n = 1}^{N}\sum_{k=1}^{K} t_{n}^k(\log \pi_k + (-\frac{1}{2}\left(x_n-\boldsymbol{\mu}_{k}\right)^{T} \boldsymbol{\Sigma}^{-1}\left(x_n-\boldsymbol{\mu}_{k}\right)) -\frac{1}{2}\log \vert\boldsymbol\Sigma\vert - \frac{D}{2}\log 2\pi)\\ \end{align*} By introducing a Lagrange multiplier $\lambda$ and maximizing \begin{align*} \mathcal{L}(\pi, \boldsymbol{\mu_k}, \boldsymbol{\Sigma}, \lambda) &= \log p(x_n, t_n\vert \pi_k, \boldsymbol{\mu_k}, \boldsymbol{\Sigma}) + \lambda(\sum_{k = 1}^{K}\pi_k - 1) \end{align*} Taking the derivative with respect to $\pi_k$ and setting it to 0, we have \begin{align*} \pi_k = \frac{N_k}{N} (\text{ follow problem 1}) \end{align*} Taking the derivative with respect to $\boldsymbol{\mu_k}$ and setting it to 0, we have \begin{align*} \frac{\partial}{\partial \boldsymbol{\mu_k}}\mathcal{L}(\pi, \boldsymbol{\mu_k}, \boldsymbol{\Sigma}, \lambda) &= \sum_{n = 1}^N t_{n}^k\boldsymbol\Sigma^{-1}(x_n-\boldsymbol{\mu_k})\\ &= \boldsymbol\Sigma^{-1}\sum_{n=1}^N t_{n}^k(x_n-\boldsymbol{\mu_k}) = 0 \end{align*} Since $\boldsymbol{\Sigma}^{-1}$ is positive definite, then \begin{equation*} \begin{aligned} & &\sum_{n = 1}^N t_{n}^k(x_n - \boldsymbol{\mu_k}) = \sum_{n=1}^N t_{n}^kx_n - \boldsymbol{\mu_k}\sum_{n=1}^{N}t_{n}^k = 0\\ & &\Rightarrow \boldsymbol{\mu_k} = \frac{\sum_{n = 1}^N t_{n}^k x_n}{\sum_{n=1}^N t_{n}^k} = \frac{\sum_{n=1}^N t_{n}^k x_n}{N_k} \end{aligned} \end{equation*} We rewrite the log-likelihood function: \begin{align*} \mathcal{L}(\pi, \boldsymbol{\mu_k}, \boldsymbol{\Sigma}, \lambda) &= \sum_{n=1}^N\sum_{k=1}^K t_{n}^k(\{-\frac{1}{2}\left(x_n-\boldsymbol{\mu}_{k}\right)^{T} \boldsymbol{\Sigma}^{-1}\left(x_n-\boldsymbol{\mu}_{k}\right)\}) - \frac{1}{2}\log\vert\boldsymbol{\Sigma}\vert)\\ &=\sum_{n=1}^N\sum_{k=1}^K t_{n}^k(\{-\frac{1}{2}tr\{\left(x_n-\boldsymbol{\mu}_{k}\right)^{T} \boldsymbol{\Sigma}^{-1}\left(x_n-\boldsymbol{\mu}_{k}\right)\})\} + \frac{1}{2}\log\vert\boldsymbol{\Sigma^{-1}}\vert)\\ &=\sum_{n=1}^N\sum_{k=1}^K t_{n}^k(\{-\frac{1}{2}tr\{\boldsymbol{\Sigma}^{-1}\left(x_n-\boldsymbol{\mu}_{k}\right)^{T} \left(x_n-\boldsymbol{\mu}_{k}\right)\})\} + \frac{1}{2}\log\vert\boldsymbol{\Sigma^{-1}}\vert) \end{align*} Taking the derivative with respect to $\boldsymbol{\Sigma^{-1}}$ and setting it to 0, we get \begin{align*} \frac{\partial}{\partial \boldsymbol{\Sigma^{-1}}}\mathcal{L}(\pi, \boldsymbol{\mu_k}, \boldsymbol{\Sigma}, \lambda) &= \frac{-1}{2}\sum_{n=1}^N\sum_{k=1}^K t_{n}^k\left(x_n-\boldsymbol{\mu}_{k}\right) \left(x_n-\boldsymbol{\mu}_{k}\right)^{T} - t_{n}^k\boldsymbol\Sigma^T\\ &= \frac{-1}{2}\sum_{k = 1}^K\sum_{n = 1}^N t_{n}^k\left(x_n-\boldsymbol{\mu}_{k}\right) \left(x_n-\boldsymbol{\mu}_{k}\right)^{T} - \sum_{k = 1}^K\sum_{n = 1}^Nt_{n}^k\boldsymbol\Sigma = 0 \end{align*} Hence, \begin{equation*} \begin{aligned} &\sum_{k = 1}^K\sum_{n = 1}^Nt_{n}^k\boldsymbol\Sigma = \sum_{k = 1}^K\sum_{n = 1}^N t_{n}^k\left(x_n-\boldsymbol{\mu}_{k}\right) \left(x_n-\boldsymbol{\mu}_{k}\right)^{T}\\ &\Rightarrow N\boldsymbol\Sigma = \sum_{k = 1}^K N_k \mathbf{\Sigma_k} \Rightarrow {\boldsymbol\Sigma}=\sum_{k=1}^K\frac{N_k}{N}{\mathbf \Sigma_k} \end{aligned} \end{equation*} Note: 1. By the invariance property of MLE, we take derivative of $\mathcal{L}$ w.r.t. $\Sigma^{-1}$ 2. If you want to take derivative of $\mathcal{L}$ w.r.t. $\Sigma$, then you would apply the fact that $\frac{\partial\ tr\left(X^{-1}M\right)}{\partial X}=-\left(X^{-1}MX^{-1}\right)$ ## Problem 4 1. $$ \begin{aligned} \sum_{i=1}^m\left\|\boldsymbol{z}^i-\boldsymbol{z}\right\|^2 &=\sum_{i=1}^m\left\|\left(\boldsymbol{z}^i-\overline{\boldsymbol{z}}\right)+(\overline{\boldsymbol{z}}-\boldsymbol{z})\right\|^2 \\ &=\sum_{i=1}^m\left(\left\|\boldsymbol{z}^i-\overline{\boldsymbol{z}}\right\|^2+\|\overline{\boldsymbol{z}}-\boldsymbol{z}\|^2+2\left(\boldsymbol{z}^i-\overline{\boldsymbol{z}}\right) \cdot(\overline{\boldsymbol{z}}-\boldsymbol{z})\right) \\ &=\sum_{i=1}^m\left\|\boldsymbol{z}^i-\overline{\boldsymbol{z}}\right\|^2+\sum_{i=1}^m\|\overline{\boldsymbol{z}}-\boldsymbol{z}\|^2+2 \sum_{i=1}^m\left(\boldsymbol{z}^i \cdot \overline{\boldsymbol{z}}-\boldsymbol{z}^i \cdot \boldsymbol{z}-\overline{\boldsymbol{z}} \cdot \overline{\boldsymbol{z}}+\overline{\boldsymbol{z}} \cdot \boldsymbol{z}\right) \\ &=\sum_{i=1}^m\left\|\boldsymbol{z}^i-\overline{\boldsymbol{z}}\right\|^2+m\|\overline{\boldsymbol{z}}-\boldsymbol{z}\|^2+2(m \overline{\boldsymbol{z}} \cdot \overline{\boldsymbol{z}}-m \overline{\boldsymbol{z}} \cdot \boldsymbol{z}-m \overline{\boldsymbol{z}} \cdot \overline{\boldsymbol{z}}+m \overline{\boldsymbol{z}} \cdot \boldsymbol{z}) \\ &=\sum_{i=1}^m\left\|\boldsymbol{z}^i-\overline{\boldsymbol{z}}\right\|^2+m\|\overline{\boldsymbol{z}}-\boldsymbol{z}\|^2 \\ & \geq \sum_{i=1}^m\left\|\boldsymbol{z}^i-\overline{\boldsymbol{z}}\right\|^2 . \end{aligned} $$ 2. It follows directly from the logic of the algorithm: $\mathcal{C}^t$ and $\mathcal{C}^{t+1}$ are different only if there is a point that finds a closer cluster center in $\boldsymbol{\mu}^t$ than the one assigned to it by $\mathcal{C}^t$: $$ L\left(\mathcal{C}^{t+1}, \boldsymbol{\mu}^t\right)=\sum_{i=1}^n\left\|\boldsymbol{x}^i-\boldsymbol{\mu}_{C^{t+1}(i)}^t\right\|^2<\sum_{i=1}^n\left\|\boldsymbol{x}^i-\boldsymbol{\mu}_{C^t(i)}^t\right\|^2=L\left(\mathcal{C}^t, \boldsymbol{\mu}^t\right) $$ 3. Use the result in (1): $$ \begin{aligned} L\left(\mathcal{C}^{t+1}, \boldsymbol{\mu}^{t+1}\right) &=\sum_{i=1}^n\left\|\boldsymbol{x}^i-\boldsymbol{\mu}_{C^{t+1}(i)}^{t+1}\right\|^2 \\ &=\sum_{k^{\prime}=1}^k \sum_{i \in\{1,2, \ldots, n\}, \mathcal{C}^{t+1}(i)=k^{\prime}}\left\|\boldsymbol{x}^i-\boldsymbol{\mu}_{C^{t+1}(i)}^{t+1}\right\|^2 \\ & \leq \sum_{k^{\prime}=1}^k \sum_{i \in\{1,2, \ldots, n\}, \mathcal{C}^{t+1}(i)=k^{\prime}}\left\|\boldsymbol{x}^i-\boldsymbol{\mu}_{C^{t+1}(i)}^t\right\|^2 (\text{by } 1)\\ &=\sum_{i=1}^n\left\|\boldsymbol{x}^i-\boldsymbol{\mu}_{C^{t+1}(i)}^t\right\|^2 \\ &=L\left(\mathcal{C}^{t+1}, \boldsymbol{\mu}^t\right) . \end{aligned} $$ 4. Define the sequence $\{l_t\}$, where $l_t = L(\mathcal{C}^t, \boldsymbol{\mu}^t)$. By previous result, we have $$ l_t = L(\mathcal{C}^t, \boldsymbol{\mu}^t)\leq L(\mathcal{C}^{t+1}, \boldsymbol{\mu}^{t+1}) = l_{t+1} $$for all $t$. Hence, $\{l_t\}$ is a monotonic decreasing sequence. Note that we apply **monotonic convergence theorem of sequence** to prove the sequence is convergence, which does not guarantee this algorithm could find the **global** minimum, just a **local** minimum. 5. There are at most $k^N$ ways to partition $N$ data points into $k$ clusters. Then, this algorithm converges in finitely many steps. Note that the upper bound ($k^N$) may not tight.