# HW2 Answer
## Problem 1
Denoted $\pi = (\pi_1, \cdots, \pi_K)$
The probability of one data point $\mathbf{x}_n$ is
\begin{equation*}
p(\mathbf{x}_n, \mathbf{t}_n) = p(\mathbf{x}_n\vert \mathbf{t}_n)p(\mathbf{t}_n) = \prod_{k = 1}^{K}(p(\mathbf{x}_n\vert C_k)\pi_k)^{t_{n}^k}
\end{equation*}
So the likelihood function is given by
\begin{equation*}
p(\mathbf{x}_n, \mathbf{t}_n\vert \pi_k) = \prod_{n = 1}^{N}\prod_{k=1}^{K} (p(\mathbf{x}_n\vert C_k)\pi_k)^{t_{n}^k}
\end{equation*}
and taking the logarithm, we get
\begin{equation*}
\log p(\mathbf{x}_n, \mathbf{t}_n\vert \pi_k) = \sum_{n = 1}^{N}\sum_{k=1}^{K} t_{n}^k(\log p(\mathbf{x}_n\vert C_k)+ \log \pi_k)
\end{equation*}
Note that $\log$ is denoted by natural log.
Now, we can formalize our maximize likelihood problem as an optimization problem:
\begin{equation*}
\begin{aligned}
& \underset{\pi}{\text{max}}
& & \log p(\mathbf{x}_n, \mathbf{t}_n\vert \pi)\\
& \text{subject to}
& & \sum_{k = 1}^{K}\pi_k = 1
\end{aligned}
\end{equation*}
By introducing a Lagrange multiplier $\lambda$ and maximizing
\begin{align*}
\mathcal{L}(\pi, \lambda) &= \log p(\mathbf{x}_n, \mathbf{t}_n\vert \pi) + \lambda(\sum_{k = 1}^{K}\pi_k - 1)
\end{align*}
Taking the derivative with respect to $\pi_k$ and setting it to 0, we have
\begin{align*}
\frac{\partial}{\partial\pi_k}\mathcal{L}(\pi, \lambda) &= \frac{\partial}{\partial\pi_k} \{\sum_{n = 1}^{N}\sum_{k=1}^{K} t_{n}^k(\log p(\mathbf{x}_n\vert C_k)+ \log \pi_k) + \lambda(\sum_{k = 1}^{K}\pi_k - 1)\}\\
&= \frac{1}{\pi_k}\sum_{n=1}^N t_{n}^k + \lambda = 0\\
&\Rightarrow \pi_k = -\frac{1}{\lambda}\sum_{n=1}^N t_{n}^k = -\frac{N_k}{\lambda}
\end{align*}
where $N_k$ us the number of data points whose label is class $k$. Taking the derivative with respect to $\lambda$, we have
\begin{align*}
&\frac{\partial}{\partial \lambda}\mathcal{L}(\pi, \lambda) = \sum_{k = 1}^{K}\pi_k - 1 = 0\\
&\Rightarrow \sum_{k = 1}^{K}\pi_k = 1
\end{align*}
Combining two equations, we get
\begin{equation*}
\sum_{k = 1}^K \pi_k = \sum_{k = 1}^K -\frac{N_k}{\lambda} = -\frac{N}{\lambda} \Rightarrow \lambda = -N
\end{equation*}
Finally, we can put it back into our equation to solve $\pi_k$'s. Thus, we have
\begin{equation*}
\pi_k = \frac{N_k}{N}
\end{equation*}
## Problem 2
1. Conseider $f(\mathbf{w}) = \mathbf{w}^T A \mathbf{w}$, and $f(\mathbf{w}+h) = (\mathbf{w} + h)^T A (\mathbf{w} + h)$
Then,
$$
\begin{aligned}
f(\mathbf{w}+h) - f(\mathbf{w}) &= \mathbf{w}^T A\mathbf{w} + h^T A \mathbf{w} + \mathbf{w}^TA h + h^T A h - \mathbf{w}^T A \mathbf{w}\\
&= h^T A \mathbf{w} + \mathbf{w}^TA h + h^T A h\\
&= h^T(A^T \mathbf{w} + A\mathbf{w}) + h^T A h\\
&= (A^T \mathbf{w} + A\mathbf{w})\cdot h + h^T A h
\end{aligned}
$$
By definition, $\frac{\partial\ \mathbf{w}^TA\mathbf{w}}{\partial\ \mathbf{w}} = A^T\mathbf{w} + A\mathbf{w}$. In particular, $A$ is a symmetric matrix i.e. $A = A^T$, then $\frac{\partial\ \mathbf{w}^TA\mathbf{w}}{\partial\ \mathbf{w}} = 2A\mathbf{w}$
2. Define $C = AB$. Note that $c_{ij} := \sum_{k=1}^m a_{ik}b_{kj}$, where $c_{ij}$ is the $i$-th row and $j$-th columns of matrix $C$. i.e. the dot product of the $i$-th row of $A$ and the $j$-th column of $B$.
Then,
$$
tr(AB) = tr(C) = \sum_{l=1}^m c_{ll} = \sum_{l=1}^m\sum_{k=1}^m a_{lk}b_{kl}
$$
Hence,
$$
\begin{aligned}
\frac{\partial\ tr(AB)}{\partial a_{ij}} &= \frac{\partial\ \sum_{l=1}^m\sum_{k=1}^m a_{lk}b_{kl}}{\partial a_{ij}} =b_{ji}
\end{aligned}
$$
3. Follow the hint
## Problem 3
By assumption, we know
\begin{align*}
& p\left(x_n \vert C_{k}\right)=\frac{1}{(2 \pi)^{D / 2}|\boldsymbol{\Sigma}|^{1 / 2}} \exp \left(-\frac{1}{2}\left(x_n-\boldsymbol{\mu}_{k}\right)^{T} \boldsymbol{\Sigma}^{-1}\left(x_n-\boldsymbol{\mu}_{k}\right)\right)\\
& p(x_n, t_n) = p(x_n\vert t_n)p(t_n) = \prod_{k = 1}^{K}(p(x_n\vert C_k)\pi_k)^{t_{n}^k}
\end{align*}
where $D$ is dimension of $x$.\
The log-likelihood function is given by
\begin{align*}
&\log p(x_n, t_n\vert \pi_k, \boldsymbol{\mu_k}, \boldsymbol{\Sigma}) = \sum_{n = 1}^{N}\sum_{k=1}^{K} t_{n}^k(\log p(x_n\vert C_k)+\log \pi_k)\\
&= \sum_{n = 1}^{N}\sum_{k=1}^{K} t_{n}^k(\log \pi_k + (-\frac{1}{2}\left(x_n-\boldsymbol{\mu}_{k}\right)^{T} \boldsymbol{\Sigma}^{-1}\left(x_n-\boldsymbol{\mu}_{k}\right)) -\frac{1}{2}\log \vert\boldsymbol\Sigma\vert - \frac{D}{2}\log 2\pi)\\
\end{align*}
By introducing a Lagrange multiplier $\lambda$ and maximizing
\begin{align*}
\mathcal{L}(\pi, \boldsymbol{\mu_k}, \boldsymbol{\Sigma}, \lambda) &= \log p(x_n, t_n\vert \pi_k, \boldsymbol{\mu_k}, \boldsymbol{\Sigma}) + \lambda(\sum_{k = 1}^{K}\pi_k - 1)
\end{align*}
Taking the derivative with respect to $\pi_k$ and setting it to 0, we have
\begin{align*}
\pi_k = \frac{N_k}{N} (\text{ follow problem 1})
\end{align*}
Taking the derivative with respect to $\boldsymbol{\mu_k}$ and setting it to 0, we have
\begin{align*}
\frac{\partial}{\partial \boldsymbol{\mu_k}}\mathcal{L}(\pi, \boldsymbol{\mu_k}, \boldsymbol{\Sigma}, \lambda)
&= \sum_{n = 1}^N t_{n}^k\boldsymbol\Sigma^{-1}(x_n-\boldsymbol{\mu_k})\\
&= \boldsymbol\Sigma^{-1}\sum_{n=1}^N t_{n}^k(x_n-\boldsymbol{\mu_k}) = 0
\end{align*}
Since $\boldsymbol{\Sigma}^{-1}$ is positive definite, then
\begin{equation*}
\begin{aligned}
& &\sum_{n = 1}^N t_{n}^k(x_n - \boldsymbol{\mu_k}) = \sum_{n=1}^N t_{n}^kx_n - \boldsymbol{\mu_k}\sum_{n=1}^{N}t_{n}^k = 0\\
& &\Rightarrow \boldsymbol{\mu_k} = \frac{\sum_{n = 1}^N t_{n}^k x_n}{\sum_{n=1}^N t_{n}^k} = \frac{\sum_{n=1}^N t_{n}^k x_n}{N_k}
\end{aligned}
\end{equation*}
We rewrite the log-likelihood function:
\begin{align*}
\mathcal{L}(\pi, \boldsymbol{\mu_k}, \boldsymbol{\Sigma}, \lambda) &= \sum_{n=1}^N\sum_{k=1}^K t_{n}^k(\{-\frac{1}{2}\left(x_n-\boldsymbol{\mu}_{k}\right)^{T} \boldsymbol{\Sigma}^{-1}\left(x_n-\boldsymbol{\mu}_{k}\right)\}) - \frac{1}{2}\log\vert\boldsymbol{\Sigma}\vert)\\
&=\sum_{n=1}^N\sum_{k=1}^K t_{n}^k(\{-\frac{1}{2}tr\{\left(x_n-\boldsymbol{\mu}_{k}\right)^{T} \boldsymbol{\Sigma}^{-1}\left(x_n-\boldsymbol{\mu}_{k}\right)\})\} + \frac{1}{2}\log\vert\boldsymbol{\Sigma^{-1}}\vert)\\
&=\sum_{n=1}^N\sum_{k=1}^K t_{n}^k(\{-\frac{1}{2}tr\{\boldsymbol{\Sigma}^{-1}\left(x_n-\boldsymbol{\mu}_{k}\right)^{T} \left(x_n-\boldsymbol{\mu}_{k}\right)\})\} + \frac{1}{2}\log\vert\boldsymbol{\Sigma^{-1}}\vert)
\end{align*}
Taking the derivative with respect to $\boldsymbol{\Sigma^{-1}}$ and setting it to 0, we get
\begin{align*}
\frac{\partial}{\partial \boldsymbol{\Sigma^{-1}}}\mathcal{L}(\pi, \boldsymbol{\mu_k}, \boldsymbol{\Sigma}, \lambda) &= \frac{-1}{2}\sum_{n=1}^N\sum_{k=1}^K t_{n}^k\left(x_n-\boldsymbol{\mu}_{k}\right) \left(x_n-\boldsymbol{\mu}_{k}\right)^{T} - t_{n}^k\boldsymbol\Sigma^T\\
&= \frac{-1}{2}\sum_{k = 1}^K\sum_{n = 1}^N t_{n}^k\left(x_n-\boldsymbol{\mu}_{k}\right) \left(x_n-\boldsymbol{\mu}_{k}\right)^{T} - \sum_{k = 1}^K\sum_{n = 1}^Nt_{n}^k\boldsymbol\Sigma = 0
\end{align*}
Hence,
\begin{equation*}
\begin{aligned}
&\sum_{k = 1}^K\sum_{n = 1}^Nt_{n}^k\boldsymbol\Sigma = \sum_{k = 1}^K\sum_{n = 1}^N t_{n}^k\left(x_n-\boldsymbol{\mu}_{k}\right) \left(x_n-\boldsymbol{\mu}_{k}\right)^{T}\\
&\Rightarrow N\boldsymbol\Sigma = \sum_{k = 1}^K N_k \mathbf{\Sigma_k} \Rightarrow {\boldsymbol\Sigma}=\sum_{k=1}^K\frac{N_k}{N}{\mathbf \Sigma_k}
\end{aligned}
\end{equation*}
Note:
1. By the invariance property of MLE, we take derivative of $\mathcal{L}$ w.r.t. $\Sigma^{-1}$
2. If you want to take derivative of $\mathcal{L}$ w.r.t. $\Sigma$, then you would apply the fact that $\frac{\partial\ tr\left(X^{-1}M\right)}{\partial X}=-\left(X^{-1}MX^{-1}\right)$
## Problem 4
1.
$$
\begin{aligned}
\sum_{i=1}^m\left\|\boldsymbol{z}^i-\boldsymbol{z}\right\|^2 &=\sum_{i=1}^m\left\|\left(\boldsymbol{z}^i-\overline{\boldsymbol{z}}\right)+(\overline{\boldsymbol{z}}-\boldsymbol{z})\right\|^2 \\
&=\sum_{i=1}^m\left(\left\|\boldsymbol{z}^i-\overline{\boldsymbol{z}}\right\|^2+\|\overline{\boldsymbol{z}}-\boldsymbol{z}\|^2+2\left(\boldsymbol{z}^i-\overline{\boldsymbol{z}}\right) \cdot(\overline{\boldsymbol{z}}-\boldsymbol{z})\right) \\
&=\sum_{i=1}^m\left\|\boldsymbol{z}^i-\overline{\boldsymbol{z}}\right\|^2+\sum_{i=1}^m\|\overline{\boldsymbol{z}}-\boldsymbol{z}\|^2+2 \sum_{i=1}^m\left(\boldsymbol{z}^i \cdot \overline{\boldsymbol{z}}-\boldsymbol{z}^i \cdot \boldsymbol{z}-\overline{\boldsymbol{z}} \cdot \overline{\boldsymbol{z}}+\overline{\boldsymbol{z}} \cdot \boldsymbol{z}\right) \\
&=\sum_{i=1}^m\left\|\boldsymbol{z}^i-\overline{\boldsymbol{z}}\right\|^2+m\|\overline{\boldsymbol{z}}-\boldsymbol{z}\|^2+2(m \overline{\boldsymbol{z}} \cdot \overline{\boldsymbol{z}}-m \overline{\boldsymbol{z}} \cdot \boldsymbol{z}-m \overline{\boldsymbol{z}} \cdot \overline{\boldsymbol{z}}+m \overline{\boldsymbol{z}} \cdot \boldsymbol{z}) \\
&=\sum_{i=1}^m\left\|\boldsymbol{z}^i-\overline{\boldsymbol{z}}\right\|^2+m\|\overline{\boldsymbol{z}}-\boldsymbol{z}\|^2 \\
& \geq \sum_{i=1}^m\left\|\boldsymbol{z}^i-\overline{\boldsymbol{z}}\right\|^2 .
\end{aligned}
$$
2. It follows directly from the logic of the algorithm: $\mathcal{C}^t$ and $\mathcal{C}^{t+1}$ are different only if there is a point that finds a closer cluster center in $\boldsymbol{\mu}^t$ than the one assigned to it by $\mathcal{C}^t$:
$$
L\left(\mathcal{C}^{t+1}, \boldsymbol{\mu}^t\right)=\sum_{i=1}^n\left\|\boldsymbol{x}^i-\boldsymbol{\mu}_{C^{t+1}(i)}^t\right\|^2<\sum_{i=1}^n\left\|\boldsymbol{x}^i-\boldsymbol{\mu}_{C^t(i)}^t\right\|^2=L\left(\mathcal{C}^t, \boldsymbol{\mu}^t\right)
$$
3. Use the result in (1):
$$
\begin{aligned}
L\left(\mathcal{C}^{t+1}, \boldsymbol{\mu}^{t+1}\right) &=\sum_{i=1}^n\left\|\boldsymbol{x}^i-\boldsymbol{\mu}_{C^{t+1}(i)}^{t+1}\right\|^2 \\
&=\sum_{k^{\prime}=1}^k \sum_{i \in\{1,2, \ldots, n\}, \mathcal{C}^{t+1}(i)=k^{\prime}}\left\|\boldsymbol{x}^i-\boldsymbol{\mu}_{C^{t+1}(i)}^{t+1}\right\|^2 \\
& \leq \sum_{k^{\prime}=1}^k \sum_{i \in\{1,2, \ldots, n\}, \mathcal{C}^{t+1}(i)=k^{\prime}}\left\|\boldsymbol{x}^i-\boldsymbol{\mu}_{C^{t+1}(i)}^t\right\|^2 (\text{by } 1)\\
&=\sum_{i=1}^n\left\|\boldsymbol{x}^i-\boldsymbol{\mu}_{C^{t+1}(i)}^t\right\|^2 \\
&=L\left(\mathcal{C}^{t+1}, \boldsymbol{\mu}^t\right) .
\end{aligned}
$$
4. Define the sequence $\{l_t\}$, where $l_t = L(\mathcal{C}^t, \boldsymbol{\mu}^t)$. By previous result, we have
$$
l_t = L(\mathcal{C}^t, \boldsymbol{\mu}^t)\leq L(\mathcal{C}^{t+1}, \boldsymbol{\mu}^{t+1}) = l_{t+1}
$$for all $t$. Hence, $\{l_t\}$ is a monotonic decreasing sequence.
Note that we apply **monotonic convergence theorem of sequence** to prove the sequence is convergence, which does not guarantee this algorithm could find the **global** minimum, just a **local** minimum.
5. There are at most $k^N$ ways to partition $N$ data points into $k$ clusters. Then, this algorithm converges in finitely many steps.
Note that the upper bound ($k^N$) may not tight.