## Multi-variate Gaussian and PAC-Bayesian Minimization
### The PAC-Bayesian Bound
We simplify the linear-form of PAC-Bayesian Bound
$$L(Q, \mathcal{D}) \leq L(Q, D) + \frac{1}{\lambda} KL(Q||P) + C(\delta, \lambda, N)$$
by assuming the prior and and the posterier distribution of paramter $\theta$ to be
$$P = \mathcal{N}(0, \sigma^2_0I), Q = \mathcal{N}(\theta, \Sigma)$$
where the diagnol of $\Sigma$ is $[\sigma^2_1, \sigma^2_2, ..., \sigma^2_K], K = |\theta|$. Also denote the vector $\sigma = [\sigma_1, \sigma_2, ..., \sigma_K]$.
With the help of this [blog post](https://mr-easy.github.io/2020-04-16-kl-divergence-between-2-gaussian-distributions/), I directly write out the the KL-div between Q and P:
\begin{align}
KL(Q||P) &= \frac{1}{2}[\log \frac{(\sigma^{2}_0)^K}{|\Sigma|} - K + \frac{||\theta||^2_2}{\sigma^2_0} + \frac{Tr(\Sigma)}{\sigma^2_0}]\\
&=\frac{1}{2}[\log \frac{(\sigma^{2}_0)^K}{\Pi_i \sigma^2_i} - K + \frac{||\theta||^2_2}{\sigma^2_0} + \frac{\sum_i \sigma^2_i}{\sigma^2_0}]
\end{align}
Now we deal with the $L(Q, D)$ with Taylor expansion (up to the second-order).
\begin{align}
L(Q, D) &= L(\theta, D) + \mathbb{E}_{\epsilon \sim \mathcal{N}(0, \Sigma)} [\epsilon^\top \triangledown L] + \frac{1}{2} \mathbb{E}_{\epsilon \sim \mathcal{N}(0, \Sigma)} [\epsilon^\top \triangledown^2 L\epsilon ] + O(\sigma^3)\\
&= L(\theta, D) + \mathbb{E}_{\epsilon \sim \mathcal{N}(0, I)} [(\sigma \circ \epsilon)^\top \triangledown L] + \frac{1}{2} \mathbb{E}_{\epsilon \sim \mathcal{N}(0, I)} [(\sigma \circ \epsilon)^\top \triangledown^2 L(\sigma \circ \epsilon) ] + O(\sigma^3)
\end{align}
where $\circ$ is the element-wise product, a.k.a Hadamard product. Notice a connection between a vector product and Hadamard product is that for any vectors $a, b$
$$a \circ b = diag(a)\cdot b$$
Using the connection above, we write
\begin{align}
L(Q, D) &= L(\theta, D) + \mathbb{E}_{\epsilon \sim \mathcal{N}(0, I)} [(diag(\sigma) \epsilon)^\top \triangledown L] + \frac{1}{2} \mathbb{E}_{\epsilon \sim \mathcal{N}(0, I)} [(diag(\sigma)\epsilon)^\top \triangledown^2 L(diag(\sigma) \epsilon) ] + O(\sigma^3)\\
&=L(\theta, D) + \frac{1}{2} \mathbb{E}_{\epsilon \sim \mathcal{N}(0, I)} [\epsilon^\top (diag(\sigma) \triangledown^2 Ldiag(\sigma)) \epsilon ] + O(\sigma^3)\\
&=L(\theta, D) + \frac{1}{2} Tr(diag(\sigma) \triangledown^2 Ldiag(\sigma)) + O(\sigma^3)
\end{align}
Here
\begin{align}
Tr(diag(\sigma) \triangledown^2 Ldiag(\sigma)) = Tr(\Sigma \circ \triangledown^2 L) = \sum^K_{i=1} \sigma^2_i \frac{\partial^2 L}{\partial \theta^2_i}
\end{align}
Putting pieces together using the linear-form PAC-Bayesian bound, we show
\begin{align}
L(Q, \mathcal{D}) \leq L(\theta, D) + \frac{1}{2} Tr(\Sigma \circ \triangledown^2 L) + \frac{1}{2\lambda}[\log \frac{(\sigma^{2}_0)^K}{\Pi_i \sigma^2_i} - K + \frac{||\theta||^2_2}{\sigma^2_0} + \frac{\sum_i \sigma^2_i}{\sigma^2_0}]\\ + C(\delta, \lambda, N) + O(\sigma^3)
\end{align}
where one needs to use the following conversion to take the derivative w.r.t $\sigma^2_i$ later
$$\log \frac{(\sigma^{2}_0)^K}{\Pi_i \sigma^2_i} = K\log \sigma^2_0 - \sum_i \log \sigma^2_i.$$
### Minimizing the bound
First of all, we re-arange the terms to sperate out the part that is relevant to $\sigma^2_i$.
\begin{align}
L(Q, \mathcal{D}) \leq L(\theta, D) + \frac{1}{2}[\sum_i \sigma^2_i (\frac{1}{\lambda \sigma^2_0} + \frac{\partial^2 L}{\partial \theta^2_i}) - \frac{1}{\lambda}\sum_i \log \sigma^2_i] + \frac{1}{2\lambda}(\frac{||\theta||^2_2}{\sigma^2_0} - K + K\log\sigma^2_0) \\+ C(\delta, \lambda, N) + O(\sigma^3)
\end{align}
Secondly, we take the derivative of the actual bound (i.e. excluding the C and high-order term) w.r.t $\sigma^2_i, 1 \leq i \leq K$ and set it to zero
\begin{align}
\frac{\partial L^2}{\partial \theta^2_i} + \frac{1}{\lambda}(\frac{1}{\sigma^2_0} - \frac{1}{\sigma^2_i}) = 0\\
\end{align}
The equition above yields the optimal below
$$ \frac{\sigma^2_0}{\sigma^{2*}_i} = 1+\sigma^2_0\lambda\frac{\partial^2 L}{\partial \theta^2_i}. $$
Plug the optimal $\sigma^{2*}_i$ back to the bound, we show
\begin{align}
min_{\sigma^2} (RHS) = L(\theta, D) + \frac{1}{2}[\frac{K}{\lambda\sigma^2_0} +\frac{1}{\lambda}\sum_i \log(1+\sigma^2_0\lambda\frac{\partial^2 L}{\partial \theta^2_i})] + \frac{1}{2\lambda}(\frac{||\theta||^2_2}{\sigma^2_0} - K) \\+ C(\delta, \lambda, N) + O(\sigma^3)
\end{align}