Bishop Chapter 2 Notes

$$ \newcommand{\set}[1]{\left\{#1\right\}} \newcommand{\Set}[1]{\left\{#1\right\}} \newcommand{\E}{\mathbb{E}} \newcommand{\D}{\mathcal{D}} \newcommand{\N}{\mathcal{N}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\brac}[1]{[#1 ]} \newcommand{\Brac}[1]{\left[#1\right]} \newcommand{\ex}[1]{\E\brac{#1}} \newcommand{\Ex}[1]{\E\Brac{#1}} $$ # Bishop Chapter 2 Notes * **Density Estimation**: model the probability distribution $p(x)$ of a random variable $x$, given a finite set $x_1,..., x_N$ of observations. :::info ill-posed problem, due to infinite solution.The issue of choosing an appropriate distribution relates to the problem of model selection. ::: * Bayesian view-point is natural for ==sequential learning,== with posteriors as updated priors. Useful to work with *conjugate priors* in this scenario. Maximum likelihood can also be cast into sequential framework, e.g., by the **Robbins-Monro algorithm**. * As the number of observations increases, the posterior distribution becomes more sharply peaked. $\E_{\theta}[{\theta}] = \E_\D[\E_{\theta}[{\theta | \D}]]$ shows that posterior averaged over data is prior. $\text{var}_\theta [\theta] = \E_\D[\text{var}_\theta[\theta|\D]] + \text{var}_\D [\E_\theta[\theta|\D]]$ shows that *on average*, the posterior variance of $\theta$ is smaller than the prior variance. * **Gaussian Property**: If two sets of variables are jointly Gaussian, then the conditional distribution of one set conditioned on the other is Gaussian. Similarly, the marginal distribution of either set is also Gaussian. **Conditional Gaussian Distribution**:$\mu_{a|b} = \mu_a + \Sigma_{ab} \Sigma_{bb}^{-1} (x_b - \mu_b)$ and $\Sigma_{a|b} = \Sigma_{aa}-\Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba}$. Mean is linear function of $x_b$ and cov is independent of $x_a$. This is an example of *linear-Gaussian* model. **Marginal Gaussian Distribution**: $p(x_a) = \mathcal N(\mu_a, \Sigma_{aa})$. * **Conjugate Priors**: Identify priors that have same functional dependence on parameters as likelihood. * Dirichlet for multinomial distribution. * Gaussian for Gaussian distribution when $\sigma^2$ is known: $p(\mu|x) \propto p(x|\mu) \cdot p(\mu)$, and let $p(\mu) = \mathcal N(\mu|\mu_0,\sigma_0^2)$, then $p(\mu|x) = \mathcal N(\mu|\mu_N, \sigma_n^2)$, where $\mu_N = \frac{\sigma^2}{N\sigma_0^2+\sigma^2}\mu_0 + \frac{N\sigma_0^2+\sigma^2}{N\sigma_0^2} \mu_{ML}$, and $\frac{1}{\sigma_N^2} = \frac{1}{\sigma_0^2} + \frac{N}{\sigma^2}$. ==Note that the **precisions** (inverse of covariance) are additive.== * Gamma for Gaussian when mean is known and we have to infer variance. * Normal-Gamma or Gaussian-Gamma for Gaussian when both mean and precision is unknown. * Gaussian for multi-variate gaussian with known variance, and Wishart for for multi-variate gaussian with known mean. Normal-Wishart or Gaussian-Wishart when both mean and variance are unknow * **Student's $t$-Distribution**: integrate out the precision of a univariate Gaussian $\N(x|\mu, \tau^{-1})$ together with a Gamma prior $\text{Gam}(\tau |a, b)$. Therefore, seen as a mixture of infinite-gaussians. It has longer tails than Gaussians, and is therefore mrobust to outliers, when compared to Gaussian. * **Peroidic RV**: *Gaussian-like* distribution called that von-Mises distribution: $p(\theta|\theta_0, m) = \frac{1}{2\pi I_0(m)} \exp{m \cos(\theta-\theta_0)}$, where $m$ is analogous to precision. * **Mixtures of Gaussians**: $p(x) = \sum_{k=1}^{K} \pi_k \N(x|\mu_k,\Sigma_k)$, $\pi_k$ is a probability density. Given data $n$-data points $X$, one way to set parameters of mixtures of Gaussian is by maximum likelihood. The log-likelihood is given by: $p(X|\pi,\mu,\Sigma) = \sum_{n=1}^N \ln\Set{\sum_{k=1}^K \pi_k \N(x_n|\mu_k,\Sigma_k)}$. The summation inside the log causes ML to have no closed form solution. For this we use the powerful EM algorithm. * **Non-Informative Priors**: the posterior depends only on terms arising from the data and not from the prior. We can choose a constant prior which makes it difficult for continuous case due to improper densities and it doesn't remain a constant when a change of variable happens, due to the jacobian term. Examples of non-informative priors: $p(x|\mu) = f(x-\mu)$, $p(x|\sigma) = 1/\sigma \cdot f(x/\sigma)$. ## Non-Parametric Methods >Non-parametric approaches to density estimation that make few assumptions about the form of the distribution. Here we shall focus mainly on simple frequentist methods. * **Histograms**: $N$ data-points, bin of width $\Delta_i$, and each bin has $n_i$ points, then $p_i = \frac{n_i}{N\Delta_i}$. Can discrad data after creating the histogram (useful for data compression), and highly amenable to online learning. Suffers from curse of dimensionality. :::info Consider that each region $\mathcal R$ has a probability $P$ of data-point falling in it. Let $K \simeq PN$, and let $P \simeq p(x)V$, where $V$ is volume of $\mathcal R$. Therefore $p(x) = \frac{K}{NV}$. Fixing $K$ and determining the value of $V$ from data gives rise to k-NN method and fixing $V$ and determining $K$ gives rise to kernel approach. ::: * **Kernel Method**: Chsoing a smooth Gaussian Kernel$$ p(x) = \frac{1}{N} \sum_{i=1}^N \frac{1}{\sqrt{2\pi}h} \exp\Set{-\frac{\norm{x-x_n}}{2h^2}}. $$ * **Nearest-Neighbour Methods**: Consider a small sphere centred on the point $x$ at which we wish to estimate the density $p(x)$, and we allow the radius of the sphere to grow until it contains precisely $K$ data points. The estimate of the density $p(x)$ is then given by $p(x) = \frac{K}{NV}$ with V set to the volume of the resulting sphere. Note that the model produced by $K$ nearest neighbours is not a true density model because the integral over all space diverges. * An interesting property of the ==nearest-neighbour $(K = 1)$ classifier== is that, in the limit $N \rightarrow \infty$, the error rate is never more than twice the minimum achievable error rate of an optimal classifier.