$$
\newcommand{\set}[1]{\left\{#1\right\}}
\newcommand{\Set}[1]{\left\{#1\right\}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\brac}[1]{[#1 ]}
\newcommand{\Brac}[1]{\left[#1\right]}
\newcommand{\ex}[1]{\E\brac{#1}}
\newcommand{\Ex}[1]{\E\Brac{#1}}
$$
# Bishop Chapter 1 Notes
* **Bayesian Approach**: the *over-fitting problem* can be avoided. We shall see that there is no difficulty from a Bayesian perspective in employing models for which the ==number of parameters greatly exceeds the number of data points==. Indeed, in a Bayesian model the effective number of parameters adapts automatically to the size of the data set.
* **Bayes Theorem**:
$$ p(Y|X) = \frac{p(X|Y) \cdot p(Y)}{p(X)} = \frac{p(X|Y) \cdot p(Y)}{\sum_{Y} p(X|Y) \cdot p(Y)} $$
We can view the denominator in Bayes’ theorem as being the normalization constant required to ensure that the sum of the conditional probability on the left-hand side of the first equality over all values of Y equals one, i.e., $\sum_{Y}p(X|Y)p(Y) = p(X)$.
* **Interpretation of Bayes Theorem**: If we had been asked which box $B$ had been chosen before being told the identity of the selected item of fruit $F$, then the most complete information we have available is provided by the probability $p(B)$. We call this the **prior probability** because it is the ==probability available before we observe the identity of the fruit.== Once we are told that the fruit is an orange, we can then use Bayes’ theorem to compute the probability $p(B|F)$, which we shall call the **posterior probability** because it is the ==probability obtained after we have observed $F$.==
* **Likelihood**: Note that given data $\mathcal D$, we need to fit a parameter $\bf w$. Prior belief is encoded by $p({\bf w})$, and posterior probability is given by $p({\bf w} |\mathcal D)$. We know from Bayes' theorem that $p({\bf w} |\mathcal D) = p(\mathcal D| {\bf w}) p({\bf w}) / p(\mathcal D)$. The quantity $p(\mathcal D| {\bf w})$ on the right-hand side of Bayes’ theorem is evaluated for the observed data set $\mathcal D$ and can be viewed as a function of the parameter vector $\bf w$, in which case it is called the ==likelihood function==.
:::info
Note that $\int_{\bf w} p(\mathcal D| {\bf w}) d{\bf w} \neq 1$.
:::
* > **Bootstrap**: One approach to determining frequentist error bars is the bootstrap , in which multiple data sets are created as follows. Suppose our original data set consists of $N$ data points $X = \set{x_1, . . . , x_N}$. We can create a new data set $X_B$ by drawing $N$ points at random from $X$, with replacement, so that some points in $X$ may be replicated in $X_B$, whereas other points in $X$ may be absent from $X_B$. This process can be repeated $L$ times to generate $L$ data sets each of size $N$ and each obtained by sampling from the original data set $X$. ==The statistical accuracy of parameter estimates can then be evaluated by looking at the variability of predictions between the different bootstrap data sets==.
* :::warning
In fact, as we shall see, the ==issue of bias (bias of emperical variance estimation) in maximum likelihood== lies at the root of the over-fitting problem that we encountered earlier in the context of polynomial curve fitting. :confused: **Check how later?**
:::
* Let ${\bf t} = y({\bf x}, {\bf w})+ \mathcal N$. The sum-of-squares error function has arisen as a consequence of maximizing likelihood under the assumption of a Gaussian noise distribution.
* **MAP**: *Maximum Posterior*, $\max_{\bf w} p(w|\mathcal D)$. This is opposed to *maximum likelihood*, i.e., $\max_{\bf w} p(\mathcal D | w)$.
> assume the problem of polynomial curve fitting with Gaussian noise, and assume a prior on parameter $\bf w$ to be a mean $0$ Gaussian with some variance. Doing the MAP for this yields automatically yields regularization (this relates to the first point of the note):
$$ \norm{{\bf y}-\bf{t}}^2 + \lambda \norm{{\bf w}}^2. $$
### Decision Theory
$$ p(\text{correct}) = \sum_{k=1}^K p(x \in \mathcal R_k, \mathcal C_k) = \sum_{k=1}^K \int_{\mathcal R_k} p(x, \mathcal C_k) dx $$
which is maximized when the regions $\mathcal R_k$ are chosen such that each $x$ is assigned to the class for which $p(x,\mathcal C_k)$ is largest. Since, $p(x,\mathcal C_k) = p(\mathcal C_k|x)p(x)$ and noting that the factor of $p(x)$ is common to all terms, we see that each $x$ should be assigned to the class having the largest posterior probability $p(\mathcal C_k |x)$.
* **Inference Stage** in which we use training data to learn a model for $p(C_k|x)$.
**Decision Stage** in which we use these posterior probabilities to make optimal class assignments.
**Discriminant Function** $f:x \rightarrow \text{decision}$, solves both inference and decision problems together.
* **Naive Bayes Model** assumes conditional independence $p(x_1,x_2|\mathcal C_k) = p(x_1|\mathcal C_k) p (x_2 | \mathcal C_k)$
* **Loss Function for Regression**
\begin{align*}
\ex{y(x) − t}^2 & = (y(x) − \ex{t|x} + \ex{t|x} − t)^2 \\& = {(y(x) − \E[t|x])}^2 + 2{(y(x) − \E[t|x])}{(\E[t|x] − t)} + {(\E[t|x] − t)}^2
\end{align*}
Substituting into the loss function and performing the integral over $t$, we see that the cross-term vanishes and we obtain an expression for the loss function in the form
$$ \E[L] = ({y(x) − \E[t|x]})^2 p(x) dx + ({\E[t|x] − t})^2 p(x) dx $$
The function $y(x)$ is minimized when $y(x)= \E[t|x]$. The second term is the variance of the distribution of $t$, averaged over $x$. Because it is independent of $y(x)$, it represents the irreducible minimum value of the loss function.
### Information Theory
* We see that there is an intimate relationship between ==data compression and density estimation== (i.e., the problem of modeling an unknown probability distribution) because the most efficient compression is achieved when we know the true distribution.
* From a Bayesian perspective, we can view $p(x)$ as the prior distribution for $x$ and $p(x|y)$ as the posterior distribution after we have observed new data $y$. The mutual information therefore represents the reduction in uncertainty about $x$ as a consequence of the new observation $y$.