Bishop Chapter 3 Notes

$$ \newcommand{\set}[1]{\left\{#1\right\}} \newcommand{\Set}[1]{\left\{#1\right\}} \newcommand{\E}{\mathbb{E}} \newcommand{\D}{\mathcal{D}} \newcommand{\N}{\mathcal{N}} \newcommand{\M}{\mathcal{M}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\brac}[1]{[#1 ]} \newcommand{\Brac}[1]{\left[#1\right]} \newcommand{\ex}[1]{\E\brac{#1}} \newcommand{\Ex}[1]{\E\Brac{#1}} $$ # Bishop Chapter 3 Notes Linear Models for Regression * A common choice of loss function for real-valued variables is the squared loss, for which the optimal solution is given by the conditional expectation of $t$, the target prediction value. * **Spline Functions**: One limitation of polynomial basis functions is that they are global functions of the input variable, so that changes in one region of input space affect all other regions. This can be resolved by dividing the input space up into regions and fit a different polynomial in each region, leading to spline functions. ### Maximum Likelihood and Least Squares * Let $p({\bf t}|X, w, \beta) = \prod_{n=1}^{N} \N(t_n|w^\top \phi(x_n),\beta^{-1})$. Assuming squaared-loss, the gradient of log-likelihood wrt $w$ with squared loss error, we get that $w_{ML} = (\Phi^\top \Phi)^{-1} \Phi {\bf t}$, where $\Phi_{n,j} = \phi_j(x_n)$. * The addition of a ==regularization== term ensures that the matrix is non-singular, even in the presence of degeneracies. * **Regularization** allows complex models to be trained on data sets of limited size without severe over-fitting, essentially by limiting the effective model complexity. However, we have to determine a suitable value of the regularization coefficient $\lambda$. * **Over-fitting** is an unfortunate property of maximum likelihood and does not arise when we marginalize over parameters in a Bayesian setting. * **Bias Variance Decomposition**: The bias-variance decomposition may provide some insights into the model complexity issue from a frequentist perspective, but it is of limited practical value, because it is based on averages wrt ensembles of data sets, whereas in practice we have only the single observed data set. ### Bayesian Linear Regression * **Predictive Distribution**: $p(t|{\bf t},\alpha,\beta) = \int p(t|w,\beta) \cdot p(w|{\bf t} ,\alpha, \beta) dw$, we are interested in the prediction, rather than the parameter. * **Gaussian process**: If we used localized basis functions such as Gaussians, then in regions away from the basis function centres, the model becomes very confident in its predictions when extrapolating outside the region occupied by the basis functions. This problem can be avoided by an alternative Bayesian approach to regression known as a Gaussian process. * Mean of predictive distribution at a point $x$ is given by $y(x,m_{N}) = \sum_{n=1}^N k(x,x_n)t_n$, where $k(.,.)$ is called the smoother matrix or equivalent kernel. ### Bayesian Model Comparision * The over-fitting associated with maximum likelihood can be avoided by ==marginalizing (summing or integrating) over the model parameters instead of making point estimates of their values==. Models can then be compared directly on the training data, without the need for a validtion set. * **Bayesian view of model comparison**: simply involves the use of probabilities to represent uncertainty in the choice of model, along with a consistent application of the sum and product rules of probability. * **Model Evidence**: defined as $p(\D|\M_i)$, where $\M_i$ denotes the $i$-th model. Once we know the posterior dist over models, the predictive dist is given by $p(t|x,\D) = \sum_{i} p(t|x,\M_i,\D) p(\M_i|\D)$. * Therefore, model evidence is given by $p(\D|\M_i) = \int p(\D|w,\M_i)p(w|\M_i)dw$. * The marginal likelihood/model evidence can be viewed as the probability of generating the data set $\D$ from a model whose parameters are sampled at random from the prior. #### Model Comparision Heuristic > Fix a model $\M_i$. Assume posterior is peaked around $w_{\text{MAP}}$, with width $\Delta w_{\text{posterior}}$. Assume that prior is flat with width $\Delta w_{\text{prior}}$, so that $p(w) = 1/\Delta w_{\text{prior}}$. * $p(\D) = \int p(\D|w)p(w)dw \simeq p(\D|w_{\text{MAP}}) (\Delta w_{\text{posterior}}/\Delta w_{\text{prior}})$. * First term shows that for a complex enough model, the data fits well with $w_{\text{MAP}}$. The second term penalizes model according to its complexity. Thus, if parameters are finely tuned to the data in the posterior distribution, then the penalty term is large. * The optimal model complexity, as determined by the maximum evidence, will be given by a trade-off between these two competing terms. :::warning Implicit in the Bayesian model comparison framework is the assumption that the true distribution from which the data are generated is contained within the set of models under consideration. ::: ### Evidence Approximation An approximation in which we set the hyperparameters to specific values determined by maximizing the marginal likelihood function obtained by first integrating over the parameters $w$. > If we introduce hyper-priors over $\alpha$, $\beta$ (e.g., $\alpha/\beta$ act as a regularization parameter). From Baye's theorem $p(\alpha,\beta|{\bf t}) \propto p({\bf t}|\alpha,\beta) p(\alpha,\beta)$, and assuming a flat hyper-prior, and a peaked hyper-posterior (around $\hat \alpha$, $\hat \beta$), we consider maximizing marginal-likelihood $p({\bf t}|\alpha,\beta)$ * **Evaluation of Evidence Approx.**: $p( {\bf t} | \alpha, \beta ) = \int p( {\bf t} | w,\beta ) p(w|\alpha) dw$. If the above assumption holds then this will help us evaluating predictive distribution $p(t|{\bf t}) \simeq p(t|{\bf t},\hat \alpha, \hat \beta)$. Then one aims to **maximize the marignal likelihood function** wrt $\alpha$, $\beta$. These can be done iteratively. * **Effective number of parameters**: The Bayesian result corrects for the bias automatically in the MLE of the variance of Gaussian dist over a single variable.