Bayesian Linear Regression

# Bayesian Linear Regression :::info Why do we want to use Bayesian methods when we can use frequentist methods? - Different tools for different jobs. Non-Bayesian models are not as expressive as their Bayesian counterparts. If all we care about is predictive power, then there’s little need for parameter confidence intervals and a non-Bayesian approach will suffice in most instances. However, when we want to do inference and compare effects (coefficients) with some level of confidence, Bayesian methods shine. - one advantage of Bayesian is its ability to perform online machine learning recursivelly as new information comes in incrementally. - Point and Interval estimation: **In Bayesian inference the outcome of interest for a parameter is its full posterior distribution** however we may be interested in summaries of this distribution. A simple point estimate would be the mean of the posterior. (although the median and mode are alternatives.) There is well-known philosophical difference between frequentist and Bayesian, [the debate can be found here](https://en.wikipedia.org/wiki/Probability_interpretations). ::: :::info There’s a Bayesian version of most models. If we have a model for data that can be expressed as a probability distribution, then we can specify distributions for its parameters and come up with a Bayesian formulation. ::: Standard regression model is based on observations $(X,y)$ and parameters $(\beta,\sigma)$: $$ y =X\beta + \epsilon \\ \epsilon \sim N(0, \sigma^{2}) $$ For Bayesian regression, since $\epsilon$ is Normally distributed, $y$ is also Normally distributed in this model. we can write down a distribution for $y$. This is called the likelihood distribution. $$ p(y | \beta, \sigma) \sim N (X\beta, \sigma^2) $$ We’re interested in estimating values for $\beta$. Before we get to estimating, the Bayesian framework allows us to add anything we know about our parameters to the model. This step is referred to as prior specification. :::info We often do not have any prior information (although true Bayesian’s would argue we always have some prior information), in this case we use **non-informative prior** (Diffuse or flat priors are often better terms to use as no prior is strictly non‐informative) For example of an unknown mean, candidate priors are a Uniform distribution over a large range or a Normal distribution with a huge variance. It is typical use normal distributoin as a prior distribution for regression coefficients, which have real number. But for variance parameter which is always positive, inverse-gamma distribution is usually used: \begin{align} P(\sigma^2 ) \sim IG \left(\frac{T_0}{2},\frac{\theta_0}{2} \right) = (\sigma^2)^{-\frac{T_0}{2}-1} e^{-\frac{\theta_0/2}{\sigma^2}} \end{align} Two arguments of inverse-gamma distribution, $T_0$ and $θ_0$, are shape parameter and scale parameter. The former is the degree of freedom which determines the height of distribution. The latter fix the spread of distribution. ::: Using Bayes's Theorem $$ f(\beta|y,X)=\frac{f(y|\beta,X)f(\beta|X)}{f(y|X)} $$ where $f(\beta|X)$ is the **prior** belief of parameter; $f(y|\beta,X)$ is the **likelihood** of observing given our prior; and $f(y|X)$ is called **evidence**. Finally, $f(\beta|y,X)$ is the **posterior** belief of pararmeter $\beta$ once the evidence $y$ has been taken into account. This posterior then becomes new prior and the circle continues recursively. ## Conjugate Bayesian Simple Regression To find the missing posterior distribution determined by Bayes Theorem, we exploit a critical property known as **conjugate prior** in the Normal Inverse Gamma family. It mathematically proves that a normally distributed prior conjugates with normally distributed likelihood to produce a normally distributed posterior. When the posterior is in the same family as the prior we have conjugacy. Examples include: | Likelihood | Parameter | Prior | Posterior | | -------- | -------- | -------- |-------- | | Normal | Mean | Normal | Normal | | Normal| Precision (= 1/Variance)| Gamma| Gamma | | Binomial | Probability | Beta | Beta | | Poisson | Mean | Gamma | Gamma | Then the closed-form conjugate distribution is available. ### Example - Known variance, unknown mean It is easier to consider first a model with 1 unknown parameter. Suppose we have a sample of Normal data: $$ x_i \sim N(\mu,\sigma^2), \quad i=1,...,n $$ Let us assume we know the variance, $\sigma^2$ and we assume a prior distribution for the mean, $\mu$ based on our prior beliefs $$ \mu \sim N(\mu_0,\sigma_0^2) $$ Then for the posterior Normal distribution we have mean of $\mu_p = (1/\sigma_0^2+n/\sigma^2)-1$ and variance $\sigma_p^2 = \mu_p(\mu_0/\sigma_0^2+\sum_i x_i/\sigma^2)$ :::info Precision and Mean In Bayesian statistics the precision = 1/variance is often more important than the variance. For the Normal model we have: $$ 1/\sigma_p^2 = (1/\sigma_0^2+1/\sigma^2) \\ \mu_p = \sigma_p(\mu_0/\sigma_0^2+\bar{x}/(\sigma^2/n)) $$ ::: In other words the posterior precision = sum of prior precision and data precision, and the posterior mean is a (precision weighted) average of the prior mean and data mean. ## MCMC In reality, most times we don't have this luxury of conjugate distribution, so we rely instead on a technique called Markov Chain Monte Carlo (MCMC). MCMC answers to this question: if we don't know what the posterior distribution looks like, and we don't have the closed form solution, how do we obtain the posterior distrubtion? Can we at least approximate it? Metropolis–Hastings provides a numerical Monte Carlo simulation method to magically draw a sample out of the posterior distribution. The magic is to construct a Markov Chain that converges to the given distribution as its stationary equilibrium distribution. Hence the name Markov Chain Monte Carlo (MCMC). The Metropolis-Hastings algorithm works as follows: ``` Initialize current state x Initialize number of iterations N Initialize proposal distribution q(x' | x) Initialize target distribution p(x) Initialize empty list samples for i = 1 to N: Sample x' from the proposal distribution q(x' | x) Calculate acceptance ratio alpha = min(1, p(x') / p(x) * q(x | x') / q(x' | x)) Sample a uniform random number u from [0, 1] if u < alpha: Accept the proposed state: x = x' else: Reject the proposed state: x = x Add x to the samples list End loop Return the list of samples ``` In this pseudocode: - x represents the current state of the Markov chain. ($\beta$ in the regression) - N is the number of iterations or samples you want to generate. - q(x' | x) is the proposal distribution that defines how you generate the next state. - p(x) is the target distribution you want to sample from. - alpha is the acceptance ratio that determines whether to accept or reject the proposed state. - u is a uniformly distributed random number between 0 and 1. In the end, throw away some initial $\beta$ values in the beginning based on burn-in hyper-parameter, and Metropolis-Hastings claims that the remaining series comes from the distribution of posterior. A toy implementation: ```python import numpy as np import matplotlib.pyplot as plt # Target distribution: Standard normal (mean=0, std=1) def target_distribution(x): return np.exp(-0.5 * x**2) / np.sqrt(2 * np.pi) # Metropolis-Hastings algorithm def metropolis_hastings(num_samples, proposal_std=1.0): samples = [] current_sample = np.random.randn() # Initial sample for _ in range(num_samples): # Propose a new sample from the proposal distribution proposed_sample = current_sample + np.random.normal(scale=proposal_std) # Calculate acceptance ratio acceptance_ratio = target_distribution(proposed_sample) / target_distribution(current_sample) # Accept or reject the proposed sample if np.random.rand() < acceptance_ratio: current_sample = proposed_sample samples.append(current_sample) return samples # Number of samples to generate num_samples = 10000 # Run Metropolis-Hastings algorithm samples = metropolis_hastings(num_samples) # Plot the histogram of generated samples plt.hist(samples, bins=50, density=True, alpha=0.6, color='g', label='Generated Samples') x = np.linspace(-5, 5, 100) plt.plot(x, target_distribution(x), color='r', label='Target Distribution') plt.xlabel('x') plt.ylabel('Density') plt.title('Metropolis-Hastings Algorithm') plt.legend() plt.show() ``` ![](https://hackmd.io/_uploads/B1WUoTqhn.png) ref: [Ordinary VS Bayesian Linear Regression](https://jramkiss.github.io/2020/03/01/regression-vs-bayesian-regression/) [An Introduction to MCMC methods and Bayesian Statistics](https://dam.ukdataservice.ac.uk/media/307220/presentation4.pdf) [Markov Chain Monte Carlo Linear Regression](https://letianzj.github.io/mcmc-linear-regression.html)