# Bayesian Linear Regression
:::info
Why do we want to use Bayesian methods when we can use frequentist methods?
- Different tools for different jobs. Non-Bayesian models are not as expressive as their Bayesian counterparts. If all we care about is predictive power, then there’s little need for parameter confidence intervals and a non-Bayesian approach will suffice in most instances. However, when we want to do inference and compare effects (coefficients) with some level of confidence, Bayesian methods shine.
- one advantage of Bayesian is its ability to perform online machine learning recursivelly as new information comes in incrementally.
- Point and Interval estimation: **In Bayesian inference the outcome of interest for a parameter is its full posterior distribution** however we may be interested in summaries of this distribution. A simple point estimate would be the mean of the posterior. (although the median and mode are alternatives.)
There is well-known philosophical difference between frequentist and Bayesian, [the debate can be found here](https://en.wikipedia.org/wiki/Probability_interpretations).
:::
:::info
There’s a Bayesian version of most models. If we have a model for data that can be expressed as a probability distribution, then we can specify distributions for its parameters and come up with a Bayesian formulation.
:::
Standard regression model is based on observations $(X,y)$ and parameters $(\beta,\sigma)$:
$$
y =X\beta + \epsilon \\
\epsilon \sim N(0, \sigma^{2})
$$
For Bayesian regression, since $\epsilon$ is Normally distributed, $y$ is also Normally distributed in this model. we can write down a distribution for $y$. This is called the likelihood distribution.
$$
p(y | \beta, \sigma) \sim N (X\beta, \sigma^2)
$$
We’re interested in estimating values for $\beta$. Before we get to estimating, the Bayesian framework allows us to add anything we know about our parameters to the model. This step is referred to as prior specification.
:::info
We often do not have any prior information (although true Bayesian’s would argue we always have some prior information), in this case we use **non-informative prior** (Diffuse or flat priors are often better terms to use as no prior is strictly non‐informative)
For example of an unknown mean, candidate priors are a Uniform distribution over a large range or a Normal distribution with a huge variance.
It is typical use normal distributoin as a prior distribution for regression coefficients, which have real number. But for variance parameter which is always positive, inverse-gamma distribution is usually used:
\begin{align}
P(\sigma^2 ) \sim IG \left(\frac{T_0}{2},\frac{\theta_0}{2} \right) = (\sigma^2)^{-\frac{T_0}{2}-1} e^{-\frac{\theta_0/2}{\sigma^2}} \end{align}
Two arguments of inverse-gamma distribution, $T_0$ and $θ_0$, are shape parameter and scale parameter. The former is the degree of freedom which determines the height of distribution. The latter fix the spread of distribution.
:::
Using Bayes's Theorem
$$
f(\beta|y,X)=\frac{f(y|\beta,X)f(\beta|X)}{f(y|X)}
$$
where $f(\beta|X)$ is the **prior** belief of parameter; $f(y|\beta,X)$ is the **likelihood** of observing given our prior; and $f(y|X)$ is called **evidence**. Finally, $f(\beta|y,X)$ is the **posterior** belief of pararmeter $\beta$ once the evidence $y$ has been taken into account. This posterior then becomes new prior and the circle continues recursively.
## Conjugate Bayesian Simple Regression
To find the missing posterior distribution determined by Bayes Theorem, we exploit a critical property known as **conjugate prior** in the Normal Inverse Gamma family. It mathematically proves that a normally distributed prior conjugates with normally distributed likelihood to produce a normally distributed posterior.
When the posterior is in the same family as the prior we have conjugacy. Examples include:
| Likelihood | Parameter | Prior | Posterior |
| -------- | -------- | -------- |-------- |
| Normal | Mean | Normal | Normal |
| Normal| Precision (= 1/Variance)| Gamma| Gamma |
| Binomial | Probability | Beta | Beta |
| Poisson | Mean | Gamma | Gamma |
Then the closed-form conjugate distribution is available.
### Example - Known variance, unknown mean
It is easier to consider first a model with 1 unknown parameter. Suppose we have a sample of Normal data:
$$
x_i \sim N(\mu,\sigma^2), \quad i=1,...,n
$$
Let us assume we know the variance, $\sigma^2$ and we assume a prior distribution for the mean, $\mu$ based on our prior beliefs
$$
\mu \sim N(\mu_0,\sigma_0^2)
$$
Then for the posterior Normal distribution we have mean of $\mu_p = (1/\sigma_0^2+n/\sigma^2)-1$ and variance $\sigma_p^2 = \mu_p(\mu_0/\sigma_0^2+\sum_i x_i/\sigma^2)$
:::info
Precision and Mean
In Bayesian statistics the precision = 1/variance is often more important than the variance. For the Normal model we have:
$$
1/\sigma_p^2 = (1/\sigma_0^2+1/\sigma^2) \\
\mu_p = \sigma_p(\mu_0/\sigma_0^2+\bar{x}/(\sigma^2/n))
$$
:::
In other words the posterior precision = sum of prior precision and data precision, and the posterior mean is a (precision weighted) average of the prior mean and data mean.
## MCMC
In reality, most times we don't have this luxury of conjugate distribution, so we rely instead on a technique called Markov Chain Monte Carlo (MCMC). MCMC answers to this question: if we don't know what the posterior distribution looks like, and we don't have the closed form solution, how do we obtain the posterior distrubtion? Can we at least approximate it?
Metropolis–Hastings provides a numerical Monte Carlo simulation method to magically draw a sample out of the posterior distribution. The magic is to construct a Markov Chain that converges to the given distribution as its stationary equilibrium distribution. Hence the name Markov Chain Monte Carlo (MCMC).
The Metropolis-Hastings algorithm works as follows:
```
Initialize current state x
Initialize number of iterations N
Initialize proposal distribution q(x' | x)
Initialize target distribution p(x)
Initialize empty list samples
for i = 1 to N:
Sample x' from the proposal distribution q(x' | x)
Calculate acceptance ratio alpha = min(1, p(x') / p(x) * q(x | x') / q(x' | x))
Sample a uniform random number u from [0, 1]
if u < alpha:
Accept the proposed state: x = x'
else:
Reject the proposed state: x = x
Add x to the samples list
End loop
Return the list of samples
```
In this pseudocode:
- x represents the current state of the Markov chain. ($\beta$ in the regression)
- N is the number of iterations or samples you want to generate.
- q(x' | x) is the proposal distribution that defines how you generate the next state.
- p(x) is the target distribution you want to sample from.
- alpha is the acceptance ratio that determines whether to accept or reject the proposed state.
- u is a uniformly distributed random number between 0 and 1.
In the end, throw away some initial $\beta$ values in the beginning based on burn-in hyper-parameter, and Metropolis-Hastings claims that the remaining series comes from the distribution of posterior.
A toy implementation:
```python
import numpy as np
import matplotlib.pyplot as plt
# Target distribution: Standard normal (mean=0, std=1)
def target_distribution(x):
return np.exp(-0.5 * x**2) / np.sqrt(2 * np.pi)
# Metropolis-Hastings algorithm
def metropolis_hastings(num_samples, proposal_std=1.0):
samples = []
current_sample = np.random.randn() # Initial sample
for _ in range(num_samples):
# Propose a new sample from the proposal distribution
proposed_sample = current_sample + np.random.normal(scale=proposal_std)
# Calculate acceptance ratio
acceptance_ratio = target_distribution(proposed_sample) / target_distribution(current_sample)
# Accept or reject the proposed sample
if np.random.rand() < acceptance_ratio:
current_sample = proposed_sample
samples.append(current_sample)
return samples
# Number of samples to generate
num_samples = 10000
# Run Metropolis-Hastings algorithm
samples = metropolis_hastings(num_samples)
# Plot the histogram of generated samples
plt.hist(samples, bins=50, density=True, alpha=0.6, color='g', label='Generated Samples')
x = np.linspace(-5, 5, 100)
plt.plot(x, target_distribution(x), color='r', label='Target Distribution')
plt.xlabel('x')
plt.ylabel('Density')
plt.title('Metropolis-Hastings Algorithm')
plt.legend()
plt.show()
```

ref:
[Ordinary VS Bayesian Linear Regression](https://jramkiss.github.io/2020/03/01/regression-vs-bayesian-regression/)
[An Introduction to MCMC methods and Bayesian Statistics](https://dam.ukdataservice.ac.uk/media/307220/presentation4.pdf)
[Markov Chain Monte Carlo Linear Regression](https://letianzj.github.io/mcmc-linear-regression.html)