Update your priors!

# Update your priors! (Maximum Likelihood Estimation) One of the most important aspects of statistics with respect to Quant/ML is what's known as [Maximum Likelihood Estimation](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation), or MLE for short. The idea is that you first create some "priors", i.e. a hypothesis that you presume to be true - usually in the sense that you claim that the world adheres to some statistical model with some unknown hyperparameters. Then, you collect data. Then, you calculate the hyperparameters such that the observed data is the _most likely_ (i.e., the [p-value](https://en.wikipedia.org/wiki/P-value) is the lowest). Equivalently, we say that the observed data is the _least surprising_. Often, it's hard to select priors a priori, so the real way that this is done in practice is that you _first_ collect the data, then you figure out your priors, then you optimize hyperparameters via MLE. The core research loop involves collecting more data, updating your priors, and then iterating. ## Line of best fit Let's work through a super simple example. I assume you know that the traditional way to calculate a line of best fit is "least-squares regression", but have you ever wondered why? Here we'll rederive this formula from first principles. Let's say that you collect some data and it looks like this: ![image](https://hackmd.io/_uploads/SkJTW97-el.png) Eyeballing it, the relationship appears linear. Therefore, you come to the "prior" is that the data is heavily influenced by some underlying function $y = b_0 + b_1 x$, and the rest of the relationship is unpredictable noise. ![image](https://hackmd.io/_uploads/H1Wnyqm-ge.png) Now our goal is to optimize $b_0, b_1$ such that the observed values are the "least suprising" with respect to that model. Of course, the data is not _exactly_ linear, so our model is not perfect. Regardless of our $b_0, b_1$ selection, we're always left with some errors known as _residuals_ (i.e., the unpredictable noise). Our goal therefore is to create a _loss function_, wherein the loss is somehow related to the accumulated errors from all residuals, and then we optimize $b_0, b_1$ by minimizing the loss. However, before we can even begin to do this, we need a prior on our residuals as well. Let's say that I draw the line-of-best-fit line by purely eye-balling it, and then we plot a histogram of the residuals like so: ![image](https://hackmd.io/_uploads/Sk3LQqXZgx.png) Ah, the residuals appear to be normally distributed. Let's set this in stone by defining another _prior_, that $e_i \sim N(0, \sigma)$. $$\hat{y} = b_0 + b_1x$$ $$y = \hat{y} + \epsilon$$ $$\epsilon \sim N(0, \sigma)$$ Now, do we have a prior on the distribution of $\sigma$? I guess at the moment no, but we can leave it undefined for now and come back to it if it's an issue. Under this prior, how can we calculate the $b_0, b_1$ that creates the "maximum likelihood" of the observed events? Well, to do this, we need an expression for the probability of the observed events. First we note that the probability density function of $N(\mu, \sigma)$ is defined by $$\text{pdf}(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2}\right)$$ Therefore, if we have residuals $e_i \;\forall i \in [0, N-1]$, then the combined probability density of the observed values is $$\prod_i \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{1}{2}\frac{(e_i-0)^2}{\sigma^2}\right)$$ So that, our optimization problem is to find $$\underset{(b_0, b_1)}{\arg\max} \prod_i \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{1}{2}\frac{e_i^2}{\sigma^2}\right)$$ Now, we just have to do algebra. First note that the logarithm function is monotonically increasing, so we can make progress by optimizing the _log-likelihood_ instead, $$\underset{(b_0, b_1)}{\arg\max} \ln\left(\prod_i \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{1}{2}\frac{e_i^2}{\sigma^2}\right)\right)$$ $$\underset{(b_0, b_1)}{\arg\max} \sum_i\ln\left( \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{1}{2}\frac{e_i^2}{\sigma^2}\right)\right)$$ Also, we can get rid of all of the constants, $$\underset{(b_0, b_1)}{\arg\max} \sum_i\left(\ln \frac{1}{\sigma\sqrt{2\pi}} + \ln\exp\left(-\frac{1}{2}\frac{e_i^2}{\sigma^2}\right)\right)$$ $$\underset{(b_0, b_1)}{\arg\max} \sum_i\left(\ln\exp\left(-\frac{1}{2}\frac{e_i^2}{\sigma^2}\right)\right)$$ $$\underset{(b_0, b_1)}{\arg\max} \sum_i-\frac{1}{2}\frac{e_i^2}{\sigma^2}$$ $$\underset{(b_0, b_1)}{\arg\max} -\sum_ie_i^2$$ $$\underset{(b_0, b_1)}{\arg\min} \sum_ie_i^2$$ And just like that, we're done. We didn't even need a prior on $\sigma$, regardless of what the stddev was, as long as the residuals are normally distributed, the end result is that we're looking for the $b_0, b_1$ that such that the _sums of the squares of the residuals_ is minimized. In other words, we've derived _least squares regression_. $\square$ ## Addendum Of course, you can see how this extends to more complex data relationships. One thing that you'll discover that "noise" is never truly random. Our job as researchers is to graph and analyze the noise, find ways to predict from other dependent variables whether the noise will be positive or negative and by how much, and then add more parameters to our statistical model where we can optimize loss and squeeze the unexplained to noise to some fundamental entropy of the Universe.