# Bayesian Optimization (Hands on)
[Repository][repo_jujo] (ask Julian for permissions)
## Problem Definition
$\max_{x \in A}(f(x))$
with
- $f$ is blackbox, potentially expensive
- $\text{len}(x) < 10$
- Easy test of membership of $x \in A$
## Solution Process
Treat f as random function and
1. Place prior on f
2. Select initial points x_init by initial space-filling experiment
3. Evaluate f at initial points
4. Update prior with knowledge of f(initial_points) -> posterior
5. Loop
1. Update posterior with function evaluations
2. Construct aquisition function from current posterior
3. Select new points x_new based on maximum of aquisition function
4. Evaluate f(x_new)
## On the Selection of x_new
See [animations below](##Animations)
- Favor points you have no idea about (large credible intervals)
- Favor points you expect to be good in value (large posterior mean)
- Exploration vs. exploitation tradeoff (high expected quality vs. high uncertainty)
## Gaussian process regression
- Assume the values $f(x_i)$ at points $x_i$ to be distributed normally, with a mean $\mu_0$ and a covariance (kernel) $\Sigma_0$
- Hence, $f(x) \sim \text{Normal}(\mu_0,\Sigma_0)$
- Compute conditional distribution (= posterior probability distribution) for new point $x_\text{new}$ as $f(x_\text{new})\vert f(x_{1:n}) \sim \text{Normal}(\mu_n,\sigma_n^2)$ (with $\mu_n$ and $\sigma_n^2$ being functions of $\mu_0$ and $\Sigma_0$) and $x_{1:n}$ being the known, evaluated points
## Questions
1. How to select initial points?
2. Given $\mathbf{x}$ of dimension $n$, i.e. $\text{len}(\mathbf{x})=n$, and number of initial points is $k$, how is "the mean vector" constructed?
## Ressources
- Literature
- [Platypus][platy_blog]
- [Frazier_2018][Frazier_2018]
- [Snoek_2012][Snoek_2012]
- [Lecture UBC][lecture_ubc]
- Code
- Gaussian Process
- [Krasser][krasser]
- Optimization
- [GPyOpt][gpyopt_github]
## Thoughts
- General: In Frazier_2018, beware the difference of
- $k$ arbitrary points for the prior and
- $n$ known evaluated points
- Could (depending on maybe sampling?) the number of points $k$ for the prior be $k=0$ or $k=1$? And subsequently we always update $\mu_n$ and $\sigma_n^2$ based on all already known points?
- I think we should be aware, that the literature focusses on machine learning (data already present) whereas we think about optimization...
- [x] good point
### Mean
- Point $x$ : Parameter set (vector), e.g., $[k_1, k_2, k_3, k_4, k_5]$
- Function value $f(x)$ : Scalar value, e.g., residuum
- Mean of prior $\mu_0(x)$ is usually constant, i.e. $\mu_0(x) = \mu$ with $\mu$ being the mean of the function values $f(x_n)$ of the prior data points.
- $\mu_0(x) = \mu + \sum_{i=1}^p\beta_i\Psi_i(x)$ : Scalar value for any point $x$? With $\beta_i$ being scalar parameters and $\Psi_i(x)$ being scalar-valued (e.g., polynomial) function.
Setting the mean **of the prior** to a non-constant expression could be beneficial, if you know that there is a trend in the data. To illustrate this, let us think about $f(x)$ describing the price of fish which varies during time $t$ and $x=t$. If we know there is inflation going on, we might be better of with specifying a structural dependency of the mean of $f(t)$ as a function of $t$ rather than letting it be constant. See this [great hands-on-blog-post][platy_blog].
- Where does $\mu$ come from? Mean of function values $f(x_k)$ all given/known points $x_k$?
- Can we CHOOSE $\mu$ (see Eq. (2))? Hence, could we define $\mu\equiv 0$?
1. $\mu$ is a hyper parameter of the model and has to be optimized itself
2. $\mu$ is usually zero (if data is normalized) or the mean of the **prior** data points. See [scikit-learn](https://scikit-learn.org/stable/modules/gaussian_process.html#gaussian-process-regression-gpr).
- Eq. (2) : $f(x_{1:k}) \sim \text{Normal}(\mu_0(x_{1:k}),\Sigma_0(x_{1:k},x_{1:k}))$ : We assume the functions $f(x_{1:k})$ to follow a normal distribution with mean (means?) $\mu_0(x_{1:k})$ and covariance $\Sigma_0(x_{1:k},x_{1:k})$? What does this mean? Why do we have several means $\mu_0(x_{1:k})$?
- $\mu_0(x_{1:n})$ : Vector of all $\mu_0(x_i)$ evaluated at the points $x_1$ till $x_n$. Could it be that usually we have $\mu_0(x_{1:n}) = \underbrace{[\mu,\mu,\dots,\mu]^T}_{n \text{ times}}$?
- *~~Wouldn't that be stupid?~~ .. It fits my mental model and makes sense in equation (3).*
- Yes :D .. But usually we seem to choose $\mu_0 = \mu = \text{const}. \equiv 0$ anyways. See above.
- $\mu_n(x)$ is the mean function of all sorts of functions. All these functions MUST take the given values $f(x_i)$ of the evaluated points $x_i$.
<img src="https://upload.wikimedia.org/wikipedia/commons/7/7f/Gaussian_Process_Regression.png"/>
- Prior: We randomly choose functions $f(x)$ based on a normal distribution with (manually chosen) $\mu_0$ and $\Sigma_0$.
- Posterior: We force the functions $f(x)$ to take the given values $f(x_i)$
- Form that we can compute $\mu_n(x)$ and $\sigma_n^2(x)$
Does this help?(Text and image are links to the original source):
["Gaussian process is a stochastic process (a collection of random variables indexed by time or space), such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed."](https://en.wikipedia.org/wiki/Gaussian_process)
[<img src="https://upload.wikimedia.org/wikipedia/commons/8/8e/MultivariateNormal.png" width="500" height="300"/>](https://en.wikipedia.org/wiki/Multivariate_normal_distribution)
### Kernel (correlation function) and Covariance matrix
- Kernel (correlation function) $\Sigma_0(x,x')$ : Scalar-valued relation between two points $x$ and $x'$
- E.g., $\Sigma_0(x,x') = \alpha_0\exp(-\Vert x-x' \Vert^2)$
- Covariance matrix $\Sigma_0(x_{1:n},x_{1:n})$ : Matrix of relations between all points $x_1$ till $x_n$ and all points $x_1$ till $x_n$ (Kernel evaluated at these points)
- Covariance vector? $\Sigma_0(x,x_{1:n})$ : Vector of relations between a point $x$ and all points $x_1$ till $x_n$
- [x] Agree
### Posterior probability distribution
- $\mu_n(x) = \underbrace{\Sigma_0(x,x_{1:n})}_{1\times n} \underbrace{\Sigma_0(x_{1:n},x_{1:n})^{-1}}_{n\times n} \underbrace{(f(x_{1:n})-\mu_0(x_{1:n}))}_{n\times 1} + \underbrace{\mu_0(x)}_{1}$
- [x] Can this be correct?
## Notes
Aquisition functions are often maximized with Newtons method such as Boyden-Fletcher-Goldfrab-Shanno or Nelder-Mead methods
## Buzz words
- Gaussian process regression
- Acquisition function
- Expected improvement (Efficient Global Optimization?)
- Entropy search
- Knowledge gradient
- [Surrogate methods][surrogate_matlab]
## More buzz words
- Gaussian process
- Kriging
- Parzen-Tree
- Probability of improvement
- Expected improvement
- Bayesian expected losses
- upper confidence bounds (UCB)
- Thompson sampling
## Animations

Source: https://towardsdatascience.com/bayesian-hyperparameter-optimization-17dc5834112d

Source: https://en.wikipedia.org/wiki/Bayesian_optimization#/media/File:GpParBayesAnimationSmall.gif

Source: https://ax.dev/docs/bayesopt.html
[bayesian_optimization_wiki]: https://en.wikipedia.org/wiki/Bayesian_optimization
[Frazier_2018]: https://arxiv.org/pdf/1807.02811.pdf
[Snoek_2012]: https://proceedings.neurips.cc/paper/2012/file/05311655a15b75fab86956663e1819cd-Paper.pdf
[surrogate_matlab]: https://de.mathworks.com/help/gads/what-is-surrogate-optimization.html
[animation_bayesion_op_wiki]: https://en.wikipedia.org/wiki/Bayesian_optimization#/media/File:GpParBayesAnimationSmall.gif
[platy_blog]: http://platypusinnovation.blogspot.com/2016/05/a-simple-intro-to-gaussian-processes.html
[repo_jujo]: https://git.scc.kit.edu/jujo/bayesian
[lecture_ubc]: https://www.cs.ubc.ca/~nando/540-2013/lectures.html
[krasser]: http://krasserm.github.io/2018/03/19/gaussian-processes/
[gpyopt_github]: https://github.com/SheffieldML/GPyOpt