Bayesian Optimization (Hands on)

# Bayesian Optimization (Hands on) [Repository][repo_jujo] (ask Julian for permissions) ## Problem Definition $\max_{x \in A}(f(x))$ with - $f$ is blackbox, potentially expensive - $\text{len}(x) < 10$ - Easy test of membership of $x \in A$ ## Solution Process Treat f as random function and 1. Place prior on f 2. Select initial points x_init by initial space-filling experiment 3. Evaluate f at initial points 4. Update prior with knowledge of f(initial_points) -> posterior 5. Loop 1. Update posterior with function evaluations 2. Construct aquisition function from current posterior 3. Select new points x_new based on maximum of aquisition function 4. Evaluate f(x_new) ## On the Selection of x_new See [animations below](##Animations) - Favor points you have no idea about (large credible intervals) - Favor points you expect to be good in value (large posterior mean) - Exploration vs. exploitation tradeoff (high expected quality vs. high uncertainty) ## Gaussian process regression - Assume the values $f(x_i)$ at points $x_i$ to be distributed normally, with a mean $\mu_0$ and a covariance (kernel) $\Sigma_0$ - Hence, $f(x) \sim \text{Normal}(\mu_0,\Sigma_0)$ - Compute conditional distribution (= posterior probability distribution) for new point $x_\text{new}$ as $f(x_\text{new})\vert f(x_{1:n}) \sim \text{Normal}(\mu_n,\sigma_n^2)$ (with $\mu_n$ and $\sigma_n^2$ being functions of $\mu_0$ and $\Sigma_0$) and $x_{1:n}$ being the known, evaluated points ## Questions 1. How to select initial points? 2. Given $\mathbf{x}$ of dimension $n$, i.e. $\text{len}(\mathbf{x})=n$, and number of initial points is $k$, how is "the mean vector" constructed? ## Ressources - Literature - [Platypus][platy_blog] - [Frazier_2018][Frazier_2018] - [Snoek_2012][Snoek_2012] - [Lecture UBC][lecture_ubc] - Code - Gaussian Process - [Krasser][krasser] - Optimization - [GPyOpt][gpyopt_github] ## Thoughts - General: In Frazier_2018, beware the difference of - $k$ arbitrary points for the prior and - $n$ known evaluated points - Could (depending on maybe sampling?) the number of points $k$ for the prior be $k=0$ or $k=1$? And subsequently we always update $\mu_n$ and $\sigma_n^2$ based on all already known points? - I think we should be aware, that the literature focusses on machine learning (data already present) whereas we think about optimization... - [x] good point ### Mean - Point $x$ : Parameter set (vector), e.g., $[k_1, k_2, k_3, k_4, k_5]$ - Function value $f(x)$ : Scalar value, e.g., residuum - Mean of prior $\mu_0(x)$ is usually constant, i.e. $\mu_0(x) = \mu$ with $\mu$ being the mean of the function values $f(x_n)$ of the prior data points. - $\mu_0(x) = \mu + \sum_{i=1}^p\beta_i\Psi_i(x)$ : Scalar value for any point $x$? With $\beta_i$ being scalar parameters and $\Psi_i(x)$ being scalar-valued (e.g., polynomial) function. Setting the mean **of the prior** to a non-constant expression could be beneficial, if you know that there is a trend in the data. To illustrate this, let us think about $f(x)$ describing the price of fish which varies during time $t$ and $x=t$. If we know there is inflation going on, we might be better of with specifying a structural dependency of the mean of $f(t)$ as a function of $t$ rather than letting it be constant. See this [great hands-on-blog-post][platy_blog]. - Where does $\mu$ come from? Mean of function values $f(x_k)$ all given/known points $x_k$? - Can we CHOOSE $\mu$ (see Eq. (2))? Hence, could we define $\mu\equiv 0$? 1. $\mu$ is a hyper parameter of the model and has to be optimized itself 2. $\mu$ is usually zero (if data is normalized) or the mean of the **prior** data points. See [scikit-learn](https://scikit-learn.org/stable/modules/gaussian_process.html#gaussian-process-regression-gpr). - Eq. (2) : $f(x_{1:k}) \sim \text{Normal}(\mu_0(x_{1:k}),\Sigma_0(x_{1:k},x_{1:k}))$ : We assume the functions $f(x_{1:k})$ to follow a normal distribution with mean (means?) $\mu_0(x_{1:k})$ and covariance $\Sigma_0(x_{1:k},x_{1:k})$? What does this mean? Why do we have several means $\mu_0(x_{1:k})$? - $\mu_0(x_{1:n})$ : Vector of all $\mu_0(x_i)$ evaluated at the points $x_1$ till $x_n$. Could it be that usually we have $\mu_0(x_{1:n}) = \underbrace{[\mu,\mu,\dots,\mu]^T}_{n \text{ times}}$? - *~~Wouldn't that be stupid?~~ .. It fits my mental model and makes sense in equation (3).* - Yes :D .. But usually we seem to choose $\mu_0 = \mu = \text{const}. \equiv 0$ anyways. See above. - $\mu_n(x)$ is the mean function of all sorts of functions. All these functions MUST take the given values $f(x_i)$ of the evaluated points $x_i$. <img src="https://upload.wikimedia.org/wikipedia/commons/7/7f/Gaussian_Process_Regression.png"/> - Prior: We randomly choose functions $f(x)$ based on a normal distribution with (manually chosen) $\mu_0$ and $\Sigma_0$. - Posterior: We force the functions $f(x)$ to take the given values $f(x_i)$ - Form that we can compute $\mu_n(x)$ and $\sigma_n^2(x)$ Does this help?(Text and image are links to the original source): ["Gaussian process is a stochastic process (a collection of random variables indexed by time or space), such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed."](https://en.wikipedia.org/wiki/Gaussian_process) [<img src="https://upload.wikimedia.org/wikipedia/commons/8/8e/MultivariateNormal.png" width="500" height="300"/>](https://en.wikipedia.org/wiki/Multivariate_normal_distribution) ### Kernel (correlation function) and Covariance matrix - Kernel (correlation function) $\Sigma_0(x,x')$ : Scalar-valued relation between two points $x$ and $x'$ - E.g., $\Sigma_0(x,x') = \alpha_0\exp(-\Vert x-x' \Vert^2)$ - Covariance matrix $\Sigma_0(x_{1:n},x_{1:n})$ : Matrix of relations between all points $x_1$ till $x_n$ and all points $x_1$ till $x_n$ (Kernel evaluated at these points) - Covariance vector? $\Sigma_0(x,x_{1:n})$ : Vector of relations between a point $x$ and all points $x_1$ till $x_n$ - [x] Agree ### Posterior probability distribution - $\mu_n(x) = \underbrace{\Sigma_0(x,x_{1:n})}_{1\times n} \underbrace{\Sigma_0(x_{1:n},x_{1:n})^{-1}}_{n\times n} \underbrace{(f(x_{1:n})-\mu_0(x_{1:n}))}_{n\times 1} + \underbrace{\mu_0(x)}_{1}$ - [x] Can this be correct? ## Notes Aquisition functions are often maximized with Newtons method such as Boyden-Fletcher-Goldfrab-Shanno or Nelder-Mead methods ## Buzz words - Gaussian process regression - Acquisition function - Expected improvement (Efficient Global Optimization?) - Entropy search - Knowledge gradient - [Surrogate methods][surrogate_matlab] ## More buzz words - Gaussian process - Kriging - Parzen-Tree - Probability of improvement - Expected improvement - Bayesian expected losses - upper confidence bounds (UCB) - Thompson sampling ## Animations ![](https://i.imgur.com/wgOfzCL.gif) Source: https://towardsdatascience.com/bayesian-hyperparameter-optimization-17dc5834112d ![](https://i.imgur.com/jEMOHNF.gif) Source: https://en.wikipedia.org/wiki/Bayesian_optimization#/media/File:GpParBayesAnimationSmall.gif ![](https://i.imgur.com/7DMlGfu.gif) Source: https://ax.dev/docs/bayesopt.html [bayesian_optimization_wiki]: https://en.wikipedia.org/wiki/Bayesian_optimization [Frazier_2018]: https://arxiv.org/pdf/1807.02811.pdf [Snoek_2012]: https://proceedings.neurips.cc/paper/2012/file/05311655a15b75fab86956663e1819cd-Paper.pdf [surrogate_matlab]: https://de.mathworks.com/help/gads/what-is-surrogate-optimization.html [animation_bayesion_op_wiki]: https://en.wikipedia.org/wiki/Bayesian_optimization#/media/File:GpParBayesAnimationSmall.gif [platy_blog]: http://platypusinnovation.blogspot.com/2016/05/a-simple-intro-to-gaussian-processes.html [repo_jujo]: https://git.scc.kit.edu/jujo/bayesian [lecture_ubc]: https://www.cs.ubc.ca/~nando/540-2013/lectures.html [krasser]: http://krasserm.github.io/2018/03/19/gaussian-processes/ [gpyopt_github]: https://github.com/SheffieldML/GPyOpt

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.