Maximum Entropy Search, everything under the hood

# Maximum Entropy Search, everything under the hood ...for level set estimation of computationally intractable functions ### Notation Disclaimer: here we use statistics/machine learning notation, we will make explicit what each variable means in a HEP context. - The implicit intractable function $f: \mathbb{R}^d \to \mathbb{R}$. In the context of HEP this is the evaluation of an analysis pipeline like SimpleAnalysis or evaluation at reco-level. - $\mathbf{x} \in \mathbb{R}^d$ is an item in the domain of the computationally intractable function $f$. In the context of HEP this is the parameter space of a BSM theory, e.g. $\mathbf{x} = (m_{Z'}=500\text{GeV}, m_{\chi}=200\text{GeV}, g=1.0, ...)$ - $y \in \mathbb{R}$ is the observation of the image of $f$, $y = f(\mathbf{x}) + \epsilon$. In the context of HEP, it is often a p-value related quantity. The random variable $\epsilon$ models the homoskedastic noise that arises when measuring $f(\mathbf{x})$, it is also called *observational model* or *likelihood*. No assumption needs to be made about the distribution of $\epsilon$ although we often use a 1-d Gaussian. In the context of HEP this models the fact that a Monte Carlo simulation cannot run forever and give an exact result. $\epsilon$ models *aleatoric* uncertainty **not** *epistemic* uncertanty. - We use a Gaussian Process $Y \sim \mathcal{GP}(\mu(\mathbf{x}), k(\mathbf{x}, \mathbf{x'}))$ to perform regression on a dataset $D=\{(\mathbf{x}_i, y_i)_{i=1..n}\}$ (see *meta-modelling*) - $t\in\mathbb{R}$ is the threshold at which we want to find the level set by $f$, that is the level set is defined as $\{\mathbf{x}\in\mathbb{R}^d | f(\mathbf{x})=t \}$. In the context of HEP, $t$ is typically a p-value threshold or a fixed upper given signal strength ($\mu^{UL}=1$). ### Definition For every point $\mathbf{x}$ in the domain we define a random variable $z(\mathbf{x}) \sim \text{Bernoulli}(S(\mathbf{x}))$ (whether $\mathbf{x}$ is below or above $t$). With $S(\mathbf{x}) = \int_{-\infty}^{t} p (Y=y|\mathbf{x}) dy$ (cumulative of the posterior probability of the GP at the point $\mathbf{x}$) <picture> The Maximum Entropy Search (MES) acquisition function is defined as following: $$ \boxed{u_{MES}(\mathbf{x}) = H[Z(\mathbf{x})]} $$ And the suggested points where to evaluate the pipeline $$ \mathbf{x}_{new} = \text{argmax}_{\mathbf{x}}\,\, H[Z(\mathbf{x})] $$ ### Warm start strategy for Active Learning (brief) <to write> ### Summary of theoretical findings (See section [Proofs](###Proofs) for details) 1.1. **In the low $\epsilon$ regime MES chooses points from the predicted level set with probability porportional to the kernel variance of the GP posterior around the predicted boundary of the level set.** (co-author Jaume de Dios, dept. Mathematics UCLA) Corollaries: + 1.2. In the $d=1$ case the level set has a finite set of points.Therefore, if we select a batch number bigger than the number of points in the level set, MES will choose points that are very close to the points in the level set, ignoring exploration. This is the only case where this happens, for all other $d$ the level set will always have an infinite amount of points. $d=1$ is a degenerate case. + 1.3. Scalability with $d$. In high dimensions most of the points of the $d$-sphere are located on the surface and *not* inside of the sphere. Prioritizing the boundary, like MES, is anyways the asymptotic behaviour of most acquisition functions. (<to do: double-check this affirmation>) 2. **It is not a good idea to artificially increase uncertainty when performing a warm start.** The $\epsilon$ on $y=f(\mathbf{x}) + \epsilon$ is linked to aleatoric uncertainty not epistemic uncertainty. I naively and wrongly suggested to increase the $\epsilon$ when going from one step to another during warm start to make yp for the fact that there is a bias between both ground truth functions. This type of uncertainty is epistemic and cannot be compenated with aleatoric uncertainty. ### Proofs TO TYPESET HANDWRITTEN [HERE](https://drive.google.com/file/d/1H9KAnQhvSCto2urDMXhNofzK2whFSuKj/view?usp=sharing) ### Frequently Asked Questions & Comments - How do you know you converge? - Is this the optimal way of choosing points?