---
title: Heteroscedastic Preferential BO
---
## Solving the non-identifiability problem
Current work suffers from non-identifiability problem, i.e., there are more than one function which satisfy the preference relation. In our work, we assume the user can provide a set of anchors. Let us assume that the provided anchors is equipped with some measurement results. We leverage the anchors for model (hyperparameters) selection. The goal is to match the magnitude of our GP to the anchors measurement value $Y_0 = \{y_0^{(\ell)}\}_{\ell = 0}^L$. The presence of $Y_0$ confines the magnitude of our GP, hence solving the non-identifiability problem. Mathematically, we can write the model selection criterion so-called the anchor's marginal likelihood as follows
\begin{align}
\log p(Y_0 \vert X_0, \theta) = \log \int p(Y_0 \vert \mathbf{X}_0, f) \, p(f \vert \theta) \, df
\end{align}
Later, we maximize the above criterion w.r.t. the GP hyperparameters. Intuitively, the expert's initial latent function is guided by the anchor's measurement. Combining this criterion with the conventional log marginal likelihood gives us
\begin{align}
\theta^\ast = \max_{\theta \in \Theta} \beta \log p(\mathbf{v}_m < 0 \vert \theta) + (1 - \beta) \log p(Y_0 \vert \mathbf{X}_0, \theta)
\end{align}
with $\beta > 0$.
<!--Furthermore, we are interested when the expert's latent function is flexible enough. This is reasonable since human may develop an abstractive intuition after gaining some information (the measurements). Therefore, we proposed the invariance property on the function or the anchor's measurement values.-->
<!--Let us assume the evolved latent function can be constructed by a composition of $f$ and and and a group $g \in G$. Assumes that there exists a probability measure over $G$. -->
<!--Given a group $G$ and suppose there is a probability measure over $G$, we can rewrite the anchor's marginal likelihood as follows
**Model 1**
\begin{align}
\log p(Y_0 \vert X_0, \theta, M) = \log \int \prod_{\ell = 1}^L \int p(g \circ y_0^\ell \vert x_0^\ell, f) \, p(g) \, dg \, p(f \vert \theta, M) \, df
\end{align}
If we view the $g \in G$ as noise, then this is analogous to a noise-robust model.
**Model 2**
\begin{align}
\log p(Y_0 \vert X_0, \theta, M) = \log \int \int p(Y_0 \vert X_0, g \circ f) \, p(g) \, p(f \vert \theta, M) \, dg \, df
\end{align}
This model force GP to be invariant under $G$. -->
<!--Questions so far:
- should we combine the above criterion with the conventional log marginal likelihood which depends on the dataset, i.e., the preference observations?
- How do we handle multiple anchor measurements (multi-output)? -->
## Gaussian process model
Since we assume the anchors are equipped with the measurements, we are now able to utilize $(X_0, Y_0)$ to build our noise estimator $g$. Recall that we follow most likely heteroscedastic gaussian process
- Given the anchors $(X_0, Y_0)$, we estimate a standard, homoscedastic GP $\hat{f}$, maximizing the likelihood $Y_0$ from $X_0$.
- Given $\hat{f}$, we estimate the empirical noise level for the anchors, i.e., $Z_0 = \{ z_\ell = \log \mathbb{V}[y_i, \hat{f}(x_i, X_0)] \}_{\ell \leq L}$ forming a new dataset $(X_0, Z_0)$
- On $(X_0, Z_0)$, we estimate a second GP $g$
## The anchors
It is evident that storing a large number of anchors is not feasible due to the computational time it would consume. This becomes particularly crucial when considering a scenario where the expert updates the anchors mid-iteration. To mitigate the issue of excessive computational load, the expert must replace one of the existing anchors with the new candidate. Intuitively, the expert is willing to replace the anchors when the new candidate looks more promising, i.e., has a better chance to be the optimum of the objective function. If the expert keeps updating the anchors, eventually the anchors will turn into a set of optima candidates.
In bandit optimization there are three common methods to deal with non-stationary environment
- restarting
- sliding window
- weighted penalty
## Acquisition function
The anchor points play a crucial role in guiding the behavior of the noise. My hypothesis is that we need to adjust the magnitude of the noise as we progress through a long iteration. An interesting idea could involve gradually reducing the noise as we grant the surrogate model more influence. Ultimately, our objective is to ensure that the predictive model exhibits homoscedastic noise, with the informed anchor points having a diminishing impact on predictions over the long term.
#### Simple model
$\varepsilon_i \sim \mathcal{N}(0, \phi(x)^2)$ with $\phi(x) = 1 + \sum_{\ell \leq L} (w_\ell - 1) \, k_\ell(d_\ell(x, x_0^\ell))$
\begin{align}
\mathrm{ANPEI}(x) = \rho \, \mathrm{EI}(x; \phi(x)^{\frac{\gamma}{t}}) - (1 - \rho) \, \phi(x)^{\frac{\gamma}{t}}
\end{align}