# [4-30-2021][1-Pager] Oralytics Simulation Environment and Pooling
## Motivation and Problem
In modeling the dental health engagement problem as a contextual bandit, to find an engagement strategy (optimal policy) that maximizes the overall oral brushing quality score of the participant (total expected reward). Before running a clinical trial, we consider how updating the approximating the value function can speed up learning. **We model our value function using a Gaussian Process (GP) and consider different kernel hyperparameters update schemes and their performance in different environments.**
## Problem Setup
We assume the underlying control problem is a *Contextual Bandit Problem*.
At each decision time $t$, taking action $A_t \in \mathcal{A}$ from state $S_t \in \mathcal{S}$ yields reward $R_t \in \mathbf{R}$. We assume the value function is a GP of the form: $f = \theta^T\phi(S_t, A_t)$ where $\phi$ is the feature map generated by the choice of kernel. We create a Thompson sampler and sample from the current posterior mean and covariance of the distribution of: $$\mathrm{E}_{\theta_t}[R_t \mid S_t, A_t] = \theta_t^T\phi(S_t, A_t)$$
### Updating the Value Function
At time $T$, we use past history $H_{T-1}=(S_t, A_t, R_t)_{t=1}^{t=T-1}$ to update value function $f$. Since $f$ is modeled using a GP with a squared exponential kernel.
For simplicity, we assume mean to be 0. We will re-train the GP with new observed data. Namely if $y$ is our training points and $y^*$ is our test points, we will update $y$ with new data and at prediction time, generate the posterior predictive.
## Hyperparameter Optimization and Cadence
The popular method for tuning hyperparameters is through maximizing the marginal likelihood $p(y | \theta, x)$ (Type-II ML).
We will explore the following hyperparameter update schemes for our GP. Each scheme will be represented by a Thompson sampler that approximates the reward function using a GP at the following cadences:
* Updates every week
* Updates every day
* Fix hyperparameters at the beginning and never change them
## Simulation Environment
We formed a generative model using the UCLA Dental School's ROBAS 2 study data set on multimodal sensory dental data for $(n = 40)$ individuals, across 30 days. We simulated clinical trials in order to compare hyperparameter tuning cadences. The data contains raw brushing time and length and phone engagement per session across multiple sessions.
We use the data to construct states of the following form:
* Morning brushing indicator of previous day $\in \{0, 1\}$
* Evening brushing indicator of previous day $\in \{0, 1\}$
* Morning brushing length of previous day $\in \mathbf{R}$
* Evening brushing length of previous day $\in \mathbf{R}$
* Weekend indicator $\in \{0, 1\}$ (0 if weekday, 1 if weekend)
* Time of day $\in \{0, 1\}$ (0 if morning, 1 if evening)
* Phone engagement length $\in \mathbf{R}$
* Indicator of missingness of phone engagement $\in \{0, 1\}$
Therefore each state $s \in S = \mathbf{R}^8$. The reward $R \in \mathbf{R}$ will be brushing length for that session.