April 25, 2025 Brainstorm Overview # How Can We Help JITAI Bandits Use Measures of Uncertainty in State Predictions to Make Better Decisions? <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/ryWOXzvJll.png" alt="Jedi Bandit" title="Jedi Bandit" width="180" height="250"/> Jedi Bandit Checking a Notification Sent Using a JITAI Bandit (credit: ChatGPT) </div> ## Goals for the Brainstorm My primary goal is for us to brainstorm ideas on [use-inspired](https://nap.nationalacademies.org/read/12015/chapter/5) methodology for how contextual bandit algorithms can leverage measures of uncertainty in context predictions to improve action selection. Broadly, our use case involves sequential decision-making and its personalization in the context of JITAIs. A concrete application would be Yoon's work discussed last week, where **fallible prediction models use passive sensor data to estimate context variables** such as negative affect. These context variables inform stochastic decisions about which intervention option a JITAI should provide. Our methodology would help a JITAI's bandit algorithm use uncertainty information it was provided about a context variable (e.g., standard deviation of predictive distribution of how anxious someone was) to better assign action probabilities in online settings -- and learn to do so within a number of decision times that is realistic for JITAI deployment. My secondary goals include brainstorming simulation plans to evaluate method performance, identifying key literature I should add to my reading list, and exploring ways to better motivate and communicate this work for future presentations, papers, etc. ## Questions to Think About - If I told you to make the decision of whether to tell someone to practice mindfulness or not and gave you a faulty prediction of the person's stress level (on a scale of 0-10) *and the standard deviation of the predictive distribution for stress*, practically speaking, how would you use that standard deviation to make a better decision? - How would your answer vary if I told you the person gets more or less annoyed if they are not really that stressed and you suggest mindfulness? - What features of the predictive distribution of a continuous context variable (e.g., 0-10 scale for stress) would you prioritize as most helpful to making decisions? - What about for categorical variables (e.g., high vs. low negative affect)? - If we went the route of providing our bandit algorithm with one uncertainty feature as an additional state element (keeping things simple with just one predicted context variable), how can we warm start (i.e., bias) our bandit algorithm to make up for increased state dimension (i.e., increased variance)? - Are there approaches that come to mind other than augmenting the state vector with additional features related to uncertainty? - Maybe a hierarchical approach where the higher-level agent learns what percentile to provide the online bandit algorithm based on offline data, given the cost-benefit ratio of intervening? That way, no increase in state dimension online... - But every user will likely have a different cost-benefit ratio...how much to personalize? What should be learned offline and what should be learned online? - What baseline methods would you be interested in comparing new methods against? - What factors are unique to this problem setting that should be left as variables when designing the simulation testbed and varied when evaluating proposed methods? - Any literature you know of that absolutely must be prioritized on my reading list? - Are you convinced that this is both an important and challenging problem? If not, what additional information or preliminary data would convince you? ## Motivation There is growing interest in using passive sensor data to inform a JITAI's decision making. Surveys and other forms of active measurement burden individuals and make them less engaged with your mobile health system. Decreased engagement with your mobile health system can lead to reduced efficacy when your JITAI intervenes. Passive sensing using smartphones, wearables (e.g., smartwatch), or other ubiquitous sensors bypasses the need to interact with an individual to collect relevant data from them. This data could include their location, phone behavior, or even cardiovascuar data from their smartwatch. The challenge with relying on passive sensing data to make intervention decisions is that some of the data streams sensed using phones and wearables can be corrupted by intermittent noise. Noise sources include motion (e.g., aperiodic movement of someone's wrist corrupts cardiovascular data) and quality of a sensor's contact with one's skin. Moreover, nascent research such as Yoon's discussed last week is aimed at using sensor-informed predictions to inform a JITAI's decision-making -- but as we saw from Yoon's slides, those predictors are fallible (e.g., AUC of ROC curve of 0.7), so although it is significant and impressive that the predictors can predict when someone will have high negative affect better than chance, these predictors will be wrong quite often. Especially when these predictors have not had additional data from the target individual to help fine tune weights, prediction errors will be even more prevalent. Uncertainty quantification (UQ) approaches could be used to provide measures of uncertainty alongside point predictions. In this brainstorm, we will not dive into the limitations or opportunities to improve these methods, as this is an area of future research. For the time being, let's assume that some UQ approach is used to provide us with measures of confidence associated with each of your context predictor's predictions. What do you do with those measures of uncertainty? How can we help a JITAI's contextual bandit algorithm leverage those measures of uncertainty in a practical way -- fast enough for it to be useful in our clinical trial settings? ## Problem Setting Before diving into any details, note that this is a [POMDP](https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process) setting, as is many of the settings we deal with in the lab where the environment's true state is not as perfectly known as we would like it to be. We focus on contextual bandit algorithms because even if we were to know all states and be in a true MDP setting, we still have to manage the bias-variance tradeoff in trying to learn online in this MDP setting -- fast enough to make personalized decisions for an individual within weeks of them using our system, while simultaneously addressing the [cold start problem](https://en.wikipedia.org/wiki/Cold_start_(recommender_systems)#:~:text=Cold%20start%20is%20a%20potential,not%20yet%20gathered%20sufficient%20information.) and remaining stable enough to minimize erratic RL algorithm behavior down the road. Life is unfortunately not as forgiving as a video game. - Uncertainty in state can vary over time and will not converge to 0 or a fixed value - Sometimes sensor data can be noisy while other times it's clean - Predictors will come across out-of-distribution settings that increase uncertainty - Uncertainty information will only be available for a subset of state elements - Only a subset of state elements are error-prone; others like GPS are fairly reliable - UQ approaches are not yet common and may only be worthwhile for critical states - Action space is binary, ternary, or of some low dimension - Intervention options are often no intervention (0) or intervention (1) - Some studies differentiate medium (1) vs. high-effort interventions (2) - Simulations in [related work](https://doi.org/10.48550/arXiv.2305.09913) include a few more possible actions (e.g., generic intervention vs. intervention tailored to the predicted context) - Rewards are dependent on interactions between states and actions - In low-risk states, there may be no real benefit of intervening, but there can be significant cost in bothering the person - In high-risk states, the benefit of intervening can far outweight the cost of bothering the person (e.g., Yoon's story of mindfulness from last week) - Engineered rewards attempt to account for the long-term cost of burdening the individual alongside the potential short-term benefit of intervening (e.g., [Oralytics](https://doi.org/10.1609/aaai.v37i13.26866)) More precisely, the JITAI's bandit algorithm has the option to intervene on the individual at decision times $t \in \{1, 2, ..., T\}$. The agent must select an action at timestep $t$ of $a_t \in \mathcal{A}$, where $\mathcal{A} = \{0, 1, ..., K\}$. At each timestep, it is provided with a predicted state vector $\hat{s}_t$ that approximates the latent state $s_t \in \mathcal{S} \subseteq \mathbb{R}^d$. For a subset of the elements of $s_t$, $s_t^{(1:b)}$, $b < d$, we assume $\hat{s}^{(1:b)} = s_t^{(1:b)}$, either because we know the estimates are reliable or because we have no information on uncertainty in those $b$ elements of $\hat{s}$. The remaining $d-b$ elements of $\hat{s}_t$ are approximate -- but we are provided with information on those elements' prediction uncertainties. For the sake of simplicity, let that information on uncertainty be represented by a vector of scalar values, where each scalar value corresponds to an approximate state element, $s_t^{(i)}$. $\Sigma_t = [\sigma_t^{(b+1)}, \sigma_t^{(b+2)}, ..., \sigma_t^{(d)}] \in \mathbb{R}^{d-b}_{\geq 0}$, where $\sigma_t^{(i)} \in \mathbb{R}_{\geq 0}$. Importantly, there is no guarantee that $\forall t, i \: \: \sigma_{t+1}^{(i)} \leq \sigma_{t}^{(i)}$, and there is no guarantee that $\Sigma_t \rightarrow \Sigma$ as $t \rightarrow T$. The reward $r_t = R(s_t, a_t)$ depends on both the latent state and the selected action. The reward signal engineered to teach the JITAI's bandit algorithm is designed to capture burden vs. benefit tradeoffs (e.g., measured benefit after action minus the estimated cumulative cost of sending bothering the participant). The approach to approximate the reward using a domain-specific (Bayesian linear) function $\phi$ with parameters $\theta$ that incorporates the predicted context variables and associated uncertainty features would be to learn a (biased) function $\phi_\theta(\hat{s}_t, \Sigma_t, a_t)$ that approximates the reward, $R(s_t, a_t)$. If we go this route, how can we help the bandit algorithm learn $\theta$ (i.e., get better $\hat{\theta}$ quickly)? Are there priors on $\hat{\theta}$ parameters associated with elements of $\Sigma$ to set that make sense?