JITAI-Inspired Toy Simulations of Inaccurate State Predictions for RL

August 27, 2025 Brainstorm Overview # JITAI-Inspired Toy Simulations of Inaccurate State Predictions for RL <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/S1rFA9cKee.png" alt="StepCountJITAI" title="StepCountJITAI"/> [StepCountJITAI](https://doi.org/10.48550/arXiv.2411.00336) is a JITAI-inspired toy simulator we are using and building upon to gain intuition and develop a better method for handling state prediction uncertainty </div> ## Goals for the Brainstorm I will overview StepCountJITAI and the work that has been done to date using StepCountJITAI to investigate "[context inference error and partial observability](https://doi.org/10.48550/arXiv.2305.09913)." I will also present some initial simulations I conducted to reproduce and add to the existing results, pointing out observations I made along the way. In doing so, my hope is that you all can think through the following with me: 1. What modifications of StepCountJITAI do you think we should investigate, given my problem setup and your knowledge of RL methods for JITAIs? 2. What baseline RL algorithms would you expect to see alongside Thompson sampling, regularized least squares value iteration (RLSVI), dueling DQN (from StepCountJITAI paper), and REINFORCE (from paper)? 3. What baseline approaches to incorporate uncertainty would you expect to see as comparison methods if a novel approach was being proposed? 4. Anything interesting that you notice from the simulation results we have reproduced and added to from the StepCountJITAI papers? ## Motivation There is growing interest in using passive sensor data to inform a JITAI's decision making. The challenge with relying on passive sensing data to inform state predictions that decisions are made based upon is that some of the data streams sensed using phones and wearables can be corrupted by intermittent noise. Noise sources include motion (e.g., aperiodic movement of someone's wrist corrupts cardiovascular data) and quality of a sensor's contact with one's skin. In addition, sensor-informed predictions are often fallible (e.g., AUC of ROC curve of 0.7 for a predictor of high negative affect), even if they still impressively perform better than random chance. Uncertainty quantification (UQ) approaches could be used to provide measures of uncertainty alongside point predictions. In this brainstorm, we will not dive into the limitations or opportunities to improve UQ methods, as this is an area of future research. For the time being, let's assume that some UQ approach is used to provide us with measures of confidence associated with each of your state predictor's predictions. What do you do with those measures of uncertainty when developing a RL algorithm for a JITAI? ## General Problem Setting Focusing in on a single individual for the sake of clarity (i.e., ignoring pooling and staggered recruitment for now), the JITAI's RL algorithm is designed to instantiate a RL agent that intervenes on the individual at decision times $t \in \{1, 2, ..., T\}$. The agent must select an action at timestep $t$ of $a_t \in \mathcal{A}$, where $\mathcal{A} = \{0, 1, ..., K\}$. At each timestep, it is provided with a predicted state vector $\hat{s}_t$ that approximates the latent state $s_t \in \mathcal{S} \subseteq \mathbb{R}^d$. The reward is a function of action and true state, $r_t = R(s_t, a_t)$, but the RL agent only has access to $\hat{s}_t$. For a subset of the elements of $s_t$, $s_t^{(1:b)}$, $b < d$, we assume $\hat{s}^{(1:b)} = s_t^{(1:b)}$, either because we know the estimates are reliable or because we have no information on uncertainty in those $b$ elements of $\hat{s}$. The remaining $d-b$ elements of $\hat{s}_t$ are approximate -- but we are provided with information on those elements' prediction uncertainties. For the sake of simplicity, let that information on uncertainty be represented by a vector of scalar values, where each scalar value corresponds to an approximate context element, $s_t^{(i)}$. $\Sigma_t = [\sigma_t^{(b+1)}, \sigma_t^{(b+2)}, ..., \sigma_t^{(d)}] \in \mathbb{R}^{d-b}_{\geq 0}$, where $\sigma_t^{(i)} \in \mathbb{R}_{\geq 0}$. Importantly, there is no guarantee that $\forall t, i \: \: \sigma_{t+1}^{(i)} \leq \sigma_{t}^{(i)}$, and there is no guarantee that $\Sigma_t \rightarrow \Sigma$ as $t \rightarrow T$. Assume for now that the predictor used to output $\hat{s}_t$ is not learning online. In other words, we are not focused on improving the online learning of the supervised learning algorithm as well -- just our RL algorithm. This is a use-inspired assumption. Recall Yoonho Chung's brainstorm last spring where he discussed a detector of high negative affect his team had developed, where the person-specific predictor is initialized using pooled data previously collected, personalized using initial baseline data (when labels are collected), and then used for predictions without updating weights for the remainder of the study. From a POMDP perspective, this is as if we have a separate model taking the history of observations (and actions) at each timestep and outputting a belief state with mean $\hat{s}_t$ and uncertainty info from the belief distribution, $\Sigma_t$. ## StepCountJITAI's Toy Simulation Setup StepCountJITAI is a JITAI-inspired toy simulator that somewhat aligns with our problem setting. StepCountJITAI is modeled after a JITAI for physical activity (e.g., HeartSteps) and includes 4 actions, 3 state variables, and a reward signal, step count, $s_t$ (sorry for the overlapping use of $s_t$ above...). For one of the state variables termed the "context" variable, $c_t \in \{0, 1\}$, a noisy observation and classification process with UQ is included. The following tables and equations summarize the base version of StepCountJITAI: ![actionTable](https://hackmd.io/_uploads/rJerzSitel.png) ![stateTable](https://hackmd.io/_uploads/BkRBMBsFlg.png) ![dynamicEquations](https://hackmd.io/_uploads/ryAUzBsKgx.png) Each instantiation of StepCountJITAI models a single individual. The individual's "trial" ends if disengagement, $d_t$, exceeds a threshold (e.g., 0.99) or if $t$ reaches $T$ (e.g., 50 timesteps). The latest version of StepCountJITAI adds within-participant stochasticity to the mix (i.e., adding noise to the deterministic dynamics shown) and also attempts to model heterogeneity in parameters such as $\delta_d$. We can see already that one addition to StepCountJITAI we may want to investigate is the use of a time-varying $\sigma_t$ (sorry for overlapping use of $\sigma$ in problem setting...), rather than a fixed $\sigma$, to model changes in signal quality over time when extracting feature $x_t$. We will discuss more in the brainstorm.