# [10/15/2021 Brainstorm] Designing A Simulation Environment For RL Algorithms In Clinical Trials ## Goal We are aiming for a paper submission on how to create a quality simulation environment with variants. We want to cheaply, efficiently, and effectively evaluate different RL models considering a wide range of axioms, in order to make design decision for our RL algorithm that will be deployed in a real clinical trial. ## Problems * We only have one shot at a clinical trial so we have to make careful, intelligent decisions that are immutable once the trial starts. * We are often in a poor data regime where we have observational data without interventions and data is sparse. * The users in the eventual clinical trial may differ from those in the initial data collection (data we have available to form a simulation environment); distribution shift. ## Motivation * Creating a testbed for clinical trials is different than for other popular applications of RL such as games or advertisements. They are different because: * Data is often sparse, batch data with no interventions * Any other properties? * no mechanistic model & sparse data * A lot of work highlights the general framework for simulation for RL (Ie et. al., 2019;https://arxiv.org/pdf/1909.04847.pdf), but **they don’t offer a framework for evaluating how good that simulation is.** We want to be able to create AND evaluate a good simulation environment/test bed so that we can properly test RL Algorithms. ## We Want Feedback on * Intro/Motivation: * Does this story make sense? Does it offer enough novelty? * Does the flow of my motivation work or are there any other motivating suggestions? * Principles of Simulation Env.: * Are these axioms good enough? Can you help us think of more? * Experiments/Case Studies: * How would I do experiments and results? Other papers I've seen only include case studies. We eventually want to also add in a section that talks about feature selection and RL algorithm design. So maybe the experiments section could be test different RL algorithm candidates on variants of the simulation environment and explaining why/how we eventually chose a candidate. * Case studies in Oralytics and Heartsteps ## Axioms for a Good Simulation Environment 1. **Diversity:** We want multiple variants of the simulation enviornment where each variant represents and tests a specific concern of problems that could occur during the real study. This is usually constructed with the help of a domain expert. 2. **Validity:** Each variant of the simulation environment should closely resemble the original dataset for user states with no interactions. For simulated users and states congruent with the training data, we should produce rewards close to the training data. a. Measure up to the 4th moment to compare the distributions of the dataset with our simulated distributions. (e.g. we look at the distribution of non-zero rewards) b. Not worse than the worst and not better than the best 3. **Robustness, Generalizability:** a. We also want to consider simulated users and states OUTSIDE of the training data b. Incorporating domain knowledge to generate subgroups of users NOT found in the training set but are users congruent with what they’ve seen. (e.g. data might be overwhelmed by a certain group trait such as having a lot of females in the study and not males) Not just captured the training dataset’s distribution c. Imputting Data With Interventions We want accurate effect size for reward under action 1 (interventions) even when it's not present in the dataset 4. **Fairness:** The variants should be able to test if RL algorithm does well for one group but not the another, for example gender.