# [ARCHIVED - PRIOR WAS TOO STRONG] PCS For RL Full Paper Initial Experiments $$p_{\text{data}} \neq p_{\theta}$$ <!-- ## Cluster By Random Users ### Experiment 1: Average Reward and Lower 25th Percentile Reward We first randomly draw 75 users with replacement from the 32 ROBAS 2 users. We compare clustering $k = 5$ with learning one model per user $k = 1$. We simulate $T = 140$ time-steps for each user. #### Results: $k = 1$ does better than $k = 5$ even for 25th percentile Clustering did not do as well as learning one model per user $k = 1$ for both the average reward and average 25th percentile reward (Before I was calculating the average 25th percentile wrong. Cluster average has higher 25th percentile reward but not on an individual level). This could be because: 1) with enough data, learning one model per user should do better with enough data. 2) our users are really heterogeneous | | Average Rewards | Average Rewards| 25th Percentile Rewards| 25th Percentile Rewards| | :--- | :----: | :----: | :----: | :----: | | **Rl Algorithms** | bayesian lin reg k = 1 | bayesian lin reg k = 5 | bayesian lin reg k = 1 | bayesian lin reg k = 5| |**Environment Variant** |||| | stationary user effect | 92.233 (1.339) | 88.228 (1.063)| 60.542 (3.892) |52.203 (3.050) | | stationary user context effect |87.851 (1.486)|85.278 (1.346)|65.526 (1.687)| 50.481 (1.377)| | stationary population effect |85.898 (1.970)|85.093 (1.920)|60.181 (0.865)| 53.412 (0.639)| | non-stationary user effect |103.717 (1.735)|98.917 (1.420)|71.182 (2.621)|64.590 (1.057)| | non-stationary user context effect |94.217 (2.823)|90.247 (2.758)|61.186 (2.322)|57.596 (1.682)| | non-stationary population effect |96.621 (2.772)|93.479 (2.517)|70.089 (1.534)|67.991 (0.783)| ### Experiment 2: Regret Plots Note: When I say average, explain what I am averaging over. Point 1) explains that maybe clustering does better when there's not a lot of datapoints per user. But with enough data, learning one model per user ($k = 1$) should do better. So, I plotted average regret across time. #### Results: $k = 1$ does better than $k = 5$ $k = 1$ still does better than $k = 5$ when plotting both between trial violin plots and between users violin plots. This could be because there's at time $t = 20$, there's already enough data points for $k = 1$ to do better. Our experiments update each algorithm on a weekly basis and does the first update after a week of data (14 datapoints per user). Even for the first 7 time steps ($t = 14$ to $t = 20$), $k = 1$ still achieves lower average regret. #### Violin Plot (Between Trial Variance) (Average Case) Average Regret of users in a trial across all trials. (5 (num_trials) x 75 (num_users) x 7 (timesteps)) ![](https://i.imgur.com/v1rL23D.png) #### Violin Plot (Between User Variance) (Considering all users, is this regret for everyone? Who am I leaving behind. Consider user variance.) Maybe fix error bars SE not range. Average Regret of users in all trials. (375 (all users across all trials) x 7 (timesteps)) ![](https://i.imgur.com/FabUbRa.png) Note: revisit the way we sample different users per trial. Ask Susan again. Maybe same 75 users per trial. ### First Time Steps Plot (Between Trial Variance) ![](https://i.imgur.com/qQPq9ZN.png) ### First Time Steps Plot (Between User Variance) ![](https://i.imgur.com/yFwyxpr.png) --> ## Cluster By Homogenous Users ### Experiment 3: Average Reward and Lower 25th Percentile Ideas for fixes: 1. increase the cluster size (idea: more data with the same noise level will do better) 2. consider for one user's trajectory and just add more data. one user with 5 * T data compared to one cluster with the same 5 users. (sanity check, can be done with the metrics collab) In the previous experiment, we noticed that clustering, $k = 5$ did not do as well as learning one model per user $k = 1$ for both the average reward and average 25th percentile reward. One reason could be because our users are just too heterogeneous for clustering to be effective. We consider comparing cluster size $k = 1$ with cluster size $k = 5$ with "homogenous users" per cluster. For clustering of size $k = 5$, we first randomly draw 15 users from the 32 ROBAS 2 users. And then for each of the 15 clusters, we repet that user 5 times to make a cluster (75 users in total but each of the 15 users are repeated 5 times in each cluster). ### Results: $k = 1$ and $k = 5$ does the same. Turns out $k = 1$ and $k = 5$ have similar results. This makes sense because we have the same users per cluster, so we should expect clustering $k = 1$ and $k = 5$ to preform the same. | | Average Rewards | Average Rewards| 25th Percentile Rewards| 25th Percentile Rewards| | :--- | :----: | :----: | :----: | :----: | | **Rl Algorithms** | bayesian lin reg k = 1 | bayesian lin reg k = 5 | bayesian lin reg k = 1 | bayesian lin reg k = 5| |**Environment Variant** |||| | stationary user effect | 97.484 (3.805) | 95.296 (3.376)| 69.634 (9.027) |66.110 (8.569) | | stationary user context effect |90.616 (3.790)|88.983 (3.116)|69.366 (8.929)| 70.312 (8.153)| | stationary population effect |86.832 (4.897)|84.512 (4.295)|62.442 (9.358)| 61.994 (8.781)| | non-stationary user effect |107.957 (3.241)|107.356 (3.247)|70.051 (4.379)|71.399 (3.646)| | non-stationary user context effect |92.320 (7.983)|92.265 (8.061)|63.661 (6.813)|62.153 (7.205)| | non-stationary population effect |96.214 (7.112)|95.397 (7.152)|70.526 (5.295)|69.175 (4.638)| ## Experiment 4: Reward Plots Plotted average rewards across trajectories, across users, and then across trials. ### Results: Average Reward (Average Across trajectory, Average Across Users) $k = 1$ still beats $k = 5$ Stationary Env | Non-Stationary Env :-------------------------:|:-------------------------: ![](https://i.imgur.com/asHXXGl.png)| ![](https://i.imgur.com/yFBHdxM.png)| ![](https://i.imgur.com/KMM65tm.png)| ![](https://i.imgur.com/NntPEIG.png)| ![](https://i.imgur.com/Wc9JxvV.png)| ![](https://i.imgur.com/C12fg9R.png) | ### Results: Sum of Rewards (Sum Of trajectory, Average Across Users) $k = 1$ still beats $k = 5$ Stationary Env | Non-Stationary Env :-------------------------:|:-------------------------: ![](https://i.imgur.com/23ylYdk.png)| ![](https://i.imgur.com/O8Fshrw.png)| ![](https://i.imgur.com/KH8VjGw.png)| ![](https://i.imgur.com/a8HZjjM.png)| |![](https://i.imgur.com/497GfDy.png)|![](https://i.imgur.com/MHiFtml.png)| ## Experiment 5: Regret Plots Plotted average rewards across trajectories, across users, and then across trials. ### Results: Average Across Trials, Average Across Users, Average Across Trials $k = 5$ mean beats $k = 1$ mean Stationary Env | Non-Stationary Env :-------------------------:|:-------------------------: ![user_eff_stat](https://i.imgur.com/AYxUYJs.png)| ![user_eff_non_stat](https://i.imgur.com/UgieSEt.png) | ![pop_eff_stat](https://i.imgur.com/drmf01U.png)| ![pop_eff_non_stat](https://i.imgur.com/T4xLXG9.png)| | ![cont_user_eff_stat](https://i.imgur.com/LI4QEY7.png)|![cont_user_eff_non_stat](https://i.imgur.com/1v2XbgS.png) | ### Results: Average Across Users, Average Across Trials Stationary Env | Non-Stationary Env :-------------------------:|:-------------------------: ![](https://i.imgur.com/XjUGxiX.png)| ![](https://i.imgur.com/4Aan7WT.png)| ![](https://i.imgur.com/Cmu406N.png)| ![](https://i.imgur.com/KT3IGCj.png)| | ![](https://i.imgur.com/xCZRPBd.png)|![](https://i.imgur.com/Sq1SZ2x.png) | <!-- ### How can we deal with between trial and between user variances? - If we had only cluster size 1 algorithms - Sample a user from population - Run algorithm 1 on that user - Run algorithm 2 on that user - Report regret over time - Look at between user variance - (there is no between trial variance) - 75 users, 5 trial --> 75*5 users <!-- - If we have cluster --> -->