## Analyzing the "Simple" Scoring System
For $t\in[T]$, generate $a_t \in \{0,1\}^d$ with exactly $pd$ ones entries, and observe
$$
y_t \sim \mathrm{Ber}(\mu+a_t^\top \theta) - \mathrm{Ber}(\mu).
$$
Let $A \in \{0,1\}^{T,d}$ be the matrix whose rows are $a_1,\ldots,a_T$. Denote the columns of $A$ by $A_1,\ldots,A_d$.
Each treatment $i$ receives a final "score":
$$s_i = \frac{A_i ^\top y^+}{pd} \sim \frac{\mathrm{Bin}(pT,(1-\mu)(\mu + (1-p)\theta_i + \sum_{j\in [d]}\theta_j))}{pd}$$
Useful for later calculations:
$\mathbb{E}[y_t^+] = (\mu+a^T_t\theta)(1-\mu)$
$\mathbb{E}[y^+] = (\mu e +A\theta)(1-\mu)$
$\mathrm{Var}[y_t^+] = (\mu+a^T_t\theta)(1-\mu)[1-(\mu+a^T_t\theta)(1-\mu)]$
Assume w.l.o.g. $\theta_1 \ge \theta_2 \ge \cdots \ge \theta_d$.
To what extent do the scores preserve this ranking?
$$\mathbb{E}[s_i] = \frac{A_i ^\top (\mu e + A\theta)(1-\mu)}{pd} = \frac{T \mu (1-\mu)}{d} + \frac{(1-p)T\theta_i(1-\mu)}{d} + \sum_{j \in [d] } \frac{pT\theta_j(1-\mu)}{d}$$
$$\mathrm{Var}[s_i^2] = \frac{T(1-\mu)(\mu + (1-p)\theta_i + \sum_{j\in [d]}\theta_j)[1-(1-\mu)(\mu + (1-p)\theta_i + \sum_{j\in [d]}\theta_j)]}{pd^2}$$
A simplifying assumptions: $\sum_{j \in [d]}\theta_j = 0$
$$\mathbb{E}[s_i] = \frac{A_i ^\top (\mu e + A\theta)(1-\mu)}{pd} = \frac{T \mu (1-\mu)}{d} + \frac{(1-p)T\theta_i(1-\mu)}{d} $$
$$\mathrm{Var}[s_i] = \frac{T(1-\mu)(\mu + (1-p)\theta_i )[1-(1-\mu)(\mu + (1-p)\theta_i )]}{pd^2}$$
Care about $\mathbb{E}[s_1-s_2]/\mathrm{SD}(s_1)$ being large
$$\mathbb{E}[s_1-s_2] = \frac{(1-p)(1-\mu)T(\theta_1-\theta_2)}{d}$$
$$\frac{\mathbb{E}[s_1-s_2]}{\sqrt{\mathrm{Var}[s_1]}} = \frac{(1-p)\sqrt{T(1-\mu)}(\theta_1-\theta_2)}{1}\sqrt{\frac{p}{(\mu + (1-p)\theta_1 )[1-(1-\mu)(\mu + (1-p)\theta_1 )]}}$$
Simplfying assumption: $\mu \approx 0$
$$\frac{\mathbb{E}[s_1-s_2]}{\sqrt{\mathrm{Var}[s_1]}} = \frac{(1-p)\sqrt{T}(\theta_1-\theta_2)}{1}\sqrt{\frac{p}{ (1-p)\theta_1[1-(1-p)\theta_1] }}$$
There exists $p^*(\theta_1)$, and we would probably use $\theta_1 = \gamma$
### 080823
### Proposed Frequentist Model ###
Given treatment effect distribution $P$
Frequentist assumption: treatment effect vector $\theta \in \mathbb{R}^d$ has sufficiently high likelihood:
$$ S = \Bigg\{\theta:\sum_{i=1}^d \log P(\theta_i) \ge \text{some chosen threshold} \Bigg\}$$
**Inferential Procedure at the "end" of the game:**
Procedure for constructing (ellipsoidal) confidence regions of level $1-\alpha$ based solely on observations during the game:
1. Define $R$ as below
2. Define $y \in \mathbb{R}^T$ as below
3. Center of the ellipsoid$$ (AA')^{-1} Ay $$
Full ellipsoid:
$$ E_\alpha = \Bigg\{ \theta : (\theta - (AA')^{-1} Ay)'R^{-1}(\theta - (AA')^{-1} Ay) \le r_\alpha\Bigg\}$$
Hacky algorithm (for computing $p$-values for each treatment effect):
For each $i \in [d]$, compute smallest $\alpha$ such that $S \cap E_\alpha$ intersects $\{\theta: \theta_i \le \gamma\}$. Call this $p_i$.
Finally, apply Benjamini-Hochberg for the given FDR threshold.
Reformulation (?): For each $i \in [d]$, compute the largest value of $(\theta - (AA')^{-1} Ay)'R^{-1}(\theta - (AA')^{-1} Ay)$ over the region $S\cap\{\theta \mid \theta_i \geq \gamma\}$.
### Notes prior to Aug 1 ###
Things to "select" or "estimate" to match reality: $P$, $n$, $d$, $T$, FDR thershold, $\mu$.
### Experiment Specification ###
Treatment effects $\theta \in \mathbb{R}^d$
Draw each $\theta_i$ independently from distribution $P$
Time horizon of the "game": $t = 1,\ldots,T$
At each time period $t$, select $a_t \in \{0,1\}^d$.
Then an "A/B" test is run on $2n$ users, and the following quantity is observed:
**DATA GENERATING DISTRBUTION: $\pi_\theta(y)$**
$$
y_t \sim \frac{\mathrm{Bin}(n,\mu_t+a_t^\top \theta) - \mathrm{Bin}(n,\mu_t)}{n}
$$
Aside 1: all that really "matters" is that this holds (and some control on the variation):
$$
\mathrm{E}[y_t] = a_t^\top \theta
$$
Aside 2: For large $n$:
$$ \mathrm{Bin}(n,p) \approx \mathcal{N}(np,\sqrt{np(1-p)}) $$
There exists a threshold $\gamma$ such that we seek to identify the $\theta_i \ge \gamma$.
At the end of the $T$ time periods, the algorithm must output $s \in \{0,1\}^d$ (where $s_i = 1$ impllies that the algorithm believes $\theta_i \ge \gamma$)
How is performance measured?
1. Hard constraint: an upper bound requirement on expected FDR (over experiment "instances" (su: "low fraction of garbage in the output") (technicality: 0/0 = 0)
2. Objective: maximize expected TPR (over experiment "instances")
### Bayesian Algorithms (i.e. we know $P$)
Slight notation change: $\theta^*$ are the "true" treatment effects
Prior distribution $P$ over $\theta$
Posterior likelihood $\ell(\theta|y)$. Any algorithm requires efficiently sampling $\theta$ from $\ell(\theta|y)$. How to do this?
First, remember Bayes rule:
$$ \ell(\theta|y_1,\dots, y_T) \sim P(\theta) \prod_{t=1}^T \pi_\theta(y_t) $$
Hack: can approximate $\pi_\theta(y)$ by "plugging in" estimate of $\mu$, and then making normal approximation
Now can approximately sample from $\ell(\theta|y)$ via MCMC (e.g. Metropolis-Hastings)
##### Side note: preconditioning for MCMC
Preconditioning = re-scale (and rotate) the coordinates so that things are "spherical"
One example "recipe": given $a_1,\ldots,a_t \in \{0,1\}^d$. Let $A$ be the $d$-by-$t$ matrix whose columns are $a_1,\ldots,a_t$. Then let $R = A A^\top + I = I +\sum_{i=1}^t a_i a_i^\top$. Sanity check: $R_{ii}$ is the number of rounds in which treatment $i$ has been included in the A/B test. Then when performing MCMC, we consider the following re-scaling - rather than take random steps $Z \sim \mathcal{N}(0,\sigma^2) \in \mathbb{R}^d$, take random steps $R^{-1/2} Z$, where $R^{-1/2}$ is the matrix satisfying $R^{-1/2} R^{-1/2} = R^{-1}$ (you would do this with python's numpy/scipy package). $\sigma$ must be tuned.
#### Algortihm 1: Thompson Sampling (requires only one sample from $\ell(\theta|y)$ per time period) ####
At each time period:
1. Sample $\tilde{\theta} \sim \ell(\theta|y)$
2. "Solve" $\max_a a^\top (\tilde{\theta} - \gamma)$, i.e. select $a$ such that entry $i$ equals 1 iff $\tilde{\theta}_i \ge \gamma$
#### Algorithm 1b: Russo-stylized Second Best
#### Algorithm 2: BayesUCB (requires many samples from $\ell(\theta|y)$ per time period) ####
Tuning parameter: $m$, $\alpha$
At each time period:
1. Let $C \subset \mathbb{R}^d$ consist of $m$ independent samples from $\ell(\theta|y)$
2. Sort the elements of $C$ according to $\ell(\theta|y)$, and remove the bottom $\alpha$ fraction from $C$
2. Solve $\max_a \max_{\theta \in C} a^\top (\theta - \gamma)$ and retrun the maximizing $a$. To do this, solve $\max_{\theta \in C} \max_a a^\top (\theta - \gamma) = \max_{\theta \in C} e^\top (\theta - \gamma)^+$, where $e$ is the vector of ones, and superscript $+$ means "positive part", and then for the maximizing $\theta$, compute the maximizing $a$
### Variants/Additions on Bayesian algorithms
#### Variant 1a: Good-Enough Elimination
Run BayesUCB or TS in the background, with the added tuning parameters $p, m$ (where we must have $p>\frac{1}{2}$). At each time step preform the following:
1. Let $S \subset \mathbb{N}^d$ be a set of indicies that represents which treatments we are confident are effective. If a some index $i$ is in $S$, we call treatment $i$ "eliminated".
1. Draw $m$ independant samples from $\ell(\theta|y)$ and call this distrubition $D$. Let $\hat{\theta}$ represent a random variable drawn from $D$. For each $i \in [n]$, calculate $\alpha_i = P(\hat{\theta_i} \geq \gamma)$.
3. For every $i \in [n] \setminus S$, if $\alpha_i > p$, then for every remaining time step: (1) we set $\tilde{\theta_i}$ to be $\mathbb{E}[\hat{\theta_i}]$, where $\tilde{\theta}$ is the sample we draw from $\ell(\theta|y)$ in M-H, (2) we add $i$ into $S$, (3) we set the $i$th element of $a_t$ to 0.
#### Variant 1b: Double Containment Elimination
Implicitly, during each step of the algorithm, in order to solve $\max_a a^\top (\tilde{\theta} - \gamma)$ we utilize a function $f: \mathbb{R}^d \to \{0,1\}^d$. Now, let's define a "backwards" variant of $f$, which we will denote $f^{(b)}: \mathbb{R} \to \{0,1\}^d$, where $f^{(b)}(v) = w$ means that $v_i \geq \gamma \iff w_i = 0$ and $v_i < \gamma \iff w_i = 1$. In fact, $f^{(b)}$ solves $\min_a a^\top (\tilde{\theta} - \gamma)$.
Run BayesUCB or TS in the background, with the added tuning parameter $\tau \leq T, \delta > 0$.
1. Run bayesUCB or TS for $\tau$ time steps and obtain a set, $E_1$, of indicies for the treatments the algorithm has identified to be past the threshold of $\gamma - \delta$.
2. Run bayesUCB or TS on only the treatments with indicies in $E_1$ and utlize $f^{(b)}$ to convert between $\theta$-space and the hypercube. In other words, during each time step of the algorithm, we want to solve $\min_a a^\top (\tilde{\theta} - \gamma)$. After running this for $T-\tau$ steps, we will obtain an output set $E_2$.
3. Return $E_1 \setminus E_2$.
Note: I believe if $TPR_1$ and $FDR_1$ are the true positive/false discovery rates for step 1, and $TPR_2$ and $FDR_2$ are those for step 2, then overall this algorithm has a false discovery rate of $FDR_1\cdot\frac{(1-TPR_2)}{1+(FDR_1-FDR_2)-TPR_2FDR_1}$ and a true positive rate of $TPR_1\cdot(1-FDR_2)$.
#### Variant 2: Controlling $\lVert a_t \rVert_1$
Run BayesUCB or TS in the background, with the added tuning parameters $0 \leq k_1 \leq k_2 \leq d$. At each time step preform the following:
1. Once we obtain our $a_t$ and $\tilde{\theta}$ for a certain time step, calculate $\lVert a_t \rVert_1$. If $k_1 \leq \lVert a_t \rVert_1 \leq k_2$, we do nothing.
2. If $k_2 < \lVert a_t \rVert_1$, let $S$ be a set that contains all indicies $i$ such that the $i$th entry of $a_t$ is a 1. Then we find $\hat{i} = \arg\min_{i \in S}\tilde{\theta_i}$. We then set the $\hat{i}$th element in $a_t$ to be 0. Continue this until the original inequality is satisifed: $\lVert a_t \rVert_1 \leq k_2$.
3. If $\lVert a_t \rVert_1 < k_1$, let $S$ be a set that contains all indicies $i$ such that the $i$th entry of $a_t$ is a 0. Then we find $\hat{i} = \arg\max_{i \in S}\tilde{\theta_i}$. We then set the $\hat{i}$th element in $a_t$ to be 1. Continue this until the original inequality is satisifed: $k_1 \leq \lVert a_t \rVert_1$.
### Frequentist Algorithms (i.e. we know, or have an upper bound on, the number of "effective" treatments. We don't know $P$)
#### Algorithm 3 (bad benchmark): LinUCB
Bad because it ignores the upper bound on the number of "effective" treatments
Need to be able to solve $\max_\theta \pi_\theta(y)$. This is easy.
#### Algorithm 4: The analagous LinUCB (leverages sparsity information)
Easy to state, not sure it's efficiently computable yet...
Need to be able to solve $\max_\theta \pi_\theta(y)$, but over "sparse" $\theta$. This is not easy, but probably do-able.
#### Algorithm 5: TS Hack (with uniform prior)