Experiments for Group multi-AB testing

## Analyzing the "Simple" Scoring System For $t\in[T]$, generate $a_t \in \{0,1\}^d$ with exactly $pd$ ones entries, and observe $$ y_t \sim \mathrm{Ber}(\mu+a_t^\top \theta) - \mathrm{Ber}(\mu). $$ Let $A \in \{0,1\}^{T,d}$ be the matrix whose rows are $a_1,\ldots,a_T$. Denote the columns of $A$ by $A_1,\ldots,A_d$. Each treatment $i$ receives a final "score": $$s_i = \frac{A_i ^\top y^+}{pd} \sim \frac{\mathrm{Bin}(pT,(1-\mu)(\mu + (1-p)\theta_i + \sum_{j\in [d]}\theta_j))}{pd}$$ Useful for later calculations: $\mathbb{E}[y_t^+] = (\mu+a^T_t\theta)(1-\mu)$ $\mathbb{E}[y^+] = (\mu e +A\theta)(1-\mu)$ $\mathrm{Var}[y_t^+] = (\mu+a^T_t\theta)(1-\mu)[1-(\mu+a^T_t\theta)(1-\mu)]$ Assume w.l.o.g. $\theta_1 \ge \theta_2 \ge \cdots \ge \theta_d$. To what extent do the scores preserve this ranking? $$\mathbb{E}[s_i] = \frac{A_i ^\top (\mu e + A\theta)(1-\mu)}{pd} = \frac{T \mu (1-\mu)}{d} + \frac{(1-p)T\theta_i(1-\mu)}{d} + \sum_{j \in [d] } \frac{pT\theta_j(1-\mu)}{d}$$ $$\mathrm{Var}[s_i^2] = \frac{T(1-\mu)(\mu + (1-p)\theta_i + \sum_{j\in [d]}\theta_j)[1-(1-\mu)(\mu + (1-p)\theta_i + \sum_{j\in [d]}\theta_j)]}{pd^2}$$ A simplifying assumptions: $\sum_{j \in [d]}\theta_j = 0$ $$\mathbb{E}[s_i] = \frac{A_i ^\top (\mu e + A\theta)(1-\mu)}{pd} = \frac{T \mu (1-\mu)}{d} + \frac{(1-p)T\theta_i(1-\mu)}{d} $$ $$\mathrm{Var}[s_i] = \frac{T(1-\mu)(\mu + (1-p)\theta_i )[1-(1-\mu)(\mu + (1-p)\theta_i )]}{pd^2}$$ Care about $\mathbb{E}[s_1-s_2]/\mathrm{SD}(s_1)$ being large $$\mathbb{E}[s_1-s_2] = \frac{(1-p)(1-\mu)T(\theta_1-\theta_2)}{d}$$ $$\frac{\mathbb{E}[s_1-s_2]}{\sqrt{\mathrm{Var}[s_1]}} = \frac{(1-p)\sqrt{T(1-\mu)}(\theta_1-\theta_2)}{1}\sqrt{\frac{p}{(\mu + (1-p)\theta_1 )[1-(1-\mu)(\mu + (1-p)\theta_1 )]}}$$ Simplfying assumption: $\mu \approx 0$ $$\frac{\mathbb{E}[s_1-s_2]}{\sqrt{\mathrm{Var}[s_1]}} = \frac{(1-p)\sqrt{T}(\theta_1-\theta_2)}{1}\sqrt{\frac{p}{ (1-p)\theta_1[1-(1-p)\theta_1] }}$$ There exists $p^*(\theta_1)$, and we would probably use $\theta_1 = \gamma$ ### 080823 ### Proposed Frequentist Model ### Given treatment effect distribution $P$ Frequentist assumption: treatment effect vector $\theta \in \mathbb{R}^d$ has sufficiently high likelihood: $$ S = \Bigg\{\theta:\sum_{i=1}^d \log P(\theta_i) \ge \text{some chosen threshold} \Bigg\}$$ **Inferential Procedure at the "end" of the game:** Procedure for constructing (ellipsoidal) confidence regions of level $1-\alpha$ based solely on observations during the game: 1. Define $R$ as below 2. Define $y \in \mathbb{R}^T$ as below 3. Center of the ellipsoid$$ (AA')^{-1} Ay $$ Full ellipsoid: $$ E_\alpha = \Bigg\{ \theta : (\theta - (AA')^{-1} Ay)'R^{-1}(\theta - (AA')^{-1} Ay) \le r_\alpha\Bigg\}$$ Hacky algorithm (for computing $p$-values for each treatment effect): For each $i \in [d]$, compute smallest $\alpha$ such that $S \cap E_\alpha$ intersects $\{\theta: \theta_i \le \gamma\}$. Call this $p_i$. Finally, apply Benjamini-Hochberg for the given FDR threshold. Reformulation (?): For each $i \in [d]$, compute the largest value of $(\theta - (AA')^{-1} Ay)'R^{-1}(\theta - (AA')^{-1} Ay)$ over the region $S\cap\{\theta \mid \theta_i \geq \gamma\}$. ### Notes prior to Aug 1 ### Things to "select" or "estimate" to match reality: $P$, $n$, $d$, $T$, FDR thershold, $\mu$. ### Experiment Specification ### Treatment effects $\theta \in \mathbb{R}^d$ Draw each $\theta_i$ independently from distribution $P$ Time horizon of the "game": $t = 1,\ldots,T$ At each time period $t$, select $a_t \in \{0,1\}^d$. Then an "A/B" test is run on $2n$ users, and the following quantity is observed: **DATA GENERATING DISTRBUTION: $\pi_\theta(y)$** $$ y_t \sim \frac{\mathrm{Bin}(n,\mu_t+a_t^\top \theta) - \mathrm{Bin}(n,\mu_t)}{n} $$ Aside 1: all that really "matters" is that this holds (and some control on the variation): $$ \mathrm{E}[y_t] = a_t^\top \theta $$ Aside 2: For large $n$: $$ \mathrm{Bin}(n,p) \approx \mathcal{N}(np,\sqrt{np(1-p)}) $$ There exists a threshold $\gamma$ such that we seek to identify the $\theta_i \ge \gamma$. At the end of the $T$ time periods, the algorithm must output $s \in \{0,1\}^d$ (where $s_i = 1$ impllies that the algorithm believes $\theta_i \ge \gamma$) How is performance measured? 1. Hard constraint: an upper bound requirement on expected FDR (over experiment "instances" (su: "low fraction of garbage in the output") (technicality: 0/0 = 0) 2. Objective: maximize expected TPR (over experiment "instances") ### Bayesian Algorithms (i.e. we know $P$) Slight notation change: $\theta^*$ are the "true" treatment effects Prior distribution $P$ over $\theta$ Posterior likelihood $\ell(\theta|y)$. Any algorithm requires efficiently sampling $\theta$ from $\ell(\theta|y)$. How to do this? First, remember Bayes rule: $$ \ell(\theta|y_1,\dots, y_T) \sim P(\theta) \prod_{t=1}^T \pi_\theta(y_t) $$ Hack: can approximate $\pi_\theta(y)$ by "plugging in" estimate of $\mu$, and then making normal approximation Now can approximately sample from $\ell(\theta|y)$ via MCMC (e.g. Metropolis-Hastings) ##### Side note: preconditioning for MCMC Preconditioning = re-scale (and rotate) the coordinates so that things are "spherical" One example "recipe": given $a_1,\ldots,a_t \in \{0,1\}^d$. Let $A$ be the $d$-by-$t$ matrix whose columns are $a_1,\ldots,a_t$. Then let $R = A A^\top + I = I +\sum_{i=1}^t a_i a_i^\top$. Sanity check: $R_{ii}$ is the number of rounds in which treatment $i$ has been included in the A/B test. Then when performing MCMC, we consider the following re-scaling - rather than take random steps $Z \sim \mathcal{N}(0,\sigma^2) \in \mathbb{R}^d$, take random steps $R^{-1/2} Z$, where $R^{-1/2}$ is the matrix satisfying $R^{-1/2} R^{-1/2} = R^{-1}$ (you would do this with python's numpy/scipy package). $\sigma$ must be tuned. #### Algortihm 1: Thompson Sampling (requires only one sample from $\ell(\theta|y)$ per time period) #### At each time period: 1. Sample $\tilde{\theta} \sim \ell(\theta|y)$ 2. "Solve" $\max_a a^\top (\tilde{\theta} - \gamma)$, i.e. select $a$ such that entry $i$ equals 1 iff $\tilde{\theta}_i \ge \gamma$ #### Algorithm 1b: Russo-stylized Second Best #### Algorithm 2: BayesUCB (requires many samples from $\ell(\theta|y)$ per time period) #### Tuning parameter: $m$, $\alpha$ At each time period: 1. Let $C \subset \mathbb{R}^d$ consist of $m$ independent samples from $\ell(\theta|y)$ 2. Sort the elements of $C$ according to $\ell(\theta|y)$, and remove the bottom $\alpha$ fraction from $C$ 2. Solve $\max_a \max_{\theta \in C} a^\top (\theta - \gamma)$ and retrun the maximizing $a$. To do this, solve $\max_{\theta \in C} \max_a a^\top (\theta - \gamma) = \max_{\theta \in C} e^\top (\theta - \gamma)^+$, where $e$ is the vector of ones, and superscript $+$ means "positive part", and then for the maximizing $\theta$, compute the maximizing $a$ ### Variants/Additions on Bayesian algorithms #### Variant 1a: Good-Enough Elimination Run BayesUCB or TS in the background, with the added tuning parameters $p, m$ (where we must have $p>\frac{1}{2}$). At each time step preform the following: 1. Let $S \subset \mathbb{N}^d$ be a set of indicies that represents which treatments we are confident are effective. If a some index $i$ is in $S$, we call treatment $i$ "eliminated". 1. Draw $m$ independant samples from $\ell(\theta|y)$ and call this distrubition $D$. Let $\hat{\theta}$ represent a random variable drawn from $D$. For each $i \in [n]$, calculate $\alpha_i = P(\hat{\theta_i} \geq \gamma)$. 3. For every $i \in [n] \setminus S$, if $\alpha_i > p$, then for every remaining time step: (1) we set $\tilde{\theta_i}$ to be $\mathbb{E}[\hat{\theta_i}]$, where $\tilde{\theta}$ is the sample we draw from $\ell(\theta|y)$ in M-H, (2) we add $i$ into $S$, (3) we set the $i$th element of $a_t$ to 0. #### Variant 1b: Double Containment Elimination Implicitly, during each step of the algorithm, in order to solve $\max_a a^\top (\tilde{\theta} - \gamma)$ we utilize a function $f: \mathbb{R}^d \to \{0,1\}^d$. Now, let's define a "backwards" variant of $f$, which we will denote $f^{(b)}: \mathbb{R} \to \{0,1\}^d$, where $f^{(b)}(v) = w$ means that $v_i \geq \gamma \iff w_i = 0$ and $v_i < \gamma \iff w_i = 1$. In fact, $f^{(b)}$ solves $\min_a a^\top (\tilde{\theta} - \gamma)$. Run BayesUCB or TS in the background, with the added tuning parameter $\tau \leq T, \delta > 0$. 1. Run bayesUCB or TS for $\tau$ time steps and obtain a set, $E_1$, of indicies for the treatments the algorithm has identified to be past the threshold of $\gamma - \delta$. 2. Run bayesUCB or TS on only the treatments with indicies in $E_1$ and utlize $f^{(b)}$ to convert between $\theta$-space and the hypercube. In other words, during each time step of the algorithm, we want to solve $\min_a a^\top (\tilde{\theta} - \gamma)$. After running this for $T-\tau$ steps, we will obtain an output set $E_2$. 3. Return $E_1 \setminus E_2$. Note: I believe if $TPR_1$ and $FDR_1$ are the true positive/false discovery rates for step 1, and $TPR_2$ and $FDR_2$ are those for step 2, then overall this algorithm has a false discovery rate of $FDR_1\cdot\frac{(1-TPR_2)}{1+(FDR_1-FDR_2)-TPR_2FDR_1}$ and a true positive rate of $TPR_1\cdot(1-FDR_2)$. #### Variant 2: Controlling $\lVert a_t \rVert_1$ Run BayesUCB or TS in the background, with the added tuning parameters $0 \leq k_1 \leq k_2 \leq d$. At each time step preform the following: 1. Once we obtain our $a_t$ and $\tilde{\theta}$ for a certain time step, calculate $\lVert a_t \rVert_1$. If $k_1 \leq \lVert a_t \rVert_1 \leq k_2$, we do nothing. 2. If $k_2 < \lVert a_t \rVert_1$, let $S$ be a set that contains all indicies $i$ such that the $i$th entry of $a_t$ is a 1. Then we find $\hat{i} = \arg\min_{i \in S}\tilde{\theta_i}$. We then set the $\hat{i}$th element in $a_t$ to be 0. Continue this until the original inequality is satisifed: $\lVert a_t \rVert_1 \leq k_2$. 3. If $\lVert a_t \rVert_1 < k_1$, let $S$ be a set that contains all indicies $i$ such that the $i$th entry of $a_t$ is a 0. Then we find $\hat{i} = \arg\max_{i \in S}\tilde{\theta_i}$. We then set the $\hat{i}$th element in $a_t$ to be 1. Continue this until the original inequality is satisifed: $k_1 \leq \lVert a_t \rVert_1$. ### Frequentist Algorithms (i.e. we know, or have an upper bound on, the number of "effective" treatments. We don't know $P$) #### Algorithm 3 (bad benchmark): LinUCB Bad because it ignores the upper bound on the number of "effective" treatments Need to be able to solve $\max_\theta \pi_\theta(y)$. This is easy. #### Algorithm 4: The analagous LinUCB (leverages sparsity information) Easy to state, not sure it's efficiently computable yet... Need to be able to solve $\max_\theta \pi_\theta(y)$, but over "sparse" $\theta$. This is not easy, but probably do-able. #### Algorithm 5: TS Hack (with uniform prior)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.