Experimentation Frameworks

# Experimentation Frameworks --- ## Identification of a single unifying framework * Traditional A/B Test has fixed allocation for the two arms (blind to rewards) * A/B Test with Prior * Bandit algorithm with Thompson Sampling (dependent on immediate rewards) has dynamic allocation Empirical Evaluation of various methods to do experimentation --- * Traditional A/B Test (have historical data on and is a baseline) * Allocation is fixed (usually 50/50, but also dependent on the phase) * A/B Test with Prior on Conversion Rate (Bayesian Approach) * Prior is a GP Prior * A/B Test with Informative prior (prior chosen based off of previous tests/historical data) * A/B Test with Generic prior (no previous knowledge of the shape the distribution can take) * Allocation is fixed * Conjecture: Regularization is helpful, imputing prior knowledge well reduce false positives and speed up learning * Bandit algorithm with Thompson Sampling * Uses a posterior to adapt allocation, allocation is dynamic * Conjecture: dynamic allocation is helpful ## Simulation Metrics * Accuracy - does the true rate lie within the confidence interval * Speed - When do we stop the experiment? How quickly we are able to identify the better performing treatment, how many users (trials) conducted before we can stop the experiment * Total Average Reward (Profit) - total average amount of rewards captured at the end of the experiment. ## Background on Bandits vs. A/B Testing Standard A/B Testing is the de facto procedure of running online experiments and testing the performance of a new variant against an old (control) one. However, there have been a number of detriments surrounding this procedure: 1) In order to be some P% confident that we’ve identified the better performing variant, we need to run a fixed number of trials to determine statistical significance. (time cost) 2) In order to finish the experiment we risk showing the suboptimal treatment to a number of users when we could have exploited the better performing variant (financial cost). These costs are incurred because A/B tests are designed to prioritize exploration and determining statistical significance. We consider an alternate method - bandit algorithms with Thompson Sampling with the idea that because bandit algorithms can offer “dynamic traffic allocation”, we can continuously redirect traffic to the better performing variant instead of being subject to the fixed allocation of A/B Testing. However, the standard bandit problem focuses more on conversions than on statistical certainty. We choose Thompson Sampling as the way to execute the bandit algorithm because it’s proven to update the estimated conversion rates of variations and chooses arms in direct proportion to these estimates. MAB optimizes a given metric (total rewards, etc.) while A/B testing is used to collect data to make critical decisions and to learn of the impact of the variations with statistical confidence. MAB offers little to no interpretation for results and performance of variations, we just want to maximize conversion. This is useful when there’s not enough time to gather statistical significance. (fast rollout) ### It seems that MAB has advantages over the following use cases: 1. When lost of conversion is too high, losing a very large sum of profit to explore 2. Time-sensitive settings such as a seasonal ad that will only be displayed for a week 3. Settings where we can have the flexibility to add or subtract multiple elements from variations. In a traditional A/B test, once the experiment is active, we cannot make changes because data sanctity cannot be interfered with. 4. When there’s not enough traffic, A/B tests take a long time to produce statistical significance. ### When MAB fails: 1. When the main goal is statistical significance 2. When you’re optimizing for multiple metrics 3. When you want to do after study inference/post experiment analysis 4. MAB does not tell you the confidence level of poor performing arms which maybe important knowledge for making a business decision.