A/B testing: Optimizing systems with experiments

# A/B testing: Optimizing systems with experiments ###### tags: `data science` In God we trust. All others bring data. --- An A/B test measures the business metric for each of version A and B. If you find that B has a better business metric, you make the modification permanent; otherwise you leave the system as is. ![](https://hackmd.io/_uploads/ByajWiGT3.png) - Design — You prepare for the experiment by **determining how many measurements to take**. - Measure — When you measure the business metric, you’ll take care to measure only the effect of switching from version A to version B by using a technique called **randomization**. - Analyze—Finally, you’ll compare the business metric estimates for A and B and decide whether to switch to B or not. If you take `num_ind` individual measurements and collect them in an array, `costs`, calculate: - The aggregate measurement is `costs.mean()` - The standard deviation of the individual measurements is `sd_1 = costs.std()` - The standard error of the aggregate measurement is `se = sd_1/np.sqrt(num_ind)` ```python np.random.seed(17) num_individual_measurements = 10 agg_asdaq, se_asdaq = aggregate_measurement_with_se("ASDAQ", num_individual_measurements) agg_byse, se_byse = aggregate_measurement_with_se("BYSE", num_individual_measurements) delta = agg_byse – agg_asdaq se_delta = np.sqrt(se_byse**2 + se_asdaq**2) ``` :::info The A/B testing decision logic: - Assume the true value—the expectation—of `delta` is zero. In other words, that BYSE and ASDAQ have the same expected execution costs. - If your measurement of `delta` is so far from zero that there’s less than a 5% chance the statement in step 1 is right, then act as if step 1 was wrong and the true delta is not zero. ::: ## Design the A/B Test Replication reduces variation, so you want to take multiple individual measurements of execution cost. But these measurements aren’t free, so it's better to determining how many measurements to take priorly. More precisely, **if the difference between execution costs is large enough to be practically significant, you want your aggregate measurement to be statistically significant**. :::info Experimentation costs: - The process (the measurement stage) will take some of your time. You might have to configure software, monitor the system for safety reasons, or even explain to customers or other members of your firm why system behavior is different than usual. - when experimenting, half of your trades are going to the one with the higher cost. The sooner you stop the experiment, the sooner you can send all trades to the better exchange. - The less time a single experiment takes, the more experiments you can run. You have lots of ideas, and they each need to be A/B tested ::: The phrase *statistically significant* means: `z = delta / se_delta = np.sqrt(num_ind) * delta / sd_1_delta < -1.64`, which means you must take at least `num_ind > (1.64 * sd_1_delta / delta)**2` this many individual measurements for your analysis stage to give a statistically significant result. :::warning To be clear: this says that you’ll make `num_ind` trades on BYSE and `num_ind` trades on ASDAQ; then you’ll compute two averages, `agg_byse` and `agg_asdaq`, from which you’ll compute `delta = agg_byse – agg_asdaq`. That means you’re making `2*num_ind` trades in total. ::: ### Substitution The problem with this expression is that, at design time, **you don’t know `delta` or `sd_1_delta`**. They are summary statistics of the individual measurement that you have yet to take. The good news is that **you can find good design-time substitutes** for these numbers. - in place of `delta`, substitute your practical significance level, `prac_sig`. (You are saying that you want to be able to measure—to make statistically significant—delta values that are at least as large (in magnitude) as prac_sig.) - you can estimate `sd_1_delta` in one of two ways: - Take the standard deviation of existing measurements of your business metric from logged data. From that, you can calculate `sd_1_asdaq` directly. Then you can guess that `sd_1_byse = sd_1_asdaq`. Usually this is a good guess. - Alternatively, run a small-scale measurement, called a *pilot study*. You could estimate `sd_1_asdaq` and `sd_1_byse` directly from logged costs collected during a pilot study, then compute sd_1_delta from those estimate. ```python def ab_test_design(sd_1_delta, prac_sig): num_individual_measurements = (1.64 * sd_1_delta / prac_sig)**2 return np.ceil(num_individual_measurements) # Collect from production log np.random.seed(17) sd_1_asdaq = np.array([trading_system("ASDAQ") for _ in range(100)]).std() sd_1_byse = sd_1_asdaq sd_1_delta = np.sqrt(sd_1_asdaq**2 + sd_1_byse**2) prac_sig = 1 # specify prac_sig ab_test_design(sd_1_delta, prac_sig) # find the minimum number of individual measurements ``` If you take this individual measurements, you’ll have a **5% chance of a false positive**—of incorrectly acting as if BYSE is better than ASDAQ. ### Power Analysis What happens if BYSE really is better than ASDAQ, but you incorrectly conclude that it isn’t, and so you don’t switch your trading to BYSE? That’s called a **false negative**. :::info - false positive: BYSE isn’t better, but you do switch. - false negative: BYSE is better, but you don't switch. ::: Just like you adjusted num_ind to limit the rate of false positives, you’d like a way to limit the rate of false negatives, too. You just designed an A/B test that would limit the rate of false positive to 5%. You can use a similar approach to also limit the rate of false negatives. **By convention, we usually limit false negative occurrences to 20% of the time**. :::warning *The conventional limit on the false-positive rate in A/B test design is 5%. The conventional limit on the false-negative rate is 20%.* Why are they different? When you run an A/B test, you start with a system running version A. If you make a false-positive error, you switch your system over to version B, even though B is worse than A. You have reduced the quality of your system. Your costs will go up or revenue will go down, or what have you. It’s worse than having done nothing at all. When you make a false-negative error, on the other hand, you leave the system running version A. You have done no harm. You’ve missed out on an opportunity to improve the system by switching to B—which is better than A in this case—but you haven’t made the system any worse. ::: ![](https://hackmd.io/_uploads/Sy51x3GT2.png) The left panel, figure 2.10a, depicts the rule for deciding whether an aggregate measurement is statistically significant, which limits false positives. In figure 2.10b, we simultaneously require that if the true distribution has a nonzero expectation—if BYSE really is cheaper than ASDAQ—that the probability of the z value falling to the right of the vertical line would be only 20%. `z0 > 0`, so `-z0`, a negative number, is the center of the left-hand histogram in figure 2.10b. The two requirements on the vertical line—which we use to decide whether to switch to BYSE or stick with ASDAQ—are: - It must be at 1.64 to the left of z = 0, where z = 0 is the expectation of the hypothetical distribution where BYSE and ASDAQ have the same cost. This limits the false-positive rate to 5%. - It must be 0.84 to the right of -z0, where -z0 is the expectation of the hypothetical distribution where BYSE is cheaper than ASDAQ. This limits the false-negative rate to 20%. For both statements to be true, we must have `0 - 1.64 = -z0 + .84` or `z0 = 2.48`. This is the design criterion. Solving for `num_ind`: ``` z0 = 2.48 <= - prac_sig / se_delta = np.sqrt(num_ind) * (-prac_sig) / sd_1_delta np.sqrt(num_ind) * (-prac_sig) / sd_1_delta >= 2.48 num_ind >= (2.48 * sd_1_delta / prac_sig)**2 ``` A/B test design with power analysis: ```python def ab_test_design_2(sd_1_delta, prac_sig): num_individual_measurements = (2.48 * sd_1_delta / prac_sig)**2 return np.ceil(num_individual_measurements) np.random.seed(17) sd_1_asdaq = np.array([trading_system("ASDAQ") for _ in range(100)]).std() sd_1_byse = sd_1_asdaq sd_1_delta = np.sqrt(sd_1_asdaq**2 + sd_1_byse**2) prac_sig = 1.0 ab_test_design_2(sd_1_delta, prac_sig) ``` ## Measure You randomize to remove any biases from these measurements. ```python def measure(min_individual_measurements): ind_asdaq = [] ind_byse = [] while ( len(ind_asdaq) < min_individual_measurements and len(ind_byse) < min_individual_measurements ): if np.random.randint(2) == 0: ind_asdaq.append(trading_system("ASDAQ")) else: ind_byse.append(trading_system("BYSE")) return np.array(ind_asdaq), np.array(ind_byse) ``` ### Analyze First, you estimate the difference in estimated expected costs, the aggregate measurement:`ind_byse.mean() – ind_asdaq.mean()`. If its magnitude is greater than practically significant result, `prac_sig`. BYSE has, so to speak, passed the first test. Then check the test for statistical significance: ```python def analyze(ind_asdaq, ind_byse): agg_asdaq = ind_asdaq.mean() se_asdaq = ind_asdaq.std() / np.sqrt(len(ind_asdaq)) agg_byse = ind_byse.mean() se_byse = ind_byse.std() / np.sqrt(len(ind_byse)) delta = agg_byse - agg_asdaq se_delta = np.sqrt(se_asdaq**2 + se_byse**2) z = delta / se_delta return z ``` If this result is statistically significant. BYSE has passed the second test. Since BYSE’s aggregate measurement is both practically and statistically significantly lower than ASDAQ’s, you decide to reconfigure your production trading system to direct all trades to BYSE. You’re fairly confident this is a good idea, but you recognize that, unavoidably, there’s still a 5% chance BYSE is not better than ASDAQ. ### Example Let's take an example: Suppose you're running an A/B test to compare the click-through rates (CTR) of two versions of an advertisement: the control version and the treatment version. You want to detect a minimum difference of 0.5% in CTR with a significance level of 0.05 (95% confidence) and a power of 0.80. Assume that the estimated standard deviation $( \text{{σ}} )$ is 2.0%. Critical values for a two-tailed test at 0.05 significance level: $( Z_{\alpha/2} = 1.96 )$. Critical values for a test with 0.80 power: $( Z_{\beta}= 0.84 )$. Plugging these values into the formula: $$ n = \frac{{2 \cdot (1.96 + 0.84)^2 \cdot (0.02)^2}}{{(0.005)^2}} \approx 6832 $$ So, you would need approximately 6832 measurements in each group (control and treatment) to detect a minimum difference of 0.5% in CTR with a significance level of 0.05 and a power of 0.80.