# STAR DIGITAL - ASSESSING THE EFFECTIVENESS OF DISPLAY ADVERTISING ### Business Overview Star Digital is a multimedia video service provider with over $100 million annual advertising spend, with a growing focus on online advertising. Understanding the return on investment for each of its media investments has been key to its spending decisions. In order to asses the effectiveness of its digital display advertising, Star Digital designed a controlled experiment with an online campaign. Through this experiment, Star Digital wanted to address the following key questions: 1. Is online advertising effective for Star Digital ? 2. Is there a frequency effect for advertising on purchase? 3. Which sites should Star Digital advertise on? In particular, should it invest in Site 6 or Sites 1 through 5? This document discusses the key questions, the experiment setup, the threats involved in causal inference and the further analysis performed to answer the key question. ### Experiment Acknowledging the potential issues in measuring digital campaigns, Star Digital designed a controlled experiment for the above marketing campaign. The campaign was scheduled to run on six websites with the primary objective of increasing the subscription on package sales. The experiment setup has two different treatments. First it established a randomly assigned treatment and control group that is 90%-10% split of the overall customer base. Star digital, arrived at the split based on factors such as baseline conversion rate, campaign reach, minimum lift expected and other statistical test keeping in mind the opportunity cost. The customers in treatment group are exposed to the actual campaign ad as the call to action and control group is shown a charity ad in its place to ensure no spillover between the two cohorts. The second variation is with respect to the different websites on which the ad was displayed. The different websites had different cost of the ad served, with the first type of websites(1-5) charging $25 per thousand impressions and the other one charging $20 per thousand impressions. ### Threats to Causal Inference **Selection Bias:** The primary threat to selection bias is observed while selecting the treatment and control group. In this experiment, Star Digital has avoided any selection bias by choosing their entire population and the assignment of test and control is random. Further for obtaining the dataset, a choice based division of the population into purchase and no-purchase group was done and and the sample was randomly drawn from it. Overall, the selection process both at treatment and control split as well as selection of sample data being random, we negate the impact of selection bias. We further verify this through a randomization check as a part of our analysis. **Omitted Variable Bias:** Current dataset has only impressions related information as part of this analysis. There might be few other external variables that is correlated with the impressions count as well as the purchase decision. **Simultaneity Bias:** There is no simultaneity bias in the experiment since we do not find any reasons to believe that purchasing the subscription package would cause an increase in viewing the advertisements. **Measurement Error:** We assume that there is no measurement error as the only variable captured at an user level being impression would not be difficult to track. **Awareness:** We assume that the consumers would not be aware of being a part of the experiment as there is no increased value to the customer being provided that just by the display of ad that would grab the attention of customers. ### Data Exploration and Overview We were provided a sample of the overall data, tracking purchase decision outcomes at the customer level. It also details whether the customer has seen the relevant ad and it’s frequency split across websites. ###### Importing necessary packages ```R library(dplyr) library(ggplot2) ``` ###### Data Transformations After loading the dataset, we notice we have not captured the overall total impressions that a treatment or control group users have seen across all the six websites. Hence, we captured in the variable "tot_impressions", the overall impressions for the user ```R data = read.csv("starDigital.csv") ``` ```R data <- data %>% rowwise() %>% mutate(tot_impressions = sum(imp_1, imp_2, imp_3,imp_4,imp_5,imp_6)) ``` #### Sanity Checks ##### Missing Values Before we move on to the analysis, we quickly performed a few sanity checks on the dataset. We checked the variables that with the missing values and the percentage of those missing values if any ```R var_miss(data) ``` ![](https://i.imgur.com/x2sDXu9.png) As seen from the above graph, we see no variables contain null values and doesn't require any imputation ##### Randomization Check Though, Star Digital has assigned test and control randomly to users inorder to prevent selection bias there is likelihood that the analyzed sample by random could represent any bias. Hence, we wanted to check the assumption that the test and control group has similar exposure to internet behavior. We wanted to test this assumption because if we fail to establish that the test and control present in the sample are not comparable in the net internet exposure, then the causal inference would not be accurate. Before, we run any statistical test, we first checked the distributions of overall impressions for both the test and control . ```R data_test <- data %>% filter(test == 1) data_control <- data %>% filter(test == 0) par(mfrow=c(1,2),oma = c(0, 0, 2, 0)) mtext("Distributions of Overall Impressions", outer = TRUE, cex = 1.5) hist(data_test$tot_impressions,xlab = 'Total Impressions', main = "Treatment" ) hist(data_control$tot_impressions,xlab = 'Total Impressions', main = "Control") ``` ![](https://i.imgur.com/y9hZoWb.png) From the above plots, we see both the treatment and control group have similar distribution i.e. heavily skewed towards the right. In order to statistically establish this, we run a t-test on total impressions for the treatment and control group users. ```R t.test(data$tot_impressions ~ data$test) ``` ![](https://i.imgur.com/30e6Tn0.png) The results indicate the average impressions for both test and control are similar and they are not significantly different from each other. Basis on above distributions and t-test results, we establish the similarity between treatment and control and hence we do not perform any matching techniques and any inference of the treatment over the control would establish a causal relationship ##### Power Test For the current experiment, the treatment and control group sizes are around 23K and 2.6K respectively. Given the current sample size, we wanted to run a power test to determine the minimum effect size that can be observed from this sample given a threshold of 0.05 of incorrectly rejecting the assumption that the two groups are indeed different and probability of correctly rejecting the null hypothesis is set at 0.8(commonly used value). Based on the above conditions, the minimum lift that can be detected is at 5.7%. Any lift we observe below 5.7% should be dealt with caution ```R tst_count = data %>% filter(test == 1) %>% select(id) %>% unique() %>% nrow() control_count = data %>% filter(test == 0) %>% select(id) %>% unique() %>% nrow() pwr.2p2n.test(n1 = control_count , n2 = tst_count , sig.level = .05 , power = .8) ``` ![](https://i.imgur.com/CgzvGoH.png) ### Data Analysis #### Is Online advertising effective for Star Digital? We started by analyzing our first key question, to determine the causal impact of the seeing Star Digital's ad in driving purchase decision. We performed a logistic regression on the dependent variable which in this case the purchase flag on the treatment vs. control flag as the predictor. ```R glm(purchase ~ test, data= data, family=binomial(link="logit")) ``` ![](https://i.imgur.com/zplseu6.png) We found that being in the treatment group and seeing a relevant ad, increases the odds of the customer purchasing the Star digital subscription package by 7.6%, in comparison to being in control group and seeing charity ads. This result was statistically significant with an assumed confidence interval of 90% #### Does increasing frequency increase the probability of purchase? Once, we established the causal impact of the display advertising, we then analyzed how the frequency of the advertisements drove the purchase decision. Our target variable is still the purchase flag while the predictors consist of test vs. control flag, the overall frequency of the advertisements as well as the interaction between test flag and the frequency of advertisements. The interaction term is mainly to understand the impact of the treatment quantity in conjunction with being in test group. ``` model_freq_log = glm(purchase ~ test+log(tot_impressions)+test*log(tot_impressions),family="binomial" ,data = data) summary(model_freq_log) ``` ![](https://i.imgur.com/Rg1Sd56.png) ``` # Coefficient = 0.07 This is change in log odds. # Taking (e^1.088)-1 = 0.0725 we get change in odds which is 7.25 % ``` For every 1% of relevant ads shown, the probability of purchasing increases by 56% ``` Calculation sum of co-efficient = 0.38+0.07 = 0.45 this is change in log odds Taking e^-0.45-1 =0.56 ``` #### Which sites should Star Digital advertise on? In particular, should it invest in Site 6 or Sites 1 through 5? Having established in prior analysis the causal impact of the online advertising for Star Digital. We now focus on guiding Star digital to advertise in websites that gurantee better returns. In this treatment, the variation is with respect to the the cost per thousand impressions. Instead of calculating the overall impressions for the two classes of websites, we estimated the cost per class and substitued in the regression equation. This also involves the interaction terms between test variable with each of these classes. ``` data$cost_1to5 <- data$sum1to5*(25/1000) data$cost_6 <- data$imp_6*(20/1000) ``` ``` model_1to5_6 = glm(purchase ~ test+log(cost_1to5+0.0001)+log(cost_6+0.0001)+test*log(cost_1to5+0.0001) +test*log(cost_6 +0.0001),family="binomial" ,data = data) # adding a small value of 0.001 summary(model_1to5_6) ``` ![](https://i.imgur.com/6wksdXY.png) For Class 1 website that charges $25 per thousand impressions we see that for every percentage increase in dollar invested the odds purchase increase by 16% ``` Calculation: Sum of co-efficient = 0.13 + 0.25= 0.155 This change in log odds. taking e ^ 0.155-1 = 0.16. We get change in odds which is 16% ``` For Class 2 website that charges $20 per thousand impressions we observe that for every percentage increase in dollar invested the odds purchase by 5.4% ``` Calculation sum of coefficient = 0.023+0.03 = 0.053 This is change in log odds taking e^0.53-1 = 0.054 we get change in odds which is 5.4% ``` From the above, we can see that investing in class 1(website 1 to 5) is more effective as it results in a 16% increase in odds of purchase as compared to website 6 which only increases the odds by 5.4%. Hence we should invest in website 1to5