# Case study for Business Intelligence Intern -- Marketing B2C If you have ever logged to Quero Bolsa, you must certainly know "Carlos, from Quero Bolsa." Good ol' Carlos is our CRM expert; he is the one in charge of sending our customers ~~spam~~ meaningful content on their scholarship interests. The problem is that Carlos is not sure if the emails he sends are effective or if their only use is to annoy people with the company hard-earned cash. In this case study, we will help Carlos to keep his job (or not) by inferring the efficacy (or lack thereof) of his email campaigns. ## Modeling a test The efficacy of an email, in our cold capitalist world, is measured by the _additional_ profit that that email delivers. I should have put "additional" in a glowing unicorn-vomiting-rainbow font! It means absolutely nothing if someone that receives your email buys something later. For all that I care, you might have just handed out email to the clients accessing the checkout page. Of course they are going to buy! The right way to measure this additional profit would be to discover what would have happened had each person not received her email. Of course, this would require a time machine, or the aliens from _Edge of Tomorrow_. Until we find a way around this small inconvenient, we will have to _infer_ what would have happened. This is how we do it: we choose some people to which we would send an email and we purposefully "forget" to send them the email. Then, we compare their behavior against the behavior of the people who have actually received the email. Our assumption is that both groups "react the same", that is, that they act the same on the same situation. Thus, we can have a glimpse of what _could_ have happened to both groups in any alternate reality. So, what could have happened to the people who receive Carlos' email? Well, he could have used his copywriting skills to cunningly affect his audience in three effective ways, namely 1. He could convince them to buy the product when they would otherwise not. With more people buying, more revenue for Quero Bolsa. 2. He could convince them to buy sooner than later, instilling what is called a _sense of urgency_ or just reminding people that we exist, after all. While this seems a legitimate way of bringing more revenue, it is just so when it also makes people buy more, which has already been covered by (1). Buying faster by itself amounts to nothing, even when time preference is taken into account (time preference is only an issue in a greater order of magnitude). 3. He could convince them to buy more expensive products. With the same amount of people buying more expensive things, more revenue to Quero Bolsa. Since only (1) and (3) actually result in more revenue, we may get away by not modeling (2). Don't get me wrong: buying time is very important if you want a good enough result fast, which is what our company needs. However, we will make things easy here; you may learn the full solution if you are admitted. So, consider we have sent $N_t$ customers Carlos' email and have kept $N_c$ customers in our control group. After that, we have observed $n_t$ conversions in the first group, each at different price $P_{ti}$, and anc $n_c$ conversion in the second one, each at a different price $P_{ci}$. From our prior knowledge, we know that there exists a _hidden_ probability of conversion for both groups that is distributed according to $$p_t, p_c \sim \mathrm{Beta}(\alpha, \beta)$$ where $\alpha$ and $\beta$ are known constants. With these probabilities fixed, the number of conversions (people byung products) is basically a skewed coin toss: $$n_t | p_t \sim \mathrm{Binomial}(p_t, N_t) \\ n_c | p_c \sim \mathrm{Binomial}(p_c, N_c)$$ Lastly, each group has a "price parameter" which controls price preference in each group: $$\beta_t, \beta_c \sim \mathrm{Gamma}(\alpha_p, \beta_p)$$ The dependency is as follows: $$P_{ti} | \beta_t \sim \mathrm{Gamma}(\alpha_p, \beta_t) \\ P_{ci} | \beta_c \sim \mathrm{Gamma}(\alpha_p, \beta_c)$$ Unless explicitly stated here, all variables are jointly independent. All definitions of these distributions are available in Wikipedia (the notation is also the same). As stated, $N_t$ and $N_c$ are chosen ahead of the experiment, while $\alpha$, $\beta$, $\alpha_p$ and $\beta_p$ are known constants. Now, it's your turn: using Bayes' theorem, write the _joint_ posterior probability distribution $p(p_t, p_c, \beta_t, \beta_c | n_t, n_c, P_{ti}, P_{ci})$. To make things easier, note that, because certain things are independent, this distribution can be broken down in two independent chunks, like so: $$p(p_t, \beta_t | n_t, P_{ti}) \times p(p_c, \beta_c | n_c, P_{ci})$$ and, because our model is symmetric, both chunks are also symmetric. ## Implementing the test Now comes the tricky part: suppose we have a new batch of customers to which we wish to send the same email. There are two actions we might take: 1. send the emails and pay a small cost $c$ per each customer. 2. not send the emails, sparing the work and the money, but possibly losing a chance to generate revenue. Each option has its pros and cons. However, we have the inferred data from our previous test. Note that is is _infered_, so there is a margin of error associated with it. Therefore, we might always _be wrong_ in our analysis. If we do not pick the best option, we will pay the difference of profit per customer between our choice and the better one. The _expected value_ of this cost is called _regret_. Of course, our best choice is to pick the option (either _send_ or _don't send_) with the smallest regret. However, calculating regrets is not possible analytically, in our case. Therefore, you will have to run a Monte Carlo simulation, sampling scenarios from your posterior distribution to calculate the regret in each case for each choice. The average regret of all simulated scenarios will slowly converge to the true regret value. Write a Python3 program that does such a simulation. You may use any package available in the PyPI registry. In special, functions from `scipy.stats` (or `numpy.random`) will make your life much easier. The input will be a JSON file named `input.json` containing a dictionary with the following keys: * `n_control`: the number $N_c$ of users that were not sent an email. * `conversions_control`: the list of all prices of buyers that were not sent an email. * `n_test`: the number $N_t$ of users that were sent an email. * `conversions_test`: the list of all prices of buyers that were sent an email. * `unit_cost`: the cost of sending each email. * `alpha`: the parameter $\alpha$ from the model. * `beta`: the parameter $\beta$ from the model. * `alpha_price`: the parameter $\alpha_p$ from the model. * `beta_price`: the parameter $\beta_p$ from the model. The output of your program will be a JSON dictionary printed to STDOUT with the following keys: * `best_action`: the name of the best action to be taken (either `send` or `dont_send`). * `regret`: a dictionary for the _per customer_ regret calculated for each action. The keys are action names (as above) and the values are the regrets for each action, respectively. The number of samples in the Monte Carlo simulation are at your discretion, but just choose a big enough constant and don't sweat it. ## Does Carlos keep his job? That is the question we were waiting for. Based on your findings and on the numbers we have supplied, do you think a CRM (a.k.a., sending email to people) strategy is worthwhile? Do you have enough data to come to a conclusion? What are the possible weaknesses of the proposed model? Discuss and explore your findings.