<p style="font-size:2em; text-align: center;"><b>Comparison between Different Hypothesized Models and Priors for Thompson Sampling </b></p> <p style="text-align: center;">Pengyu Chen, Peilin Sun, and You-Syuan Liou </p> ## Abstract This report attempts to give an introduction to the multivariate multi-armed bandit (MAB) in a context of identifying the most appropriate web-based content. Our target audience is undergraduate computer science students who are unfamiliar with MAB. We expect the audience will have a fundamental understanding of the principal of MAB and know how different priors and hypothesized model affect the performance. They also can implement it based on the code we provide. The report can be evaluated from three aspects, the application with various priors, the legibility, and the sample code in the appendix. Our work is based on [1], and we extend its algorithm by adding different prior assumptions about model parameters. We conduct a set of simulation experiments and present empirical results using interactive web apps to help the audience understand how the algorithm works as we vary parameters of the simulated data. The results show the suitability of our algorithm to combinatorial decision-making problems that involves large decision space with strong interactions between content. ## 1 Introduction and Background Suppose we want to design a promotional web page which consists of several components, and each component is a separate widget with several alternative content. For example, a promotional message may contain a title, offer details, an image, and a call-to-action button. Our goal is to find the layout that can maximize conversion rate. This can be seen as a decision-making problem that involves distinct but interdependent decisions to determine its layout. ### 1.1 Exploitation vs Exploration Decision making typically involves some choice between exploitation versus exploration. Exploitation refers to the continuation of probing the best-so-far solution given current information, whereas exploration refers to gathering more information that might lead us to better decisions in the future. The exploitation/exploration dilemma exists in many aspects of our life. For example, when you are about to have lunch, you can either “exploit” – choose your favorite restaurant – or “explore” – try something you haven’t had before, which may be the best ever or something you will not taste it again. Similarly, in the context of web page design, one may want to balance between the known most attractive design and the new ones that could be more successful. The dilemma comes from the incomplete information about the environment: we need to collect more information to find the best overall strategy while keeping the cost of exploring under control. This means that sometimes you have to take an action which you think will be less rewarding for the purpose of learning more about the environment. The dilemma is in fact a trade-off: you either be conservative and select the one you have already known or take a risk to find something better. ### 1.2 Multi-armed bandit The multi-armed bandit, also known as $k$-armed bandit, is a classic model for studying exploration versus exploitation trade-off in sequential decision problems. Consider the following learning problem. You are in a casino facing multiple slot machines. Each machine is configured with an unknown probability of how likely you will get a reward at one play. Your objective is to maximize the expected total reward over some time steps, for example, 1000 plays. Now let’s formalize this in a mathematical form. A Bernoulli multi-armed bandit can be described as a tuple of $<A, R>$, where: - We have $K$ machines with reward probabilities $\{\theta_1,..., \theta_K\}$. - At each time step $t$, we take an action $a$ on one slot machine and receive a reward $r$. - $A$ is a set of actions, each referring to the interaction with one slot machine. The value of action $a$ is defined as the expected reward $Q(a)=E[r|a]=\theta$. If action $a_t$ at the time step $t$ is on the i-th machine, then $Q(a_t)=\theta_i$ - $R$ is a reward function. At time step $t$, $r_t=R(a_t)$ returns 1 with a probability $Q(a_t)$ or 0 otherwise. This is a simplified version of Markov decision process as there is no state $S$. Our goal is to maximize the cumulative reward $\sum^{T}_{t=1}r_t$. The optimal reward probability $\theta^*$of the optimal action $a^*$ with the best reward is: $$ \theta^*=Q(a^*)=max_{a\in A}Q(a)=max_{1\leq i \leq K}\theta_i $$ Our loss function is the cumulative regret due to not selecting the optimal action $$ L_T=E\left[\sum^{T}_{t=1}\left(\theta^*-Q(a_t)\right)\right] $$ If we knew the value of each action, then it would be trivial to optimize the loss function since we can always select the action with highest value. Unfortunately, most of the time we don’t know the values of actions, although we may have estimates. We denote the estimated value of action at time step as $q_t(a)$. We would like $q_t(a)$ to be close to $Q(a)$. At any time step, there is at least one action whose estimated expected reward is largest. We call these the greedy actions. Again, we are faced with the exploitation/exploration dilemma: selecting greedy actions (exploitation) is the right thing to do to maximize the expected reward on the current step, but selecting one of the non-greedy actions (exploration) enables us to improve the estimate of the non-greedy action’s value and therefore may receive a greater total reward in the long run. Therefore, we attempt to solve this dilemma by Thompson sampling, a method to balances between exploitation and exploration. ## 2 Application Example ### 2.1 Problem Setting We define the web page design problem as follows: we want to select a website layout $A$ under a context $X$. We try to find the optimal combination of widgets to present the layout when users browse the web page in order to maximize the reward $R$, which can be a click-through rate, purchase rate, or purchase amount. <center> <img src="https://i.imgur.com/Q9Fbg5a.png" alt="drawing" width="500"/> <figcaption>Fig 1. Example of a webpage layout</figcaption> </center> <br><br> Each layout $A$ follows a common template that contains $D$ widgets, each having $N$ alternative selections. Note that we assume all widgets have the same number of alternatives in our experiment. Thus, our web page template can produce $N^D$ possible layouts. We represent $A$ as a $D$–dimensional vector, $A\in \{1,2,…,N\}^D$, where $A[i]$ denotes the selection for $i^{th}$ widget. Context $X$ stores user- and session-related information that may have an impact on a layout’s expected reward, e.g., user history, user’s age or device type. We assume context $X$ is drawn from a fixed but unknown distribution and represent it as a vector of length $L$ where each dimension can take one of $G$ values. Let $X_l$ be the $l^{th}$ feature of the context. Feature vector $B∈R^M$, obtained by combining $X$ and $A$, contains layout and user information and possibly their interactions. The reward $R_{A,X}$ for a given layout and context is a linear combination of $B$ and a fixed unknown weight vector $\mu \in R^M$. Thus, we can model the reward with a generalized linear model: $$ E[R|A, X]=g(B^{T}\mu), $$ where $g$ is GLM's link function. Now we have enough notations to formalize a contextual multi-armed bandit problem. The contextual multi-armed bandit is the multi-armed bandit extended by incorporating context information that may affect the expected reward of an action. Each distinct layout can be viewed as an arm, so we have $N^D$ arms in total. At time step $t$, a context $X_t$ is received, and a feature vector $B_{A,X_t}$ is generated for each arm $A$. An arm $A_t$ is selected and displayed, and its corresponding reward $R_t$ is observed. We denote the optimal arm at time $t$ as: $$ A^*_{t}=argmax_{A}E[R_{A, X_t}] $$ Thus, the regret at time $t$ for not choosing the optimal arm is: $$ L_t = E[R_{A^*_t, X_t}]-E[R_{A_t, X_t}] $$ Our ultimate goal is to approximate the solution to $A^*_{t}$ effectively by estimating $\mu$ while minimizing the cumulative regret over $T$ time steps: $$ L^T=\sum_{t=1}^{T}L_t $$ In the remainder of the report, for the purpose of notation convenience, we will assume reward $R$ and feature vector elements are binary. Thus, we have $B\in [0,1]^M$ and $R=±1$. We use the probit link function, the cumulative standard normal distribution function $Φ$, for our generalized linear model. Let $W$ represent the estimate of $\mu$. Our binary-response probit model is: $$ P(R|A, X)=\Phi(R*B^TW). $$ We assume that the regression parameters $W$ are mutually independent random variables with a certain type of prior distributions. Possible priors include Gaussian, Laplace, and horseshoe. We update the posterior distributions of $W$ using approximate inference with MCMC. To limit the number of interaction terms estimated, we assume three-way and higher interactions are effectively zero and only consider pair-wise interaction effects. ### 2.2 Algorithms #### 2.2.1 Thompson Sampling for Contextual Bandits In our setting, we would like to choose the best layout of promotional messages to users. Given we have limited information of users’ preferences, we need to trade off between exploiting users’ high preference of known layouts and exploring unknown preference of new layouts. Thompson sampling is an algorithm to address this trade-off. In the beginning, without any data available, randomly generated layouts $A_0$ are assigned to different users $X_0$. By recording user behaviours (click or not), we have the reward $R_0$. Based on our prior knowledge and available information, we can find a better weight $W_1$ by probit regression, and through optimization methods maximizing the probability of reward, a better layout $A_1$ is selected. This layout is then assigned to new users $X_1$, and reward $R_1$ is recorded. Finally, update current information incrementally with the new data. We keep this iteration for $T$ times, until the reward reaches its optimum. The algorithm is shown as the following: - Step 1: Randomly initialize context $X_0$ and action $A_0$ - Step 2: Observe the reward $R_0$, so we have the history information $H_0=(A_0, R_0, X_0)$. - Step 3: Set a prior distribution for the weights of the Bayesian regression $W_0$, say $\mathrm{Normal}(0, I)$. - Step 4: For $t = 1, …, T$ do: &nbsp; Given $H_{t-1}$, sample $W_t$ from the posterior $P(W|H_{t-1})$. &nbsp; Select $A_t$ by optimizing $BW_t$ &nbsp; Receive new context $X_t$ &nbsp; Observe the reward $R_t$ with $X_t$ and $A_t$ &nbsp; Update $H_t$ by combining $H_{t-1}$ with $(A_t, R_t, X_t)$ #### 2.2.2 Hill Climbing Optimization In Thompson sampling above, optimizing $BW_t$ would be an NP-hard problem. Instead of an exhaustive search to find the best layout $A_t$, we use hill climbing optimization to obtain an approximate solution. With hill climbing optimization, we select one widget from the layout each time and examine all possible alternatives to find the best one given other widgets fixed. In the non-technical interpretation, we initialize a layout randomly and keep updating it by randomly selecting a widget and finding the best choice of this widget without changing other widgets. We repeat this several times and get the final combination of layout that approximate the optimal combination. To avoid starting from a poor initial layout, we initialize multiple times, doing steps above, and use the one with highest reward. The algorithm is shown as the following: For $s = 1, …, S$ do: &nbsp;&nbsp; Randomly initialzie a layout $A_0$ &nbsp;&nbsp; For $k = 1, …, K$ do: &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1. Randomly choose a widget $i$ to optimize the reward $B$ &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2. Update layout $A_{k-1}$ with widget $i$ change to the optimized choice to $A_k$ &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;3. Return the best layout among these $S$ rounds ## 3 Simulation We test our implementation on a simulated dataset. Our goal is to compare the performances under different hypothesized reward models and priors. ### 3.1 Simulation Scheme In each simulation, we would have a fixed true model including all features (MAB-AX2). The weights of all features are randomly sampled from a normal distribution, and the mean of the normal distribution is generated from a uniform distribution from (-3, 3) or (-5, 5) to make sure its randomness. #### 3.1.1 Hypothesized Models In our experiment, we set the true reward model as $$B = W_0 + W_1*A + W_2*X + W_3*X_{AA} + W_4*X_{AX} + \epsilon$$ Then, we define four various hypothesized reward models to test the performance of contextual bandit using thompson sampling. 1. Model with only widget actions (MAB-A) $$B = W_0 + W_1*A $$ 2. Model with both widget actions and context (MAB-AX) $$B = W_0 + W_1*A + W_2*X$$ 3. Model with widget actions and interactions between widget actions (MAB-A2) $$B = W_0 + W_1*A + W_3*X_{AA}$$ 4. Model with widget actions, context, interactions between widgets and interactions between widgets and context (MAB-AX2) $$B = W_0 + W_1*A + W_2*X + W_3*X_{AA} + W_4*X_{AX}$$ where $A$ is actions of widgets, $X$ is the context, $X_{AA}$ is the interaction term between widgets and $X_{AX}$ is the interaction term between widgets and context. By evaluating each these four hypothesized models in terms of regret, we want to know how the results vary based on different hypotheses. #### 3.1.2 Priors In the simulation process, we would like to choose different prior distributions for the weights. Due to their different regularization effects, we want to explore how they influence the performance of contextual Thompson sampling and how they variate on simulated data under different conditions. 1. Normal Prior Under the Bayesian regression setting, $$B = W_0 + W_1A + W_2X + W_3X_{AA} + W_4X_{AX} + \epsilon,$$ where $\epsilon \sim N(0,\sigma^2)$. Let $W = (W_0, W_1, W_2, W_3, W_4)$, which is the weight of all features in the model. Suppose $(W, \sigma^2)$ has a Normal-inverse Gaussian prior, if $\epsilon=1$ is known, then $W\sim N(\mu_0, I)$. In this case, it is the same as a ridge regression under frequentists’ regression settings. Thus, no $w_i$ would be regularized to 0. Despite its conjugacy, this may lead to bias in the model. 2. Laplace Prior Also called double-exponential distribution, we say if $w_i\sim \mathrm{Laplace}(\tau)$, its pdf is as follows: $$f(w_i)=\frac{1}{2\tau}\exp\left(-\frac{|w_i|}{\tau}\right)$$ It can be proved that when optimizing the loss function with $L_1$ regularization term $\sum |w_i|\leq t$, the solution can be interpreted as the posterior mode of $W$ in the Bayesian case with a Laplace prior. Compared to a normal prior, it has wider tails and assigns more weights to regions near zero, thus more features tend to be eliminated while keeping more informative features. 3. Horseshoe Prior Though Laplace priors can set parameters to zero, it leads to a bias, and reduces accuracy significantly. Thus, we also try the Horseshoe prior proposed by Carvalho et al: $$ \begin{aligned} W_m &\sim \mathrm{N}(0, \tau\lambda_m), \\ \lambda_m &\sim \mathrm{Half-Cauchy}(0, 1), \\ \tau &\sim \mathrm{Half-Cauchy}(0, \tau_0) \end{aligned} $$ Due to the special property of Half-Cauchy distribution, we can prove that $W_m$ has a non-zero probability at 0. In other words, its prior probability density function diverges at 0. Therefore, with its even wider tails, it introduces less bias, while eliminating more insignificant features. We expect the Horseshoe prior would have the best performance if the true model has zero weights on many features. ### 3.2 Experimental Results Results of the simulation experiments are shown in the figures below. As stated in section 3.1, the true underlying model accounts for widgets, contextual information, interactions between widgets and interactions between widgets and context. Fig. 2 shows how regret varies as the training proceeds under the horseshoe piror assumption. The plots under the other two prior assumptions reveal a similar pattern. As expected, ``MAB-AX2`` gives a superior result as its formulation matches our true model. Its regret quickly drops and slowly converges to $0$. The regrets of the other three model reach a platue due to its inability of capturing interaction effects. <center> <img src="https://i.imgur.com/6rQdEs5.png" alt="drawing" width="500"/> <figcaption>Fig 2. Comparison of regrets among different training models</figcaption> </center> <br><br> Next, we compare the performance of ``MAB-AX2`` under different assumptions of prior distributions. Fig. 3 indicates that the selection of prior does not have a significant impact on the performance of training models in terms of regret. <center> <img src="https://i.imgur.com/1z0V64N.png" alt="drawing" width="500"/> <figcaption>Fig 3. Comparison of regrets among different priors</figcaption> </center> <br> Fig. 4 confirms this assumption for ``MAB-A2`` and ``MAB-AX2`` since the average regret values of them are roughly at the same level respectively. However, it requires further studies to know how prior interacts with ``MAB-A`` and ``MAB-AX``. <center> <img src="https://i.imgur.com/DzDG3bQ.png" alt="drawing" width="500"/> <figcaption>Fig 4. Comparison of mean regrets among different training models and priors</figcaption> </center> <br> These simulation results preliminarily verify the correctness of our implementation of the proposed multivariate bandit algorithm. The results suggest that the selection of the hypothesized model has a dominant influence on the performance of the algorithm compared with the selection of priors. ## 4 Discussion ### 4.1 Limitations Although its time complexity is less than exponential time, since only two-way interactions are considered, it still costs lots of time to train. In the greedy hill-climbing search, for example, at least three for-loops are required to find optimal actions, so if the dimension of the layout increase by one, the time needed would increase by a larger scale. Moreover, since our Bayesian regression process is based on the MCMC method of the PyMC package, which is flawed in updating priors, more computation is needed in obtaining new priors and updating history information $H_t$. In addition, according to the figures in the last section, the performance of Laplace and horseshoe priors is not as good as we expected. Since normal prior does not eliminate any features, more experiments should be conducted in finding a better layout. For example, the hyperparameters we choose may not be the best setting in our model. Therefore, we should explore more possibilities in choosing hyperparameters and different priors. ### 4.2 Future Works Future studies could consider the potential effects from three aspects, different true models, various priors and other MAB algorithms. First, true models with varied weights between widgets, context and the interaction between them could be examined in the future. Second, for the priors, there are still many distribution we didn't try, such as spike-and-slab. As for MAB algorithms, we used Thompson sampling for all of our experiments. It is also interesting to explore the effect of other algorithms, like Upper Confidence Bound. ## 5 Conclusion We present an introduction to multi-armed bandit (MAB) from discussing exploitation vs exploration to its algorithm details. Also, we apply MAB to a web page layout example, which combined by multiple widgets with several alternatives for each, and we aim to find the optimal combination of widgets to enhance the user click-through rate. In our application, we set the true model with widget actions, context, the interaction between actions and the interaction between action and context. Then, we simulate four hypothesized models and used three priors, Normal, Laplace and Horseshoe distribution to test the performance of MAB. The algorithms we utilize for contextual bandit is Thompson sampling and we optimize the action selections by hill climbing algorithms. The result suggests that ``MAB-AX2``, which matches the true model, has the lowest regret and other models have similar performance no matter which prior is used. As for priors, three priors we used do not have significant difference with respect to the regrets after 5000 iterations and their decay rates. ## Reference - Hill, D. N., Nassif, H., Liu, Y., Iyer, A., & Vishwanathan, S. (2017). An Efficient Bandit Algorithm for Realtime Multivariate Optimization. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - 17. doi: 10.1145/3097983.3098184 - Carvalho, C. M., Polson, N. G., & Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika, 97(2), 465–480. doi: 10.1093/biomet/asq017 - Qian, H. (2018). Big Data Bayesian Linear Regression and Variable Selection by Normal-Inverse-Gamma Summation. Bayesian Analysis, 13(4), 1011–1035. doi: 10.1214/17-ba1083 - Tibshirani, R. (1996). Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288. doi: 10.1111/j.2517-6161.1996.tb02080.x ## Appendix ### Motivation Our project is based on Amazon's multi-armed bandit paper, *An Efficient Bandit Algorithm for Realtime Multivariate Optimization*. As we think MAB will be useful in our future internships or jobs, we decided to go through all details in the paper and implement the algorithm by ourselves. Also, in many situations, there are many insignificant features in the data, so besides Normal prior used in the paper, we also tried two more priors, Laplace and Horseshoe, in our experiment. These two priors have stronger regularization power and would shrink the weights of uninformative features to zero. ### How we checked our code Basically, we have written testing methods for most of the functions. For example, in Utils class, we set the dimension of action as ``(2, 3)``, and an action ``A = np.array([1, 2])``, so the function ``get_A_dummy_vector`` should transfer this array to ``[[0, 1], [0, 0]]``. Also, in Model class, we tested simple Bayesian linear regressions with different priors, and the weights sampled converged according to the trace plots. Actually, each of the team members checked others' code to make sure everything worked correctly. ### Code Our code run on University of Toronto computer science department CDF. The instruction of how to use the server can be found from CSLab webpage: https://support.cs.toronto.edu/systems/linuxgpu.html ```python import numpy as np import numpy.random as random import pymc3 as pm import theano.tensor as tt from scipy.stats import norm ``` ```python class Utils: def __init__(self, dim_action, dim_context): self.widget_dim = dim_action[0] self.alternative_num = dim_action[1] self.context_dim = dim_context[0] self.context_alt = dim_context[1] def get_A_dummy_vector(self, a): D = self.widget_dim N = self.alternative_num # given data vector a, generate D(N-1) dummy variables without interaction terms dummy_mat = np.zeros((D, N-1)) for i, a_i in enumerate(a): if a_i < N-1: dummy_mat[i, a_i] = 1 return dummy_mat.reshape((-1,)) def get_A_dummy_matrix(self, A): D = self.widget_dim N = self.alternative_num # given data matrix A, generate dummy matrix without interaction terms M = A.shape[0] dummy_matrix = np.zeros((M, D*(N-1))) for i in range(M): dummy_matrix[i] = self.get_A_dummy_vector(A[i]) return dummy_matrix def get_X_dummy_vector(self, a): D = self.context_dim N = self.context_alt # given data vector a, generate D(N-1) dummy variables without interaction terms dummy_mat = np.zeros((D, N-1)) for i, a_i in enumerate(a): if a_i < N-1: dummy_mat[i, a_i] = 1 return dummy_mat.reshape((-1,)) def get_X_dummy_matrix(self, A): D = self.context_dim N = self.context_alt # given data matrix A, generate dummy matrix without interaction terms M = A.shape[0] dummy_matrix = np.zeros((M, D*(N-1))) for i in range(M): dummy_matrix[i] = self.get_X_dummy_vector(A[i]) return dummy_matrix def get_A_inter_vector(self, a): D = self.widget_dim N = self.alternative_num # given vector a, generate all interaction terms within features of a dummy_mat = self.get_A_dummy_vector(a).reshape(D, -1) idx_start = 0 idx_end = (D-1) * (N-1) ** 2 inter_vector = np.empty(D*(D-1) * (N-1)**2 // 2) for i in range(D-1): int_mat = dummy_mat[i, :].reshape((-1, 1)) @ dummy_mat[(i+1):, :].reshape((1, -1)) inter_vector[idx_start:idx_end] = int_mat.reshape((1, -1))[0] idx_start = idx_end idx_end += (D-2-i) * (N-1) ** 2 return inter_vector def get_A_inter_matrix(self, A): D = self.widget_dim N = self.alternative_num # given data matrix A, generate all the interaction terms within features of A M = A.shape[0] inter_matrix = np.zeros((M, D*(D-1)*(N-1)**2 // 2)) for i in range(M): inter_matrix[i] = self.get_A_inter_vector(A[i]) return inter_matrix def get_AX_inter_vector(self, a, x, dummy_context=False): D = self.widget_dim N = self.alternative_num L = self.context_dim G = self.context_alt # given data vectors a and x, generate all the interaction terms among features of a and x a_dummy_mat = self.get_A_dummy_vector(a).reshape(D, -1) # x is already dummy if dummy_context is true if not dummy_context: x_dummy_mat = self.get_X_dummy_vector(x).reshape(L, -1) else: x_dummy_mat = x.reshape(L, -1) idx_start = 0 idx_add = (N-1)*(G-1) inter_vector = np.empty(D*L*(N-1)*(G-1)) for i in range(D): for j in range(L): inter_mat = np.dot(a_dummy_mat[i, :].reshape((-1, 1)), x_dummy_mat[j, :].reshape((1, -1))) inter_vector[idx_start:(idx_start+idx_add)] = inter_mat.reshape((1, -1))[0] idx_start += idx_add return inter_vector def get_AX_inter_matrix(self, A, X, dummy_context=False): D = self.widget_dim N = self.alternative_num L = self.context_dim G = self.context_alt # given data matrices A and X, generate all the interaction terms among features of A and X M = A.shape[0] inter_matrix = np.zeros((M, D*L*(N-1)*(G-1))) for i in range(M): # x is already dummy if dummy_context is true if not dummy_context: inter_matrix[i] = self.get_AX_inter_vector(A[i], X[i]) else: inter_matrix[i] = self.get_AX_inter_vector(A[i], X[i], dummy_context=True) return inter_matrix def get_all_dummy(self, A, X, dummy_context=False): N = A.shape[0] A_dummy = self.get_A_dummy_matrix(A) if dummy_context: X_dummy = X else: X_dummy = self.get_X_dummy_matrix(X) A_inter = self.get_A_inter_matrix(A) AX_inter = self.get_AX_inter_matrix(A, X, dummy_context) return np.hstack((np.ones((N, 1)), A_dummy, X_dummy, A_inter, AX_inter)) ``` ```python class Environment(Utils): def __init__(self, dim_action, dim_context, a_params, beta): # inherit Utils class super().__init__(dim_action, dim_context) # a_params is a list of a1 - a4, controlling the weights of the ture model self.a_params = a_params self.mu = self.true_model_init(beta) # Sampling mu def true_model_init(self, beta): self.num_A = self.widget_dim * (self.alternative_num-1) self.num_X = self.context_dim * (self.context_alt-1) self.num_AA = self.widget_dim * (self.widget_dim-1) * (self.alternative_num-1)**2 // 2 self.num_AX = self.widget_dim * self.context_dim * (self.alternative_num-1)*(self.context_alt-1) self.dim_list = np.array([self.num_A, self.num_X, self.num_AA, self.num_AX]).cumsum() + 1 bias = random.normal(5, 1) A_idx = random.randint(-5, 5, size=self.num_A) X_idx = random.randint(-5, 5, size=self.num_X) AA_idx = random.randint(-3, 3, size=self.num_AA) AX_idx = random.randint(-5, 5, size=self.num_AX) beta_A = np.array([random.normal(A_idx[i], 3, size=1)[0] for i in range(self.num_A)]) beta_X = np.array([random.normal(X_idx[i], 3, size=1)[0] for i in range(self.num_X)]) beta_AA = np.array([random.normal(AA_idx[i], 3, size=1)[0] for i in range(self.num_AA)]) beta_AX = np.array([random.normal(AX_idx[i], 3, size=1)[0] for i in range(self.num_AX)]) return (1/beta) * np.concatenate(([bias], beta_A, beta_X, beta_AA, beta_AX)) # compute reward def compute_reward(self, data, idx): N = data.shape[0] y = norm.cdf(data @ self.mu.reshape((-1, )) + self.noise[idx:(idx+N)]) reward = np.zeros(N) for i, p in enumerate(y): reward[i] = random.binomial(1, p) return reward # data initialization def data_init(self, N, minibatch): # noise of the true model self.noise = random.normal(0, 3, size=N) X = np.random.randint(0, self.context_alt, size=(N, self.context_dim)) A = np.random.randint(0, self.alternative_num, size=(minibatch, self.widget_dim)) # generate first minibatch of data, say, 1000 lines data_minibatch = self.get_all_dummy(A, X[0:minibatch, :]) # generate a data matrix with all features which can be used and updated in training steps data = np.zeros((N - minibatch, len(self.mu))) # intercept terms data[:, 0] = np.ones(N - minibatch) # generate the rest of X data[:, self.dim_list[0]:self.dim_list[1]] = self.get_X_dummy_matrix(X[minibatch:, :]) # concatenate all data and compute rewards data = np.vstack((data_minibatch, data)) # generate a vector with rewards reward = np.zeros(N) reward[0:minibatch] = self.compute_reward(data_minibatch, idx=0) self.data = data self.reward = reward def update_data(self, A, X, idx): # the input X should be dummy N = A.shape[0] A_dummy = self.get_A_dummy_matrix(A) A_inter = self.get_A_inter_matrix(A) AX_inter = self.get_AX_inter_matrix(A, X, dummy_context=True) self.data[idx:(idx+N), 1:self.dim_list[0]] = A_dummy self.data[idx:(idx+N), self.dim_list[1]:self.dim_list[2]] = A_inter self.data[idx:(idx+N), self.dim_list[2]:self.dim_list[3]] = AX_inter reward = self.compute_reward(self.data[idx:(idx+N), :], idx=idx) self.reward[idx:(idx+N)] = reward ``` ```python # Please feel free to compare with the following Github repo: # https://github.com/neal-o-r/horseshoe/blob/master/horseshoe.py class Model(): def __init__(self, data, reward, n_draw = 1000, n_tune = 1000): self.X = data self.y = reward self.n_draw = n_draw self.n_tune = n_tune def probit(self, Y): # Probit function is acutally the CDF of the standard normal distribution N(0, 1) mu = 0 sd = 1 return 0.5 * (1 + tt.erf((Y - mu) / (sd * tt.sqrt(2)))) def mcmc(self, prior = 'normal'): model = pm.Model() with model: # set the prior distribution of weights if prior == 'normal': W = pm.Normal('w', mu=0, sigma=1, shape=self.X.shape[1]) elif prior == 'laplace': W = pm.Laplace('w', 0, b=1, shape=self.X.shape[1]) elif prior == 'horseshoe': sigma = pm.HalfNormal('sigma', 2) tau_0 = 10 / (self.X.shape[1] - 10) * sigma / tt.sqrt(self.X.shape[0]) tau = pm.HalfCauchy('tau', tau_0) lambda_m = pm.HalfCauchy('lambda', 1) W = pm.Normal('w', mu=0, sigma=tau * lambda_m, shape=self.X.shape[1]) else: print("Invlaid prior type.") return None return self.get_trace(W, model) def get_trace(self, W, model): # get samples of W given a prior distribution with model: y_hat = tt.dot(self.X, W) theta = self.probit(y_hat) # likelihood y_new = pm.Bernoulli('r', p=theta, observed=self.y) trace = pm.sample(draws = self.n_draw, tune = self.n_tune, chains=1, init='adapt_diag') return trace def sample_weight(self, trace, environment, train_model): # get a random sample of weight W_posterior = trace.get_values('w') random_idx = random.randint(W_posterior.shape[0]) W = W_posterior[random_idx] dim_list = environment.dim_list # distinguish model types by values of a # dim_list = np.array([num_A, num_X, num_AA, num_AX]).cumsum() + 1 if train_model == 'MAB-A': W[dim_list[1]:] = 0 elif train_model == 'MAB-A2': W[dim_list[0]:dim_list[1]] = 0 W[dim_list[2]:dim_list[3]] = 0 elif train_model == 'MAB-AX2': pass elif train_model == 'MAB-AX': W[dim_list[1]:] = 0 else: print("Invalid model") return None return W ``` ```python class Agent(Utils): def __init__(self, dim_action, dim_context, train_model): # inherit Utils class super().__init__(dim_action, dim_context) # train_model is the type of the training model self.train_model = train_model def compute_BW(self, W, A, X, noise=False, environment=None, idx=None): # the input X should be dummy N = A.shape[0] data = self.get_all_dummy(A, X, dummy_context=True) if noise: BW = np.dot(data, W) + environment.noise[idx:(idx+N)] else: BW = np.dot(data, W) return BW def hill_climbing_search(self, W, X, N, times=5, steps=20, noise=False, environment=None, idx=None): # N is the batch size in thompson sampling # the input X should be dummy batch_size = X.shape[0] reward_matrix = np.zeros((batch_size, times), dtype=int) action_matrix = np.zeros((times, batch_size, self.widget_dim), dtype=int) for s in range(times): A = random.randint(self.alternative_num, size=(N, self.widget_dim)) # pick a layout A_0 randomly for k in range(steps): i = random.randint(self.widget_dim) # choose a widget randomly to optimize A_all_possible = np.repeat(A.reshape(N, -1, 1), self.alternative_num, axis=2) A_all_possible[:, i, :] = np.repeat(np.array([t for t in range(self.alternative_num)]).reshape(1, -1), N, axis=0) result = np.zeros((batch_size, self.alternative_num)) for n in range(self.alternative_num): A_test_one = A_all_possible[:, :, n] result[:, n] = self.compute_BW(W, A_test_one, X, noise, environment, idx) j_star = np.argmax(result, axis=1) A[:, i] = j_star BW = self.compute_BW(W, A, X, noise, environment, idx) reward_matrix[:, s] = BW action_matrix[s, :, :] = A max_idx = np.argmax(reward_matrix, axis=1) # print("max", max_idx) max_idx = np.array([[i]*self.widget_dim for i in max_idx]).reshape(N, -1) return max_idx.choose(action_matrix) def compute_regret(self, A, X, N, environment, idx): # the input A is the selected by hill climbing search # the input X should be dummy optimal_true_action = self.hill_climbing_search(environment.mu, X, N, noise=True, environment=environment, idx=idx) # At* optimal_empirical_action = A # At BW_true_action = self.compute_BW(environment.mu, optimal_true_action, X, True, environment, idx) BW_empirical_action = self.compute_BW(environment.mu, optimal_empirical_action, X, True, environment, idx) true_reward = norm.cdf(BW_true_action) empirical_reward = norm.cdf(BW_empirical_action) return true_reward - empirical_reward # shape = (X.shape[0], ) def thompson_sampling(self, environment, minibatch, prior='normal', times=10, steps=100): N = environment.reward.shape[0] t = minibatch regret = np.zeros(N) while t < N: print("Thompson sampling iteration {}/{}".format(t/minibatch, N/minibatch-1)) model = Model(environment.data[0:t], environment.reward[0:t]) trace = model.mcmc(prior) W = model.sample_weight(trace, environment, self.train_model) print("Starting hill-climbing search...") # the newest minibatch of context X is used in the optimization X_dummy = environment.data[(t-minibatch):t, environment.dim_list[0]:environment.dim_list[1]] A = self.hill_climbing_search(W, X_dummy, minibatch, times, steps) print("Hill-climbing search finished!") print("Starting computing regret...") # compute regret regret[t:(t+minibatch)] = self.compute_regret(A, X_dummy, minibatch, environment, idx=t) print("Regret computing completed!") # update data with A, and time t for the next iteration environment.update_data(A, X_dummy, t) print("Data updating completed!") t += minibatch return regret[minibatch:] ``` ```python import dash import dash_core_components as dcc import dash_html_components as html from dash.dependencies import Input, Output import pandas as pd external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css'] app = dash.Dash(__name__, external_stylesheets=external_stylesheets) df = pd.read_csv("all_data.csv") available_priors = ['normal', 'laplace', 'horseshoe'] available_train_models = ['MAB-A', 'MAB-A2', 'MAB-AX', 'MAB-AX2'] app.layout = html.Div(children=[ html.Div([ html.Div('Prior Assumption'), dcc.Dropdown( id='prior_type', options=[{'label': i, 'value': i} for i in available_priors], value='normal'), ], style={'width': '48%', 'display': 'inline-block'} ), html.Div([ dcc.Graph(id='line_plot') ]), html.Div('Training Model'), dcc.Dropdown( id='training_model', options=[{'label': i, 'value': i} for i in available_train_models], value='MAB-A' ), html.Div([ dcc.Graph(id='line_plot2') ]) ]) @app.callback( Output('line_plot', 'figure'), [Input('prior_type', 'value')] ) def line_plot_by_prior(prior_type): dff = df[(df['prior'] == prior_type)] models = dff.train_model.unique() return { 'data': [dict( x=dff[dff['train_model'] == model]['iteration'], y=dff[dff['train_model'] == model]['regret'], mode='lines', name=model ) for model in models], 'layout': dict( xaxis={ 'title': 'Number of Iterations', 'type': 'linear' }, yaxis={ 'title': 'Regret', 'type': 'linear' }, margin={'l': 40, 'b': 30, 't': 10, 'r': 0}, # title="Local Regret vs. Number of Iterations" ) } @app.callback( Output('line_plot2', 'figure'), [Input('training_model', 'value')] ) def line_plot_by_train_model(train_model): dff = df[(df['train_model'] == train_model)] priors = dff.prior.unique() return { 'data': [dict( x=dff[dff['prior'] == prior]['iteration'], y=dff[dff['prior'] == prior]['regret'], mode='lines', name=prior ) for prior in priors], 'layout': dict( xaxis={ 'title': 'Number of Iterations', 'type': 'linear' }, yaxis={ 'title': 'Regret', 'type': 'linear' }, margin={'l': 40, 'b': 30, 't': 10, 'r': 0}, # title="Local Regret vs. Number of Iterations" ) } if __name__ == '__main__': app.run_server(debug=True) ``` ```python import dash import dash_core_components as dcc import dash_html_components as html from dash.dependencies import Input, Output import plotly.graph_objs as go import pandas as pd external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css'] app = dash.Dash(__name__, external_stylesheets=external_stylesheets) df = pd.read_csv("mean_data.csv") def plot_hist(): models = df.train_model.unique() return go.Figure( data=[ go.Bar( x=df[df['train_model'] == model]['prior'], y=df[df['train_model'] == model]['regret'], name=model ) for model in models ], layout=dict( xaxis={ 'title': 'prior', }, yaxis={ 'title': 'Mean regret', }, margin={'l': 40, 'b': 30, 't': 10, 'r': 0} ) ) app.layout = html.Div(children=[ html.Div([ dcc.Graph(id='plot', figure=plot_hist()) ]) ]) if __name__ == '__main__': app.run_server(debug=True) ```