Progressive TSLU

# Progressive TSLU The TSLU bandit runs through [run_bandit.py](https://github.com/CVC-Lab/Thompson-Sampling/blob/main/run_bandit.py) script. It has a fancy argument parser for handling various datasets and bandit models. We can tackle details of every datasets and models later. At this point, let's focus on understanding skeleton flowline of the code. As always, the code starts with the main function, and the core line is "regret_i = run_iter()". ``` {.py} def main(_): IterMax = 50 num_algos = 9 regrets = np.zeros((2000, num_algos, IterMax)) #print('results/' + FLAGS.dataset + '-lr-' + str(FLAGS.lr) + '.npy') for i in range(IterMax): regret_i = run_iter() regrets[:, :, i] = regret_i print('---------- Finish Loop (', i, ') --------------') results = np.cumsum(regrets, axis=0) print('Average Cumulative Regret: ', np.mean(results[-1, :, :], axis=1)) cwd = os.getcwd() save_dir = cwd + '/camera_ready/' if not os.path.exists(save_dir): os.makedirs(save_dir) np.save(save_dir + FLAGS.dataset + '.npy', results) ``` In the function run_iter(), there are lots of definition of hyperparameters and calling different bandit models. Let's focus on the first TSLU model introduced in the paper: "VariationalSampling_v4('LU_Gaussian', hparams_luga), #6". Let's see the definition of hyperparameters later. ``` {.py} hparams_luga = tf.contrib.training.HParams(num_actions=num_actions, num_contexts=num_contexts, context_dim=context_dim, activation=tf.nn.relu, latent_dim=50, batch_size=512, initial_lr=2e-4, show_training=False, lr_decay=False, freq_summary=10000, buffer_s=-1, initial_pulls=2, training_freq=20, training_epochs=40, lambda_prior=0.25, show_loss=False, kl=1.0, recon=1.0, psigma=1.0, glnoise=False) algos = [VariationalSampling_v4('LU_Gaussian', hparams_luga), #6] ``` Then the code runs ``` {.py} results = run_contextual_bandit(context_dim, num_actions, dataset, algos) ``` Here, "context_dim, num_actions, dataset" are problem specific called by the code in the run_iter() function earlier ``` {.py} # Create dataset sampled_vals = sample_data(data_type, num_contexts) dataset, opt_rewards, opt_actions, num_actions, context_dim = sampled_vals ``` Among a bunch of dataset, let's consider only the mushroom dataset. ``` {.py} def sample_data(data_type, num_contexts=None): """Sample data from given 'data_type'. Args: data_type: Dataset from which to sample. num_contexts: Number of contexts to sample. Returns: dataset: Sampled matrix with rows: (context, reward_1, ..., reward_num_act). opt_rewards: Vector of expected optimal reward for each context. opt_actions: Vector of optimal action for each context. num_actions: Number of available actions. context_dim: Dimension of each context. """ elif data_type == 'mushroom': # Create mushroom dataset num_actions = 2 context_dim = 117 file_name = FLAGS.mushroom_data dataset, opt_mushroom = sample_mushroom_data(file_name, num_contexts) opt_rewards, opt_actions = opt_mushroom return dataset, opt_rewards, opt_actions, num_actions, context_dim ``` Then, the next step is running the function run_contextual_bandit() in the [script](https://github.com/CVC-Lab/Thompson-Sampling/blob/main/bandits/core/contextual_bandit.py). ``` {.py} from bandits.core.contextual_bandit import run_contextual_bandit ``` The overall purpose of the run_contextual_bandit() function is, as the name implies, to run the contextual bandit problem on a set of algorithms. The core of this lies within the ContextualBandit class, which is essentially the class utilized to run the contextual bandits problem. ```{.py} class ContextualBandit(object): """Implements a Contextual Bandit with d-dimensional contexts and k arms.""" def __init__(self, context_dim, num_actions): """Creates a contextual bandit object. Args: context_dim: Dimension of the contexts. num_actions: Number of arms for the multi-armed bandit. """ self._context_dim = context_dim self._num_actions = num_actions def feed_data(self, data): """Feeds the data (contexts + rewards) to the bandit object. Args: data: Numpy array with shape [n, d+k], where n is the number of contexts, d is the dimension of each context, and k the number of arms (rewards). Raises: ValueError: when data dimensions do not correspond to the object values. """ if data.shape[1] != self.context_dim + self.num_actions: raise ValueError('Data dimensions do not match.') self._number_contexts = data.shape[0] self.data = data self.order = range(self.number_contexts) def reset(self): """Randomly shuffle the order of the contexts to deliver.""" self.order = np.random.permutation(self.number_contexts) def context(self, number): """Returns the number-th context.""" return self.data[self.order[number]][:self.context_dim] def reward(self, number, action): """Returns the reward for the number-th context and action.""" return self.data[self.order[number]][self.context_dim + action] def optimal(self, number): """Returns the optimal action (in hindsight) for the number-th context.""" return np.argmax(self.data[self.order[number]][self.context_dim:]) @property def context_dim(self): return self._context_dim @property def num_actions(self): return self._num_actions @property def number_contexts(self): return self._number_contexts ``` Since the overall structure and function of the ContextualBandit class is explained reasonably well within the comments, we should instead evaluate the role of the class within the general run_contextual_bandit() function: ```{.py} def run_contextual_bandit(context_dim, num_actions, dataset, algos): """Run a contextual bandit problem on a set of algorithms. Args: context_dim: Dimension of the context. num_actions: Number of available actions. dataset: Matrix where every row is a context + num_actions rewards. algos: List of algorithms to use in the contextual bandit instance. Returns: h_actions: Matrix with actions: size (num_context, num_algorithms). h_rewards: Matrix with rewards: size (num_context, num_algorithms). """ num_contexts = dataset.shape[0] # Create contextual bandit cmab = ContextualBandit(context_dim, num_actions) cmab.feed_data(dataset) h_actions = np.empty((0, len(algos)), float) h_rewards = np.empty((0, len(algos)), float) # Run the contextual bandit process for i in range(num_contexts): context = cmab.context(i) actions = [a.action(context) for a in algos] rewards = [cmab.reward(i, action) for action in actions] for j, a in enumerate(algos): a.update(context, actions[j], rewards[j]) h_actions = np.vstack((h_actions, np.array(actions))) h_rewards = np.vstack((h_rewards, np.array(rewards))) # for algo in algos: # plt.plot(algo.loss) return h_actions, h_rewards ``` Overall, the function is relatively simple: it takes some dataset and runs the contextual bandit problem over it on a set of algorithms. The code itself features a variety of algorithms to test the contextual bandit problem on, but we will only focus on the LU_SIVI algorithm as this is the only one relevant to our current problem. The code implementation of the LU_SIVI algorithm is rather nuanced, so it is important to break the algorithm into multiple distinct parts in order to make it easier to understand. The core of the algorithm, lies within the VariationalSamplingSivi_dgf_v7 class. Before we go into this class, we must first define the hyperparameters being used as part of the model: * num_actions: number of possible actions to take in the bandit problem * num_contexts: number of contexts within the problem * context_dim:? * activation: neural network activation function (ReLu) * latent_dim: ? * batch_size: training batch size * initial_lr: learning rate of the nn * show_training: * verbose: * lr_decay: * freq_summary: * buffer_s: * initial_pulls: * training_freq: * training_epochs: * lambda_prior: * show_loss: * k1: * recon: * two_decoder: * glnoise: * psigma: > need to finish this part later when I get to the parts of the code that actually reference the hyperparams We define the class variables as below: * name: name of the algorithm (LU_SIVI) * hparams: the hyperparameters being fed into the algorithm (will define in greater detail below) * Optimizer_n: loss function optimizer (will be RMS or root mean-squared in this situation) * training_freq: * training_epochs: number of epochs as defined by the hyperparameter inputs * num_actions: number of actions as defined by hyperparameters * t: time (initialized at 0 and incremented by 1 every iteration) * data_h: dataset being passed into the class, will be defined in the VIContextualDataset class described in a later section * bnn: bayesian neural network used in problem (referenced by VariationalSivi_dgf_v7 class) * loss: ? We can see that within the run_contextual_bandits() function that the two primary methods we must analyze are the action and update methods within the VariationalSamplingSivi_dgf_v7 class. We can view the action method as below: ```{.py} def action(self, context): """Samples beta's from posterior, and chooses best action accordingly.""" # Round robin until each action has been selected "initial_pulls" times if self.t < self.hparams.num_actions * self.hparams.initial_pulls: return self.t % self.hparams.num_actions with self.bnn.graph.as_default(): self.c = context.reshape((1, self.hparams.context_dim)) y_pred_mu = self.bnn.sess.run(self.bnn.y_pred_mu, feed_dict={self.bnn.x: self.c}) r = y_pred_mu.mean(axis=0).mean(axis=0) # r = np.random.normal(y_pred_mu, y_pred_var) return np.argmax(r) ``` The action method takes some context as input and outputs the arm that the bandit will rob within the contextual bandit problem. Firstly, it will select each arm "initial_pulls" times in order to obtain a rough idea of the reward distribution of each arm. .... ## Algorithms There are 7 algorithms listed below that are tested in conjunction with each other upon each of the datasets demonstrated within the testing section: 1.) Uniform Sampling: An action is chosen at random based on the uniform distribution. Significantly less efficient than the other sampling methods. 2.) Posterior BNN Sampling (BBB): Utilizes variational inference for uncertainty measurement. Weights of the neural network are drawn from Gaussian distribution, and as such, instead of the weights themselves needing to be optimized, the mean and variance of each Gaussian instead become the parameters for optimization. 3.) Neural Linear Posterior Sampling: Uses Linear reward function $y_t=x_t^T\beta + \epsilon_t$ for some time $t$, samples $\beta$ from some normal prior, and $\epsilon \sim N(0,\sigma_t^2)$ where $\sigma_t$ is sampled from some inverse gamma distributed conditional prior. At every time $t$, the agent receives some context $x_t$ and makes the highest reward action according to the reward distribution $y_t$, and updates $\beta$ and $\sigma$ accordingly. 4.) Linear Full Posterior Sampling: The reward function of the BNN $y_t$ is sampled from $y_t \sim N(\beta^Tz_x, \sigma^2)$, where $z_x$ is the the output of a neural network given context $x_t$, $\beta$ is the regression coefficient with gaussian prior, and $\sigma$ is sampled from the inverse gamma distributed conditional prior. 5.) Particle-Interactive TS via Discrete Gradient Flow: (I actually have no idea how this is quantified) 6.) TSLU Gaussian: 7.) TSLU SIVI: ## Testing To benchmark the efficacy of the proposed TSLU-SIVI algorithm, the researchers test it on a series of 8 datasets in comparison to 8 other algorithms. For time purposes, we will only be looking at the LU_SIVI approach in relation to the LU_Gaussian approach, although additional algorithms may be added in the future. Overall, we can see that the Average Cumulative Regret for the LU_SIVI approach is generally much lower than the LU_Gauss approach, although for some datasets, such as the Jester dataset, the regret is relatively equal. | Dataset | Regret(LU_SIVI) | Regret(LU_Gauss) | | --------- | --------------- | ---------------- | | Mushroom | 1525.8 | 2992 | | Financial | 232.43 | 356.28 | | Statlog | 126.24 | 142.7 | | Jester | 7227.98 | 7010.43 | | Wheel | | | | Covertype | | | | Adult | | | | Census | | | ![](https://i.imgur.com/zLJPW72.png) ![](https://i.imgur.com/eoq34Pe.png) ![](https://i.imgur.com/8xQh7qG.png) ![](https://i.imgur.com/jSIOocn.png) More Algorithms: ![](https://i.imgur.com/esuBvxc.png) ### Adaptive Inducing Points Selection for Gaussian Processes https://arxiv.org/abs/2107.10066 A major disadvantage of a Gaussian Process (GP) is that its computational complexity is cubic with the number of points. Sparse Variational GPs (SVGPs) seek to solve this problem by using a smaller number of inducing points (IPs). A standard GP has training data $\mathcal{D} = \{X,\boldsymbol{y}\}$ where the input is $X=\{x_i\}_{i=1}^N$ and the labels are $\boldsymbol{y}=\{u_i\}_{i=1}^N$. The goal is to predict the distibution for other inputs. We have a latent function $f$ and the associated vector $\boldsymbol{f}=\{f(x_i)\}_{i=1}^N$. The posterior distribution is calculated by: $$p(\boldsymbol{f} | \mathcal{D}) = \frac{\prod_{i=1}^N p(y_i|f_i) p(\boldsymbol{f})}{p(\mathcal{D})}$$ With a SVGP there is inducing variable $\boldsymbol{u}$ with the inducing locations $Z=\{Z_i\}_{i=1}^M$. The relationship between $\boldsymbol{u}$ and $\boldsymbol{f}$ is $p(\boldsymbol{f},\boldsymbol{u})=p(\boldsymbol{f}|\boldsymbol{u})p(\boldsymbol{u})$, but is approximated by $q(\boldsymbol{f},\boldsymbol{u})=p(\boldsymbol{f}|\boldsymbol{u})q(\boldsymbol{u})$ where $q(\boldsymbol{u})=\mathcal{N}(\boldsymbol{\mu},\Sigma)$ such that $KL(q(\boldsymbol{f},\boldsymbol{u})||p(\boldsymbol{f},\boldsymbol{u}|\mathcal{D}))$ is optimized. If a streaming SVGP is used, then there are batches of data $\mathcal{D}_t$. For each batch there is a variational distribution $q_t(\boldsymbol{u}_t,\boldsymbol{f})$ which approximates $$p(\boldsymbol{u}_t,\boldsymbol{f}|\mathcal{D}_{1:t}) = \frac{p(\mathcal{D_t}|\boldsymbol{f}) p(\mathcal{D_{1:(t-1)}}|\boldsymbol{f}) p(\boldsymbol{u}_t,\boldsymbol{f}|\theta_t)}{p(\mathcal{D}_{1:t})}$$ In the above formula, $p(\mathcal{D_{1:(t-1)}}|\boldsymbol{f})$ is approximated using $q_{t-1}(\boldsymbol{u}_{t-1})$: $$p(\mathcal{D_{1:(t-1)}}|\boldsymbol{f}) \approx \frac{ q_{t-1}(\boldsymbol{u}_{t-1}) p(\mathcal{D_{1:{i-1}}})}{p(\boldsymbol{u}_{t-1},\boldsymbol{f}|\theta_{t-1})}$$ This paper develops an algorithm called Online Inducing Points Selection (OIPS) to determine the number of IPs and their locations. The parameter for the algorithm is its acceptance threshold $\rho\in(0,1)$. The algorithm generates a set of IPs. The algorithm is as follows: 1. Let $Z=\{Z_i\}_{i=1}^M$ be the set of IPs generated so far. 2. Let $x$ be the next sample. 3. Calculate the kernel function of $x$ with each of the IPs: $k(x,Z_i)$. 4. If $\max_i k(x,Z_i) < \rho$, then 5. add $x$ to the set of IPs: $Z=Z\cup\{x\}$. 6. increment the number of IPs: $M=M+1$. ### Algorithm summary Suppose the data tensor $\mathcal{A}\in \mathbb{R}^{N_1 \times N_2 \times N_3}$ is matricized into $A\in\mathbb{R}^{N \times M}$ where $N = N_1N_2$ and $M = N_3$. The algorithm starts with dividing the row and column spaces into $k$ segments. Each segment contains $d_i,i=1,...,k$ number of indices that $\sum_{i=1}^{k}d_i = N+M$. Then, the sampling data fibers can be formulated as $k$-bandit problem. At each step, $b$ number of arms (=segments) are randomly selected based on the scores of arms that are initially unity. For the scores of the arms, we use the sum of absolute difference of fibers corresponding to the indices included in each arm. Sum of absolute difference is simple measure for informativeness of data fiber that is defined as $$ SAD(row_j) = \sum_{i=1}^{M-1} |A[j,i] - A[j,i+1]|,$$ $$ SAD(col_j) = \sum_{i=1}^{N-1} |A[i,j] - A[i+1,j]|.$$ Then, sparse variational Gaussian process (SVGP) is used to model distribution of SAD values for arms. Using SAD values of sampled fibers of each arm, inducing points are estimated, and mean and standard deviation of SAD function is estimated. Then, the score of the selected arm is updated by a combination of mean and standard deviation of SAD function. To avoid oversampling, the factor that is decreasing proportionally to the number of sampled fibers is applied. $$ u_t(k) = \left(1 - \frac{l_k}{d_k}\right)*(\mu_t(k) + \beta_t*\sigma_t(k))$$ where $l_k$ is number of selected indices from arm k and $\beta_k$ is balancing parameter that should be tuned. Using the updated scores of arms, the next batch $b$ number of arms are selected. The procedure will be repeated untill it sample target number of fibers. ### TSLU Implementation Given the same matrix $A$ as described in the the Algorithm Summary above, we can generalize the Thompson-Sampling algorithm to a $3$-arm bandit approach, in which our action taken at time $t$ is equal to the number of sampled fibers from each of the $3$ dimensions from the tensor. We can convert this approach to a TSLU approach as follows: At each point in time $t$, we denote the current context $x_t$ as the mean and variance of the sampled fibers, $z_t \sim p(z)$ as the latent space of $x_t$, and $r_t \sim p(r_t|x_t,z_t)$ as the SAD score (will later modify to expected SAD entropy) of selecting a fiber within each dimension. Initially, the sampling ratios in the original Thompson Sampling approach were determined using a Dirichlet prior whose concentration parameters were updated using the SAD entropy of the sampled fibers; however, since the reward distribution of the TSLU approach could also be negative, it is not as feasible to simply scale the returned reward distribution to a probability distribution; instead, for each batch, we run TSLU $n$ times to get $n$ aprroximate rewards, and select our $n$ different actions to sample fibers from based on the rewards determined previously. Currently, the fiber selection method is the same as the one described in Progressive SketchyCoreTucker, in which the weight of each unsampled fiber is determined by a linear interpolation function $\psi_{k}(\Delta_{k}(i))$, in which $k=\{1,2,3\}$ is the dimension being sampled. ### Results 01/02/2022 Approximation error of Random Sampling, Progressive Sampling, and TSLU-based sampling over 100 trials done on the Cardiac dataset. Error is calculated from the accuracy of the SketchyCoreTucker approximation constructed from the sampled fibers. Note that the TSLU approach outperforms both the RS and PS approaches, but comes at the cost of computation time (although this can be fixed to some degree with some optimization). ![](https://i.imgur.com/mxXEokT.png) Approximation error of Progressive Sampling and TSLU sampling on matrix-reshaped Cardiac dataset (2 dimensions instead of 3). Error is calculated from the accuracy of the SketchyCoreSVD approximation constructed from the sampled fibers. Note that the TSLU approach in 2-dimensions is roughly equal to the PS approach. This is likely due to the fact that the fiber-selection method between the two approaches is the same, and it may be more difficult to optimize the sampling ratios past a certain extent when dimensions are reduced. ![](https://i.imgur.com/0t4cFFr.png) ### Results <12/28/2021 Error graph of TSLU Implementation on Cardiac Dataset ![](https://i.imgur.com/OWgOgCz.png) A few things to consider from this graph: 1.) Note that the input matrix has dimension 45056x160 and we begin by sampling 2 fibers from each block before beginning TSLU sampling, beginning with column blocks. 2.) There are 2 main perceived drops in error: the first being due to beginning to perform TSLU sampling on the input matrix, and the second due to sampling all of the column fibers within the matrix. 3.) In the error calculation, I initially had it so that the approximation error was set to 1 if we only had rows and not columns sampled, and vice versa. This combined with the fact that that the initial sampling drew from the column blocks before sampling the rows, caused algorithm to inadvertently avoid selecting columns as the error was perceived as "higher", unintentially causing the error to converge much slower than the uniform sampling model. Currently, the default error is set to 0, but I am not sure if this is a better way of accurately determining the efficacy of TSLU, or if it is simply a band-aid workaround. Error graph of Uniform Sampling implementation on Cardiac Dataset ![](https://i.imgur.com/XqJ1u3g.png) Unfortunately the error graph for the uniformal distribution is not entirely visible do to the existence of massive outliers at around the 7000th selection process. Regraphing with the outlier values regulated to be equal to 1, we get: ![](https://i.imgur.com/qQDWktF.png) , in which we can see that the rank error takes significantly longer to drop off in comparison to that of the TSLU approach. Currently I am not sure why this outlier value is occurring considering that the approximation error should theoretically not exceed 1 (although I may be wrong on this) and that the TSLU implementation does not contain such an outlier. ## 3d Implementation Error Histogram ![](https://i.imgur.com/P4BpqNS.png) For some reason, the approximation error for the TSLU implementation is magnitudes higher than the approximation error for both the Random Sampling and Progressive Sampling implementations. This may be from the algorithm overexploiting the best possible action instead of exploring the rewards of other potential actions, or perhaps the blocks are constructed in a way that causes low-information fibers to be grouped with high-information ones, or I've just been implementing this incorrectly the entire time. :( ![](https://i.imgur.com/0amBMXM.png) I'm not sure why the initial TS learning curve looks like this, but at the end the log error of TS is higher than both the log error for PS and RS. ![](https://i.imgur.com/qxWpOO5.png) First row = PS approach, Second row = RS, Third row = TSLU. The images in the TSLU approach are only slightly more clear than the ones in the RS approach and significanly less clear than in the PS approach