Stochastic search variable selection (SSVS) Tutorial

# Stochastic search variable selection (SSVS) Tutorial ## Variable Selection The stochastic search variable selection (SSVS), introduced by George and McCulloch[1], is one of the prominent Bayesian variable selection approaches for regression problems. Some of the basic principles of modern Bayesian variable selection methods were first introduced via the SSVS algorithm such as the use of a vector of variable inclusion indicators. SSVS can effectively search large model spaces, identifying the maximum a posteriori and median probability models, and also readily produce Bayesian model averaging estimates. A number of generalizations and extensions of the method have appeared in the statistical literature implementing SSVS to a variety of applications such as generalized linear models, contingency tables, time series data, and factor analysis. This tutorial explains the basics of the SSVS algorithm. The first step in understanding the SSVS, is to define the term Bayes factor. Bayes factor is formally defined as follows, > A Bayes factor is the ratio of the likelihood of one particular hypothesis to the likelihood of another. It can be interpreted as a measure of the strength of evidence in favor of one theory among two competing theories. $$ BF=\frac{P(D|H_1)}{P(D|H_0)} $$ In other words bayes factor compares two competing theories and relatively weighs the chances of occurance of one against the other given the data. In terms of variable selection, the bayes factor gets reformulated to answer the critical question, i.e., whether or not to include a particular variable. $$ BF=\frac{\int\limits _{\beta \ \varepsilon \ M_{1}} L( \beta |M_{k\backslash j}) \pi ( \beta |M_{k\backslash j}) d\beta }{\int\limits _{\beta \ \varepsilon \ M_{1}} L( \beta |M_{k}) \pi ( \beta |M_{k}) d\beta } $$ In the above expression, $\beta$ represents the coefficients of the variable in a standard linear regression setting. The prior set for $\beta$ is the spike and slab prior which puts mass at zero and smears mass with a dispersion of $\psi$, as follows, $$\pi(\beta_j)=\pi_0\delta(\beta_j=0)+(1-\pi_0)N(\beta_j|0,\psi^2).$$ $M_{k\backslash j}$ is the model without including the $j^{th}$ variable and $M_{k}$ is the model with the $j^{th}$ variable. The relative BF value provides a "score" of how model $M_{k\backslash j}$ fares with the model $M_{k}$ given the data. The above expression is cumbersome and closed form solution is available if the regression has standard normal error as follows, $$ y_i=\sum_{j=1}^pf_j(x_{i,j})\beta_j +\epsilon_i,\qquad i=1,\dots, N, $$ In the limiting case of linear regression with normal errors, BF expression reduces to, $$ BF=\frac{\sqrt{2\pi } \psi \ \times e^{\frac{1}{2}\left( \mu ^{T}_{\backslash j} E^{-1}_{\backslash j} \mu _{\backslash j}\right)}( 2\pi )^{k/2} |E_{\backslash j} |^{1/2}}{e^{\frac{1}{2}\left( \mu ^{T} E^{-1} \mu \right)}( 2\pi )^{\frac{k+1}{2}} |E|^{1/2}}\\ \\ E_{\backslash j} =\left(\frac{X^{T}_{\backslash j} X_{\backslash j}}{\sigma ^{2}} +\frac{I}{\psi ^{2}}\right)^{-1}\\ E=\left(\frac{X^{T} X}{\sigma ^{2}} +\frac{I}{\psi ^{2}}\right)^{-1} $$ The expression simplifies in log form, log( BF) =log( numerator) -log( denominator) ![](https://i.imgur.com/h1jrg2g.png) In the above expression, $\psi$ is a hyperparameter to be set by the user, this parameter decides how much dispersion is allowed in the beta coefficient parameter. One way to set the value is standardize all predictor variables to have zero mean and unit standard deviation and then set the value of $\psi$ to be 1.69, 1.96 or 2.25. To download the SSVS alogirthm, interested users are directed to my personal [github page](https://github.com/manukrisvt/Stochastic-Search-Variable-Selection-SSVS-//). ## Example problem A simple linear regression toy example is created to demonstrate the applicability of the SSVS method. $$ y_i=\sum_{j=1}^pf_j(x_{i,j})\beta_j +\epsilon_i,\qquad i=1,\dots, N, $$ ### Step 1: Generate pseudo data. $$ Y_i=-0.41X_1+1.89X_2+3.51X_3 -2.38X_4+6.94X_{10} $$ Generating 50 data points of y using the above mentioned formule, where $X_1$ to $X_{10}$ are the possible predictor variables. ![](https://i.imgur.com/A8hOAP0.jpg) In the original problem, the required task is to identify the pertinent predictor variable and find out its coefficient value. To recreate the same, a library candidate of 10 variables are created ($X_1$ to $X_{10}$) and the box plot is shown below, ![](https://i.imgur.com/vkMbk0Y.jpg) In the above example the main aim of the algorithm is to fish out the correct X variables that went into forming the response variable and calculate the coefficient of those regressors. ``` init_data.predictor_np=10; % No of predictor terms init_data.order=1; % No of order of terms to be included init_data.N=50; % Total no of data points init_data.predictor=randn([init_data.N,init_data.predictor_np]); % Random predictor variable generation init_data=generate_data(init_data); % Generate data using the order of the data init_data.noofterms=floor(size(init_data.predictor_column,2)/2); % Randomly select predictor variables % rng(1,'v5uniform'); % init_data.actualmodelterms=randi([1,size(init_data.predictor_column,2)],[init_data.noofterms,1]); init_data.actualmodelterms=sort(randsample(size(init_data.predictor_column,2),[init_data.noofterms])); init_data.actualmodelpredictors=init_data.predictor_column(:,init_data.actualmodelterms); beta_predictors=4*randn([length(init_data.actualmodelterms),1]); %Beta_coefficient generation init_data.response=init_data.actualmodelpredictors*beta_predictors+0.15*randn([init_data.N,1]); % Actual y term ``` ### Step 2: SVSS algorithm initialization This involves setting the hyperparameters $\psi$ and prior belief $p_0$. Default value of 2.25 and 0.5 is used respectively. ``` data_available.response=init_data.response; % data_available.response=zscore(init_data.response); % data_available.predictor_library=init_data.predictor_column; data_available.predictor_library=zscore(init_data.predictor_column); data_available.org_predictor=init_data.predictor; data_available.init_pi0=0.5; %Prior belief data_available.Xvar=2.25^2; ``` ### Step 3: Markov chain monte carlo Here the algorithm iterates through a MCMC producing new models at each iteration of MCMC. Gibbs sampling is used as closed form solutions are available. Customizable to include MH. ``` disp('Gibbs sampling running'); disp('%%%%%%%%%%%%%%%'); T=40000;flg=0; % No of MCMC iterations p=size(init_data.predictor_column,2); model.heatmap=cell(T,1); model.global_indicator=cell(T,1); model.global_indicator{1,1}=ones(1,p); model.selected=cell((p),1); model.sumterms=zeros(p,1); N=length(data_available.response); data_available.var_Y=11; list_var=1:1:p; for t=2:T model.indicator=model.global_indicator{t-1,1}; data_available.beta=model.indicator; if (rem(t*100/T,20)==0) flg=flg+1; fprintf('Gibbs sampling: %d percent done\n',20*flg); end for j=1:p M1=j; M2=find(list_var~=j); [bf_inv,pr]=compute_bayes_fac_V2(M1,M2,data_available); if binornd(1,pr.H0)==1 model.indicator(1,j)=0; model.betaind(1,j)=0; else model.indicator(1,j)=normrnd(pr.E,sqrt(pr.V)); model.betaind(1,j)=1; end data_available.beta=model.indicator; end model.residuals=data_available.response-data_available.predictor_library*model.indicator'; model.B=(sum(model.residuals.^2))*0.5; model.phi(t,1)=gamrnd(N/2,1/model.B); data_available.var_Y=1/model.phi(t,1); model.global_indicator{t,1}=model.indicator; model.global_betaindicator{t,1}=model.betaind; model.heatmap{t,1}=find(model.indicator~=0); if flg>1 if isempty(model.heatmap{t,1})==0 % length(model.selected(length(model.heatmap{t,1})))=len; model.selected{length(model.heatmap{t,1}),1}(1,end+1)=t; model.sumterms(length(model.heatmap{t,1}),1)=model.sumterms(length(model.heatmap{t,1}),1)+1; end end end ``` ### Step 4: Post processing results This step is mostly fishing through the results and finding the terms that appears in the MCMC simulations maximum number of times and displaying the results alongside with the coefficients. ``` [~,Ind] = max(model.sumterms);newjj=1;jj=1;member=[];clstr=0; while(jj<length(model.selected{Ind,1})) tterm=model.selected{Ind,1}(1,jj);cntr=0; totmember=0;clstr=clstr+1;newjj=[]; for kk=jj:length(model.selected{Ind,1}) tterm2=model.selected{Ind,1}(1,kk); comp_model=model.global_betaindicator{tterm2,1}; if(ismember(comp_model,model.global_betaindicator{tterm,1},'rows')) totmember=totmember+1; member{clstr,1}(totmember)=tterm2; clstrlen(clstr,1)=totmember; else % clstr=clstr+1 newjj(1,end+1)=kk; end end if isempty(newjj) break else jj=newjj(1); end end [~,indc]=max(clstrlen); model.betarray=cell2mat(model.global_indicator(member{indc,1})); model.meanbetarray=mean(model.betarray); disp(model.meanbetarray); ```