Choice of Shrinkage Prior

# Choice of Shrinkage Prior Main Idea: The likelihood assigns similar or even more value to the added edge path or using both added and expert paths. Therefore we need to better construct our prior so the total objective value is higher for the grond true path and not the added path. ## Initial Experiment and Problem ### Setup: ![](https://i.imgur.com/bnT5Fco.png) * $N=100$ datapoints * Expert specifies green path. Ground truth graph has green path structure. Our method consideres X->Z as a candidate edge. * We optimize our objective function using both structures. The expert graph is where we initially set $\sigma_{XZ}$ to be 0 and then optimize. ### Desiderata: Our method should assign greater objective value on the green path structure (expert specified and ground truth structure) than on the candidate structure (green path with red edge) ## The Likelihood Weights Equally So We Heavily Rely on Prior #### The likelihood assigns similar value to both the added edge path and the expert path. 1. Likelihood of using both paths ![](https://i.imgur.com/psHKM39.png) 2. Likelihood of preferring the added edge path over expert path ![](https://i.imgur.com/2525PHJ.png) 3. Likelihood of expert path over added edge path ![](https://i.imgur.com/rA84bPx.png) ## Overview and Definition ### The Horseshoe Prior In order to encourage sparsity and strong preference for the expert graph, we assign a horseshoe prior on the amplitude hyperparameter $\sigma_f$ for each edge. There are two levels of shrinkage in our problem. 1. [Edge Complexity] If the edge complexity was correctly specified by the expert, then we want $\sigma_{f_e}$ to go to 0, which means the added GP component was not needed. 2. [Edge Structure] If the structure of the graph was correctly specified by the expert, then we also want $\sigma_{f_e}$ to go to 0, which means the added edge candidate was not needed. **Since making a sturctural change is more expensive than making a complexity change, we want to assign a stronger shrinkage prior on $\sigma_{f_e}$'s where $e$ is an edge candidate NOT specified by the expert.** The horseshoe prior is defined as follows: $$ \sigma_{f_e} | \lambda_e \sim \mathcal{N}(0, \lambda_e^2\tau^2), \begin{cases} \lambda_e \sim \mathcal{C}^+(0, b_e) & \text{if $e$ is specified by expert} \\ % & is your "\tab"-like command (it's a tab alignment character) \lambda_e \sim \mathcal{C}^+(0, b_c) & \text{otherwise.} \end{cases} $$ where $\tau = 1$, $\mathcal{C}^+$ is a Half Cauchy distribution. Reparameterization Trick says (ref: [here](https://jmlr.org/papers/volume20/19-236/19-236.pdf)): $$ \lambda \sim \mathcal{C}^+(0, b) \implies \lambda^2 | a \sim \text{Inv-Gamma}(\frac{1}{2}, \frac{1}{a}); a \sim \text{Inv-Gamma}(\frac{1}{2}, \frac{1}{b^2}) $$ Note: Half Normal was chosen because it has shorter tails than a Half Cauchy. ### Objective Function $$ \beta^*, \theta^* = \underset{\beta, \theta}{\arg \max} \bigg( \sum_{i = 1}^N \log p(Y_1, ...,Y_K) \bigg) + \log p(\sigma_{f} | \lambda) + \log p(\lambda) $$ Finale suggests something like: Read Paper on Horseshoe Reparameterization Trick: [here](https://jmlr.org/papers/volume20/19-236/19-236.pdf) $$ \beta^*, \theta^* = \underset{\beta, \theta}{\arg \max} \bigg( \sum_{i = 1}^N \log p(Y_1, ...,Y_K) \bigg) + \log p(\sigma_{f}) $$ We are maximizing the log posterior where the first term is the data likelihood and the second and third terms are part of the shrinkage prior. | Value | Objective Value | Likelihood | Prior | Prior Value on $\sigma$ | Prior Value on $\lambda_{XZ}$| Train RMSE | | :--- | :----: | :----: | :----: | :----: | :----: | :----: | | Expert | -1443.948 | 181.199 | -1625.148 | -6.285 | -1613.605 |4.73e-5| | Candidate | 19.373 | 177.516 | -158.142 | -10.083 | -146.165 | 56.115 | | Value | $\sigma_{XZ}$ | $\lambda_{XZ}$| | :--- | :----: | :----: | | Expert | 0 | 0.568 | | Candidate | 1.721 | 0.173 | #### Takeaway: We're penalizing a slightly incorrect $\lambda_{XZ}$ more than we are rewarding a good $\sigma_Y$ ### How do I weight a "good" value of $\sigma$ more than I penalize a "bad" value of $\lambda$?