Explain/get rid of $\beta$

# Explain/get rid of $\beta$ - [x] $b_r = F(\mathbf{s}) - J\mathbf{s}$ $b_r = \mathbf{o} - J\mathbf{s}$ What we do, calculate $b_r$ and $J$ for each training samples and average over them. :::danger Needs $\beta$ ::: - [x] $b_r = \mathbf{o_{mean}} - J_{mean} \mathbf{s_{mean}}$ :::danger Needs $\beta$ ::: - [x] $b_r = F(\mathbf{s_{mean}}) -J\mathbf{s_{mean}}$ :::danger Needs $\beta$ ::: - [ ] How the magnitude of $|\mathbf{o}|$ changes based on $|\mathbf{s}|$? - [x] Calculate with unit $\mathbf{s}$ and check if we still need the $\beta$ ==> Doesn't work. $|\mathbf{s}|$ seems to have little effect on $|\mathbf{o}|$ - [ ] Only consider samples with smaller $|\mathbf{s}|$ while calculating $b_r$? Maybe it doesn't skew $b_r$ towards training samples that much. ==> Will not work for the same reason stated above - [ ] Make all the $|\mathbf{s}|$ equal before doing any calculations ==> Will not work for the same reason stated above - [x] Do some sort of weighted averaging over $\mathbf{o}$? - [x] Make all $\mathbf{|o|}$ equal to their smallest value and then do the averaging so they are all given the same weight. ==> Still needs $\beta$ - [x] For calculation using each sample do a binary search over $\beta$. Smallest $\beta$ that doesn't change $o$. (Just another way of getting to $\beta$) ==> Still needs $\beta$ ==> Weird $z_{proj} = W_r(\mathbf{s}_i)$ --- [$W_r$ calculated for current sample] ``` Chile -> Santiago | z_norm=258.0 | z_proj=30.75 || o_pred=([(':[', 0.014), (' ready', 0.012), (' demanding', 0.011)], {}) France -> Paris | z_norm=261.5 | z_proj=21.828125 || o_pred=([(' Mil', 0.068), (' Los', 0.041), (' Roma', 0.034)], {}) China -> Beijing | z_norm=247.875 | z_proj=24.9375 || o_pred=([(' Los', 0.032), (' Baltimore', 0.029), (' Oakland', 0.022)], {}) Peru -> Lima | z_norm=237.375 | z_proj=31.296875 || o_pred=([(' Chicago', 0.021), (' commercial', 0.013), (' campaign', 0.011)], {}) Brazil -> Bras\u00edlia | z_norm=269.75 | z_proj=24.140625 || o_pred=([(' Pentagon', 0.04), (' Harlem', 0.037), (' White', 0.033)], {}) ``` $z_{proj} = W_{mean}(\mathbf{s}_i)$ --- [$W_{mean}$ is the mean of all $W_r$] ``` Chile -> Santiago | z_norm=258.0 | z_proj=52.0625 || o_pred=([(' Santiago', 0.942), (' Chili', 0.033), (' Chile', 0.008)], {}) France -> Paris | z_norm=261.5 | z_proj=51.15625 || o_pred=([(' Paris', 0.663), (' France', 0.141), (' French', 0.042)], {}) China -> Beijing | z_norm=247.875 | z_proj=51.71875 || o_pred=([(' Beijing', 0.504), (' Chinese', 0.335), (' China', 0.076)], {}) Peru -> Lima | z_norm=237.375 | z_proj=55.34375 || o_pred=([(' Lima', 0.129), (' Pi', 0.017), ('ée', 0.013)], {}) Brazil -> Bras\u00edlia | z_norm=269.75 | z_proj=54.34375 || o_pred=([(' Bras', 0.927), (' bras', 0.011), (' Brazil', 0.009)], {}) ``` - [x] Make $J\mathbf{s} = 0$ and drop it entirely. The corner method doesn't take into account $J\mathbf{s} = 0$ at all. But, we saw when applied with the projection term $\mathbf{o} = W_r\mathbf{s} + corner$, it kind of worked. :::danger Needs $\beta$ ::: :::danger Setting $b_r = corner$ (corner calculated by averating the embeddings, rows of the decoder head matrix) also needs $\beta$ ::: - [ ] Taylor series expansion is really $$F(s) = F(s_0) + J(s - s_0) + \epsilon$$ $$b_r = F(\mathbf{s}) - J\mathbf{s} + \epsilon$$ So, the noise term $\epsilon$ gets absorbed into $b_r$. And, when we are reducing $b_r$ by multiplying it with $\beta (<1)$ are we in turn just reducing this noise term $\epsilon$? ## TODO: - [x] Fix $\beta=1$ and search over layers? (will mess up the efficacy vs faithfulness scatterplots) - [x] Fix hparams on some trials and report the numbers of different trials. - [x] J after layer-norm, with $\beta = 1$ - [x] Would it make sense to amplify the projection term instead of reducing the translation - [x] Can we make $\beta$ a global constant? - [x] Check the sweeps for the same $\beta$