# Explain/get rid of $\beta$
- [x] $b_r = F(\mathbf{s}) - J\mathbf{s}$
$b_r = \mathbf{o} - J\mathbf{s}$
What we do, calculate $b_r$ and $J$ for each training samples and average over them.
:::danger
Needs $\beta$
:::
- [x] $b_r = \mathbf{o_{mean}} - J_{mean} \mathbf{s_{mean}}$
:::danger
Needs $\beta$
:::
- [x] $b_r = F(\mathbf{s_{mean}}) -J\mathbf{s_{mean}}$
:::danger
Needs $\beta$
:::
- [ ] How the magnitude of $|\mathbf{o}|$ changes based on $|\mathbf{s}|$?
- [x] Calculate with unit $\mathbf{s}$ and check if we still need the $\beta$
==> Doesn't work. $|\mathbf{s}|$ seems to have little effect on $|\mathbf{o}|$
- [ ] Only consider samples with smaller $|\mathbf{s}|$ while calculating $b_r$? Maybe it doesn't skew $b_r$ towards training samples that much.
==> Will not work for the same reason stated above
- [ ] Make all the $|\mathbf{s}|$ equal before doing any calculations
==> Will not work for the same reason stated above
- [x] Do some sort of weighted averaging over $\mathbf{o}$?
- [x] Make all $\mathbf{|o|}$ equal to their smallest value and then do the averaging so they are all given the same weight.
==> Still needs $\beta$
- [x] For calculation using each sample do a binary search over $\beta$. Smallest $\beta$ that doesn't change $o$. (Just another way of getting to $\beta$)
==> Still needs $\beta$
==> Weird
$z_{proj} = W_r(\mathbf{s}_i)$ --- [$W_r$ calculated for current sample]
```
Chile -> Santiago | z_norm=258.0 | z_proj=30.75 || o_pred=([(':[', 0.014), (' ready', 0.012), (' demanding', 0.011)], {})
France -> Paris | z_norm=261.5 | z_proj=21.828125 || o_pred=([(' Mil', 0.068), (' Los', 0.041), (' Roma', 0.034)], {})
China -> Beijing | z_norm=247.875 | z_proj=24.9375 || o_pred=([(' Los', 0.032), (' Baltimore', 0.029), (' Oakland', 0.022)], {})
Peru -> Lima | z_norm=237.375 | z_proj=31.296875 || o_pred=([(' Chicago', 0.021), (' commercial', 0.013), (' campaign', 0.011)], {})
Brazil -> Bras\u00edlia | z_norm=269.75 | z_proj=24.140625 || o_pred=([(' Pentagon', 0.04), (' Harlem', 0.037), (' White', 0.033)], {})
```
$z_{proj} = W_{mean}(\mathbf{s}_i)$ --- [$W_{mean}$ is the mean of all $W_r$]
```
Chile -> Santiago | z_norm=258.0 | z_proj=52.0625 || o_pred=([(' Santiago', 0.942), (' Chili', 0.033), (' Chile', 0.008)], {})
France -> Paris | z_norm=261.5 | z_proj=51.15625 || o_pred=([(' Paris', 0.663), (' France', 0.141), (' French', 0.042)], {})
China -> Beijing | z_norm=247.875 | z_proj=51.71875 || o_pred=([(' Beijing', 0.504), (' Chinese', 0.335), (' China', 0.076)], {})
Peru -> Lima | z_norm=237.375 | z_proj=55.34375 || o_pred=([(' Lima', 0.129), (' Pi', 0.017), ('ée', 0.013)], {})
Brazil -> Bras\u00edlia | z_norm=269.75 | z_proj=54.34375 || o_pred=([(' Bras', 0.927), (' bras', 0.011), (' Brazil', 0.009)], {})
```
- [x] Make $J\mathbf{s} = 0$ and drop it entirely.
The corner method doesn't take into account $J\mathbf{s} = 0$ at all. But, we saw when applied with the projection term $\mathbf{o} = W_r\mathbf{s} + corner$, it kind of worked.
:::danger
Needs $\beta$
:::
:::danger
Setting $b_r = corner$ (corner calculated by averating the embeddings, rows of the decoder head matrix) also needs $\beta$
:::
- [ ] Taylor series expansion is really
$$F(s) = F(s_0) + J(s - s_0) + \epsilon$$
$$b_r = F(\mathbf{s}) - J\mathbf{s} + \epsilon$$
So, the noise term $\epsilon$ gets absorbed into $b_r$. And, when we are reducing $b_r$ by multiplying it with $\beta (<1)$ are we in turn just reducing this noise term $\epsilon$?
## TODO:
- [x] Fix $\beta=1$ and search over layers? (will mess up the efficacy vs faithfulness scatterplots)
- [x] Fix hparams on some trials and report the numbers of different trials.
- [x] J after layer-norm, with $\beta = 1$
- [x] Would it make sense to amplify the projection term instead of reducing the translation
- [x] Can we make $\beta$ a global constant?
- [x] Check the sweeps for the same $\beta$