# Research Thoughts
## Emergence of Syntax through Speed of Learning
Hypothesis is that syntax is the structure that allows models to learn to predict words in the fastest way. In order to probe this hypothesis we can setup an experiment as follows:
At step $t$:
1. Sample a parameter $\theta_t \sim \text{init}(\theta)$ or set $\theta_t = \theta_0$.
2. Train an LM $p(x; \theta_t)$ with attention $q(x; \phi_{t-1})$ for K iterations.
3. Train the attention masking $q(x; \phi_t)$ to decrease loss of $p(x; \theta_t)$.
In 1., $\theta$ is re-initialized. This has the effect of reducing co-adaptation between $p$ and $q$. In a sense, the fact that $g$ gets better at performing language modeling *diminishes* the signal for learning a good $q$. This has already been observed in the setting of VAEs in https://arxiv.org/pdf/1802.04537.pdf, where increasing the number of samples for the IWAE estimation is beneficial for $p_\theta$ but detrimental to $q_\phi$. Here, this is exemplified by the fact that, to a certain extent, if the generative model $p_\theta$ is overly powerful, then $q_\phi$ collapses to trivial solutions. We argue that this problem can be solved by training $q_\phi$ to decrease the loss of a partially trained $p_\theta$ model, or equivalently, to increase the learning speed of $p_\theta$.
It occurs to me that this can be tested also in the context of SCAN, where the $q_\phi$ is just the alignement between input and output tokens, e.g. a model that outputs an attention matrix for each step, and $p_\theta$ is the encoder-decoder architecture.
Procedurally, one would proceed as follows:
1. Initialize a random alignment between each word for each sentence.
2. Train the decoder model for K epochs.
3. Train the random alignments such that they decrease the loss of the decoder model.
4. Loop 2-3.
Investigate relationship between this and variational bayesian EM https://cse.buffalo.edu/faculty/mbeal/thesis/beal03_2.pdf.
## Test-time inference for compositional generalization
```bash
jump -> JUMP
walk twice -> WALK WALK
-----------------------
jump twice -> JUMP JUMP
```
How to infer the embedding of jump given a new unseen context? In other words, how to estimate $p(e_{\text{jump}} | \text{jump twice}, \mathcal{D})$, where $\mathcal{D}$ is the training set (note this doesn't make use of $y$).
A first step in this direction would be to unrealistically assume that we have $y$ for $\text{jump twice}$ and only update $e_j$ by gradient descent. It seems plausible that GD will move $e_j$ towards $e_w$ (embedding of walk), but it's not clear whether the model can solve the task just by updating $e_j$ (e.g. think about the model having a completely entangled representation of $\text{walk twice}$).
### Distilling a powerful inference procedure
Another thought in this respect is whether we can train the forward pass of a neural net to distill a powerful inference procedure which is quite costly to perform. For example, consider a network $f$ and its application to input $x$, $f(x; \theta)$. $f$ produces internal states $H(x)$ and an output $o(x)$. Consider we have a set of examples $X$ that are related to $x$, we can train $H(x)$ and $o(x)$ to imitate the $H'(x), o'(x)$ obtained by $\theta' = \arg\min_\theta L(X, \theta)$. This needs to be formalized more: what does this mean in practice?
- MAML: $\min_\theta f(x; \theta - \alpha \nabla_\theta L(X))$
- $\theta' = \min_\theta L(X)$, $H(x; \theta) - H(x; \theta')$, $o(x) - o'(x)$
During training, I see $A \rightarrow a$ and $B, T \rightarrow b, b$. During test I see $A, T \rightarrow a, a$. $p(y | x, D) = \int_f p(y | x, f) p(f | D)$.
Inference procedure to estimate $p(x | \tilde x)$.
## Sharpness-Aware minimizer for Spurious Correlations
Idea is to take adversarial steps in the direction of spurious correlations.
## Data Augmentation Forcing
Reduces distributional shift between real and augmented data.