## Emergence of Syntax through Speed of Learning Hypothesis is that syntax is the structure that allows models to learn to predict words in the fastest way. In order to probe this hypothesis we can setup an experiment as follows: At step $t$: 1. Sample a parameter $\theta_t \sim \text{init}(\theta)$ or set $\theta_t = \theta_0$. 2. Train an LM $p(x; \theta_t)$ with attention $q(x; \phi_{t-1})$ for K iterations. 3. Train the attention masking $q(x; \phi_t)$ to decrease loss of $p(x; \theta_t)$. In 1., $\theta$ is re-initialized. This has the effect of reducing co-adaptation between $p$ and $q$. In a sense, the fact that $p$ gets better at performing language modeling *diminishes* the signal for learning a good $q$. This has already been observed in the setting of VAEs in https://arxiv.org/pdf/1802.04537.pdf, where increasing the number of samples for the IWAE estimation is beneficial for $p_\theta$ but detrimental to $q_\phi$. Here, this is exemplified by the fact that, to a certain extent, if the generative model $p_\theta$ is overly powerful, then $q_\phi$ collapses to trivial solutions. We argue that this problem can be solved by training $q_\phi$ to decrease the loss of a partially trained $p_\theta$ model, or equivalently, to increase the learning speed of $p_\theta$. It occurs to me that this can be tested also in the context of SCAN, where the $q_\phi$ is just the alignement between input and output tokens, e.g. a model that outputs an attention matrix for each step, and $p_\theta$ is the encoder-decoder architecture. Procedurally, one would proceed as follows: 1. Initialize a random alignment between each word for each sentence. 2. Train the decoder model for K epochs. 3. Train the random alignments such that they decrease the loss of the decoder model. 4. Loop 2-3. Investigate relationship between this and variational bayesian EM https://cse.buffalo.edu/faculty/mbeal/thesis/beal03_2.pdf. ## Possible Tasks