PCFG vs LLM - HackMD

--- title: PCFG vs LLM tags: Idea description: Study inductive biases of LLMs when trained to generate PCFG samples. --- # Probabilistic Context Free Grammars vs Language Models ## Setup: training a LM on samples from a PCFG Let $P$ be a probabilistic context free Grammar (PCFG) over vocabulary $\mathcal{X}$ with associated context free grammar $\mathcal{G}$. The grammar is the support of the distribution $P$ over finite-legth sequences $S$ composed of tokens from $\mathcal{X}$. Let $Q$ denote an arbitrary distribution over infinite sequences of tokens taken from $\mathcal{X}\cup \{🛑\}$, where 🛑 is a special 'end of string' token. This $Q$ is going to be our language model, maybe a Transformer-based model like GPT. We are going to optimize $Q$ via maximum likelihood on samples from $P$, Thus, $Q$ will approximate $P$ in the KL divergence sense, i.e. we are minimizing $\operatorname{KL}[P\|Q]$. So this KL divergence is one of the quantities we can track about how well $Q$ fits the data from $P$ But, we are also interested in another quantity, $Q(\mathcal{G})$, which is the probability that $Q$ generates grammatical sentences from the grammar $\mathcal{G}$. In other words, we are interested in measuring $Q[\{S🛑S': S \in \mathcal{G}\}] = Q[\{S🛑S': P[S]>0\}]$. This is the probability that a random sequence sampled from $Q$ is a valid member of the grammar $\mathcal{G}$, or lies within the support of $P$. ### Example: $a^nb^n$ toy grammar Let's make the vocabulary simply $\{a, b\}$, and the PCFG as follows: \begin{align} X \rightarrow ab \text{ with probability }0.2\\ X \rightarrow aXb \text{ with probability }0.8 \end{align} This PCFG generates the grammar $\mathcal{G} = \{ab, aabb, aaabbb, aaaabbbb, \ldots\} = \{a^nb^n: n\in\mathcal{N}\}$. Furthermore, the length of the sequence follows and geometric distribution $P(a^nb^n) = 0.8^{n-1}0.2$ By contrast, $Q$ can generate arbitrary sequences of vocabulary elements $\{a, b, 🛑\}$. $Q(\mathcal{G})$ measures the total probability that $Q$ generates a member of the grammar, followed by the "🛑" token. ### Example: programming language Most programming languages are described by context-free grammars. So when we train a LLM to generate computer code, we are in fact asking it to approximate a PCFG. The P or probabilistic part comes from the fact that not all programs are equally likely to be written on github, for example. (Note that the distribution of C++ codes on github may not be a PCFG, simply a distribution over a subset of a context-free grammar). We are interested in $Q(\mathcal{G})$ because it represents the language model's ability to generate, say, C++ code that is syntactically correct. The fact that LLMs like GPT can do this, rather reliably, is pretty surprising to me, especially given that they are trained using a compression objective. ## The relationship between KL divergence and $Q(\mathcal{G})$ Question: what is the relationship between approximation ($\operatorname{KL}[P\|Q]$) and grammaticality $Q(\mathcal{G})$? Is it possible to make KL very small, yet also keep generating agrammatical sequences? We can decompose the language model as a mixture of grammatical and agrammatical sequences: thus $Q = qQ_\mathcal{G} + (1-q)Q_\mathcal{\overline{\mathcal{G}}}$, where $q = Q(\mathcal{G})$, $Q_\mathcal{G}$ is a distribution with support on $\mathcal{G}$, while $Q_\mathcal{\overline{\mathcal{G}}}$ has support outside of $\mathcal{G}$. Using this decomposition we can write \begin{align} \operatorname{KL}[P\|Q] &= - \mathbb{E}_{S\sim P} \log Q[S] - \mathbb{H}[P]\\ &= - \mathbb{E}_{S\sim P} \log \left(qQ_\mathcal{G} + (1-q)Q_\mathcal{\overline{\mathcal{G}}}\right) - \mathbb{H}[P]\\ &= - \log q - \mathbb{E}_{S\sim P} \log Q_\mathcal{G}[S] - \mathbb{H}[P]\\ &= - \log q + \operatorname{KL}[P\|Q_\mathcal{G}]\\ &\geq -\log q \end{align} Thus we can see that $$ Q(\mathcal{G}) \geq e^{-\operatorname{KL}[P\|Q]} $$ Consequently, as we reduce the cross entropy, the probability of generating grammatical sentences has to increase. ### Research question 1. In practice, when we fit $Q$ to $P$ via stochastic gradient descent in a neural network architecture, how does $Q(\mathcal{G})$ evolve. We have a worst-case bound which says it should decrease at least as $e^{-\operatorname{KL}[P\|Q]}$. My hypothesis is that it decreases 1. faster than this worst case bound 2. abruptly, as a phase transition, rather than smoothly, in line with $e^{-\operatorname{KL}[P\|Q]}$. Experiment: train a language model to fit a PCFG. Monitor both $\operatorname{KL}[P\|Q]$ and $Q(\mathcal{G})$. Observe how the two quantities are reduced ### Research question 2. Let's say we train a neural network on a hard-to-learn subset $\mathcal{D}$ of a nice, easy-to-learn grammar $\mathcal{G}$. We can track $Q(\mathcal{D})$ and $Q(\mathcal{G})$ as well. Hypothesis: first, $Q(\mathcal{G})$ will reach $1$, only later will $Q(\mathcal{D})$ reach $1$ if at all. Experiment: Come up with a way to split a nice (low algorithmic complexity) grammar into a training and test set, in such a way that the training set has high algorithmic complexity (i.e. it can't be described by a spmple, concise set of production rules). For example, one can consider $\{a^nb^n: n\in\mathcal{N}\setminus\text{primes}\}$ where we split the $a^nb^n$ language into a training set where $n$ are not primes and a test set where $n$ are primes. Fit a neural network to this data and monitor the likelihood of the training and test set. Hypothesis: training and test likelihood will continue to increase, while at some point the network is forced to overfit to $\mathcal{D}$ causing the test loss to drop. ### Research question 3. Let's say we are in the situation like above, where the LLM learnt to generate a subset $\mathcal{D}$ of a nice grammar $\mathcal{G}$. If the likelihood is high enough, then $Q(\mathcal{G}/\mathcal{D})$ should be effectively $0$. Question: can the LLM produce grammatical completions of substrings that were not in its training set, and have $0$ probability of generation? Example: let's say that we trained the LLM to generate $a^nb^n$ strings but we left $n=5$ out of the training set. If we train long enough, we should find that $Q(aaaaabbbbb)=0$. Hypothesis: we will find that $Q(bbb | aaaaabb) \approx 1$, in other words, that the model can grammatically correctly complete a substring of aaaaabbbbb even though: * it has never seen it in the training set, and * it actuall assigns probability 0 to the entire string $aaaaabbbbb$. Consequence of this: even if LLMs have never seen a piece of code written on github, or if they assign very low, or practically 0 probability to the whole piece of code, they may still be able to complete the code once prompted. This has important implications for the practice of prompting: prompting can be successful even on out-of-distribution samples.