---
title: PCFG vs LLM
tags: Idea
description: Study inductive biases of LLMs when trained to generate PCFG samples.
---
# Probabilistic Context Free Grammars vs Language Models
## Setup: training a LM on samples from a PCFG
Let $P$ be a probabilistic context free Grammar (PCFG) over vocabulary $\mathcal{X}$ with associated context free grammar $\mathcal{G}$. The grammar is the support of the distribution $P$ over finite-legth sequences $S$ composed of tokens from $\mathcal{X}$.
Let $Q$ denote an arbitrary distribution over infinite sequences of tokens taken from $\mathcal{X}\cup \{๐\}$, where ๐ is a special 'end of string' token. This $Q$ is going to be our language model, maybe a Transformer-based model like GPT.
We are going to optimize $Q$ via maximum likelihood on samples from $P$, Thus, $Q$ will approximate $P$ in the KL divergence sense, i.e. we are minimizing $\operatorname{KL}[P\|Q]$. So this KL divergence is one of the quantities we can track about how well $Q$ fits the data from $P$
But, we are also interested in another quantity, $Q(\mathcal{G})$, which is the probability that $Q$ generates grammatical sentences from the grammar $\mathcal{G}$. In other words, we are interested in measuring $Q[\{S๐S': S \in \mathcal{G}\}] = Q[\{S๐S': P[S]>0\}]$. This is the probability that a random sequence sampled from $Q$ is a valid member of the grammar $\mathcal{G}$, or lies within the support of $P$.
### Example: $a^nb^n$ toy grammar
Let's make the vocabulary simply $\{a, b\}$, and the PCFG as follows:
\begin{align}
X \rightarrow ab \text{ with probability }0.2\\
X \rightarrow aXb \text{ with probability }0.8
\end{align}
This PCFG generates the grammar $\mathcal{G} = \{ab, aabb, aaabbb, aaaabbbb, \ldots\} = \{a^nb^n: n\in\mathcal{N}\}$.
Furthermore, the length of the sequence follows and geometric distribution $P(a^nb^n) = 0.8^{n-1}0.2$
By contrast, $Q$ can generate arbitrary sequences of vocabulary elements $\{a, b, ๐\}$.
$Q(\mathcal{G})$ measures the total probability that $Q$ generates a member of the grammar, followed by the "๐" token.
###ย Example: programming language
Most programming languages are described by context-free grammars. So when we train a LLM to generate computer code, we are in fact asking it to approximate a PCFG. The P or probabilistic part comes from the fact that not all programs are equally likely to be written on github, for example. (Note that the distribution of C++ codes on github may not be a PCFG, simply a distribution over a subset of a context-free grammar).
We are interested in $Q(\mathcal{G})$ because it represents the language model's ability to generate, say, C++ code that is syntactically correct. The fact that LLMs like GPT can do this, rather reliably, is pretty surprising to me, especially given that they are trained using a compression objective.
## The relationship between KL divergence and $Q(\mathcal{G})$
Question: what is the relationship between approximation ($\operatorname{KL}[P\|Q]$) and grammaticality $Q(\mathcal{G})$? Is it possible to make KL very small, yet also keep generating agrammatical sequences?
We can decompose the language model as a mixture of grammatical and agrammatical sequences: thus $Q = qQ_\mathcal{G} + (1-q)Q_\mathcal{\overline{\mathcal{G}}}$, where $q = Q(\mathcal{G})$, $Q_\mathcal{G}$ is a distribution with support on $\mathcal{G}$, while $Q_\mathcal{\overline{\mathcal{G}}}$ has support outside of $\mathcal{G}$. Using this decomposition we can write
\begin{align}
\operatorname{KL}[P\|Q] &= - \mathbb{E}_{S\sim P} \log Q[S] - \mathbb{H}[P]\\
&= - \mathbb{E}_{S\sim P} \log \left(qQ_\mathcal{G} + (1-q)Q_\mathcal{\overline{\mathcal{G}}}\right) - \mathbb{H}[P]\\
&= - \log q - \mathbb{E}_{S\sim P} \log Q_\mathcal{G}[S] - \mathbb{H}[P]\\
&= - \log q + \operatorname{KL}[P\|Q_\mathcal{G}]\\
&\geq -\log q
\end{align}
Thus we can see that
$$
Q(\mathcal{G}) \geq e^{-\operatorname{KL}[P\|Q]}
$$
Consequently, as we reduce the cross entropy, the probability of generating grammatical sentences has to increase.
### Research question 1.
In practice, when we fit $Q$ to $P$ via stochastic gradient descent in a neural network architecture, how does $Q(\mathcal{G})$ evolve. We have a worst-case bound which says it should decrease at least as $e^{-\operatorname{KL}[P\|Q]}$. My hypothesis is that it decreases
1. faster than this worst case bound
2. abruptly, as a phase transition, rather than smoothly, in line with $e^{-\operatorname{KL}[P\|Q]}$.
Experiment: train a language model to fit a PCFG. Monitor both $\operatorname{KL}[P\|Q]$ and $Q(\mathcal{G})$. Observe how the two quantities are reduced
### Research question 2.
Let's say we train a neural network on a hard-to-learn subset $\mathcal{D}$ of a nice, easy-to-learn grammar $\mathcal{G}$. We can track $Q(\mathcal{D})$ and $Q(\mathcal{G})$ as well. Hypothesis: first, $Q(\mathcal{G})$ will reach $1$, only later will $Q(\mathcal{D})$ reach $1$ if at all.
Experiment: Come up with a way to split a nice (low algorithmic complexity) grammar into a training and test set, in such a way that the training set has high algorithmic complexity (i.e. it can't be described by a spmple, concise set of production rules). For example, one can consider $\{a^nb^n: n\in\mathcal{N}\setminus\text{primes}\}$ where we split the $a^nb^n$ language into a training set where $n$ are not primes and a test set where $n$ are primes. Fit a neural network to this data and monitor the likelihood of the training and test set. Hypothesis: training and test likelihood will continue to increase, while at some point the network is forced to overfit to $\mathcal{D}$ causing the test loss to drop.
### Research question 3.
Let's say we are in the situation like above, where the LLM learnt to generate a subset $\mathcal{D}$ of a nice grammar $\mathcal{G}$. If the likelihood is high enough, then $Q(\mathcal{G}/\mathcal{D})$ should be effectively $0$. Question: can the LLM produce grammatical completions of substrings that were not in its training set, and have $0$ probability of generation?
Example: let's say that we trained the LLM to generate $a^nb^n$ strings but we left $n=5$ out of the training set. If we train long enough, we should find that $Q(aaaaabbbbb)=0$. Hypothesis: we will find that $Q(bbb | aaaaabb) \approx 1$, in other words, that the model can grammatically correctly complete a substring of aaaaabbbbb even though:
* it has never seen it in the training set, and
* it actuall assigns probability 0 to the entire string $aaaaabbbbb$.
Consequence of this: even if LLMs have never seen a piece of code written on github, or if they assign very low, or practically 0 probability to the whole piece of code, they may still be able to complete the code once prompted. This has important implications for the practice of prompting: prompting can be successful even on out-of-distribution samples.