LM:Agile Presentation

Introduction

The task of language modeling is to assign probabilities to text such that text that come from the language are assigned higher probabilities than text that don't. Let's look at some examples:

(1) я иду домой
(2) I am going home
(3) Ես տուն եմ գնում
(4) je vais a la maison

An English language model would assign high probability to (2), lower probability to (4), and extremely low probabilities to (1) and (3). Let's test this out:

from kenlm import LanguageModel

lm = LanguageModel('wiki103.klm')

sentences = [
    'я иду домой',
    'I am going home',
    'Ես տուն եմ գնում',
    'je vais a la maison'
]

scores = [(10 ** lm.score(sent), sent) for sent in sentences]
scores.sort(reverse=True)

The results are:

[(2.0708687409496807e-13, 'I am going home'),
 (4.944977001922915e-25, 'я иду домой'),
 (2.1057933808153777e-30, 'je vais a la maison'),
 (6.335664110472439e-32, 'Ես տուն եմ գնում')]

This particular language model didn't agree with us on the probability of (4), but as you can see it's assigned the highest probability to the English sentence.

That was easy, you say, since non-English sentences contain almost no words from English. You're right, but language models help us do other sorts of useful things. The most important problem they help us with is that of ordering English sentences based on their probabilities. For example, let's look at these sentences:

(5) I am going house
(6) I am going home

The language model would, ideally, assign a higher probability to (6) than to (5). Our example lm does this properly:

[(2.0708687409496807e-13, 'I am going home'),
 (8.306997584230622e-15, 'I am going house')]

A note on what exactly we mean by the probability of a piece of a piece of text. Given text composed of characters \((w_1, w_2, \ldots, w_n)\), by \(P(w_1, w_2, \ldots, w_n)\) we mean the likelihood that a randomly selected sequence of \(n\) characters from, say, a book, is that sequence.

I've spent some time studying various language models, so in this talk I want to share with you some of what I've learned. Language modeling is a foundational problem in NLP and is among the very first topics that are taught in NLP and computational lingustics courses. The outline of the talk is the following:

Applications
History
Models
Resources

In Applications we'll briefly list some of the places language models show up in NLP. In History, we take a very biased look at the history of the subject. In Models, we consider the three main classes of approaches to actually implementing language models: statistical Markov models, neural Markov models, and neural non-Markov models. In Resources we'll provide some links to the literature and state-of-the-art implementations of the different models.

Applications

Language modeling is a foundational task in NLP; many downstream tasks have an LM as a subcomponent.

spelling correction
authorship identification (e.g. Mosteller-Wallace)
plagiarism detection
machine translation
speech recognition
optical character recognition
grammatical error correction
- in decoders
- in labeling functions
- during data augmentation
etc

History

Before we get into the weeds, I thought it might be good to take a brief glance at the history of the subject. A language model has two components: there's the language and there's the probabilistic model. The history of language goes back millenia; the history of probabilistic modeling several hundred years.

The history of language is only slightly younger than history itself. As Gleick writes in Information, one of the very first applications of the technology of language was to language itself. First in the hands of philosophers, then philologists, and now of linguists. Needless to say, it's outside the scope of this presentation to offer an outline of this history.

Independence

Although probability has been thought about for a very long time, the first major breakthroughs trace to the Renaissance era gambling. The big players of this revolution were Cardano, Pascal, Fermat, Laplace, the Bernoullis, among others. We may call this the first phase of the development of probabilistic models. An essential part of these models is the idea of independence, which says that the probability of an outcome is independent of the probabilities of other outcomes. In the context of games of chance, this assumption makes sense. For more on this early history of probability take a look at [links].

Diversion: Statistical Revolution

Before we get to the second phase of development, a quick diversion to point out an important realization that lies at the foundation of statistics and science in general. The idea is due to Karl Pearson and is this: observations of natural phenomena are probabilistic in nature: what we have are not deterministic truth-values, but random variables with underlying distributions. The classic example here is that of the electron, whose properties are modeled using probability distributions [Albert]. In the context of language modeling, text from a language is thought of as a sequence of random variables considered over time. Even when each token in the text is as likely to be seen as any other, we still have an underlying distribution of what token at a particular index is.

Conditional Independence

The second phase of development of probabilistic modeling coincides with the birth of language modeling in the hands of the great Russian mathematician Andrei Andreevich Markov. His insight was that often the probabilities of outcomes in a sequence are not independent! For example, the probability that the letter following "t" is "h" is much higher than the probability of "e" following "t", even though "e" is more frequent than "h" in English. Instead of unconditional independence, we have conditional independence of an event in a sequence.

The most restrictive Markov model is the 0th-order one and coincides with the independent model: the probability of random variable (r.v.) \(x\) at time \(t\) is independent of the probabilities of \(x_{1:t}\). We may call this a memoryless model: we see x[t] and know nothing of what came before at x[:t].

It's important to note that this is not necessarily a uniform probability distribution, since the probabilities of events can have any distribution. For example, the letter "e" occurs more frequently than "z" in English, so even if \(p(\text{e|th})\) is by the 0th-order assumption equal to \(p(e)\) and \(p(\text{z|th}) = p(z)\), we still have \(p(e) \gt p(z)\) so \(p(e) \ne p(z)\), so we don't have in \(p\) a uniform distribution.

The first non-trivial Markov model is the 1st-order one. It says that the probability of \(t\)th element in a sequence is conditionally independent of its history given the \((t-1)\)th element. When people refer to "Markov Property", this 1sth-order Markov model is what they're commonly referring to.

As we increase the order, we obtain sequence models that are able to capture more and more of the past to influence the probability of the next state in a sequence.

For expositions of the early work on Markov language models, look at [links, incl. Mind at Play].

Hidden Markov Models

Markov models have been flourishing since then and have found use in most ML tasks. When people say "statistical models", they're often referring to Markov models. State of the art statistical models use what are called Hidden Markov Models (HMMs). This is a huge subject with lots of cool algorithms (e.g. Viterbi for decoding and Baum-Welch for parameter estimation). When we get to models, I'll mention the role HMMs play in language modeling. Until then we leave this subject alone. [links]

Bayesian Learning

Another place where Markov models are heavily applied is in Bayesian learning. The problem of Bayesian Learning is to compute what's called a posterior. Although there are exceptions to this (e.g. conjugate methods), it turns out that this posterior is intractable to compute, so in practice people resort to approximation methods. Markov models show up in one such method called Markov Chain Monte Carlo (MCMC). It marries the concepts of Monte-Carlo simulation and Markov models. MCMC, and Bayesian Learning in general, are immensely rich fields of inquiry. Saying anything more about them would take us too far afield, but I'll mention some expositions of the topic at the end for independent exploration.

Neural Networks

The history of language modeling starts, but doesn't end with Markov models. Since the pioneering work of Bengio and others, neural networks have been applied to langauge modeling with incredible success. We'll discuss these neural models in modeling section, but I want to conclude this historical section with an important note. Namely that, neural network applies to language modeling doesn't automatically constitute an advancement over statistical modeling. In a sense, it's a mere implementation detail: we have a different architecture, a different loss function, different learning and inference algorithms, but the fundamental probabilistic assumptions may be the same. As we'll see, however, there is a class of neural networks that allow us to go beyond Markov models, capturing, in theory at least, all of the history.

Models

As we noted in the Introduction, language models assign probabilities to text. The first task is to represent text in a way that's convenient for modeling.

Text Representation

A piece of text is inherently ordered. For example, the following pairs aren't the same:

(7a) dog
(7b) god
(8a) I am going home
(8b) home going I am

The atomic units are characters (or Unicode code points, if you insist), such as "d", "o", and "g". To say that text is ordered is to say that it's not a set (or bag), but a sequence (or tuple) of characters. There's only one question that can be asked of a set: is a particular element a member of it. A sequence answers that same question, but also tells us where the element can be found. In a sequence, every element has a position. Since text is naturally produced over time, we say that "d" and "o" occur at time points 0 and 1 in the text "dog".

Once we think of text as a sequence of tokens over time, we can model producers of a language using functions that generate such sequences. Our task then is to approximate this generating function using machine learning.

Levels

Although characters are the atoms of text, language models are often defined at different levels of text representation:

character-level
word-level
sentence-level

Consider the text "I am going home". At the sentence level, we just have the sequence of one element ("I am going home"). At the word level, we have ("I", "am", "going", "home"). At the character level we have ("I", " ", "a", "m", " ", "g", "o", "i", "n", "g", " ", "h", "o", "m", "e").

Sentence-level models aren't used commonly, as pretty much all you can do is count occurrences of sentences in datasets. That leads, due to sparsity, to very poor estimators.

Character and word-level models are widespread. Word-level models lead to short but deep sequences, whereas character-level models lead to shallow but long sequences. I use length and depth in the following sence. The sentence "I am going home" can be represented at word and character levels as:

(8.5a) "I", "am", "going", "home"
(8.5b) "I", " ", "a", "m", " ", "g", "o", "i", "n", "g", " ", "h", "o", "m", "e"

The length of (8.5a) is 4, while the length of (8.5b) is 15. The depth of (8.5a) is the number of words in the language, while the depth of (8.5b) is the number of possible characters in the language, so (8.5a) is much much deeper than (8.5b).

These quantities of length and depth have important consequences for modeling. Long sequences can lead to very deep neural networks that are difficult to train. Deep sequences can lead to very slow training and inference due to softmax layers used in modeling.

Probabilistic Setup

Whatever level we choose will determine what elements go into our elementary outcome space \(\Omega\). For example:

(character-level) \(\Omega\) is the set of all characters.
(word-level) \(\Omega\) is the set of all words.

Since everything represented in a computer must be finite, these \(\Omega\)s are assumed to be finite as well. In the character-level case, \(\Omega\) is already finite. In the word-level case, the most frequence words are determined empirically to constitute \(\Omega\). Any word \(w \not\in \Omega\) is by convention represented by a special <unk> (for unknown) token.

In this setup, text is a sequence of random variables \(w_1,\ldots,w_n\) (\(w_i \in \Omega\)), distributed according to an unkown probability distribution \(p\). A language model \(\mathcal{M}\) is an approximation of this unkown \(p\). The task of language modeling is the machine learning task of estimating \(\mathcal{M}\) from data so as to minimize the difference between \(\mathcal{M}\) and \(p\).

Rant: it's common to hear people saying that language modeling is an unsupervized learning task. This is understandable, since there is no need to manually label a dataset for this task. But it's still an incorrect characterization of the learning problem. Language modeling is, in fact, a supervized learning task, it just happens that the labels are trivially obtained from unlabelled, raw text.

Loss

We said that the goal is to minimize the difference between \(\mathcal{M}\) and \(p\). In language modeling, this difference is commonly defined in terms of cross-entropy or the derived notion of perplexity. Cross-entropy is defined as follows:

\[ H(p, \mathcal{M}) = -\sum_w p(w) * \log_2(w) \]

where \(w \in D\) and \(D\) is our dataset of text, a sequence of elements from \(\Omega\), and the units of \(H\) are in bits (due to the base of the \(\log\)). This is like one of those argmax formulas that appear in the literature but seem useless in practice due to a bunch of unknowns. The unknown here is \(p\). Since we don't have \(p\), in practice, \(H\) is approximated using the average negative log-likelihood of the data (\(D\)) under the model (\(\mathcal{M}\)):

\[ \mathsf{ANLL}(D, \mathcal{M}) = - \frac{1}{|D|} \sum_{j=1}^{|D|} \log_2(\mathcal{M}(s^j)) \]

Perplexity is the dimensionless quantity computed as follows:

\[ \mathsf{PPL}(D, \mathcal{M}) = 2^{\mathsf{ANLL}(D, \mathcal{M})} \]

Depending on the type of model, we may optimize with respect to ANLL or PPL. For reporting evaluation results PPL is pretty standard (small ANLL gains show up as large PPL gains!).

Models

In this final section we discuss three of the main modeling approaches. The first two are Markov models, the second goes beyond Markov models, allowing us to condition on the entire sequence.

As we noted earlier, Markov models make strong conditional independence assumptions. An nth orderMarkov model assumes the following identity:

\[ p(w_i|w_1,\ldots,w_{i-1}) = p(w_i|w_{i+1-n},\ldots,w_{i-1}) \]

Given a sequence of text \(w_1,\ldots,w_n\), a Markov language model \(M\) decomposes \(p(w_1,\ldots,w_n)\) into a chain of multiplications:

\[ p(w_1,\ldots,w_n) = p(w_1) * p(w_2|w_1) * p(w_3|w_1, w_2) * \ldots * p(w_n|w_1,\ldots,w_{n-1}) \]

So far this is just the chain rule applied to the sequence. The next step is to apply a Markov assumption of some order to shorten those long conditioning contexts. For concreteness, let's stick to \(n=2\) models; these are called "bigram" models. A bigram LM is defined as follows:

\[ M(w_1,\ldots,w_n) = p(w_1) * \prod_{i=2}^n p(w_i|w_{i-2}, w_{i-1}) \]

As you can see, starting from the second element of the sequence, we condition on the previous two elements, ignoring all elements preceding those. This is the essential characteristic of Markov models.

Now that we've defined Markov language models, how do we go about estimating these \(p\) parameters? There are two main approaches: statistical and neural. In the next two subsections we cover both. In the final subsection, we'll look at non-Markov language models.

Markov: statistical

The statistical approach tries to maximize the likelihood of the data under \(M\). Solving this objective using Maximum-Likelihood Estimation (MLE) yields the following optimal estimators:

\[ p(w_i|w_{i-2}, w_{i-1}) = \frac{c(w_{i-2}, w_{i-1}, w_i)}{c(w_{i-2}, w_{i-1})} \]

where:

\(c(w_i)\): the number of times token \(w_i\) occurs in the dataset
\(c(w_{i-1}, w_i)\): the number of times tokens \(w_{i-1}, w_i\) appear in the dataset, and
\(c(w_{i-2}, w_{i-1}, w_i)\): the number of times \(w_{i-2}, w_{i-1}, w_i\) occurs in the dataset

Once we gather these n-gram counts, we can compute trigram probabilities \(p(w_i|w_{i-2}, w_{i-1})\), which means that we can compute the probability of \(w_1,\ldots,w_2\) by factorizing it into a chain of trigram terms and then multiplying them out.

For example, let's say we have "I am going home" and we want to know \(p(\text{"I", "am", "going", "home"})\). We factorize it into terms:

\(p(\text{I})\)
\(p(\text{am|I})\)
\(p(\text{going|I, am})\)
\(p(\text{home|I, am, going})\)

Then we use a second-order Markov assumption to simplify that last term to:

\(p(\text{home|am, going})\)

Then we count how many times these occur in the training dataset:

\(c_1=\) I
\(c_2=\) I am
\(c_3=\) I am going
\(c_4=\) am going home
\(c_5=\) am going

Using those we compute the probabilities:

\(p(\text{I}) = c_1\)
\(p(\text{am|I}) = \frac{c_2}{c_1}\)
\(p(\text{going|I, am}) = \frac{c_3}{c_2}\)
\(p(\text{home|am, going}) = \frac{c_4}{c_5}\)

Lastly, we multiply those out to get the probability of the entire sequence.

This is nice and easy, but has a huge problem due to sparsity. The problem is that n-gram counts can be 0! Of course if a 0 occurs in the denominator of the \(p\) terms, they'll be undefined, so no probability will be assigned to some sequences. And if 0 occurs in the numerator, we'll get a 0 probability for some of the terms, which, when multiplied with the other terms, will result in 0 probability for the entire sequence, no matter what the probabilities of the other terms were.

Smoothing

To overcome this problem of 0 counts, statistical Markov language models use smoothed estimates of these \(p\) values. Smoothing is a huge subject of inquiry, so we'll only hint at some of the common approaches.

The most basic approach is additive smoothing. The idea is to add a small positive number \(\alpha\) to all c terms and re-normalize:

\[ p(w_i|w_{i-2}, w_{i-1}) = \frac{c(w_{i-2}, w_{i-1}, w_i) + \alpha}{c(w_{i-2}, w_{i-1}) + \alpha|V|} \]

where \(|V|\) is the size of the vocabulary. A commonly-chosen value for \(\alpha\) is 1 (this is known as Laplace smoothing). Additive smoothing solves the problem of 0 counts, but leads to over-esimators (no pun intended).

The second common approach is backoff smoothing. The idea is to learn three language model parameter classes simultaneously:

\[ p_3(w_i|w_{i-2}, w_{i-1})\\ p_2(w_i|w_{i-1})\\ p_3(w_i) \]

Once we have these, we apply \(p_3\) to the sequence. If the result is a positive number, we return that probability. Otherwise, we back off, applying \(p_2\) and then \(p_1\). There are many variations on this theme, known under the names Katz and Stupid backoff.

The third most common approach is interpolated smoothing, which, instead of backing-off from higher to lower-order models, interpolates or mixes all three probabilities (this is why it's often called a mixture model). Interpolated smoothing parametrizes the mixture using \(\lambda\) terms that are required to sum to 1 and be non-negative:

\[ M(w_{1:n}) = \lambda_3 p_3(w_{1:n}) + \lambda_2 p_2(w_{1:n}) + \lambda_1 p_1(w_{1:n}) \]

These \(\lambda\) parameters can be set uniformly to \(\lambda_i = \frac{1}{3}\) but are ideally learned from the data. The way I've optimized these before is with constrained optimization algorithms using SciPy. Unfortunately, these algorithms are slow on real-world data (e.g. WikiText or Penn TreeBank). It's possible to differentiate the loss and use gradient-based optimization methods to find the \(\lambda\)s. However, the way these \(\lambda\)s are found in practice is by modeling the interpolated language model as a Hidden Markov Model and using Expectation-Maximization algorithms to estimate the \(\lambda\)s. I need to spend some time figuring out how to do this properly. A good textbook on this subject is Jelinek.

Last, but not least, there is Knesser-Ney smoothing, which is widely used in state-of-the-art statistical models. It's a little involved for our scope, but there are some nice expositions of the technique in the statistical language modeling literature.

Markov: Feed-forward neural

The neural approach to Markov language models traces back to Bengio and others. The problem is a supervized one: we want to predict \(w_t\) given the previous tokens \(w_{t-2}\) and \(w_{t-1}\). We model this a a multi-layer perceptron (MLP) with input layer consisting of two nodes, a hidden affine layer followed by a tanh-activation, and a softmax output layer, which gives the probabilities of each vocabulary element being \(w_t\). If the network output is \(\widehat{w_t}\), the cross-entropy loss is:

\[ l(w_t, \widehat{w_t}) = -\log(\widehat{w_t}_i) \]

where \(i\) is the index of \(w_t\) in the vocabulary. We use this loss to compute the gradient and update the network weights. This turns out to work OK in practice and avoids all of the smoothing business. But the Markov limitation remains: these models can only pay attention to a small part of the history when predicting the probability of the elements of a sequence.

Non-Markov: Recurrent Neural

Our final class of models allow us to go beyond the Markov assumptions. The idea is to model sequences using recurrent neural networks (RNNs). Given a sequence \(w_1,\ldots,w_n\), we create an RNN with \(n\) "cells", each computing an output probability like the neural Markov networks, but, crucially, also a shared state vector, which acts as memory. Graphically, the network looks like this:

<diagram>

The same cross-entropy loss is used to accumulate a total loss for the entire sequence, which is used to pass a gradient back through the network. Since cells of the network are arranged in time, this application of backpropagation has acquired the cool name "back-propagation through time".

It turns out that this basic setup doesn't do very well in practice due to two problems:

vanishing gradients
exploding gradients

The network is computing a composition of \(n\) activation functions for each sequence of length \(n\). If, at some point, the network weights vary from 1, they'll either exponentially decrease or increase, causing either too much or too little gradient to flow back through the network. If \(w \gt 1\), the gradients will explode, which will lead to divergence. If \(w \lt 1\), the gradients will vanish, which will lead to insignificant updates to the weights of the cells earlier in the sequence. These problems are more pronounced for character-level language models than for word-level ones, because the sequences are longer.

In practice, the exploding gradients problem is solved by clipping the gradients, forcing them to remain within a particular interval. The vanishing gradients problem is solved by using a more complicated RNN architecture, such as LSTMs or GRUs. These models introduce gates that allow the network to memorize long-range dependencies in the presence of the vanishing gradient problem, by selectively updating the network weights. LSTMs and GRUs are used in state-of-the-art non-Markovian language modeling. With proper regularization (e.g. dropout), they outperform all other models when it comes to modeling sequences with long-range context dependencies.

Resources

Literature

Markov, Shannon
Markov Models, HMMs, MCMC
Koehn, Goldberg, Smith, Neubig
Coursera, Oxford/Blunsom
Jelinek, Bengio, Rosenfeld, Chen & Goodman, Mikolov, others

Software

KenLM
Keras, PyTorch, TensorFlow implementations
Fairseq