The task of language modeling is to assign probabilities to text such that text that come from the language are assigned higher probabilities than text that don't. Let's look at some examples:
An English language model would assign high probability to (2), lower probability to (4), and extremely low probabilities to (1) and (3). Let's test this out:
from kenlm import LanguageModel
lm = LanguageModel('wiki103.klm')
sentences = [
'я иду домой',
'I am going home',
'Ես տուն եմ գնում',
'je vais a la maison'
]
scores = [(10 ** lm.score(sent), sent) for sent in sentences]
scores.sort(reverse=True)
The results are:
[(2.0708687409496807e-13, 'I am going home'),
(4.944977001922915e-25, 'я иду домой'),
(2.1057933808153777e-30, 'je vais a la maison'),
(6.335664110472439e-32, 'Ես տուն եմ գնում')]
This particular language model didn't agree with us on the probability of (4), but as you can see it's assigned the highest probability to the English sentence.
That was easy, you say, since non-English sentences contain almost no words from English. You're right, but language models help us do other sorts of useful things. The most important problem they help us with is that of ordering English sentences based on their probabilities. For example, let's look at these sentences:
The language model would, ideally, assign a higher probability to (6) than to (5). Our example lm does this properly:
[(2.0708687409496807e-13, 'I am going home'),
(8.306997584230622e-15, 'I am going house')]
A note on what exactly we mean by the probability of a piece of a piece of text. Given text composed of characters
I've spent some time studying various language models, so in this talk I want to share with you some of what I've learned. Language modeling is a foundational problem in NLP and is among the very first topics that are taught in NLP and computational lingustics courses. The outline of the talk is the following:
In Applications we'll briefly list some of the places language models show up in NLP. In History, we take a very biased look at the history of the subject. In Models, we consider the three main classes of approaches to actually implementing language models: statistical Markov models, neural Markov models, and neural non-Markov models. In Resources we'll provide some links to the literature and state-of-the-art implementations of the different models.
Language modeling is a foundational task in NLP; many downstream tasks have an LM as a subcomponent.
Before we get into the weeds, I thought it might be good to take a brief glance at the history of the subject. A language model has two components: there's the language and there's the probabilistic model. The history of language goes back millenia; the history of probabilistic modeling several hundred years.
The history of language is only slightly younger than history itself. As Gleick writes in Information, one of the very first applications of the technology of language was to language itself. First in the hands of philosophers, then philologists, and now of linguists. Needless to say, it's outside the scope of this presentation to offer an outline of this history.
Although probability has been thought about for a very long time, the first major breakthroughs trace to the Renaissance era gambling. The big players of this revolution were Cardano, Pascal, Fermat, Laplace, the Bernoullis, among others. We may call this the first phase of the development of probabilistic models. An essential part of these models is the idea of independence, which says that the probability of an outcome is independent of the probabilities of other outcomes. In the context of games of chance, this assumption makes sense. For more on this early history of probability take a look at [links].
Before we get to the second phase of development, a quick diversion to point out an important realization that lies at the foundation of statistics and science in general. The idea is due to Karl Pearson and is this: observations of natural phenomena are probabilistic in nature: what we have are not deterministic truth-values, but random variables with underlying distributions. The classic example here is that of the electron, whose properties are modeled using probability distributions [Albert]. In the context of language modeling, text from a language is thought of as a sequence of random variables considered over time. Even when each token in the text is as likely to be seen as any other, we still have an underlying distribution of what token at a particular index is.
The second phase of development of probabilistic modeling coincides with the birth of language modeling in the hands of the great Russian mathematician Andrei Andreevich Markov. His insight was that often the probabilities of outcomes in a sequence are not independent! For example, the probability that the letter following "t" is "h" is much higher than the probability of "e" following "t", even though "e" is more frequent than "h" in English. Instead of unconditional independence, we have conditional independence of an event in a sequence.
The most restrictive Markov model is the 0th-order one and coincides with the independent model: the probability of random variable (r.v.) x[t]
and know nothing of what came before at x[:t]
.
It's important to note that this is not necessarily a uniform probability distribution, since the probabilities of events can have any distribution. For example, the letter "e" occurs more frequently than "z" in English, so even if
The first non-trivial Markov model is the 1st-order one. It says that the probability of
As we increase the order, we obtain sequence models that are able to capture more and more of the past to influence the probability of the next state in a sequence.
For expositions of the early work on Markov language models, look at [links, incl. Mind at Play].
Markov models have been flourishing since then and have found use in most ML tasks. When people say "statistical models", they're often referring to Markov models. State of the art statistical models use what are called Hidden Markov Models (HMMs). This is a huge subject with lots of cool algorithms (e.g. Viterbi for decoding and Baum-Welch for parameter estimation). When we get to models, I'll mention the role HMMs play in language modeling. Until then we leave this subject alone. [links]
Another place where Markov models are heavily applied is in Bayesian learning. The problem of Bayesian Learning is to compute what's called a posterior. Although there are exceptions to this (e.g. conjugate methods), it turns out that this posterior is intractable to compute, so in practice people resort to approximation methods. Markov models show up in one such method called Markov Chain Monte Carlo (MCMC). It marries the concepts of Monte-Carlo simulation and Markov models. MCMC, and Bayesian Learning in general, are immensely rich fields of inquiry. Saying anything more about them would take us too far afield, but I'll mention some expositions of the topic at the end for independent exploration.
The history of language modeling starts, but doesn't end with Markov models. Since the pioneering work of Bengio and others, neural networks have been applied to langauge modeling with incredible success. We'll discuss these neural models in modeling section, but I want to conclude this historical section with an important note. Namely that, neural network applies to language modeling doesn't automatically constitute an advancement over statistical modeling. In a sense, it's a mere implementation detail: we have a different architecture, a different loss function, different learning and inference algorithms, but the fundamental probabilistic assumptions may be the same. As we'll see, however, there is a class of neural networks that allow us to go beyond Markov models, capturing, in theory at least, all of the history.
As we noted in the Introduction, language models assign probabilities to text. The first task is to represent text in a way that's convenient for modeling.
A piece of text is inherently ordered. For example, the following pairs aren't the same:
The atomic units are characters (or Unicode code points, if you insist), such as "d", "o", and "g". To say that text is ordered is to say that it's not a set (or bag), but a sequence (or tuple) of characters. There's only one question that can be asked of a set: is a particular element a member of it. A sequence answers that same question, but also tells us where the element can be found. In a sequence, every element has a position. Since text is naturally produced over time, we say that "d" and "o" occur at time points 0 and 1 in the text "dog".
Once we think of text as a sequence of tokens over time, we can model producers of a language using functions that generate such sequences. Our task then is to approximate this generating function using machine learning.
Although characters are the atoms of text, language models are often defined at different levels of text representation:
Consider the text "I am going home". At the sentence level, we just have the sequence of one element ("I am going home"). At the word level, we have ("I", "am", "going", "home"). At the character level we have ("I", " ", "a", "m", " ", "g", "o", "i", "n", "g", " ", "h", "o", "m", "e").
Sentence-level models aren't used commonly, as pretty much all you can do is count occurrences of sentences in datasets. That leads, due to sparsity, to very poor estimators.
Character and word-level models are widespread. Word-level models lead to short but deep sequences, whereas character-level models lead to shallow but long sequences. I use length and depth in the following sence. The sentence "I am going home" can be represented at word and character levels as:
The length of (8.5a) is 4, while the length of (8.5b) is 15. The depth of (8.5a) is the number of words in the language, while the depth of (8.5b) is the number of possible characters in the language, so (8.5a) is much much deeper than (8.5b).
These quantities of length and depth have important consequences for modeling. Long sequences can lead to very deep neural networks that are difficult to train. Deep sequences can lead to very slow training and inference due to softmax layers used in modeling.
Whatever level we choose will determine what elements go into our elementary outcome space
Since everything represented in a computer must be finite, these <unk>
(for unknown) token.
In this setup, text is a sequence of random variables
Rant: it's common to hear people saying that language modeling is an unsupervized learning task. This is understandable, since there is no need to manually label a dataset for this task. But it's still an incorrect characterization of the learning problem. Language modeling is, in fact, a supervized learning task, it just happens that the labels are trivially obtained from unlabelled, raw text.
We said that the goal is to minimize the difference between
where
Perplexity is the dimensionless quantity computed as follows:
Depending on the type of model, we may optimize with respect to ANLL or PPL. For reporting evaluation results PPL is pretty standard (small ANLL gains show up as large PPL gains!).
In this final section we discuss three of the main modeling approaches. The first two are Markov models, the second goes beyond Markov models, allowing us to condition on the entire sequence.
As we noted earlier, Markov models make strong conditional independence assumptions. An nth orderMarkov model assumes the following identity:
Given a sequence of text
So far this is just the chain rule applied to the sequence. The next step is to apply a Markov assumption of some order to shorten those long conditioning contexts. For concreteness, let's stick to
As you can see, starting from the second element of the sequence, we condition on the previous two elements, ignoring all elements preceding those. This is the essential characteristic of Markov models.
Now that we've defined Markov language models, how do we go about estimating these
The statistical approach tries to maximize the likelihood of the data under
where:
Once we gather these n-gram counts, we can compute trigram probabilities
For example, let's say we have "I am going home" and we want to know
Then we use a second-order Markov assumption to simplify that last term to:
Then we count how many times these occur in the training dataset:
Using those we compute the probabilities:
Lastly, we multiply those out to get the probability of the entire sequence.
This is nice and easy, but has a huge problem due to sparsity. The problem is that n-gram counts can be 0! Of course if a 0 occurs in the denominator of the
To overcome this problem of 0 counts, statistical Markov language models use smoothed estimates of these
The most basic approach is additive smoothing. The idea is to add a small positive number
where
The second common approach is backoff smoothing. The idea is to learn three language model parameter classes simultaneously:
Once we have these, we apply
The third most common approach is interpolated smoothing, which, instead of backing-off from higher to lower-order models, interpolates or mixes all three probabilities (this is why it's often called a mixture model). Interpolated smoothing parametrizes the mixture using
These
Last, but not least, there is Knesser-Ney smoothing, which is widely used in state-of-the-art statistical models. It's a little involved for our scope, but there are some nice expositions of the technique in the statistical language modeling literature.
The neural approach to Markov language models traces back to Bengio and others. The problem is a supervized one: we want to predict
where
Our final class of models allow us to go beyond the Markov assumptions. The idea is to model sequences using recurrent neural networks (RNNs). Given a sequence
The same cross-entropy loss is used to accumulate a total loss for the entire sequence, which is used to pass a gradient back through the network. Since cells of the network are arranged in time, this application of backpropagation has acquired the cool name "back-propagation through time".
It turns out that this basic setup doesn't do very well in practice due to two problems:
The network is computing a composition of
In practice, the exploding gradients problem is solved by clipping the gradients, forcing them to remain within a particular interval. The vanishing gradients problem is solved by using a more complicated RNN architecture, such as LSTMs or GRUs. These models introduce gates that allow the network to memorize long-range dependencies in the presence of the vanishing gradient problem, by selectively updating the network weights. LSTMs and GRUs are used in state-of-the-art non-Markovian language modeling. With proper regularization (e.g. dropout), they outperform all other models when it comes to modeling sequences with long-range context dependencies.