Deep Learning with Sequences

--- title: "An Architecture For Neural Language Modeling" format: html: toc: true # toc-location: left toc-title: " " author: - Jacob Buckman - Carles Gelada date: "2023-12-27" categories: false draft: false --- $$ \newcommand{\R}{\mathbb{R}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\N}{\mathbb{N}} \newcommand{\sft}{\text{softmax}} \newcommand{\List}{\text{List}} \newcommand{\Seq}{\text{Seq}} \newcommand{\SeqT}{\text{SeqT}} \newcommand{\CSeqT}{\text{CSeqT}} \newcommand{\Dist}{\text{Dist}} \newcommand{\SM}{\text{SM}} \newcommand{\id}{\text{id}} \newcommand{\Fn}{\text{Fn}} \newcommand{\Tok}{\text{Tok}} \newcommand{\Aij}{ A_{[i,j]}} $$ A crucial decision in neural language modeling is the choice of architecture, which governs the compuatational and statistical properties of the model. A well-selected architecture can improve performance-per-dollar by orders of magnitude. The objective of this work is to find an architecture with the following four properties: * Good scaling law on model size. * Good scaling law on context size. * Efficient usage of GPU during training. * Efficient sampling. At the time of writing, the most popular architecture for neural language modeling, the Transformer, has only three of these properties (it lacks efficient sampling). Many other architectures have been been proposed, but none that meets all four criteria. We believe that such an architecture exists. This document is an expository writeup of progress we have made towards the goal of discovering it. This document contains our evolving understanding of the key concepts and results needed to understand neural language modeling, including mathematical, computational, and experimental results. Unlike a traditional research paper, this is living document that will continually be updated as we make progress. Much of the content of this document will already be familiar to anyone who has thought deeply about language modeling. There is nothing particularly groundbreaking here. But we found a lot of value in explicitly formalizing the concepts, e.g. how think of Transformers and RNNs as members the same family; as well as grounding all of these ideas in rigorous mathematics. Also, in order to properly understand our specific setting, we often find that it is helpful to describe more general objects of which neural language modeling is a specific case. Where applicable and insightful, we provide such results in their full generality; and as such, this document contains many ideas that may be of independent interest to researchers in related areas. It's not about minimizing floating point operations. It's about optimizing tokens per second on our hardware. GPUs perform well with tasks that hare highly **parallelizable**. RNNs, the more "clasical" architectures for sequences, are not very well suited for these types of tasks. important feature of the transformer architecture is that it can compute We find that it's important to think about sequences, and maps between sequences as the core objects.  # 1) Language Modeling Let's begin by defining some basic concepts and definitions. **Definition 1:** Given a set $X$, $t\in \N$ and elements $x_1, \cdots, x_t \in X$, we say that $[x_1, \cdots, x_t]$ is a length $t$ sequence with elements in $X$. We denote by $\Seq_t X$ the collection of all such sequences. In other words, $\Seq_t X$ is the cross product of $t$ copies of the set $X$. $$ \Seq_t X = \underbrace{X\times \cdots \times X }_\text{t} $$ This is somewhat on an uncommon notation. To refer to the cross product of to $t$ copies of $X$, it would be more standard to write $X^{\times t}$ instead of $\Seq_t X$. But in the context of language modeling we find it nicer to talk about sequences instead of cross products. **Definition 2:** We denote by $\Seq X$ the set of all sequences of any finite length: $$ \Seq X = \bigcup_{t\in \N} \Seq_t X $$ ::: {.column-margin} Note that even when $X$ is a finite set, $\Seq X$ will contain infinitely many elements. But don't confuse that with the sequences in $\Seq X$ potentially having infinite length. Any sequence in $\Seq X$ will be a collection of $t$ elements of $X$ for some $t\in \N$. Infnite length sequences are of course an important mathematicl object, but we are not concerned with them in this work. ::: In this work, when we talk about **language** we refer to written language, represented as a sequence of tokens which could be alphanumeric characters, words in a vocabulary or some other tokenization scheme. Here $Z = \{\text{"a"},\text{"b"}, \cdots \}$ would be the set of alphanumeric characters. And an element of $\Seq Z$ is just a piece of text like $\text{"I like penguins"} \in \Seq Z$ but also $\text{"I like potatos"} \in \Seq Z$ and the sequence $\text{"Adk23rhf"} \in \Seq X$and all the works of Shakespeare. $\Seq X$ contains every relegious holly book, the proof to the Riemann hypothesis, the details of what happened to Amelia Earhart. Every truth and every lie. The entire story of your life and of how you will die. Our anthestors communicate with us via elements of $\Seq X$. And even after other forms of commuincation became technologically possible, we still largely prefer to use $\Seq Z$ to share work and ideas. $\Seq Z$ seems to have some sort of universality, where all the important questions and their answers can be naturally encoded as points in it. The reason text has been playing such an outsized role in the field of artificial intelligence has something to do with the combiation of this universality and the fact that, for the last 3000 years, one of major endevours of humanity has been to gather a vast collection of interesting points in $\Seq Z$. ![387588502_1063711931451275_6779269320171301447_n.jpeg](https://hackmd.io/_uploads/HJLq0zp7a.jpg) The first step one must take when doing language modeling is to go into the wild and collect a dataset of them $D \in \Seq_n \Seq_t Z$. ::: {.column-margin} We could generalize the language modeling task to involve variable length sequeneces, but it's convinient to assume all sequences on the dataset $D \subset \Seq Z$ have length $t$. First of all, talking about $\Dist \Seq X$ requires significantly more complex mathematics, as when talking about the distributions over infinite sets one must consider carefuly what it means for the probability must sum to 1. It involves considerations about the convergence of the sum. Another reason we prefer to think about $\Seq_t Z$ is that we will be very interested in how the computational cost of training models changes with the sequence length $t$. It would be very cumbersome to do that analysis when the dataset is composed of variable length sequences. And since any variable length dataset can be trivially converted into a fixed length one (by padding short sequences of cutting off long ones), we make this simplifying assumption. ::: Before we discuss the next step, let us properly define the concept of a distribution. **Definition.** Given a finite set $X$, the space of distributions $\Dist X$ is the collection of functions $f: X \to \R$ where $f(x) \ge 0$ and $\sum_{x\in X} f(x) = 1$. In language modeling, we aim to find a model $f : \Dist \Seq_t Z$ that closely matches our natural-language-derived training data. We want our model to assign high probability to the sequences in our dataset. To turn that into a precise mathematical statement we need the concept of independent distribution. **Definition** Given two finite sets $X,Y$ and $f_X\in \Dist X, \; f_Y \in \Dist Y$. The independent distribuiton $f_{X\times Y}$ is a distribution over the finite set $X\times Y$ defined as $$ f_{X\times Y}(x,y)=f_X(x) f_Y(y) $$ **Proof is well defined** To see that $f_{X\times Y}$ is well defined we need to check that the outputs are non negative and sum to 1. $\blacksquare$ From this we can construct the basic learning objective. To know if our language model is any good, we ask the question: what is the probability that all the sequences in $D$ where sampled independently from $f$? From the definition of the independent distribution we know that probability is $$ \prod_{(x_1 \cdots x_t)\in D} f(x_1 \cdots x_t) $$ The goal of learning is to find an $f\in \Dist \Seq_t Z$ where this probability is high. In practice, it is more common to frame the problem as minimization rather than maximization, and useful to work in log-space (this doesn't change the optimum since log is monotonic), giving us our loss $$l(D) = -\ln Pr(D | f) = \sum_{(x_1 \cdots x_t)\in D} -\ln f(x_1 \cdots x_t)$$ ::: {.column-margin} The primary reason to use the log-prob loss is that it enables minibatching, which is an important technique when training models on large datasets. The minibatch approximates the loss on the entire dataset by computing the loss on a subsample from the dataset. In order for this approximation to be unbiased, we require $l(D) = \mathbb{E}_{B \sim D}[l(B)]$, a property that can easily be seen to hold for the log-prob loss. A secondary reason to use the log-prob loss is for numerical stability; working directly with probabilities when using floating-point numbers is challenging, since for relatively long sequences the cumulative product of probabilities becomes very small. Since the logarithm transforms products into sums, log-probabilities tend to still stay within representable regions. ::: ### 1.1) Modeling with Deep Learning A modeling task is about two problems. First you construct a model family / architecture / parametrized family etc.. then you find a good point on it usually with a variant of gradient descent. **Definition.** Given sets $X$ and $Y$, a model architecture is a map $\phi: \R^p \to \Fn(X, Y)$. Given a parameter vector $\theta \in \R^p$, the architecture gives us a function $f = \phi(\theta)$. **deep learning** generally refers to cases where the architecture $\phi$ is constructed with the following principles in mind * **compositional:** individual building blocks stacked on top of each other creating a stacked computation. Usualluy alternate between linear functions and some nonlinearlity. * **differentiable:** We optimize the parameters $\theta$ with a variant of gradient descent. This means we will need to construct a loss function $l : \R^p \to \R$ over the parameters that we can compute $\nabla_\theta l$ during the optimization stage. This menas that the architcture $\phi$ must be differentiable, meaning that if the parameters change a little bit, so do the outputs of the function. * **scale** There are some trends that reliably increase performance. One can reliably estimate the gains of training with larger models on more data. When evaluating an architecture, the goal is to understand how what would be the performance attained at any model scale and data ammount. * **high throughput:** It's not about minimizing floating point operations. It's about optimizing tokens per second on our hardware. * **parallelizable**. GPUs and other DL speciallized hardware perform well with tasks that are highly parallelizable. It's another essential characteristic of deep learning. In this work, we will generally suppress $\theta$ for notational cleanliness, writing models as $f$ instead of $f_\theta$ unless necessary. ### 1.2) Autoregressive Modeling Ok, so we will be looking for a distribution $f \in \Dist \Seq_t X$ assigning high log-probability to the data. But how do go about constructing (i.e. implementing in a computer) such a distribution? If the number of tokens is $c$, the space of all documents of length $t$ has $c^t$ elements. *WARNING: cliche incoming.* Even for short sentences of length $t=100$, $c^t$ will be larger than the number of atoms in the universe, and so impossible to represent explicitly. The most common approach is to tackle this problem is via **autoregressive modeling**, which emerges out of two key realizations: * It's very easy to computationally represent a model $g: \Seq X\to \Dist X$, that maps sequences to distributions over tokens. * For any fixed $t\in \N$, there is a natural way to transform any $g: \Seq X\to \Dist X$ into an $f\in \Dist \Seq_t X$, and vice versa. We call this transformation the *autoregressive map*. **Definition 3:** For a given $t\in \N$ and any $g: \Seq X \to \Dist X$ and $x_1 \cdots x_t \in \Seq_t X$, the autoregressive map $A_t$ is defined as $$ A_t(g)(x_1 \cdots x_t) = \prod_{i=0}^t g(x_1 \cdots x_{i-1})(x_i) $$ Out of this definition, it's obvious that $A(g)$ is a function in $\Fn(\Seq_t X, \R)$. But it's not clear that it also is a distribution in $\Dist \Seq_t X$. It's also not clear that for all $f\in \Dist \Seq_t X$ there exists a $g\in \Fn(\Seq X, \Dist X)$ such that $f=A(g)$. The result below adress these concerns. **Result 1:** The map $A_t: \Fn(\Seq X, \Dist X) \to \Dist \Seq_t X$ is well defined and is surjective. **Proof.** To see that $A_t(g)\in \Dist \Seq_t X$ we need to show the outputs of $A_t(g)$ are possitive and sum to $1$. Seeing that $A_t(g)(x_1 \cdots x_t) \ge 0$ is trivial, since the outputs are product of $t$ terms of the form $g(x_1 \cdots x_{i-1})(x_i)$, which are all positive due to the assumption that $g: \Seq X \to \Dist X$. Now, showing that $$ \sum_{x_1 \cdots x_t \in \Seq_t X} A_t(g)(x_1 \cdots x_t) = 1 $$ is a little tricker. We are going to proceed via induction. Take $t=1$ as the base case. $$ \sum_{x_1 \in \Seq_1 X} A_t(g)(x_1) = \sum_{x \in X} g()(x) = 1 $$ Now assume the result holds for all sequences up to length $t$. Then \begin{align} \sum_{x_1 \cdots x_t \in \Seq_t X} f(x_1 \cdots x_t) &= \sum_{x_1 \cdots x_t \in \Seq_t X} \prod_{i=0}^t h(x_1 \cdots x_{i-1})(x_i) \\ &= \sum_{x_1 \cdots x_{t-1} \in \Seq_{t-1} X} \sum_{x_t \in X} \prod_{i=0}^{t-1} h(x_1 \cdots x_{i-1})(x_i) h(x_1 \cdots x_{t-1})(x_t) \\ &= \sum_{x_1 \cdots x_{t-1} \in \Seq_{t-1} X} \prod_{i=0}^{t-1} h(x_1 \cdots x_{i-1})(x_i) \sum_{x_t \in X} h(x_1 \cdots x_{t-1})(x_t) \\ &= \sum_{x_1 \cdots x_{t-1} \in \Seq_{t-1} X} \prod_{i=0}^{t-1} h(x_1 \cdots x_{i-1})(x_i) \\ &= \sum_{x_1 \cdots x_{t-1} \in \Seq_{t-1} X} A_{t-1}(g) \\ &= 1 \end{align} We have shown that $A_t(g) \in \Dist\Seq_t X$. Now we just need to show the map is surjective. TODO $\blacksquare$ Once we are looking for a function from $g: \Seq X \to \Dist X$ we are completely within the domain of deep learning. This result is at the core of what allows deep learning to tackle sequence modeling. With this established, let's now explore in more detail how to construct a neural network $g: \Seq X \to \Dist X$. ASIDE 4: There are also radically different approaches to use NNs to model distributions over sequences, that are non-autoregessive and thus do not rely on this result; but autoregressive modeling is what has been behind everything impressive so far. TODO: perhaps we could explain here how to turn an MLP into an autoregressive language model ### 1.3) Autoregressive Sampling There are two main things we want to do with our language model $f\in \Dist \Seq_t Z$. For training we want to evaluate the probability $f(x_1\cdots x_t)$ of specific sequences in the dataset. But once the model is trained, presumably we will want to use it to sample sequencs from it. **The dumbest way to sample:** Given a finite set $X$ with $n$ elements, there is a very simple way to sample from any distribution $p \in \Dist X$. You first have to pick an ordering of the set of elements $x_1 \cdots x_n$. You then sample a random real number $r \sim \text{Uniform}(0, 1)$. Then start going throgh the elements in order and accumulate the total probability seen $$ a_i = \sum_{j=1}^i p(x_j) $$ Once the accumulated probability $a_i >r$, you return the element $x_i$ as the sample. When we are sampling sequences, the total number of elements in $\Seq_t Z$ is $c^t$. This means we would need to evaluate the probability of an enormous number of sequences to get a single sample (yes, there is a lot of optimizations we could do on top of this basic algorithm, but this is so astronomically inefficinet that it's really not worth trying to find imporvements) **The autoregressive way:** Luckily for us, the fact we get our distribution $f$ out of $g\in \Fn(X,Y)$ via $f=A_t(g)$ opens up a completely different approach to sampling. Think about the following pocedure. Start with the empty sequence and sample $x_1 \sim g()$, then sample $x_2 \sim g(x_1)$ followed by $x_3\sim g(x_1, x_2)$ and so on. **Result:** When recursively sampling from $g$, the proability of sampling the sequence $z_1 \cdots z_t \in \Seq_t Z$ is precisely $A_t(g)(z_1 \cdots z_t )$. **Proof:** Fix a sequence $z_1 \cdots z_t \in \Seq_t Z$. Clearly, the probability of first sampling $z_1$ is $g()(z_1)$. Then, given that we've sampled $z_1$ the probability of sampling $z_2$ is $g(z_1)(z_2)$. Continuing the patter we get that the probability of sampling the full sequence is $z_1 \cdots z_t$ is $$ \prod_{i=1}^t g(z_1 \cdots z_{i-1})(z_i) = A_t(g)(z_1 \cdots z_t) $$ $\blacksquare$ Talk about gains. We've gone form needing $O(|Z|^t)$ evaluatioin of the function $g$ to only needing $t$! That is the other major advantage of using $\Fn(\Seq Z, \Dist Z)$ to construct our language model. Not only does $A_t$ enable us to construct distributions over sequences in a way that works well with neural networks, it does so in a way that is cheap to sample form. # 2) Deep Learning with Sequences Ok, so to train an autoregressive LM we need an architecture $\Seq Z \to \Dist Z$. To think clearly about all the ways we can go about constructing such architecures it is useful to take a step back and breifly consider a more general class of deep learning problems. Tasks that involve sequences as inputs or outputs. This more generic problem is of independent interest, becuase sequence spaces turn out to be common in modeling many real world phenomena, not just language. Some examples would be: * **Point clouds.** The set $X = \R^3$, 3 coordinates representing a position in space. $\Seq X$ would be the set of all possible point clouds * **Molecular representations.** A molecule could be the descripition of an atom, a list of numbers representing the position (like for 3D point clouds) but also other information like momentum, charge, or anything else physically relevant. $\Seq X$ would represent the configuration of a phyisical world with many such atoms. * **Aminoacid sequences.** $X$ is the set of aminoacids The description of an aminoacid might or mihgt not have a spatial location. * **Images.** Images can be thought of as a sequence of pixels. An image of RGB pixels with $h$ vertical pixels and $w$ horizontal ones would correspond to an element in $\Seq \R^3$ of length $h\times w$. What about elements of $\Seq \R^3$ that don't have length multiple of $w$, they correspond to non square images... you can see that $\Seq \R^3$ is not a super natural to represent the space of images. And indeed, the architectures that "think" about images in a different way tend to be more succsfull (yes, I'm talking about convolutional networks). But still, $\Seq \R^3$ is a perfectly legitimate way to think of images. These sequence spaces are used in tasks like the following: * **Molecular modeling.** Given the desciption of the physical state, predict the evolution of the system. * **Protein folding.** Here $X$ is a representation of an aminoacid and the outputs are a sequence of 3D points of the position in space of the corresponding aminoacid. * **Point cloud classification.** For a finite set of categories $C$ we are looking for afunciton $\Seq \R^3 \to \Dist C$. * **Image coloring.** Given an grayscale image, assign an RGB value to each pixel. In these tasks we've seen signatures of the type * $\Seq X \to Y$ * $\Seq X \to \Seq Y$ And we would like to have a good sense about how to construct architectures in all of these cases. Even tough this seems like a very generic problem, it turns out that there is a much more constrained type of architecture that plays an outsized role in the space of deep learning architectures. And even when our task demands a more generic architecture, we can reduce the problem into using architectures of this constrained type under the hood. ## 2.1) Sequence Transformations **Definition** We say $h: \Seq X \to \Seq Y$ is a **sequence transformation** if $h$ preserves the length of the sequence. Meaning that if the input sequence has lenght $t$, so does the output sequence. We refer to $\SeqT(X, Y)\subset \Fn(\Seq X, \Seq Y)$ as the subset of all sequence transformations. In deep learning we care about composing multiple layers into more complex architectures. Sequence transformations are very well behaved in this way, as the composition of two sequence transforms is a sequence transform itself.[^1] This is a completely obvious fact, but since it is quite an important one it's worth stateting it as a result. [^1]: Sequence transformations form a category. **Result:** If $h\in \SeqT(X, Y)$ and $f\in \SeqT(Y, Z)$ then $f\circ h \in \SeqT(X,Z)$. In the next section we'll see how RNNs and Transformers are examples of sequence transformations. So are elementwise functions on the sequence elements. Just from a quick glance at the list above, it's quite obvious that many lerning problems naturally involve sequence transformations. For tasks like **protein folding** and **image coloring**, each entrie of the output sequence is trying to predict missing information of the same entrie on the input sequence. Thus, for this type of tasks we need the output sequence to have the same length as the input one, and so we our model architectures must be a sequence transformation. But of course that isn't always the case. In particular our central objective of autoregressive language modeling requires architectures with a different signature $\Seq X \to Y$. What is surprizing is that even for tasks like these, sequence transformations turn out to be an extremely useful building block that is used under the hood. To see why, take a moment to think about how you might construct an architecture with the desired input and output spaces. In one side you have the space $\Seq X$, where a point might contain an unboudedly large amount of informaiton. On the other, you have a $Y$, which often is a finite dimensional vector space. $Y$ is a much "bounded" space than $\Seq X$. When designing your architecture you are forced to decide where to place the projection from the variable sized space to the fixed sized one. Given that neural network architectures are allways constructed as the composition of many individual layers, there are two very natural choices of where to place the projection: in the beginning, or at the end. We would call the choice early drop and late drop respectively. Early drop architectures would look something like $$ \Seq X \to \R^d \to \cdots \to \R^d \to Y $$ ::: {.column-margin} $d$ represents the width of each layer of the neural network. The arrows represent the individual layers of the neural network. ::: and the late drop type would be constructed out of many sequence transforms layers except for the last one $$ \Seq X \to \Seq \R^d \to \cdots \to \Seq \R^d \to Y $$ One can definetely construct neural network architectures in the two ways, and we are about to see some examples of both. But it does seem to be the case that all the architectures we actually used in practice (e.g. transformers and RNNs) are of the drop last type. We could speculate about why: * using sequences interally is a good inductive bias for tasks that are naturally expressed as sequences. * adaptive computation. More complex tasks have more internal state abailable to solve them But honestly, that just leads to a lot of talk with little substance. What is certainly true is that the second type is the one with widespread use, it seems to be the deep learning way of doing things. Moreover, in section 3 we will see how thinking about architectures based on sequence transforms enable one massive optimization to train autoregressive language models. TODO: Even when your architecture can be thought of in a different way, sequence transform perspective is what allows you to implement them efficiently on hardware. Unleash all your cores to compute different parts of the output all at the same time. Unconstrain yourself ## 2.3) Example Architectures TODO: make the point that sequence transforms give us a unified perspective to think about RNNs and Transformers ### 2.3.1) Recurrent Neural Networks An RNN has the signature $$ r(s_{t-1}, x_t) = (s_t, y_t) $$ For example, a very basic RNN layer might be something like $$ s_{i+1} = \sigma(W s_i) + x_i \;\;\;\;\;\;\;\;\;\; y_i = U s_i $$ where $x_i, s_i, y_i \in \R^d$ are the inputs, states and outputs respectively and $W, U \in \R^{d\times d}$ are weight matrices and $\sigma$ is a nonlinearlity like a sigmoid. Of course, there is an infinitude of variations. The specific step equations matter much less than the general pattern of a layer with the signature of $r$. As written, it's not clear that we have a sequence transformation in our hads. But once we pick an initial state $s_{-1}\in \R^d$, any function with such a recurrent formulation can be Given the input sequence $x_1 \cdots x_t$ we can *unrolled* the recurrent step and compute the sequence $y_1 \cdots y_t$. Thus we have a sequence transform in our hands. In deep learning we will want to stack many such layers $r$. For example let's consider stacking $n$ layers $r$ to construct a deep RNN. The stack of layers will have the same signature of our basic RNN, the only difference is that the state will consist of $n$ vectors in $\R^d$. The following diagram represents the computation of a deep RNN with 2 layers and $t=3$. The blue boxes represent each step of unrolling the RNN through time. ![File (11)](https://hackmd.io/_uploads/rJjhy0Mda.jpg) But people don't tend to implement deep RNNs this way! Anyone who has seen an implementation of an RNN will recognize that the actual computation we implement is instead: ![File (12)](https://hackmd.io/_uploads/r1HhkCfup.jpg) It's like we try to avoid the recurrent formulation of a deep RNN and instead prefer to see it as a stack of sequence transformations. But why? Ofc they are mathematicaully equivalent, but there is a difference in practical performance in hardware: cache hits EXPAND FURTHER. We start to see how thinking about sequence transforms seems to have some computational advantages. To create the function we need for our autoregressive LM we could take a stack of RNN layers and on the last layer throw out all the outputs except for $y_t$. That gives us a function with signature $\Seq X\to Y$. ### 2.3.2) Transformers A transformer layer takes in a sequence of inputs $X_1 \cdots X_t \in \R^d$ and uses weight matrices $W_Q, W_K, W_v \in \R^{d\times d}$ to compute $Q_i = W_Q X_i$ etc.. The outputs of the layer are $$ Y_i = \sum_{j=0}^t e^{Q_i K_j^T} V_j $$ An the causal transformer layer is $$ Y_i = \sum_{j=0}^t e^{Q_i K_j^T} V_j $$ They are both clearly sequence transformations since we get $t$ outputs $Y_i$. One massive advantage of transformers over alternative architectures like the RNNs we've just seen is that they are highly parallelizable. Our GPUs with many cores can split the work of computing all the $Y_i$ in parallel. You don't need to finish computing $Y_1$ before you can move on to $Y_2$. The potential for this parallelism is something quite natural when one thinks in terms of sequence transformations $h: \Seq X \to \Seq Y$. There is nothing inherent that says that one must compute the output sequence one step at a time. To create a practical architecture for our autoregressive language modeling we would stack a sereies of transformer layers, normally with some MLPs in between. Just like for RNNs, we would throw out all the outputs of the last layer except for $Y_t$. That will gives us a function with the desired signature $\Seq X\to Y$. *Note how it really doesn't matter is we use causal or non causal transformers for this task. We can get an autoregressive LM out of each. Yes, there is a very good reason to use causal transformers, but to see why we will need to think carefuly about training costs. Something that we will do in section 3.* ### 2.3.3) k-gram MLP We've seen a couple of common examples of "drop last" architectures. Let's give 1 of the "drop first" type. As stated, this is not a common thing to do, and the experienced deep learning practitioner will recognize it's kind of funky. Let's do it non the less for the sake of completeness Inspired by the classic $k$-gram models. If the input sequence is $\Seq \R^d$, we could "stack" the $k$ last inputs and apply an MLP to the vector in $\R^{nd}$. The last layer of the $MLP$ would map into $Y$. Independent of how well this arch learns. It turns out that it is very poorly behaved computationally compared to RNNs and causal transformers. In section 3 we will see how this architecture is poorly suited for the porpuses of autoregressive language modeling in the same way than non causal transformers are. # 3) Efficient Autoregressive Modeling TODO: come up with a better name for the section Combining what we have seen in the last two sections we have everything we need to implement a neural network that can be used for autoregressive language modeling. But as we will now see, there is a lot left on the plate. We will now study the costs of training and sampling and see how, by introducing a new type of archtecture, we can make get a lot more value out of our GPUs. ## 3.1) Cost of Training Now, let's return to considering autoregressive language modeling, $f : \Seq X \to \Dist X$. How many times do we have to call the underlying function $f$ to evaluate the loss $l$? When we evaluate the loss we have to compute $$ l = \sum_{x\in \Seq_t X} \sum_{i=1}^t -\ln[f(x_1 \cdots x_{i-1})(x_i)]$$ For example, suppose our dataset consists of a single sentence: $\text{"I like penguins"}$. To evaluate the loss we need to evaluate our model $f$ on the empty sequence $\text{""}$ and the first character $\text{"I"}$ and $\text{"I "}$ and $\text{"I l"}$, $\text{"I lik"}$ etc.. $$ f(\text{""}) \\ f(\text{"I"}) \\ f(\text{"I "}) \\ f(\text{"I l"}) \\ f(\text{"I li"}) \\ \vdots $$ ![385505895_353915030504501_2594593902089499332_n.jpeg](https://hackmd.io/_uploads/BySruQpX6.jpg) So, for every sequence $x_1 \cdots x_t \in D$, to compute the loss we need to call our model on $t$ different inputs. **The motivatiion to find a speed up** We have been considering the possibility of using an architecture of the type $$ f:\Seq X \to \Seq \R^d \to \cdots \to \Seq \R^d \to Y $$ but it might be dumb that all the steps of the computation $f(x\cdots x_t)$ involve sequences $t$ and at the end we project down and output a vector. Perhaps, we could use a different architecture that was a sequence transformation $$ g: \Seq X \to \Seq \R^d \to \cdots \to \Seq \R^d \to \Seq Y $$ The dream would be that $g(\text{"I like penguins"}) = [f(\text{"I"}), f(\text{"I "}), f(\text{"I l"}), f(\text{"I li"}), \cdots]$, so the output sequence contains all the values we need in a single call to our architecture. ## 3.2) Causal Sequence Transformations The fundamental principle of causality is that *the past does not depend on the future*. The way to formalize this intuition is to say that evaluating a causal sequence transformation on a subsequence produces an output sequence that is also a subsequence. **Definition** We say $h\in \SeqT(X,Y)$ is a **causal sequence transformation** if, given a sequence $x_1 \cdots x_t \in \Seq_t X$ with output sequence $y_1 \cdots y_t = h(x_1 \cdots x_t)$, for any $i\le t$, evaluating $h$ on the subsequence $x_1 \cdots x_i \in \Seq_i X$, the outputs $y'_1 \cdots y'_i = h(x_1 \cdots x_i)$ have the property that $y_j = y'_j$ for all $j\le i$. We refer to $\CSeqT(X, Y) \subset \SeqT(X, Y)$ as the subset of all causal sequence transformations. The reader is encouraged to check that all functions below are causal sequence transformations * RNNs as defined in EQX * Transformers as defined in EQY * Elementwise functions $h(x_1 \cdots x_t) = \sigma(x_1) \cdots \sigma(x_t)$ Just like sequence transformations, causal sequence transformations form a category, meaning that they are closed under composition. This means we can use layers that we know are causal sequence transforms and combine as building blocks into more complex architectures. **Result:** If $h\in \CSeqT(X, Y)$ and $f\in \CSeqT(Y, Z)$ then $f\circ h \in \CSeqT(X,Z)$. **Proof** Bruh, why you expanding this proof? Don't be lazy and do it yourself. It's real easy. For example, a full transformer architecture usually would be constructed by composing many transformer blocks. Where each block consists of a transformer layer as in EQY followed by an MLP applied elementwise to all outputs of the transformer layer. Thanks to the previous result we know that the full transformer architecture will be causal just by confirming that elementwise functions and EQY are causal sequence transformations. Alright, now that we understand causal sequence transformations and we have a vast space of such architectures at our dispoal, let's see why they are so useful for the problem we were trying to solve. **Definition:** The "take last" function $L: \CSeqT(X, Y) \to \Fn(\Seq X, Y)$ is defined the following way: given an $h\in \CSeqT(X,Y)$, if $x_1 \cdots x_t \in \Seq_t X$ then $L(h)(x_1 \cdots x_t) = [h(x_1 \cdots x_t)]_t$. In other words, if $y_1 \cdots y_t = h(x_1 \cdots x_t)$ then $L(h)(x_1 \cdots x_t) = y_t$. **Result** $L$ is a 1-1 mapping where the inverse "concatifies" a function $f \in \Fn(\Seq X, Y)$ by evaluating it on $x_1$ and $x_1, x_2$ etc.. Concretely: $$ L^{-1}(f)(x_1 \cdots x_t) = \left(f(x_1), f(x_1, x_2), \cdots, f(x_1 \cdots x_t) \right) $$ **Proof** First note that, given an $f\in \Fn(\Seq X, Y)$, if $h$ is the concatification of $f$, meaning that $h(x_1 \cdots x_t) = \left(f(x_1), f(x_1, x_2), \cdots, f(x_1 \cdots x_t) \right)$ then clearly, $L(h)(x_1 \cdots x_t) = f(x_1 \cdots x_t)$. We also need to check the other direction. Given an $h\in \CSeqT(X, Y)$: \begin{align} h(x_1 \cdots x_t) &= \left([h(x_1 \cdots x_t)]_1, [h(x_1 \cdots x_t)]_2, \cdots, [h(x_1 \cdots x_t)]_t \right) \\ &= \left([h(x_1)]_1, [h(x_1, x_2)]_2, \cdots, [h(x_1 \cdots x_t)]_t \right) & \text{using the fact that $h$ is causal} \\ &= \left(L(h)(x_1), L(h)(x_1, x_2), \cdots, L(h)(x_1 \cdots x_t) \right) & \text{using the definition of $L$} \end{align} so if we concatify $L(h)$ we get $h$ back. $\blacksquare$ In summary, ![File (5)](https://hackmd.io/_uploads/r1f_JCaIa.jpg) We've established a 1-1 mapping between the two function spaces. But computationally there is a very big difference between applying $L$ and applying $L^{-1}$. One is expensive the other is cheap. The color of the arrows represent that. We get another interesting diagram when we combine this result with the previous one we get the following picture (where horizontal lines indicate the function is invertible and vertical ones indicate the function is surjective) ![File (6)](https://hackmd.io/_uploads/B1hwM0TLp.jpg) When we are trying to construct our distribution over length $t$ sequences we loose no expresivety if we start with an $h\in \CSeqT(X, \Dist X)$ and we apply $A_t\circ L: \CSeqT(X, \Dist X) \to \Dist \Seq_t X$. For any $f\in \Dist \Seq_t X$ there will be an $h\in \CSeqT(X, \Dist X)$ s.t. $f = A_t \circ L (h)$. So mathematically it really doesn't matter if our base architecture is $g\in \Fn(\Seq X, \Dist X)$ or $h\in \CSeqT(X, \Dist X)$, but computationally it makes a massive differene! Suppose $g = L (h)$. Then, to evaluate the loss on a single sequence $\text{"hat"})$ we required 3 invokations of the function $g$ $$ g(\text{"h"}), \; g(\text{"ha"}), \; g(\text{"hat"}) $$ But when we evaluate $h(\text{"hat"})$, since $h = L^{-1} \circ L (h) = L^{-1} \circ g$ then $$ h(\text{"hat"}) = L^{-1}(g)(\text{"hat"}) = \left (g(\text{"h"}), \; g(\text{"ha"}), \; g(\text{"hat"}) \right) $$ For a sequence of length $t$, just by evaluating $h$ a single time we get the $t$ evaluations of $f$ that we need for the loss. Using a causal sequence transformation is an incredible bargain. *You compute 1 and get $t-1$ for free!*  ## 3.3) Efficient Sampling **Can we do even better?** Let's consider what would it would look like to follow the autoregressive sampling procedure for the RNN of EQX. First we call $g(x_1)$ which requires $$ s_1, y_1 = r(s_0, x_1) \;\;\;\;\; z_1 \sim y_1 \in \Dist Z $$ Then we want to call $g(x_1, x_2)$ which requires $$ s_1, y_1 = r(s_0, x_1) \;\;\;\;\; s_2, y_2 = r(s_1, x_2) \;\;\;\;\; z_2 \sim y_2 \in \Dist Z $$ And we can already see the problem. The second invocation of $g$ internally recomputes the term $s_1$ which was already computed in the first call. This is a consequence that the way we are calling $g$ is very structured. And often there is potential for cacheing values and reusing computation between the individual calls. In the case of an RNN, the way you actually want to sample is to first compute $s_1, y_1 = r(s_0, x_1)$ and sample $z_1 \sim y_1$. Then reuse $s_1$ and compute $s_2, y_2 = r(s_1, x_2)$ and sample $z_2 \sim y_2$ and so on. This is a big optimization because we are cacheing computation into the state. ## 3.4) State Machines To optimize training we exchanged $t$ calls to an expensive function $g$ into $1$ call to an equally expensive function $h$. To optimize sampling we will echange $t$ calls to an expensive function $g$ into $t$ calls to a function $r$ which is cheaper because it reuses cached computation from the previous calls. **Definition:** A **state machine** A state machine in $\SM(X,Y)$ is a tuple $(S, s_0, r)$ where $S$ is a set called the state space, $s_0 \in S$ is a point we call the origin and $u: S, X \to S, Y$ is a step/tick function $$ r(s_{t-1}, x_t) = (s_t, y_t) $$ There is no difference between these equations and the basic form of an RNN, but there are a few reasons we are choosing to call the object state machine instead of RNN. * We will apply this type of equation to things that in no way can be considered neural networks. A good example is a tape, which we will discuss shortly. Using the term NN (i.e. neural network) to refer to such objects feels silly. * Another reason is that for RNN it is usually assumed that $s_t \in \R^d$ for some $d\in \N$. In other words, the state has a "fixed size". As we will now see, to be able to unify transformers and RNNs, it's very important to think in terms of state machines with states of "growing size". It's kind of obvious that, if the whole idea of state machines is to allow for the cacheing of information during the repreated invokations of $g$, there will always be a "worst case state machine" for any $g$: the state machine that caches no computations. We call this the **tape state machine** because the only thing it stores is the inputs. **Definition:** The tape state machine $P(g)\in \SM(X,Y)$ of a function $g\in \Fn(\Seq X, Y)$ is $$ P(g)=( \underbrace{\Seq X}_S, \underbrace{[\;]}_{s_0}, r_g ) $$ where the tick funciton $r_g$ is defined as $$ r_g \bigg( \underbrace{[x_1 \cdots x_{t-1}]}_{s_{t-1}}, \; x_t \bigg) = \bigg( \underbrace{[x_1 \cdots x_t]}_{s_t}, \; \underbrace{g(s_t)}_{y_t} \bigg) $$ The tape state machine might seem like a bit of a silly object, * It helps us see that all $g$ can be thought of as state machiens. Once we have this unified framework to compare sampling costs of different models. We just compare the size of the sates and the cost of every tick. We are going to do that next. * Even though you never want to run the tape state machine, it's a useful starting point to start the analysys of imporovements. We can sample from autoregressive MLPs. It's also useful to construct baselines. * And is useful as a software engineering idea. You can see if the samples from a model are good and only then take the effort to implement a highly optimized state machine. Another aplication is testing. When you are implementing an optimized SM, it's useful to have a (slow) reference that has the exact same interface and outputs as the funciton you are trying to implement. **Result:** The unrolling funciton $U: \SM(X,Y) \to \CSeqT(X,Y)$ is surjective. Given $m = (S, s_0, p) \in \SM(X,Y)$ $$ U(m)(x_1 \cdots x_t) = [y_1 \cdots y_t] \;\;\;\; \text{where} \;\;\;\; (s_t, y_t) = p(s_{t-1}, x_t) $$ **Proof:** Use the tape state machine... $\blacksquare$ The following diagram summarizes the result ![File (10)](https://hackmd.io/_uploads/BJ3601kva.jpg) Note how $P\circ L$ behaves similarly to an inverse of $U$ because $U \circ P \circ L = \text{id} : \Fn(X,Y) \to \Fn(X,Y)$. But the other direction is not necessarily true. For example if we start with a classic RNN $(\R^d, 0, r)$ and we $P \circ L \circ U$ we end up with tape state machine that is different from the RNN. We can also combine it with the autoregressive correspondance result to get the diagram: ![File (9)](https://hackmd.io/_uploads/SJxRAy1D6.jpg) ### RNN State Machines #### Tape SM: $O(t)$ memory and $O(td^2)$ compute #### Efficient SM: $O(d)$ memory and $O(d^2)$ compute ### Self Attention State Machines #### Tape SM: $O(t)$ memory and $O(t^2d)$ compute #### KV cache SM: $O(td)$ memory and $O(td)$ compute Funnily enough, this imporvement #### Efficient SM: $O(d^2)$ memory and $O(d^2)$ compute Does it exist for the transformer? Yes under the assumption that... ## 3.5) Summary Ok, we've seen a lot of changes of perspectives. The following diagram sumamrizes what is the relationship between the objects we've been using, and what each perspective is useful for. ![File (1)](https://hackmd.io/_uploads/By-KTapIT.jpg) Our objective now is to construct good architectures with both, an efficient causal sequence transformation and state machine implementations. We will see how causal transformers can be seen as examples of causal sequence transformations, and then we will present our star architecture a small modification that we call linear self attention that retains all the advantages of the transformers but gives a big improvement on the causal sequence implementation But it's going to be easier to introduce the key ideas that allow for these optimizations in the simpler setting of sequence transformations. Transformers are amazing for two general reasons * generalize very well (better than RNNs?) * run very fast on our GPUs (way better than RNNs) but suffer from some disadvantages: * Cost $O(t^2)$ on train on datasets of length $t$ as opposed to $O(t)$ for RNNs * Sampling a sequence of length $t$ has cost $O(t^2)$ vs $O(t)$ for RNNs Our empirical results will be focused on an architecture we implemented that we call Self Attention RNN (SARNN) / Self Attention State Machine / We think its a very succseful architecture at combining the best of both worlds because it has the properties: * generalize very well * run very fast on our GPUs * Cost $O(t)$ to train on datasets of length $t$ * Sampling a sequence of length $t$ has cost $O(t)$ # 4) Linear Transformer

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.