cgel
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    --- title: "An Architecture For Neural Language Modeling" format: html: toc: true # toc-location: left toc-title: " " author: - Jacob Buckman - Carles Gelada date: "2023-12-27" categories: false draft: false --- $$ \newcommand{\R}{\mathbb{R}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\N}{\mathbb{N}} \newcommand{\sft}{\text{softmax}} \newcommand{\List}{\text{List}} \newcommand{\Seq}{\text{Seq}} \newcommand{\SeqT}{\text{SeqT}} \newcommand{\CSeqT}{\text{CSeqT}} \newcommand{\Dist}{\text{Dist}} \newcommand{\SM}{\text{SM}} \newcommand{\id}{\text{id}} \newcommand{\Fn}{\text{Fn}} \newcommand{\Tok}{\text{Tok}} \newcommand{\Aij}{ A_{[i,j]}} $$ A crucial decision in neural language modeling is the choice of architecture, which governs the compuatational and statistical properties of the model. A well-selected architecture can improve performance-per-dollar by orders of magnitude. The objective of this work is to find an architecture with the following four properties: * Good scaling law on model size. * Good scaling law on context size. * Efficient usage of GPU during training. * Efficient sampling. At the time of writing, the most popular architecture for neural language modeling, the Transformer, has only three of these properties (it lacks efficient sampling). Many other architectures have been been proposed, but none that meets all four criteria. We believe that such an architecture exists. This document is an expository writeup of progress we have made towards the goal of discovering it. This document contains our evolving understanding of the key concepts and results needed to understand neural language modeling, including mathematical, computational, and experimental results. Unlike a traditional research paper, this is living document that will continually be updated as we make progress. Much of the content of this document will already be familiar to anyone who has thought deeply about language modeling. There is nothing particularly groundbreaking here. But we found a lot of value in explicitly formalizing the concepts, e.g. how think of Transformers and RNNs as members the same family; as well as grounding all of these ideas in rigorous mathematics. Also, in order to properly understand our specific setting, we often find that it is helpful to describe more general objects of which neural language modeling is a specific case. Where applicable and insightful, we provide such results in their full generality; and as such, this document contains many ideas that may be of independent interest to researchers in related areas. It's not about minimizing floating point operations. It's about optimizing tokens per second on our hardware. GPUs perform well with tasks that hare highly **parallelizable**. RNNs, the more "clasical" architectures for sequences, are not very well suited for these types of tasks. important feature of the transformer architecture is that it can compute We find that it's important to think about sequences, and maps between sequences as the core objects. <!-- During the course of this reasearch we've converged to a framework that helps us think about what makes an architecture suitable for efficient language modeling. We refer to the key property that enables efficient training as **causal sequence transformations**. Surprizingly, efficient sampling turs out to be a consequence of a totally independent property, one that has to do with **state machines**. This helps us to draw the connection between the classic KV caching optimization for transformer sampling and the recurrence property that makes RNNs so well suited for sampling. A funny side effect of understanding how to do things efficiently is that you also learn how to do them inneficinetly. We will see how there is a very straightforward way to turn any generic architecture (like an MLP) into an autoregressive language model. Albeit one that is very inneficient for both, training and sampling. --> # 1) Language Modeling Let's begin by defining some basic concepts and definitions. **Definition 1:** Given a set $X$, $t\in \N$ and elements $x_1, \cdots, x_t \in X$, we say that $[x_1, \cdots, x_t]$ is a length $t$ sequence with elements in $X$. We denote by $\Seq_t X$ the collection of all such sequences. In other words, $\Seq_t X$ is the cross product of $t$ copies of the set $X$. $$ \Seq_t X = \underbrace{X\times \cdots \times X }_\text{t} $$ This is somewhat on an uncommon notation. To refer to the cross product of to $t$ copies of $X$, it would be more standard to write $X^{\times t}$ instead of $\Seq_t X$. But in the context of language modeling we find it nicer to talk about sequences instead of cross products. **Definition 2:** We denote by $\Seq X$ the set of all sequences of any finite length: $$ \Seq X = \bigcup_{t\in \N} \Seq_t X $$ ::: {.column-margin} Note that even when $X$ is a finite set, $\Seq X$ will contain infinitely many elements. But don't confuse that with the sequences in $\Seq X$ potentially having infinite length. Any sequence in $\Seq X$ will be a collection of $t$ elements of $X$ for some $t\in \N$. Infnite length sequences are of course an important mathematicl object, but we are not concerned with them in this work. ::: In this work, when we talk about **language** we refer to written language, represented as a sequence of tokens which could be alphanumeric characters, words in a vocabulary or some other tokenization scheme. Here $Z = \{\text{"a"},\text{"b"}, \cdots \}$ would be the set of alphanumeric characters. And an element of $\Seq Z$ is just a piece of text like $\text{"I like penguins"} \in \Seq Z$ but also $\text{"I like potatos"} \in \Seq Z$ and the sequence $\text{"Adk23rhf"} \in \Seq X$and all the works of Shakespeare. $\Seq X$ contains every relegious holly book, the proof to the Riemann hypothesis, the details of what happened to Amelia Earhart. Every truth and every lie. The entire story of your life and of how you will die. Our anthestors communicate with us via elements of $\Seq X$. And even after other forms of commuincation became technologically possible, we still largely prefer to use $\Seq Z$ to share work and ideas. $\Seq Z$ seems to have some sort of universality, where all the important questions and their answers can be naturally encoded as points in it. The reason text has been playing such an outsized role in the field of artificial intelligence has something to do with the combiation of this universality and the fact that, for the last 3000 years, one of major endevours of humanity has been to gather a vast collection of interesting points in $\Seq Z$. ![387588502_1063711931451275_6779269320171301447_n.jpeg](https://hackmd.io/_uploads/HJLq0zp7a.jpg) The first step one must take when doing language modeling is to go into the wild and collect a dataset of them $D \in \Seq_n \Seq_t Z$. ::: {.column-margin} We could generalize the language modeling task to involve variable length sequeneces, but it's convinient to assume all sequences on the dataset $D \subset \Seq Z$ have length $t$. First of all, talking about $\Dist \Seq X$ requires significantly more complex mathematics, as when talking about the distributions over infinite sets one must consider carefuly what it means for the probability must sum to 1. It involves considerations about the convergence of the sum. Another reason we prefer to think about $\Seq_t Z$ is that we will be very interested in how the computational cost of training models changes with the sequence length $t$. It would be very cumbersome to do that analysis when the dataset is composed of variable length sequences. And since any variable length dataset can be trivially converted into a fixed length one (by padding short sequences of cutting off long ones), we make this simplifying assumption. ::: Before we discuss the next step, let us properly define the concept of a distribution. **Definition.** Given a finite set $X$, the space of distributions $\Dist X$ is the collection of functions $f: X \to \R$ where $f(x) \ge 0$ and $\sum_{x\in X} f(x) = 1$. In language modeling, we aim to find a model $f : \Dist \Seq_t Z$ that closely matches our natural-language-derived training data. We want our model to assign high probability to the sequences in our dataset. To turn that into a precise mathematical statement we need the concept of independent distribution. **Definition** Given two finite sets $X,Y$ and $f_X\in \Dist X, \; f_Y \in \Dist Y$. The independent distribuiton $f_{X\times Y}$ is a distribution over the finite set $X\times Y$ defined as $$ f_{X\times Y}(x,y)=f_X(x) f_Y(y) $$ **Proof is well defined** To see that $f_{X\times Y}$ is well defined we need to check that the outputs are non negative and sum to 1. $\blacksquare$ From this we can construct the basic learning objective. To know if our language model is any good, we ask the question: what is the probability that all the sequences in $D$ where sampled independently from $f$? From the definition of the independent distribution we know that probability is $$ \prod_{(x_1 \cdots x_t)\in D} f(x_1 \cdots x_t) $$ The goal of learning is to find an $f\in \Dist \Seq_t Z$ where this probability is high. In practice, it is more common to frame the problem as minimization rather than maximization, and useful to work in log-space (this doesn't change the optimum since log is monotonic), giving us our loss $$l(D) = -\ln Pr(D | f) = \sum_{(x_1 \cdots x_t)\in D} -\ln f(x_1 \cdots x_t)$$ ::: {.column-margin} The primary reason to use the log-prob loss is that it enables minibatching, which is an important technique when training models on large datasets. The minibatch approximates the loss on the entire dataset by computing the loss on a subsample from the dataset. In order for this approximation to be unbiased, we require $l(D) = \mathbb{E}_{B \sim D}[l(B)]$, a property that can easily be seen to hold for the log-prob loss. A secondary reason to use the log-prob loss is for numerical stability; working directly with probabilities when using floating-point numbers is challenging, since for relatively long sequences the cumulative product of probabilities becomes very small. Since the logarithm transforms products into sums, log-probabilities tend to still stay within representable regions. ::: ### 1.1) Modeling with Deep Learning A modeling task is about two problems. First you construct a model family / architecture / parametrized family etc.. then you find a good point on it usually with a variant of gradient descent. **Definition.** Given sets $X$ and $Y$, a model architecture is a map $\phi: \R^p \to \Fn(X, Y)$. Given a parameter vector $\theta \in \R^p$, the architecture gives us a function $f = \phi(\theta)$. **deep learning** generally refers to cases where the architecture $\phi$ is constructed with the following principles in mind * **compositional:** individual building blocks stacked on top of each other creating a stacked computation. Usualluy alternate between linear functions and some nonlinearlity. * **differentiable:** We optimize the parameters $\theta$ with a variant of gradient descent. This means we will need to construct a loss function $l : \R^p \to \R$ over the parameters that we can compute $\nabla_\theta l$ during the optimization stage. This menas that the architcture $\phi$ must be differentiable, meaning that if the parameters change a little bit, so do the outputs of the function. * **scale** There are some trends that reliably increase performance. One can reliably estimate the gains of training with larger models on more data. When evaluating an architecture, the goal is to understand how what would be the performance attained at any model scale and data ammount. * **high throughput:** It's not about minimizing floating point operations. It's about optimizing tokens per second on our hardware. * **parallelizable**. GPUs and other DL speciallized hardware perform well with tasks that are highly parallelizable. It's another essential characteristic of deep learning. In this work, we will generally suppress $\theta$ for notational cleanliness, writing models as $f$ instead of $f_\theta$ unless necessary. ### 1.2) Autoregressive Modeling Ok, so we will be looking for a distribution $f \in \Dist \Seq_t X$ assigning high log-probability to the data. But how do go about constructing (i.e. implementing in a computer) such a distribution? If the number of tokens is $c$, the space of all documents of length $t$ has $c^t$ elements. *WARNING: cliche incoming.* Even for short sentences of length $t=100$, $c^t$ will be larger than the number of atoms in the universe, and so impossible to represent explicitly. The most common approach is to tackle this problem is via **autoregressive modeling**, which emerges out of two key realizations: * It's very easy to computationally represent a model $g: \Seq X\to \Dist X$, that maps sequences to distributions over tokens. * For any fixed $t\in \N$, there is a natural way to transform any $g: \Seq X\to \Dist X$ into an $f\in \Dist \Seq_t X$, and vice versa. We call this transformation the *autoregressive map*. **Definition 3:** For a given $t\in \N$ and any $g: \Seq X \to \Dist X$ and $x_1 \cdots x_t \in \Seq_t X$, the autoregressive map $A_t$ is defined as $$ A_t(g)(x_1 \cdots x_t) = \prod_{i=0}^t g(x_1 \cdots x_{i-1})(x_i) $$ Out of this definition, it's obvious that $A(g)$ is a function in $\Fn(\Seq_t X, \R)$. But it's not clear that it also is a distribution in $\Dist \Seq_t X$. It's also not clear that for all $f\in \Dist \Seq_t X$ there exists a $g\in \Fn(\Seq X, \Dist X)$ such that $f=A(g)$. The result below adress these concerns. **Result 1:** The map $A_t: \Fn(\Seq X, \Dist X) \to \Dist \Seq_t X$ is well defined and is surjective. **Proof.** To see that $A_t(g)\in \Dist \Seq_t X$ we need to show the outputs of $A_t(g)$ are possitive and sum to $1$. Seeing that $A_t(g)(x_1 \cdots x_t) \ge 0$ is trivial, since the outputs are product of $t$ terms of the form $g(x_1 \cdots x_{i-1})(x_i)$, which are all positive due to the assumption that $g: \Seq X \to \Dist X$. Now, showing that $$ \sum_{x_1 \cdots x_t \in \Seq_t X} A_t(g)(x_1 \cdots x_t) = 1 $$ is a little tricker. We are going to proceed via induction. Take $t=1$ as the base case. $$ \sum_{x_1 \in \Seq_1 X} A_t(g)(x_1) = \sum_{x \in X} g()(x) = 1 $$ Now assume the result holds for all sequences up to length $t$. Then \begin{align} \sum_{x_1 \cdots x_t \in \Seq_t X} f(x_1 \cdots x_t) &= \sum_{x_1 \cdots x_t \in \Seq_t X} \prod_{i=0}^t h(x_1 \cdots x_{i-1})(x_i) \\ &= \sum_{x_1 \cdots x_{t-1} \in \Seq_{t-1} X} \sum_{x_t \in X} \prod_{i=0}^{t-1} h(x_1 \cdots x_{i-1})(x_i) h(x_1 \cdots x_{t-1})(x_t) \\ &= \sum_{x_1 \cdots x_{t-1} \in \Seq_{t-1} X} \prod_{i=0}^{t-1} h(x_1 \cdots x_{i-1})(x_i) \sum_{x_t \in X} h(x_1 \cdots x_{t-1})(x_t) \\ &= \sum_{x_1 \cdots x_{t-1} \in \Seq_{t-1} X} \prod_{i=0}^{t-1} h(x_1 \cdots x_{i-1})(x_i) \\ &= \sum_{x_1 \cdots x_{t-1} \in \Seq_{t-1} X} A_{t-1}(g) \\ &= 1 \end{align} We have shown that $A_t(g) \in \Dist\Seq_t X$. Now we just need to show the map is surjective. TODO $\blacksquare$ Once we are looking for a function from $g: \Seq X \to \Dist X$ we are completely within the domain of deep learning. This result is at the core of what allows deep learning to tackle sequence modeling. With this established, let's now explore in more detail how to construct a neural network $g: \Seq X \to \Dist X$. ASIDE 4: There are also radically different approaches to use NNs to model distributions over sequences, that are non-autoregessive and thus do not rely on this result; but autoregressive modeling is what has been behind everything impressive so far. TODO: perhaps we could explain here how to turn an MLP into an autoregressive language model ### 1.3) Autoregressive Sampling There are two main things we want to do with our language model $f\in \Dist \Seq_t Z$. For training we want to evaluate the probability $f(x_1\cdots x_t)$ of specific sequences in the dataset. But once the model is trained, presumably we will want to use it to sample sequencs from it. **The dumbest way to sample:** Given a finite set $X$ with $n$ elements, there is a very simple way to sample from any distribution $p \in \Dist X$. You first have to pick an ordering of the set of elements $x_1 \cdots x_n$. You then sample a random real number $r \sim \text{Uniform}(0, 1)$. Then start going throgh the elements in order and accumulate the total probability seen $$ a_i = \sum_{j=1}^i p(x_j) $$ Once the accumulated probability $a_i >r$, you return the element $x_i$ as the sample. When we are sampling sequences, the total number of elements in $\Seq_t Z$ is $c^t$. This means we would need to evaluate the probability of an enormous number of sequences to get a single sample (yes, there is a lot of optimizations we could do on top of this basic algorithm, but this is so astronomically inefficinet that it's really not worth trying to find imporvements) **The autoregressive way:** Luckily for us, the fact we get our distribution $f$ out of $g\in \Fn(X,Y)$ via $f=A_t(g)$ opens up a completely different approach to sampling. Think about the following pocedure. Start with the empty sequence and sample $x_1 \sim g()$, then sample $x_2 \sim g(x_1)$ followed by $x_3\sim g(x_1, x_2)$ and so on. **Result:** When recursively sampling from $g$, the proability of sampling the sequence $z_1 \cdots z_t \in \Seq_t Z$ is precisely $A_t(g)(z_1 \cdots z_t )$. **Proof:** Fix a sequence $z_1 \cdots z_t \in \Seq_t Z$. Clearly, the probability of first sampling $z_1$ is $g()(z_1)$. Then, given that we've sampled $z_1$ the probability of sampling $z_2$ is $g(z_1)(z_2)$. Continuing the patter we get that the probability of sampling the full sequence is $z_1 \cdots z_t$ is $$ \prod_{i=1}^t g(z_1 \cdots z_{i-1})(z_i) = A_t(g)(z_1 \cdots z_t) $$ $\blacksquare$ Talk about gains. We've gone form needing $O(|Z|^t)$ evaluatioin of the function $g$ to only needing $t$! That is the other major advantage of using $\Fn(\Seq Z, \Dist Z)$ to construct our language model. Not only does $A_t$ enable us to construct distributions over sequences in a way that works well with neural networks, it does so in a way that is cheap to sample form. # 2) Deep Learning with Sequences Ok, so to train an autoregressive LM we need an architecture $\Seq Z \to \Dist Z$. To think clearly about all the ways we can go about constructing such architecures it is useful to take a step back and breifly consider a more general class of deep learning problems. Tasks that involve sequences as inputs or outputs. This more generic problem is of independent interest, becuase sequence spaces turn out to be common in modeling many real world phenomena, not just language. Some examples would be: * **Point clouds.** The set $X = \R^3$, 3 coordinates representing a position in space. $\Seq X$ would be the set of all possible point clouds * **Molecular representations.** A molecule could be the descripition of an atom, a list of numbers representing the position (like for 3D point clouds) but also other information like momentum, charge, or anything else physically relevant. $\Seq X$ would represent the configuration of a phyisical world with many such atoms. * **Aminoacid sequences.** $X$ is the set of aminoacids The description of an aminoacid might or mihgt not have a spatial location. * **Images.** Images can be thought of as a sequence of pixels. An image of RGB pixels with $h$ vertical pixels and $w$ horizontal ones would correspond to an element in $\Seq \R^3$ of length $h\times w$. What about elements of $\Seq \R^3$ that don't have length multiple of $w$, they correspond to non square images... you can see that $\Seq \R^3$ is not a super natural to represent the space of images. And indeed, the architectures that "think" about images in a different way tend to be more succsfull (yes, I'm talking about convolutional networks). But still, $\Seq \R^3$ is a perfectly legitimate way to think of images. These sequence spaces are used in tasks like the following: * **Molecular modeling.** Given the desciption of the physical state, predict the evolution of the system. * **Protein folding.** Here $X$ is a representation of an aminoacid and the outputs are a sequence of 3D points of the position in space of the corresponding aminoacid. * **Point cloud classification.** For a finite set of categories $C$ we are looking for afunciton $\Seq \R^3 \to \Dist C$. * **Image coloring.** Given an grayscale image, assign an RGB value to each pixel. In these tasks we've seen signatures of the type * $\Seq X \to Y$ * $\Seq X \to \Seq Y$ And we would like to have a good sense about how to construct architectures in all of these cases. Even tough this seems like a very generic problem, it turns out that there is a much more constrained type of architecture that plays an outsized role in the space of deep learning architectures. And even when our task demands a more generic architecture, we can reduce the problem into using architectures of this constrained type under the hood. ## 2.1) Sequence Transformations **Definition** We say $h: \Seq X \to \Seq Y$ is a **sequence transformation** if $h$ preserves the length of the sequence. Meaning that if the input sequence has lenght $t$, so does the output sequence. We refer to $\SeqT(X, Y)\subset \Fn(\Seq X, \Seq Y)$ as the subset of all sequence transformations. In deep learning we care about composing multiple layers into more complex architectures. Sequence transformations are very well behaved in this way, as the composition of two sequence transforms is a sequence transform itself.[^1] This is a completely obvious fact, but since it is quite an important one it's worth stateting it as a result. [^1]: Sequence transformations form a category. **Result:** If $h\in \SeqT(X, Y)$ and $f\in \SeqT(Y, Z)$ then $f\circ h \in \SeqT(X,Z)$. In the next section we'll see how RNNs and Transformers are examples of sequence transformations. So are elementwise functions on the sequence elements. Just from a quick glance at the list above, it's quite obvious that many lerning problems naturally involve sequence transformations. For tasks like **protein folding** and **image coloring**, each entrie of the output sequence is trying to predict missing information of the same entrie on the input sequence. Thus, for this type of tasks we need the output sequence to have the same length as the input one, and so we our model architectures must be a sequence transformation. But of course that isn't always the case. In particular our central objective of autoregressive language modeling requires architectures with a different signature $\Seq X \to Y$. What is surprizing is that even for tasks like these, sequence transformations turn out to be an extremely useful building block that is used under the hood. To see why, take a moment to think about how you might construct an architecture with the desired input and output spaces. In one side you have the space $\Seq X$, where a point might contain an unboudedly large amount of informaiton. On the other, you have a $Y$, which often is a finite dimensional vector space. $Y$ is a much "bounded" space than $\Seq X$. When designing your architecture you are forced to decide where to place the projection from the variable sized space to the fixed sized one. Given that neural network architectures are allways constructed as the composition of many individual layers, there are two very natural choices of where to place the projection: in the beginning, or at the end. We would call the choice early drop and late drop respectively. Early drop architectures would look something like $$ \Seq X \to \R^d \to \cdots \to \R^d \to Y $$ ::: {.column-margin} $d$ represents the width of each layer of the neural network. The arrows represent the individual layers of the neural network. ::: and the late drop type would be constructed out of many sequence transforms layers except for the last one $$ \Seq X \to \Seq \R^d \to \cdots \to \Seq \R^d \to Y $$ One can definetely construct neural network architectures in the two ways, and we are about to see some examples of both. But it does seem to be the case that all the architectures we actually used in practice (e.g. transformers and RNNs) are of the drop last type. We could speculate about why: * using sequences interally is a good inductive bias for tasks that are naturally expressed as sequences. * adaptive computation. More complex tasks have more internal state abailable to solve them But honestly, that just leads to a lot of talk with little substance. What is certainly true is that the second type is the one with widespread use, it seems to be the deep learning way of doing things. Moreover, in section 3 we will see how thinking about architectures based on sequence transforms enable one massive optimization to train autoregressive language models. TODO: Even when your architecture can be thought of in a different way, sequence transform perspective is what allows you to implement them efficiently on hardware. Unleash all your cores to compute different parts of the output all at the same time. Unconstrain yourself ## 2.3) Example Architectures TODO: make the point that sequence transforms give us a unified perspective to think about RNNs and Transformers ### 2.3.1) Recurrent Neural Networks An RNN has the signature $$ r(s_{t-1}, x_t) = (s_t, y_t) $$ For example, a very basic RNN layer might be something like $$ s_{i+1} = \sigma(W s_i) + x_i \;\;\;\;\;\;\;\;\;\; y_i = U s_i $$ where $x_i, s_i, y_i \in \R^d$ are the inputs, states and outputs respectively and $W, U \in \R^{d\times d}$ are weight matrices and $\sigma$ is a nonlinearlity like a sigmoid. Of course, there is an infinitude of variations. The specific step equations matter much less than the general pattern of a layer with the signature of $r$. As written, it's not clear that we have a sequence transformation in our hads. But once we pick an initial state $s_{-1}\in \R^d$, any function with such a recurrent formulation can be Given the input sequence $x_1 \cdots x_t$ we can *unrolled* the recurrent step and compute the sequence $y_1 \cdots y_t$. Thus we have a sequence transform in our hands. In deep learning we will want to stack many such layers $r$. For example let's consider stacking $n$ layers $r$ to construct a deep RNN. The stack of layers will have the same signature of our basic RNN, the only difference is that the state will consist of $n$ vectors in $\R^d$. The following diagram represents the computation of a deep RNN with 2 layers and $t=3$. The blue boxes represent each step of unrolling the RNN through time. ![File (11)](https://hackmd.io/_uploads/rJjhy0Mda.jpg) But people don't tend to implement deep RNNs this way! Anyone who has seen an implementation of an RNN will recognize that the actual computation we implement is instead: ![File (12)](https://hackmd.io/_uploads/r1HhkCfup.jpg) It's like we try to avoid the recurrent formulation of a deep RNN and instead prefer to see it as a stack of sequence transformations. But why? Ofc they are mathematicaully equivalent, but there is a difference in practical performance in hardware: cache hits EXPAND FURTHER. We start to see how thinking about sequence transforms seems to have some computational advantages. To create the function we need for our autoregressive LM we could take a stack of RNN layers and on the last layer throw out all the outputs except for $y_t$. That gives us a function with signature $\Seq X\to Y$. ### 2.3.2) Transformers A transformer layer takes in a sequence of inputs $X_1 \cdots X_t \in \R^d$ and uses weight matrices $W_Q, W_K, W_v \in \R^{d\times d}$ to compute $Q_i = W_Q X_i$ etc.. The outputs of the layer are $$ Y_i = \sum_{j=0}^t e^{Q_i K_j^T} V_j $$ An the causal transformer layer is $$ Y_i = \sum_{j=0}^t e^{Q_i K_j^T} V_j $$ They are both clearly sequence transformations since we get $t$ outputs $Y_i$. One massive advantage of transformers over alternative architectures like the RNNs we've just seen is that they are highly parallelizable. Our GPUs with many cores can split the work of computing all the $Y_i$ in parallel. You don't need to finish computing $Y_1$ before you can move on to $Y_2$. The potential for this parallelism is something quite natural when one thinks in terms of sequence transformations $h: \Seq X \to \Seq Y$. There is nothing inherent that says that one must compute the output sequence one step at a time. To create a practical architecture for our autoregressive language modeling we would stack a sereies of transformer layers, normally with some MLPs in between. Just like for RNNs, we would throw out all the outputs of the last layer except for $Y_t$. That will gives us a function with the desired signature $\Seq X\to Y$. *Note how it really doesn't matter is we use causal or non causal transformers for this task. We can get an autoregressive LM out of each. Yes, there is a very good reason to use causal transformers, but to see why we will need to think carefuly about training costs. Something that we will do in section 3.* ### 2.3.3) k-gram MLP We've seen a couple of common examples of "drop last" architectures. Let's give 1 of the "drop first" type. As stated, this is not a common thing to do, and the experienced deep learning practitioner will recognize it's kind of funky. Let's do it non the less for the sake of completeness Inspired by the classic $k$-gram models. If the input sequence is $\Seq \R^d$, we could "stack" the $k$ last inputs and apply an MLP to the vector in $\R^{nd}$. The last layer of the $MLP$ would map into $Y$. Independent of how well this arch learns. It turns out that it is very poorly behaved computationally compared to RNNs and causal transformers. In section 3 we will see how this architecture is poorly suited for the porpuses of autoregressive language modeling in the same way than non causal transformers are. # 3) Efficient Autoregressive Modeling TODO: come up with a better name for the section Combining what we have seen in the last two sections we have everything we need to implement a neural network that can be used for autoregressive language modeling. But as we will now see, there is a lot left on the plate. We will now study the costs of training and sampling and see how, by introducing a new type of archtecture, we can make get a lot more value out of our GPUs. ## 3.1) Cost of Training Now, let's return to considering autoregressive language modeling, $f : \Seq X \to \Dist X$. How many times do we have to call the underlying function $f$ to evaluate the loss $l$? When we evaluate the loss we have to compute $$ l = \sum_{x\in \Seq_t X} \sum_{i=1}^t -\ln[f(x_1 \cdots x_{i-1})(x_i)]$$ For example, suppose our dataset consists of a single sentence: $\text{"I like penguins"}$. To evaluate the loss we need to evaluate our model $f$ on the empty sequence $\text{""}$ and the first character $\text{"I"}$ and $\text{"I "}$ and $\text{"I l"}$, $\text{"I lik"}$ etc.. $$ f(\text{""}) \\ f(\text{"I"}) \\ f(\text{"I "}) \\ f(\text{"I l"}) \\ f(\text{"I li"}) \\ \vdots $$ ![385505895_353915030504501_2594593902089499332_n.jpeg](https://hackmd.io/_uploads/BySruQpX6.jpg) So, for every sequence $x_1 \cdots x_t \in D$, to compute the loss we need to call our model on $t$ different inputs. **The motivatiion to find a speed up** We have been considering the possibility of using an architecture of the type $$ f:\Seq X \to \Seq \R^d \to \cdots \to \Seq \R^d \to Y $$ but it might be dumb that all the steps of the computation $f(x\cdots x_t)$ involve sequences $t$ and at the end we project down and output a vector. Perhaps, we could use a different architecture that was a sequence transformation $$ g: \Seq X \to \Seq \R^d \to \cdots \to \Seq \R^d \to \Seq Y $$ The dream would be that $g(\text{"I like penguins"}) = [f(\text{"I"}), f(\text{"I "}), f(\text{"I l"}), f(\text{"I li"}), \cdots]$, so the output sequence contains all the values we need in a single call to our architecture. ## 3.2) Causal Sequence Transformations The fundamental principle of causality is that *the past does not depend on the future*. The way to formalize this intuition is to say that evaluating a causal sequence transformation on a subsequence produces an output sequence that is also a subsequence. **Definition** We say $h\in \SeqT(X,Y)$ is a **causal sequence transformation** if, given a sequence $x_1 \cdots x_t \in \Seq_t X$ with output sequence $y_1 \cdots y_t = h(x_1 \cdots x_t)$, for any $i\le t$, evaluating $h$ on the subsequence $x_1 \cdots x_i \in \Seq_i X$, the outputs $y'_1 \cdots y'_i = h(x_1 \cdots x_i)$ have the property that $y_j = y'_j$ for all $j\le i$. We refer to $\CSeqT(X, Y) \subset \SeqT(X, Y)$ as the subset of all causal sequence transformations. The reader is encouraged to check that all functions below are causal sequence transformations * RNNs as defined in EQX * Transformers as defined in EQY * Elementwise functions $h(x_1 \cdots x_t) = \sigma(x_1) \cdots \sigma(x_t)$ Just like sequence transformations, causal sequence transformations form a category, meaning that they are closed under composition. This means we can use layers that we know are causal sequence transforms and combine as building blocks into more complex architectures. **Result:** If $h\in \CSeqT(X, Y)$ and $f\in \CSeqT(Y, Z)$ then $f\circ h \in \CSeqT(X,Z)$. **Proof** Bruh, why you expanding this proof? Don't be lazy and do it yourself. It's real easy. For example, a full transformer architecture usually would be constructed by composing many transformer blocks. Where each block consists of a transformer layer as in EQY followed by an MLP applied elementwise to all outputs of the transformer layer. Thanks to the previous result we know that the full transformer architecture will be causal just by confirming that elementwise functions and EQY are causal sequence transformations. Alright, now that we understand causal sequence transformations and we have a vast space of such architectures at our dispoal, let's see why they are so useful for the problem we were trying to solve. **Definition:** The "take last" function $L: \CSeqT(X, Y) \to \Fn(\Seq X, Y)$ is defined the following way: given an $h\in \CSeqT(X,Y)$, if $x_1 \cdots x_t \in \Seq_t X$ then $L(h)(x_1 \cdots x_t) = [h(x_1 \cdots x_t)]_t$. In other words, if $y_1 \cdots y_t = h(x_1 \cdots x_t)$ then $L(h)(x_1 \cdots x_t) = y_t$. **Result** $L$ is a 1-1 mapping where the inverse "concatifies" a function $f \in \Fn(\Seq X, Y)$ by evaluating it on $x_1$ and $x_1, x_2$ etc.. Concretely: $$ L^{-1}(f)(x_1 \cdots x_t) = \left(f(x_1), f(x_1, x_2), \cdots, f(x_1 \cdots x_t) \right) $$ **Proof** First note that, given an $f\in \Fn(\Seq X, Y)$, if $h$ is the concatification of $f$, meaning that $h(x_1 \cdots x_t) = \left(f(x_1), f(x_1, x_2), \cdots, f(x_1 \cdots x_t) \right)$ then clearly, $L(h)(x_1 \cdots x_t) = f(x_1 \cdots x_t)$. We also need to check the other direction. Given an $h\in \CSeqT(X, Y)$: \begin{align} h(x_1 \cdots x_t) &= \left([h(x_1 \cdots x_t)]_1, [h(x_1 \cdots x_t)]_2, \cdots, [h(x_1 \cdots x_t)]_t \right) \\ &= \left([h(x_1)]_1, [h(x_1, x_2)]_2, \cdots, [h(x_1 \cdots x_t)]_t \right) & \text{using the fact that $h$ is causal} \\ &= \left(L(h)(x_1), L(h)(x_1, x_2), \cdots, L(h)(x_1 \cdots x_t) \right) & \text{using the definition of $L$} \end{align} so if we concatify $L(h)$ we get $h$ back. $\blacksquare$ In summary, ![File (5)](https://hackmd.io/_uploads/r1f_JCaIa.jpg) We've established a 1-1 mapping between the two function spaces. But computationally there is a very big difference between applying $L$ and applying $L^{-1}$. One is expensive the other is cheap. The color of the arrows represent that. We get another interesting diagram when we combine this result with the previous one we get the following picture (where horizontal lines indicate the function is invertible and vertical ones indicate the function is surjective) ![File (6)](https://hackmd.io/_uploads/B1hwM0TLp.jpg) When we are trying to construct our distribution over length $t$ sequences we loose no expresivety if we start with an $h\in \CSeqT(X, \Dist X)$ and we apply $A_t\circ L: \CSeqT(X, \Dist X) \to \Dist \Seq_t X$. For any $f\in \Dist \Seq_t X$ there will be an $h\in \CSeqT(X, \Dist X)$ s.t. $f = A_t \circ L (h)$. So mathematically it really doesn't matter if our base architecture is $g\in \Fn(\Seq X, \Dist X)$ or $h\in \CSeqT(X, \Dist X)$, but computationally it makes a massive differene! Suppose $g = L (h)$. Then, to evaluate the loss on a single sequence $\text{"hat"})$ we required 3 invokations of the function $g$ $$ g(\text{"h"}), \; g(\text{"ha"}), \; g(\text{"hat"}) $$ But when we evaluate $h(\text{"hat"})$, since $h = L^{-1} \circ L (h) = L^{-1} \circ g$ then $$ h(\text{"hat"}) = L^{-1}(g)(\text{"hat"}) = \left (g(\text{"h"}), \; g(\text{"ha"}), \; g(\text{"hat"}) \right) $$ For a sequence of length $t$, just by evaluating $h$ a single time we get the $t$ evaluations of $f$ that we need for the loss. Using a causal sequence transformation is an incredible bargain. *You compute 1 and get $t-1$ for free!* <!-- ### The Causal Partial ordering *NOTE: this section should be folded by default* There is an alternative approach to reach the same definition of causal sequence transformation. We won't really use it anywhere else in the doc but we believe its a particlarly beautiful angle, so we discuss it anyways. feel free to skip this section. There is a natural partial ordering of $\Seq(X)$. A sequence $x_1 \cdots \in \Seq(X)$ is a continuation of $y_1 \cdots \in \Seq(X)$ iff $len(x)\ge len(y) = t$ and $x_1=y_1, \;\cdots, \; x_t =y_t$ A sequence transformation $f:\Seq X \to \Seq Y$ is causal if it preserves this ordering relation. Meaning that if we have two inputs $x, x' \in \Seq(X)$ with $x \le x'$ ($x'$ is a continuation of $x$) then $f(x')$ will also be a continuation of $f(x)$. In other words, if $x\le x'$ then $f(x)\le f(x')$. Any function that preserves this ordering must respect that th e**the past does not depend on the future**. If $y_1 \cdots y_t = f(x_1 \cdots x_t)$. The first output $y_1$ depends only on $x_1$, meaning that for two completely different input sequences sharing the same initial token $x_1 x_2 \cdots x_t$ and $x_1 x'_2 \cdots x'_t$ the two output sequences would have the same $y_1$. ---- **Result** A sequence transform $f: \Seq X \to \Seq Y$ is causal iff it perserves the causal ordering. **proof** --> ## 3.3) Efficient Sampling **Can we do even better?** Let's consider what would it would look like to follow the autoregressive sampling procedure for the RNN of EQX. First we call $g(x_1)$ which requires $$ s_1, y_1 = r(s_0, x_1) \;\;\;\;\; z_1 \sim y_1 \in \Dist Z $$ Then we want to call $g(x_1, x_2)$ which requires $$ s_1, y_1 = r(s_0, x_1) \;\;\;\;\; s_2, y_2 = r(s_1, x_2) \;\;\;\;\; z_2 \sim y_2 \in \Dist Z $$ And we can already see the problem. The second invocation of $g$ internally recomputes the term $s_1$ which was already computed in the first call. This is a consequence that the way we are calling $g$ is very structured. And often there is potential for cacheing values and reusing computation between the individual calls. In the case of an RNN, the way you actually want to sample is to first compute $s_1, y_1 = r(s_0, x_1)$ and sample $z_1 \sim y_1$. Then reuse $s_1$ and compute $s_2, y_2 = r(s_1, x_2)$ and sample $z_2 \sim y_2$ and so on. This is a big optimization because we are cacheing computation into the state. ## 3.4) State Machines To optimize training we exchanged $t$ calls to an expensive function $g$ into $1$ call to an equally expensive function $h$. To optimize sampling we will echange $t$ calls to an expensive function $g$ into $t$ calls to a function $r$ which is cheaper because it reuses cached computation from the previous calls. **Definition:** A **state machine** A state machine in $\SM(X,Y)$ is a tuple $(S, s_0, r)$ where $S$ is a set called the state space, $s_0 \in S$ is a point we call the origin and $u: S, X \to S, Y$ is a step/tick function $$ r(s_{t-1}, x_t) = (s_t, y_t) $$ There is no difference between these equations and the basic form of an RNN, but there are a few reasons we are choosing to call the object state machine instead of RNN. * We will apply this type of equation to things that in no way can be considered neural networks. A good example is a tape, which we will discuss shortly. Using the term NN (i.e. neural network) to refer to such objects feels silly. * Another reason is that for RNN it is usually assumed that $s_t \in \R^d$ for some $d\in \N$. In other words, the state has a "fixed size". As we will now see, to be able to unify transformers and RNNs, it's very important to think in terms of state machines with states of "growing size". It's kind of obvious that, if the whole idea of state machines is to allow for the cacheing of information during the repreated invokations of $g$, there will always be a "worst case state machine" for any $g$: the state machine that caches no computations. We call this the **tape state machine** because the only thing it stores is the inputs. **Definition:** The tape state machine $P(g)\in \SM(X,Y)$ of a function $g\in \Fn(\Seq X, Y)$ is $$ P(g)=( \underbrace{\Seq X}_S, \underbrace{[\;]}_{s_0}, r_g ) $$ where the tick funciton $r_g$ is defined as $$ r_g \bigg( \underbrace{[x_1 \cdots x_{t-1}]}_{s_{t-1}}, \; x_t \bigg) = \bigg( \underbrace{[x_1 \cdots x_t]}_{s_t}, \; \underbrace{g(s_t)}_{y_t} \bigg) $$ The tape state machine might seem like a bit of a silly object, * It helps us see that all $g$ can be thought of as state machiens. Once we have this unified framework to compare sampling costs of different models. We just compare the size of the sates and the cost of every tick. We are going to do that next. * Even though you never want to run the tape state machine, it's a useful starting point to start the analysys of imporovements. We can sample from autoregressive MLPs. It's also useful to construct baselines. * And is useful as a software engineering idea. You can see if the samples from a model are good and only then take the effort to implement a highly optimized state machine. Another aplication is testing. When you are implementing an optimized SM, it's useful to have a (slow) reference that has the exact same interface and outputs as the funciton you are trying to implement. **Result:** The unrolling funciton $U: \SM(X,Y) \to \CSeqT(X,Y)$ is surjective. Given $m = (S, s_0, p) \in \SM(X,Y)$ $$ U(m)(x_1 \cdots x_t) = [y_1 \cdots y_t] \;\;\;\; \text{where} \;\;\;\; (s_t, y_t) = p(s_{t-1}, x_t) $$ **Proof:** Use the tape state machine... $\blacksquare$ The following diagram summarizes the result ![File (10)](https://hackmd.io/_uploads/BJ3601kva.jpg) Note how $P\circ L$ behaves similarly to an inverse of $U$ because $U \circ P \circ L = \text{id} : \Fn(X,Y) \to \Fn(X,Y)$. But the other direction is not necessarily true. For example if we start with a classic RNN $(\R^d, 0, r)$ and we $P \circ L \circ U$ we end up with tape state machine that is different from the RNN. We can also combine it with the autoregressive correspondance result to get the diagram: ![File (9)](https://hackmd.io/_uploads/SJxRAy1D6.jpg) ### RNN State Machines #### Tape SM: $O(t)$ memory and $O(td^2)$ compute #### Efficient SM: $O(d)$ memory and $O(d^2)$ compute ### Self Attention State Machines #### Tape SM: $O(t)$ memory and $O(t^2d)$ compute #### KV cache SM: $O(td)$ memory and $O(td)$ compute Funnily enough, this imporvement #### Efficient SM: $O(d^2)$ memory and $O(d^2)$ compute Does it exist for the transformer? Yes under the assumption that... ## 3.5) Summary Ok, we've seen a lot of changes of perspectives. The following diagram sumamrizes what is the relationship between the objects we've been using, and what each perspective is useful for. ![File (1)](https://hackmd.io/_uploads/By-KTapIT.jpg) Our objective now is to construct good architectures with both, an efficient causal sequence transformation and state machine implementations. We will see how causal transformers can be seen as examples of causal sequence transformations, and then we will present our star architecture a small modification that we call linear self attention that retains all the advantages of the transformers but gives a big improvement on the causal sequence implementation But it's going to be easier to introduce the key ideas that allow for these optimizations in the simpler setting of sequence transformations. Transformers are amazing for two general reasons * generalize very well (better than RNNs?) * run very fast on our GPUs (way better than RNNs) but suffer from some disadvantages: * Cost $O(t^2)$ on train on datasets of length $t$ as opposed to $O(t)$ for RNNs * Sampling a sequence of length $t$ has cost $O(t^2)$ vs $O(t)$ for RNNs Our empirical results will be focused on an architecture we implemented that we call Self Attention RNN (SARNN) / Self Attention State Machine / We think its a very succseful architecture at combining the best of both worlds because it has the properties: * generalize very well * run very fast on our GPUs * Cost $O(t)$ to train on datasets of length $t$ * Sampling a sequence of length $t$ has cost $O(t)$ # 4) Linear Transformer <!-- # 2) Linear Self Attention Transformer $$\newcommand{\Aij}{ A_{[i,j]}}$$ Turns out that three sets of equations that look quite different, are all mathematically equivalent: 1. $Y_{i} = S_i Q_i$ with $S_i = S_{i-1} + V_i K_i^T$ with $S_0 = 0 \in \R^{d\times d}$ 2. $Y_i = \sum_{j=0}^i V_j A_{[i,j]}$ with $A_{[i,j]} = K_j^T Q_i$ 3. $Y = AV$ the attention matrix is defined as $A = M \circ (K^T Q)$ 4. $Y_i = S_{i-k} Q_i + \sum_{j=i-k}^i V_j A_{[i,j]}$ a hybrid of 1 and 1 with $A_{[i,j]} = K_j^T Q_i$ and assuming $S_i = 0$ for all $i\le 0$. Before showing that they are indeed equivalent, let's look at the computational properties of each. 1. Computing $Y$ costs $O(td^2)$. To see why this is the cost lets see what it woull take to compute the entire length $t$ sequence $Y_1, Y_2, \cdots Y_t$. Each step of the recurrent equation first requires $V_i K_i^T$ with cost $O(d^2)$, then a sum of two $d \times d$ matrices, also with cost $O(d^2)$ and finally a matrix vector product with the same cost. Since we need to do $t$ of those steps, the cost is indeed $O(td^2)$. 2. Computing $Y$ costs $O(t^2d)$. To implement 1 we would first compute $\Aij$, which requires doing an inner product with cost $O(d)$, then for each of the $i$ elements of the loop we would do $\Aij V_j$ which is a vector scalar multiplication with cost $O(d)$. Thus, computing $Y_i$ would be $O(id)$ and computing the entire sequence $Y$ would be $O(t^d)$. For very large sequence length $t^2$ will dominate and it would seem like option 1 would be preferable but the function might be faster in cases where the inner dimension of the model $d$ is much larger than the sequence length $t$, although that is pretty much never the case for real world applications. 3. Computing $Y$ costs $O(t^2d)$. This is because computing $A$ is just the cost of computing all $t^2$ combinations of $K_j^T Q_i$, which takes $O(t^2d)$. Then to apply the mask we need $O(t^2)$ and finally to do $AV$ that is $O(t^2d)$. Clearly 2 and 3 are related, they are two different implementations of the same mathematical function that have the same cost. But 3 has a very useful features. Computing it requires only the invokation of 3 primitives on matrices: matrix multiplication, transpose and element multiplication. Which are all highly parallelizable and run very efficiently on GPUs. In our initial experiments we were using $d=64$ and $t=1024$. The mathematically savy reader will realize $1024$ is larger than $64, and so $t^2= 1024^2 \simeq 1000000$ is much larger than $d^2 \simeq 4000$. Yet, compre at the times it takes for 1 and 3 to compute $Y$ H100 GPU: SHOW DATA Method 3 requires on the order of 250x more floating operations than 1 and yet it completely obvliterates it. This is how insanely powerful GPUs can be for parallelizable computations. But GPUs only have so many cores and at some point operations start to be scheduled sequentially. Then, we are guaranteed that for a large enough $t$, the quadratic cost $O(t^2d)$ will dominate $O(td^2)$ and method 1 will be faster. This is indeed the case, but is not a domain where people train models. SHOW EXPERIMENT we keep increasing the size of the context until we see 1 win over 3. Perhaps at around 16 or 32k ---------------- # 2) Linear Self Attention Mathematically, SARNN is a tiny modification on a transformer self attention layer. But it's much better behaved computationally, mainly, it has $O(t)$ cost as opposed to $O(t^2)$. The basic idea that makes this possible is very simple, and can be found in the literature from years ago. But that paper went wrong in how they propsed computing this different type of architecture. ## As a Sequence to Sequence Model SARNN vs LSA (Linear self attention) First we do a simplification of the transformer attention layer: $$ f(Q,K,V) = Y_i = \sum_{j=1}^i e^{K_j^T Q_i} V_j $$ The normalization doesn't matter for the issues at hand. In section 2.X will revisit the issue of normalizations. **Linear Self Attention** $$ [h(K,V,Q)]_i = Y_i = \sum_{j=1}^i K_j^T Q_i V_j $$ Simply removing the exponential produces all the mathematical advantages. Better context scaling, better sampling. **The attention vectors:** There is also a sequence of $t$ attention vectors $A_1, \cdots, A_t \in \R^t$, but their equation is just $$ A_i[j] = 1_{j\le i} K_j^TQ_i $$ We also have a mask $M_i$ s.t. $M_i[j] = 1_{i\le j}$ which we use in the equation that summarizes wha tthe eintire attention vector is: $$A_i = \left(K^T)Q_i \right) \circ M_i$$ **The output vector sequence:** Once we have the attention vectors $A_i$, we can compute $Y_i$, the output vector at timestep $i$. $$ Y_i = \sum_{j=1}^i A_i[j] V_j $$ or equivalently, $$Y_i = VA_i$$ It would seem ```python! def sequential_linear_self_attn(K,V,Q): t = K.shape[1] Y = [] for i in range(t): T_i = K.T @ Q A_i = T_i * LienarMaskVec(i) Y.append(V@A_i) return Y ``` ## As a Recurrent Sequence Model **Key result** The recurrent sequence model defined by the equations: * $S_0=0 \in \R^{d\times d}$ * $S_i = S_{i-1} + V_i K_i^T$ Now that we have the sequence of states, we define the output sequence to be $S_i Q_i$ (the multiplication of a matrix and a vector). Then this recurrent model is equivalent to the linear self attention layer. In other words: $$Y_i=S_i Q_i$$ **Proof** Clearly $$ S_i = \sum_{j=1}^i V_i^T K_i $$ And so the output sequence $S_i Q_i$ is $$ S_i = \sum_{j=1}^i V_i^T K_i $$ And so the output sequence $S_i Q_i$ is \begin{align} S_i Q_i &= \left( \sum_{j=1}^i V_i K_i^T \right) Q_i = \sum_{j=1}^i (V_i K_i^T) Q_i = \sum_{j=1}^i (K_i^T Q_i) V_i \\ &= Y_i \end{align} QED **A better sequential program** This recurrency result give us a big improvement on the previous sequential implementation of the linear self attention. ```python! def step(s, k, v, q): sp = s + v@v.T y = s@q return sp, y def self_atten_layer_recurrent(K, V, Q): d, t = K.shape s = zeros([d,d]) Y = [] for i in range(t): k, v, q = K[:,i], V[:,i], Q[:,i] s, Y_i = step(s, k, v, q) Y.append(Y_i) return Y ``` Even though this program is sequential (and thus not very well suited for GPUs) it is wayy better than the last one. Mainly because each of the $t$ steps only uses individual vectors $k,q,v$ as opposed to entire sequence $K,Q,V$. Also, as we will see in a later section. The RNN formulation is much better behave -->

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully