Presentation on Complexity Theoretic Limitations of Large Language Models - HackMD

<style> section { text-align: left; } .reveal .slides section { font-size: 30px; } </style> # Presentation on Complexity Theoretic Limitations of Large Language Models  Slide: https://hackmd.io/@tobi-alafin/llm-complexity-limitations --- ## Background Contemporary large language models (LLMs) perform a fixed number of computational steps per generated token (they have a "fixed length forward pass") and this is true regardless of the computational difficulty of the task predicting a particular next token represents. Unlike humans, LLMs do not by default “think” harder/for longer when faced with more difficult problems. This project aims to investigate what complexity theoretic constraints (if any) we can place on autoregressive sequence prediction models with a fixed length forward pass. --- ### Key Research Questions 1. Can we derive asymptotic upper bounds on the complexity of problems solvable by such models as a function of their (hyper)parameters? 2. How do architectural choices of the model affect these bounds? --- ## Transformers - Auto-regressive sequence models - Relies on attention mechanism - Widely used in NLP tasks ---- ### Components of Attention 1. Query (Q): Represents the token that we are trying to focus on or decode. 2. Keys (K): Represent all tokens in the sequence, in relation to which the query's importance is assessed. 3. Values (V): Also represent all tokens in the sequence; these are the values that we actually want to sum up to get the final output. Mathematical Description The attention mechanism can be mathematically defined as follows: $$ \operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{\top}}{\sqrt{d_k}}\right) V $$ Here, $d_k$ is the dimension of the key vectors, and the division by $\sqrt{d_k}$ is for scaling purposes. --- ### Architecture of a Transformer ![Transformer Architecture](https://i.imgur.com/iMeODZ9.png) N. Elhage, N. Nanda, C. Olsson, et al., "A Mathematical Framework for Transformer Circuits," Transformer Circuits Thread, 2021. [Online]. Available: https://transformer-circuits.pub/2021/framework/index.html. [Accessed: 03-Nov-2023]. (see "High Level Architecture"). --- ## Modeling Transformers - Key characteristics: - Finite sliding window context - Reads entire context - Intermediate computations - Fixed computation cost - Statelessness - Initial simplified model: - Deterministic (temperature = 0) - Non-interactive - Sequential --- ## Basic Conceptual Model ### State Description - The model has a tape, two heads ("read" and "write"), and a finite buffer (working memory). - Constants define the tape alphabet, input alphabet, sliding window length, processing depth, and buffer size. - Variables track current state, window fullness, start position, steps since last write, and buffer contents. ---- ### Characteristics - Sliding window context of fixed length - Writes to the first empty cell - Periodically reads entire context into buffer - Fixed intermediate computations between reads and writes - Stateless - forgets computation details between writes ---- ### Operational Behavior - Initialisation sets up tape, window, heads, buffer - Reading copies context into buffer - Computations involve state transitions using buffer - Writing outputs a symbol and resets the buffer - Termination on special "end of output" symbol - Analogous to the end of sequence symbol in a Transformer - Simplified cycle: read context, compute, write symbol. Repeat until termination. ---- ### Basic Formal Model #### Components 1. **States**: $Q$ is a finite set of states. 2. **Halt States**: $Q_{\text{halt}} = Q_{\text{accept}} \cup Q_{\text{reject}}$, where $Q_{\text{halt}} \subseteq Q$. 3. **Accept States**: $Q_{\text{accept}} \subseteq Q$ is a set of accept states. 4. **Reject States**: $Q_{\text{reject}} \subseteq Q$ is a set of reject states. 5. **Alphabet**: $\Sigma$ is a finite set of symbols, including a special "blank" symbol $\text{BLANK}$. 6. **Buffer Alphabet**: $\Sigma_{\text{buffer}} \subseteq \Sigma$, the set of symbols that can be stored in the buffer. ---- 7. **Tape**: An infinite tape, each cell of which can contain a symbol from $\Sigma$. 8. **Write Head**: Positioned over a single cell on the tape at any given time. The head can write a symbol to the cell it is positioned over. 9. **Read Head**: Positioned over a single cell within the sliding window at any time. The head can read the symbol in the cell it is positioned over. 10. **Buffer**: A finite-sized, full random-access memory that serves as the model's working memory, analogous to the residual stream of a transformer. The buffer can contain symbols from $\Sigma_{\text{buffer}}$. ---- 11. **Start Position**: $\text{START} \in \mathbb{N}$, the beginning of the sliding window on the tape. 12. **Sliding Window "Full" Indicator**: $f \in \{0, 1\}$, a flag indicating whether the sliding window is full. 13. **Initial State**: $q_0 \in Q$, the state in which the model starts. 14. **Processing Depth**: $\text{PROCESSING_DEPTH} \in \mathbb{N}$, a constant specifying the number of intermediate computation steps. ---- 15. **Transition Function**: $$\delta : Q \times \Sigma_{\text{buffer}} \times \{0, 1\} \times \mathbb{N} \rightarrow Q \times \Sigma \times \Sigma_{\text{buffer}} \times \mathbb{N}$$ Here, $\delta$ takes as input: - The current state $q \in Q$ - The last symbol read into the buffer $s \in \Sigma_{\text{buffer}}$ - The "full" flag $f \in \{0, 1\}$ - The current start position $\text{START}$ And it returns: - The next state $q' \in Q$ - The symbol to write $w \in \Sigma$ - The symbol to store or manipulate in the buffer $b \in \Sigma_{\text{buffer}}$ - The new start position $\text{START}'$ --- #### Operational Semantics 1. **Initialization**: The machine starts in state $q_0$, with the write head positioned at the first cell and the read head within the sliding window. The start position $\text{START}$ is initially set to 0. An empty buffer is initialized, and it can store up to $\text{LENGTH}$ symbols. 2. **Reading**: The entire context is copied into the buffer. 3. **Intermediate Computations**: The automaton transitions between states based on its current state, the "full" flag, and possibly the contents of the buffer. These computations take $\text{PROCESSING_DEPTH}$ steps. ---- 4. **Writing and Resetting**: After the intermediate computations, the automaton writes a new symbol to the tape and flushes the buffer. It then returns to the start state $q_0$. 5. **Sliding Window Update**: If the "full" flag is set to 1, the start position $\text{START}$ is incremented by 1. 6. **Termination Condition**: The automaton halts if it writes a special "end of output" token to the tape. 7. **Acceptance and Rejection**: - If the machine enters a state in $Q_{\text{accept}}$, it accepts the input. - If the machine enters a state in $Q_{\text{reject}}$, it rejects the input. ---- ### Summary An abstract machine with a sliding window context, fixed processing steps per write, and no persistent memory across writes. The operational behavior follows a read-compute-write cycle. --- ## Questions ### Complexity Analysis - Time Complexity: - Processing depth $\text{PROCESSING_DEPTH}$ is a function of context length/input size $N$ - Canonical $\text{PROCESSING_DEPTH}$ is quadratic or linear in $N$ based on attention sparsity - How does time complexity scale with $\text{PROCESSING_DEPTH}$ compared to a standard RAM model? - Sample complexities: logarithmic (binary search), linear (Dyck languages), quadratic (3SUM), cubic (matrix multiplication) ---- - Space Complexity: - Buffer size $\text{BUFFER_SIZE}$ is a function of context length/input size - Canonical space is $O(N \times L)$ - How does time and output space complexity change with buffer size compared to a RAM? ---- - Also consider: - Time-space tradeoffs of the model - Output space vs working memory space complexity ---- ### Summary Key questions revolve around formally analysing time and space complexities of the transformer model as a function of context length, processing depth, and buffer size. The goal is to derive complexity bounds on sample problems and understand how they differ from a standard RAM model. --- ## Extensions ### Parallelism Extending the transformer automaton model to incorporate parallelism (multiple attention heads in a given layer) Some challenges identified: - Managing multiple computation "streams" requires more complex state representation - Transition function must also account for parallel operations ---- Proposed modifications: - Extend single buffer to multiple (logical) buffers, one per parallel head - Expand transition function to operate on all buffers simultaneously - Add state variables to track each head's status New components: - Divide finite buffer into non-overlapping slices, one per head - Parallel transition function operates on the slices in parallel ---- Modified operation cycle: - Reading copies context to all logical buffers - Parallel transition function manipulates buffer slices - Writing aggregates outputs from all heads The revised model aims to capture transformer parallelism while maintaining the simplicity of original design. Main adjustments are buffer slicing and an expanded transition function. This is still a work in progress. Formal additions would expand state representation and transition function. --- ### Further Extensions - Interactivity - Online algorithms - Probabilistic sampling - Non-zero temperature - Non-greedy sampling - Randomised/probabilistic algorithms framework --- ## Summary of Progress - Developed conceptual model - Initial formalisation - Core analysis still ongoing - Future extensions planned --- ## Future Work - Complete formal bounds - Extend model - Validate on simple datasets!