**nano‑vLLM: A Build‑from‑Scratch Architectural Lab**

# **nano‑vLLM: Build‑from‑Scratch Architectural Labs** *A Complete, Step‑by‑Step Guide with Enhanced Visual Learning* --- ## **Executive Summary: Why vLLM Matters** ### **The Problem** Traditional LLM serving systems are fundamentally inefficient due to sequential processing: 1. **GPU Underutilization**: GPUs remain idle 60‑70% of the time 2. **Memory Fragmentation**: Variable‑length sequences waste memory space 3. **Poor Throughput**: Each request blocks the entire GPU 4. **High Latency**: Long requests delay all subsequent requests ### **The vLLM Revolution** vLLM introduces three breakthrough innovations enabling 10‑23× throughput improvements: 1. **Continuous Batching**: Process multiple requests simultaneously 2. **Paged KV Cache**: Manage memory in fixed‑size blocks 3. **Hybrid Scheduling**: Intelligently prioritize requests ### **Our Learning Journey** We'll build "nano‑vLLM" from scratch through 14 structured labs, each focusing on one core concept with clear visualizations. --- ## **Module 1: Transformer Inference Foundations** *Understanding the computational graph of LLM inference* ### **Lab 1.1: The Transformer Decoder Forward Pass** #### **Objective** Understand the exact data flow through a transformer decoder layer during inference and identify the attention bottleneck. #### **Key Concept** Autoregressive generation means each new token depends on all previous tokens. The attention mechanism is the primary computational bottleneck we must optimize. #### **Detailed Explanation** During inference, a transformer decoder processes tokens one at a time in an autoregressive manner. Each layer performs: 1. **Self‑Attention**: The input hidden states are transformed into Query (Q), Key (K), and Value (V) tensors. The attention computation follows: ``` Attention(Q, K, V) = softmax(Q·Kᵀ/√d) · V ``` Where `d` is the hidden dimension size. 2. **Residual Connection**: The attention output is added to the original input (pre‑attention hidden states) to preserve information. 3. **Feed‑Forward Network (FFN)**: A two‑layer MLP with an activation function (typically GeLU or SwiGLU). 4. **Another Residual Connection**: The FFN output is added to the attention+residual output. 5. **Layer Normalization**: Applied after each residual addition for training stability. The critical insight: For token `t`, the attention computation needs Keys and Values for tokens `0` through `t‑1`. These **do not change** across generation steps, making KV caching possible. #### **Visualization 1: Single Transformer Layer Data Flow** ```mermaid graph TB subgraph "Transformer Layer" Input[Input Hidden State ht] --> LN1[LayerNorm] LN1 --> QProj[Linear Compute Q] LN1 --> KProj[Linear Compute K] LN1 --> VProj[Linear Compute V] QProj --> Q[Query Qt] KProj --> K[Key Kt] VProj --> V[Value Vt] Q --> Attention[Attention softmax(Q·KT/√d)·V] KCache[K Cache] --> Attention VCache[V Cache] --> Attention Attention --> Add1[⊕ Add] Input --> Add1 Add1 --> LN2[LayerNorm] LN2 --> FFN[Feed-Forward W₂·GeLU(W₁·x)] FFN --> Add2[⊕ Add] Add1 --> Add2 Add2 --> Output[Output Hidden State ht+1] K --> KCacheUpdate[Update K Cache] V --> VCacheUpdate[Update V Cache] end style Attention fill:#e1f5fe style FFN fill:#f3e5f5 style KCache fill:#e8f5e8 style VCache fill:#e8f5e8 ``` #### **Visualization 2: Attention Computation Pattern** ```mermaid graph LR subgraph "Generation Step t: Compute token t+1" direction TB subgraph "Input" A[Current Token t] --> B[Compute Qt] end subgraph "Cache Lookup" C[K Cache: K0 to Kt-1] --> D E[V Cache: V0 to Vt-1] --> F end subgraph "Computation" B --> D[Matrix Multiply Qt · K0:t-1T] D --> G[Scale & Softmax /√d → softmax] G --> F[Weighted Sum ∑(weights · V0:t-1)] end subgraph "Output" F --> H[Attention Output] end subgraph "Cache Update" I[Compute Kt] --> J[Append to K Cache] K[Compute Vt] --> L[Append to V Cache] end end ``` ### **Lab 1.2: Autoregressive Decoding Loop** #### **Objective** Model the complete token generation process as a finite state machine and understand memory growth patterns. #### **Key Concept** Generation is a loop where each iteration produces one token using cached computations from all previous tokens, transforming computational complexity from O(n²) to O(n). #### **Detailed Explanation** The decoding loop follows this pattern: **Phase 1: Prompt Processing** ``` Input: [The, quick, brown, fox] Process: Single forward pass through all prompt tokens Output: KV cache for all prompt tokens + first output token ``` **Phase 2: Token Generation Loop** For each iteration i (generating token i+1): 1. Take current token (or prompt for i=0) 2. Compute Query for current token 3. Retrieve cached Keys and Values for tokens 0 to i-1 4. Compute attention scores: Q·Kᵀ 5. Apply softmax to get attention weights 6. Compute weighted sum: weights · V 7. Continue through FFN and normalization 8. Compute logits from final hidden state 9. Sample next token from logits (using chosen strategy) 10. Compute and cache new Key and Value 11. If token == EOS or i == max_tokens: break 12. Set current token = sampled token This loop transforms the computational complexity: - Without cache: O(n²) total operations - With cache: O(n) total operations #### **Visualization 1: Autoregressive Loop State Machine** ```mermaid stateDiagram-v2 [*] --> Idle : System ready Idle --> PromptProcessing : Receive prompt PromptProcessing --> FirstToken : Compute KV cache for all prompt tokens state FirstToken { [*] --> ComputePrompt ComputePrompt --> SampleFirst SampleFirst --> CacheFirst CacheFirst --> [*] } FirstToken --> Generating : Begin generation state Generating { [*] --> GetCurrentToken GetCurrentToken --> ComputeQuery ComputeQuery --> RetrieveCache RetrieveCache --> AttentionCompute AttentionCompute --> FFNCompute FFNCompute --> LayerNorm LayerNorm --> ComputeLogits ComputeLogits --> SampleNext SampleNext --> CheckStop CheckStop --> ContinueCheck{Continue?} ContinueCheck --> CacheUpdate : Yes ContinueCheck --> [*] : No CacheUpdate --> GetCurrentToken } Generating --> Complete : EOS or max_tokens Complete --> Cleanup : Free resources Cleanup --> [*] : Ready for next request ``` #### **Visualization 2: Memory Growth Comparison** ```mermaid graph TB subgraph "Without KV Cache" direction LR A[Token 1] --> A1[Compute: All 1 tokens] A1 --> A2[Memory: █] A2 --> A3[Ops: 1×] B[Token 2] --> B1[Compute: All 2 tokens] B1 --> B2[Memory: █ █] B2 --> B3[Ops: 2×] C[Token 3] --> C1[Compute: All 3 tokens] C1 --> C2[Memory: █ █ █] C2 --> C3[Ops: 3×] D[...] --> D1[...] D1 --> D2[...] D2 --> D3[...] E[Token N] --> E1[Compute: All N tokens] E1 --> E2[Memory: N blocks] E2 --> E3[Ops: N×] F[Total for N tokens] --> F1[Operations: N²/2] F --> F2[Memory: Reused each step] end subgraph "With KV Cache" direction LR G[Token 1] --> G1[Compute & Cache: 1] G1 --> G2[Memory: █ + cache[█]] G2 --> G3[Ops: 1×] H[Token 2] --> H1[Compute: 2, Use cache[█]] H1 --> H2[Memory: █ + cache[██]] H2 --> H3[Ops: 1× + lookup] I[Token 3] --> I1[Compute: 3, Use cache[██]] I1 --> I2[Memory: █ + cache[███]] I2 --> I3[Ops: 1× + lookup] J[...] --> J1[...] J1 --> J2[...] J2 --> J3[...] K[Token N] --> K1[Compute: N, Use cache[N-1 blocks]] K1 --> K2[Memory: █ + cache[N blocks]] K2 --> K3[Ops: 1× + lookup] L[Total for N tokens] --> L1[Operations: N] L --> L2[Memory: Grows linearly] end ``` ### **Lab 1.3: Sampling Strategies** #### **Objective** Understand different methods for selecting the next token and their tradeoffs between creativity and coherence. #### **Key Concept** Sampling strategies manipulate the probability distribution of next tokens to control the creativity vs. coherence tradeoff in generated text. #### **Detailed Explanation** **1. Greedy Sampling** - Always select the token with highest probability - **Pros**: Deterministic, highly coherent - **Cons**: Repetitive, lacks creativity - **Use case**: Code generation, factual responses **2. Temperature Sampling** - Scale logits by temperature `T` before softmax: `softmax(logits / T)` - T < 1.0: More deterministic (peakier distribution) - T = 1.0: Original distribution - T > 1.0: More random (flatter distribution) - **Use case**: Chat responses, creative tasks **3. Top‑k Sampling** - Only consider the `k` highest‑probability tokens - Renormalize probabilities among these `k` tokens - **Pros**: Eliminates very unlikely tokens - **Cons**: Fixed `k` can be problematic for varying distributions - **Use case**: General purpose generation **4. Top‑p (Nucleus) Sampling** - Consider smallest set of tokens whose cumulative probability ≥ `p` - Renormalize probabilities within this set - **Pros**: Adaptive to distribution shape - **Cons**: Computationally more expensive - **Use case**: Creative writing, research **5. Beam Search** - Maintain `b` most promising sequences - Expand all sequences each step, keep top `b` - **Pros**: Highest coherence, finds optimal sequence - **Cons**: Computationally expensive (b × more compute) - **Use case**: Translation, summarization #### **Visualization 1: Sampling Strategy Effects** ```mermaid graph TD subgraph "Original Probability Distribution" A[Tokens] --> B[Probabilities] B --> B1[Token 1: 0.40] B --> B2[Token 2: 0.30] B --> B3[Token 3: 0.20] B --> B4[Token 4: 0.05] B --> B5[Token 5: 0.05] end subgraph "After Sampling Strategies" C[Greedy] --> D[Always pick Token 1: 0.40] E[Temperature T=0.5] --> F[Sharpen distribution] F --> F1[Token 1: 0.55] F --> F2[Token 2: 0.25] F --> F3[Token 3: 0.15] F --> F4[Token 4: 0.03] F --> F5[Token 5: 0.02] G[Top-k k=3] --> H[Consider only top 3 tokens] H --> H1[Token 1: 0.44] H --> H2[Token 2: 0.33] H --> H3[Token 3: 0.22] H --> H4[Token 4: 0.00] H --> H5[Token 5: 0.00] I[Top-p p=0.9] --> J[Smallest set with cum prob ≥ 0.9] J --> J1[Tokens 1-3 cum: 0.90] J --> J2[Renormalized: 0.44, 0.33, 0.22] end ``` #### **Visualization 2: Temperature Effect Visualization** ```mermaid xychart-beta title "Temperature Scaling Effect on Token Probabilities" x-axis ["Token 1", "Token 2", "Token 3", "Token 4", "Token 5"] y-axis "Probability" 0 --> 0.7 bar [0.67, 0.19, 0.09, 0.03, 0.02] bar [0.40, 0.30, 0.20, 0.05, 0.05] bar [0.25, 0.22, 0.20, 0.18, 0.15] legend ["T=0.1 (Very Deterministic)", "T=1.0 (Original)", "T=2.0 (Creative)"] ``` --- ## **Module 2: KV Cache Fundamentals** *Trading memory for compute efficiency* ### **Lab 2.1: The Memory‑Compute Tradeoff** #### **Objective** Quantify exactly how much memory KV cache requires and how much compute it saves. #### **Key Concept** Storing K and V tensors allows us to avoid recomputing them, turning O(n²) attention operations into O(n) operations by trading memory for compute. #### **Detailed Explanation** **Memory Calculation Formula:** ``` Memory per token = hidden_size × num_kv_heads × bytes_per_param × 2 ``` Where: - `hidden_size`: Model's hidden dimension (e.g., 4096 for Llama2‑7B) - `num_kv_heads`: Number of key/value attention heads - `bytes_per_param`: 2 for float16, 4 for float32 - `×2`: For both Key and Value tensors **Example: Llama2‑7B:** - Hidden size: 4096 - KV heads: 32 - Precision: float16 (2 bytes) ``` Memory per token = 4096 × 32 × 2 × 2 = 524,288 bytes = 512 KB ``` **Example: Llama2‑70B:** - Hidden size: 8192 - KV heads: 8 - Precision: float16 (2 bytes) ``` Memory per token = 8192 × 8 × 2 × 2 = 262,144 bytes = 256 KB ``` **Compute Savings Analysis:** Without KV cache, each new token requires attention computation with all previous tokens: - Operations per token: O(n) - Total operations for n tokens: O(n²) With KV cache: - Operations per token: O(1) for compute + O(n) for cache lookup - Total operations for n tokens: O(n) **Tradeoff Decision:** - For short sequences (<100 tokens): Cache may not be worthwhile - For medium sequences (100‑1000 tokens): 10‑50× speedup - For long sequences (>1000 tokens): 50‑200× speedup - ![image](https://hackmd.io/_uploads/BJhhqWX4Wx.png) #### **Visualization 1: Memory Usage Growth** ```mermaid xychart-beta title "KV Cache Memory Usage vs Sequence Length" x-axis ["256", "512", "1024", "2048", "4096"] y-axis "Memory (GB)" 0 --> 3 line [0.13, 0.26, 0.52, 1.05, 2.10] line [0.26, 0.52, 1.05, 2.10, 4.20] legend ["Llama2-7B (512KB/token)", "Llama2-70B (1MB/token)"] ``` #### **Visualization 2: Compute Operations Comparison** ```mermaid graph TB subgraph "Compute Requirements" direction LR subgraph "Without KV Cache" A1[Token 1] --> A2[Attention with 1 token] A2 --> A3[1× compute] B1[Token 100] --> B2[Attention with 100 tokens] B2 --> B3[100× compute] C1[Token 1000] --> C2[Attention with 1000 tokens] C2 --> C3[1000× compute] D1[Total for 1000 tokens] --> D2[500,500 operations O(n²) growth] end subgraph "With KV Cache" E1[Token 1] --> E2[Compute + Cache: 1] E2 --> E3[1× compute] F1[Token 100] --> F2[Compute + Cache lookup] F2 --> F3[~1.01× compute] G1[Token 1000] --> G2[Compute + Cache lookup] G2 --> G3[~1.001× compute] H1[Total for 1000 tokens] --> H2[~1000 operations O(n) growth] end end subgraph "Speedup Factor" I[Sequence Length] --> J[Speedup] K[100 tokens] --> L[~10× faster] M[500 tokens] --> N[~50× faster] O[1000 tokens] --> P[~100× faster] end ``` ### **Lab 2.2: Cache Efficiency Metrics** #### **Objective** Measure the actual throughput improvement from KV caching and understand memory bandwidth limitations. #### **Key Concept** The benefit of caching grows with sequence length but is ultimately limited by memory bandwidth when loading cached values. #### **Detailed Explanation** **Throughput Calculation:** ``` Throughput (tokens/sec) = Batch Size ÷ Time per Token ``` **Without caching:** - Time per token grows linearly with sequence length - Throughput decreases as sequences get longer **With caching:** - Time per token is nearly constant - Throughput remains high until memory bandwidth limits **Memory Bandwidth Bottleneck:** Even with perfect caching, we must load all cached K and V values from GPU memory for each attention operation. For sequence length `n`: ``` Bytes loaded per token = 2 × n × hidden_size × bytes_per_param ``` **Example: Llama2‑7B, 1000 tokens:** ``` Bytes loaded = 2 × 1000 × 4096 × 2 = 16,384,000 bytes ≈ 15.6 MB ``` With GPU memory bandwidth of 1.5 TB/s: ``` Theoretical minimum time = 15.6 MB ÷ 1.5 TB/s ≈ 10.4 microseconds ``` This becomes the fundamental limit for attention speed. **Practical Efficiency Metrics:** 1. **Cache Hit Rate**: Percentage of KV reads served from cache 2. **Memory Bandwidth Utilization**: How close to peak bandwidth 3. **Compute Utilization**: Percentage of time GPU does actual computation vs. memory transfers #### **Visualization 1: Throughput vs Sequence Length** ```mermaid xychart-beta title "Throughput Comparison: With vs Without KV Cache" x-axis ["100", "200", "500", "1000", "2000"] y-axis "Tokens/Second" 0 --> 5000 line [4500, 4300, 3800, 3200, 2500] line [1200, 600, 240, 120, 60] legend ["With KV Cache", "Without KV Cache"] ``` #### **Visualization 2: Memory Bandwidth Bottleneck** ```mermaid graph LR subgraph "Attention Computation Timeline" direction TB A[Start: Token t] --> B[Load Q vector ~0.01ms] B --> C[Load K Cache from DRAM ~0.3ms for 1000 tokens] C --> D[Compute Q·Kᵀ ~0.1ms] D --> E[Softmax ~0.05ms] E --> F[Load V Cache from DRAM ~0.3ms for 1000 tokens] F --> G[Weighted Sum ~0.1ms] G --> H[Complete] end subgraph "Bottleneck Analysis" I[Total Time: ~0.86ms] --> J[Compute: 0.25ms (29%)] I --> K[Memory Load: 0.61ms (71%)] K --> L[Conclusion: Memory Bandwidth Limited] end subgraph "Improvement Strategies" M[Memory Optimizations] --> N[1. Use faster memory (HBM)] M --> O[2. Increase cache locality] M --> P[3. Reduce precision (FP16 → INT8)] end ``` --- ## **Module 3: Continuous Batching Architecture** *Keeping the GPU busy with multiple requests* ### **Lab 3.1: Sequential vs Continuous Batching** #### **Objective** Understand why traditional batching is inefficient and how continuous batching improves GPU utilization. #### **Key Concept** GPUs are throughput‑oriented devices that perform best when fully utilized with large, regular computations. Continuous batching interleaves requests at the token level to maintain high utilization. #### **Detailed Explanation** **Traditional (Static) Batching Problems:** 1. **Straggler Problem**: One long request delays all others 2. **Low GPU Utilization**: GPU idle while waiting for batch to fill 3. **High Latency**: Requests wait for batch formation 4. **Memory Inefficiency**: Fixed batch size wastes memory **Traditional Batch Timeline:** ``` Time: 0ms-200ms: Request A (200 tokens) Time: 200ms-400ms: Request B (150 tokens) Time: 400ms-600ms: Request C (100 tokens) GPU Utilization: 100%, 0%, 100%, 0%, 100%, 0% Average Utilization: ~50% ``` **Continuous (Iteration‑level) Batching:** 1. Maintain pool of active requests at different stages 2. Each iteration generates one token for selected requests 3. Batch composition changes dynamically each iteration 4. New requests can join, completed requests exit **Continuous Batch Timeline:** ``` Iteration 1: Request A (token 1), Request B (token 1) Iteration 2: Request A (token 2), Request B (token 2), Request C (token 1) Iteration 3: Request A (token 3), Request B (complete), Request C (token 2) GPU Utilization: Consistently 85-95% ``` **Key Benefits:** 1. **Higher Throughput**: Process more tokens per second 2. **Lower Latency**: Requests start immediately 3. **Better Fairness**: No single request blocks others 4. **Adaptive**: Handles variable‑length requests efficiently #### **Visualization 1: Timeline Comparison** ```mermaid gantt title GPU Utilization Timeline: Sequential vs Continuous Batching dateFormat HH:mm:ss.SSS axisFormat %S.%L section Sequential Batching (Low Utilization) Request A (200ms) :a1, 00:00:00.000, 200ms Request B (200ms) :a2, after a1, 200ms Request C (200ms) :a3, after a2, 200ms section Continuous Batching (High Utilization) Request A (40ms slots) :b1, 00:00:00.000, 40ms Request B (40ms slots) :b2, 00:00:00.040, 40ms Request C (40ms slots) :b3, 00:00:00.080, 40ms Request A (cont.) :b4, after b3, 40ms Request B (cont.) :b5, after b4, 40ms Request C (cont.) :b6, after b5, 40ms Request A (final) :b7, after b6, 40ms Request B (final) :b8, after b7, 40ms Request C (final) :b9, after b8, 40ms ``` #### **Visualization 2: Batch Composition Evolution** ```mermaid graph TB subgraph "Time Step t₀ (New Requests Arrive)" A[Batch t₀] --> B[Request R1: Prefill Process prompt] A --> C[Request R2: Prefill Process prompt] A --> D[Request R3: Waiting (queue full)] end subgraph "Time Step t₁ (Generation Starts)" E[Batch t₁] --> F[Request R1: Token 1] E --> G[Request R2: Token 1] E --> H[Request R3: Prefill (joins batch)] end subgraph "Time Step t₂ (Progress Continues)" I[Batch t₂] --> J[Request R1: Token 2] I --> K[Request R2: Token 2] I --> L[Request R3: Token 1] I --> M[Request R4: Waiting (new arrival)] end subgraph "Time Step t₃ (Completion & New Join)" N[Batch t₃] --> O[Request R1: COMPLETE (exits batch)] N --> P[Request R2: Token 3] N --> Q[Request R3: Token 2] N --> R[Request R4: Prefill (joins batch)] end ``` ### **Lab 3.2: Request Lifecycle Management** #### **Objective** Design a state machine for tracking request progress through the system. #### **Key Concept** Each request has metadata tracking its progress, scheduling priority, and resource allocation as it moves through different states in the system. #### **Detailed Explanation** **Request States:** 1. **WAITING**: In scheduler queue, not yet scheduled 2. **SCHEDULED**: Selected for next batch, resources allocated 3. **RUNNING**: Currently executing in a batch 4. **PAUSED**: Yielded GPU (waiting for other requests) 5. **COMPLETED**: Generation finished (EOS or max_tokens) 6. **FAILED**: Error occurred **Request Metadata Fields:** - `request_id`: Unique identifier - `prompt_tokens`: Tokenized input - `generated_tokens`: List of output tokens so far - `kv_cache_blocks`: List of block IDs for KV cache - `priority`: Scheduling priority score - `tokens_remaining`: Tokens left to generate - `arrival_time`: When request was received - `start_time`: When generation started - `current_state`: Current state from above **State Transition Rules:** 1. WAITING → SCHEDULED: When scheduler selects request 2. SCHEDULED → RUNNING: When batch execution starts 3. RUNNING → PAUSED: When request yields (prefill complete or time slice) 4. PAUSED → RUNNING: When rescheduled 5. RUNNING → COMPLETED: Generation finishes 6. Any → FAILED: On error 7. COMPLETED/FAILED → [Removed]: Cleanup **Priority Calculation:** ``` priority = base_priority + age_factor + size_factor where: base_priority = 100 for high, 50 for medium, 10 for low age_factor = minutes_waiting × 5 (prevents starvation) size_factor = -tokens_remaining × 0.1 (favors shorter requests) ``` #### **Visualization 1: Request State Transition Diagram** ```mermaid stateDiagram-v2 [*] --> WAITING : Request arrives WAITING --> SCHEDULED : Scheduler selects SCHEDULED --> RUNNING : Batch execution starts state RUNNING { [*] --> PROCESSING PROCESSING --> YIELDING : Prefill complete / time slice YIELDING --> PROCESSING : Rescheduled PROCESSING --> [*] : Token generated } RUNNING --> PAUSED : Yield for fairness PAUSED --> RUNNING : Rescheduled RUNNING --> COMPLETED : EOS / max_tokens RUNNING --> FAILED : Error SCHEDULED --> FAILED : Allocation failed WAITING --> FAILED : Timeout expired COMPLETED --> [*] : Response sent FAILED --> [*] : Error handled note right of WAITING Average wait: 50ms Max wait: 5s Queue position: #3 end note note right of RUNNING Batch: #7 Tokens: 42/100 Cache: 8 blocks Progress: 42% end note ``` #### **Visualization 2: Request Metadata Structure** ```mermaid classDiagram class Request { +String request_id +List~Integer~ prompt_tokens +List~Integer~ generated_tokens +List~Integer~ kv_cache_blocks +Integer priority +Integer tokens_remaining +DateTime arrival_time +DateTime start_time +State current_state +Integer max_tokens +Float temperature +get_progress() Float +get_elapsed_time() Duration +get_estimated_time_remaining() Duration +needs_prefill() Boolean +is_complete() Boolean +calculate_priority() Integer } class RequestManager { +Map~String, Request~ active_requests +Integer max_concurrent +Integer total_processed +add_request(Request) Boolean +remove_request(String) Boolean +get_request(String) Request +get_waiting_requests() List~Request~ +get_running_requests() List~Request~ +update_request_state(String, State) void } Request "1" --> "1" RequestManager : managed_by ``` --- ## **Module 4: Scheduler Design** *Intelligently selecting which requests to run* ### **Lab 4.1: Scheduling Policies** #### **Objective** Compare different scheduling algorithms for LLM inference and understand their tradeoffs. #### **Key Concept** The scheduler must balance three competing goals: throughput (tokens/sec), latency (response time), and fairness (equal treatment of requests). #### **Detailed Explanation** **1. FIFO (First‑In, First‑Out)** - Process requests in arrival order - **Pros**: Simple, fair wait time - **Cons**: Poor throughput (long requests block short ones), high tail latency - **Use case**: Simple systems, testing **2. Shortest Job First** - Prioritize requests with fewest tokens remaining - **Pros**: Maximizes throughput, minimizes average latency - **Cons**: Starvation problem (long requests may never run), unfair - **Use case**: Batch processing systems **3. Round Robin** - Each request gets equal time slice - **Pros**: Fair GPU time allocation - **Cons**: Poor cache locality (frequent context switches), moderate throughput - **Use case**: Multi‑tenant systems **4. vLLM Hybrid Approach** - Multiple priority queues (high/medium/low) - Token budget per batch - Aging: Increase priority of waiting requests - Preemption: Pause long requests for short ones - **Pros**: Balances all metrics well - **Cons**: More complex implementation - **Use case**: Production LLM serving **Algorithm Pseudocode:** ``` def schedule_hybrid(token_budget, queues): selected = [] remaining = token_budget # Process high priority first for req in queues.high: if req.tokens_needed <= remaining: selected.append(req) remaining -= req.tokens_needed if remaining <= 0: break # Then medium priority for req in queues.medium: if req.tokens_needed <= remaining: selected.append(req) remaining -= req.tokens_needed if remaining <= 0: break # Finally low priority for req in queues.low: if req.tokens_needed <= remaining: selected.append(req) remaining -= req.tokens_needed if remaining <= 0: break return selected ``` #### **Visualization 1: Scheduling Policy Tradeoffs** ```mermaid quadrantChart title "Scheduling Policy Tradeoff Space" x-axis "Low Throughput" --> "High Throughput" y-axis "Low Fairness" --> "High Fairness" "FIFO": [0.2, 0.9] "Round Robin": [0.5, 0.7] "Shortest Job First": [0.9, 0.2] "vLLM Hybrid": [0.85, 0.75] "Ideal": [1.0, 1.0] ``` #### **Visualization 2: Priority Queue System** ```mermaid graph TB subgraph "Priority Queue System" A[Incoming Requests] --> B{Request Analyzer} B --> C[Short < 100 tokens] B --> D[Medium 100-500 tokens] B --> E[Long > 500 tokens] C --> F[High Priority Queue Max wait: 50ms] D --> G[Medium Priority Queue Max wait: 200ms] E --> H[Low Priority Queue Max wait: 1s] subgraph "Scheduler Core" I[Token Budget: 2048] --> J{Selection Algorithm} J --> K[Drain High Priority First] K --> L[Fill with Medium Priority] L --> M[Add Low Priority if space] K --> N[Selected Batch] L --> N M --> N end F --> J G --> J H --> J end subgraph "Queue Statistics" O[High Queue: 3 requests] --> P[Wait times: 10ms, 25ms, 45ms] Q[Medium Queue: 5 requests] --> R[Wait times: 50ms to 180ms] S[Low Queue: 2 requests] --> T[Wait times: 300ms, 450ms] end ``` ### **Lab 4.2: Token Budget Allocation** #### **Objective** Implement dynamic token budgeting based on GPU memory and compute capacity. #### **Key Concept** Each batch can only process a limited number of tokens due to memory constraints (KV cache) and compute constraints (attention operations). #### **Detailed Explanation** **Constraints on Batch Size:** 1. **Memory Constraint**: KV cache memory limits total tokens ``` max_tokens_memory = available_memory ÷ memory_per_token ``` 2. **Compute Constraint**: Attention operations scale with sequence length ``` max_tokens_compute = compute_budget ÷ compute_per_token ``` 3. **Latency Constraint**: Larger batches increase iteration time ``` target_iteration_time = 20-50ms (for responsive streaming) ``` **Token Budget Calculation:** ``` token_budget = min( available_memory / memory_per_token, compute_budget / compute_per_token, hardware_limits.max_tokens_per_batch ) ``` **Dynamic Adjustment:** - Monitor iteration time - If > 50ms: reduce budget by 20% - If < 30ms: increase budget by 10% - Monitor memory fragmentation - If high fragmentation: reduce budget temporarily **Example Calculation (A100 40GB):** - GPU memory: 40GB - Model weights: 20GB - Available for KV: 20GB - Memory per token: 2MB (Llama2‑70B) - Max tokens: 20GB ÷ 2MB = 10,000 tokens - Target batch: 8,000 tokens (80% utilization) - With average 500 tokens/request: 16 requests/batch **Prefill vs Decode Considerations:** - Prefill (prompt processing): Higher memory/compute per token - Decode (generation): Lower memory/compute per token - Mixed batches need separate budgeting #### **Visualization 1: Budget Allocation Process** ```mermaid graph TB subgraph "Budget Allocation Algorithm" A[Start: Budget = 2048 tokens] --> B{High Priority Queue Empty?} B -- No --> C[Take all high priority requests] C --> D[Update: Budget = 2048 - used_tokens] B -- Yes --> D D --> E{Budget > 0?} E -- Yes --> F{Medium Priority Queue Empty?} F -- No --> G[Take medium requests until budget exhausted] G --> H[Update budget remaining] F -- Yes --> H H --> I{Budget > 0?} I -- Yes --> J{Low Priority Queue Empty?} J -- No --> K[Take low requests until budget exhausted] J -- Yes --> L[Final Batch Selected] K --> L end subgraph "Example: Step-by-Step" M[Start: 2048 tokens] --> N[High Priority: 3×128 = 384] N --> O[Remaining: 1664] O --> P[Medium Priority: 4×256 = 1024] P --> Q[Remaining: 640] Q --> R[Low Priority: 2×320 = 640] R --> S[Remaining: 0] S --> T[Batch Complete: 9 requests, 2048 tokens] end ``` #### **Visualization 2: Dynamic Budget Adjustment** ```mermaid graph LR subgraph "Performance Monitoring Loop" A[Measure Iteration Time] --> B{Iteration > 50ms?} B -- Yes --> C[Reduce budget by 20%] B -- No --> D{Iteration < 30ms?} D -- Yes --> E[Increase budget by 10%] D -- No --> F[Keep current budget] C --> G[New Compute Budget] E --> G F --> G end subgraph "Memory Pressure Check" H[Monitor Free Memory] --> I{Free < 15%?} I -- Yes --> J[Reduce budget by 25%] I -- No --> K[No change] J --> L[Memory-Safe Budget] K --> L end subgraph "Final Budget Calculation" G --> M[Compute Budget] L --> N[Memory Budget] M --> O[Take minimum of: Compute Budget, Memory Budget, Hardware Limit] N --> O O --> P[Final Token Budget] end subgraph "Statistics" Q[Current Iteration: 42ms] --> R[Adjustment: -10%] S[Free Memory: 18%] --> T[Adjustment: None] U[Previous Budget: 2048] --> V[New Budget: 1843] end ``` --- ## **Module 5: Paged KV‑Cache Architecture** *Eliminating memory fragmentation* ### **Lab 5.1: The Fragmentation Problem** #### **Objective** Understand how variable‑length sequences cause memory fragmentation in traditional allocation schemes. #### **Key Concept** Allocating contiguous memory for each request leads to external fragmentation, wasting 30‑50% of available memory even when total free space exists. #### **Detailed Explanation** **The Fragmentation Problem Sequence:** 1. **Request A** needs 512 tokens → allocate 512 contiguous blocks 2. **Request A completes** → free 512 blocks 3. **Request B** needs 256 tokens → allocate 256 blocks (uses part of freed space) 4. **Request C** needs 300 tokens → cannot fit in remaining 256 contiguous blocks 5. **Result**: 256 blocks wasted due to fragmentation, even though total free memory exists **Why It Matters for LLMs:** - LLMs require large KV caches (gigabytes for long sequences) - Fragmentation reduces effective memory by 30‑50% - Can cause Out‑Of‑Memory errors even when total free memory exists - Wastes expensive GPU memory resources **Traditional Solutions (and Their Limitations):** 1. **Memory Pools**: Fixed‑size allocations still waste space for variable sizes 2. **Buddy Allocator**: Reduces but doesn't eliminate fragmentation for growing allocations 3. **Slab Allocator**: Works well for fixed sizes, not for variable‑length sequences 4. **Compaction**: Moving live data is expensive and causes pauses **vLLM's Insight**: Treat KV cache like virtual memory: - Divide physical memory into fixed‑size blocks (e.g., 16KB) - Map logical blocks (per request) to physical blocks - Allow non‑contiguous allocation - Enable block reuse across requests **Benefits of Paged Allocation:** 1. **No External Fragmentation**: Blocks can be allocated anywhere 2. **Efficient Reuse**: Freed blocks immediately available for new requests 3. **Predictable Memory**: Fixed block size enables simple management 4. **Defragmentation Possible**: Can move blocks if needed #### **Visualization 1: Fragmentation Problem** ```mermaid graph TB subgraph "Initial Memory State" A[8 Memory Blocks Available] A1[█] --> A2[█] --> A3[█] --> A4[█] --> A5[█] --> A6[█] --> A7[█] --> A8[█] end subgraph "Step 1: Request A (5 blocks)" B[Request A allocated 5 blocks] B1[█ █ █ █ █ ▯ ▯ ▯] style B1 fill:#e1f5fe end subgraph "Step 2: Request A completes" C[5 blocks freed] C1[█ █ █ █ █ ▯ ▯ ▯] end subgraph "Step 3: Request B (3 blocks)" D[Request B allocated 3 blocks] D1[█ █ █ ▯ ▯ ▯ ▯ ▯] style D1 fill:#f3e5f5 end subgraph "Step 4: Request C needs 4 blocks" E[Free blocks: ▯ ▯ ▯ █ █ ▯ ▯ ▯] F[Request C needs 4 contiguous blocks] G[Available: Only 3 contiguous] H[RESULT: CANNOT ALLOCATE! Fragmentation: 38%] style H color:red end A --> B1 --> C1 --> D1 --> E --> F --> G --> H ``` #### **Visualization 2: Paged Allocation Solution** ```mermaid graph TB subgraph "Physical Memory (8 blocks)" P0[Block 0] --> P1[Block 1] --> P2[Block 2] --> P3[Block 3] --> P4[Block 4] --> P5[Block 5] --> P6[Block 6] --> P7[Block 7] end subgraph "Step 1: Request A (5 logical blocks)" A1[Req A, L0] --> P0 A2[Req A, L1] --> P1 A3[Req A, L2] --> P2 A4[Req A, L3] --> P3 A5[Req A, L4] --> P4 style P0 fill:#e1f5fe style P1 fill:#e1f5fe style P2 fill:#e1f5fe style P3 fill:#e1f5fe style P4 fill:#e1f5fe end subgraph "Step 2: Request B (3 logical blocks)" B1[Req B, L0] --> P5 B2[Req B, L1] --> P6 B3[Req B, L2] --> P7 style P5 fill:#f3e5f5 style P6 fill:#f3e5f5 style P7 fill:#f3e5f5 end subgraph "Step 3: Request A completes, Request C (4 blocks)" C1[Req C, L0] --> P0 C2[Req C, L1] --> P1 C3[Req C, L2] --> P2 C4[Req C, L3] --> P3 style P0 fill:#e8f5e8 style P1 fill:#e8f5e8 style P2 fill:#e8f5e8 style P3 fill:#e8f5e8 end subgraph "Memory Status" D[Block 0-4: Freed, then reused] E[Block 5-7: Still used by Request B] F[Total Utilization: 100%] G[Fragmentation: 0%] style G color:green end ``` ### **Lab 5.2: Block Management System** #### **Objective** Design a block manager that handles allocation, deallocation, mapping, and defragmentation. #### **Key Concept** Physical memory is divided into fixed‑size blocks; each request gets logical blocks that are mapped to physical blocks through an allocation table. #### **Detailed Explanation** **Block Manager Components:** 1. **Free List**: Tracks available physical blocks - Implemented as a bitmap or list of free block indices - O(1) allocation and deallocation 2. **Allocation Table**: Maps (request_id, logical_block) → physical_block - Dictionary/hash map for fast lookups - Enables non‑contiguous allocation 3. **Reverse Mapping**: physical_block → (request_id, logical_block) - Needed for defragmentation - Helps with debugging and monitoring 4. **Block Metadata**: Additional per‑block information - Reference count (for shared cache scenarios) - Access time (for LRU eviction if needed) - Status (free, allocated, dirty) **Key Operations:** **Allocation:** ``` def allocate_blocks(request_id, num_blocks): if free_blocks.count < num_blocks: return None # or trigger defragmentation allocated = [] for i in range(num_blocks): physical = free_blocks.pop() logical = i allocation_table[(request_id, logical)] = physical reverse_map[physical] = (request_id, logical) allocated.append(physical) return allocated ``` **Deallocation:** ``` def deallocate_blocks(request_id): # Find all blocks for this request for (req_id, logical), physical in allocation_table: if req_id == request_id: free_blocks.add(physical) del allocation_table[(req_id, logical)] del reverse_map[physical] ``` **Access Pattern:** When computing attention for request R at token position T: 1. Determine which logical block contains token T 2. Look up physical block: `physical = allocation_table[(R, logical_block)]` 3. Load K and V from that physical block in GPU memory **Defragmentation Strategy:** Trigger when: 1. Fragmentation ratio > threshold (e.g., 40%) 2. Large allocation request fails 3. Periodic maintenance Defragmentation steps: 1. Identify movable requests (not currently executing) 2. Copy their blocks to new contiguous location 3. Update allocation table 4. Update free list 5. Resume execution #### **Visualization 1: Block Manager Architecture** ```mermaid classDiagram class BlockManager { +int block_size +int total_blocks +List~int~ free_blocks +Map~(string, int), int~ allocation_table +Map~int, (string, int)~ reverse_map +float fragmentation_threshold +allocate(string request_id, int num_blocks) List~int~ +deallocate(string request_id) void +get_physical_block(string request_id, int logical_block) int +defragment() void +get_fragmentation_ratio() float +get_utilization() float +get_statistics() dict } class Request { +string request_id +List~int~ logical_blocks +List~int~ physical_blocks +int token_count +bool is_executing } class MemoryMonitor { +BlockManager block_manager +float current_fragmentation +float average_utilization +int defragmentation_count +monitor() void +should_defragment() bool +trigger_defragmentation() void } BlockManager "1" --> "*" Request : manages MemoryMonitor "1" --> "1" BlockManager : monitors ``` #### **Visualization 2: Defragmentation Process** ```mermaid graph TB subgraph "Before Defragmentation" A[Physical Memory State] B[Request A: █ █ █ █] --> C[Free: ▯ ▯] --> D[Request B: █ █] --> E[Free: ▯] --> F[Request C: █ █ █] --> G[Free: ▯ ▯ ▯ ▯] H[Statistics] --> I[Total Free: 7 blocks (44%)] H --> J[Largest Contiguous Free: 4 blocks] H --> K[Fragmentation Ratio: 43%] style K color:orange end subgraph "Defragmentation Steps" L[Step 1: Identify movable requests] --> M[Request B (2 blocks) not executing] M --> N[Step 2: Copy Request B to end] N --> O[Block 4 → Block 10] N --> P[Block 5 → Block 11] O --> Q[Step 3: Update allocation table] P --> Q Q --> R[Step 4: Update free list] end subgraph "After Defragmentation" S[Physical Memory State] T[Request A: █ █ █ █] --> U[Request C: █ █ █] --> V[Free: ▯ ▯ ▯ ▯ ▯ ▯ ▯] --> W[Request B: █ █] X[Statistics] --> Y[Total Free: 7 blocks (44%)] X --> Z[Largest Contiguous Free: 7 blocks] X --> AA[Fragmentation Ratio: 0%] style AA color:green end subgraph "Improvement" AB[Before: Could allocate max 4 blocks] AC[After: Can allocate up to 7 blocks] AD[Improvement: 75% increase] end ``` --- ## **Module 6: Execution Engine** *Orchestrating all components* ### **Lab 6.1: Engine Loop Architecture** #### **Objective** Design the main execution loop that coordinates all components efficiently. #### **Key Concept** The engine must pipeline operations to hide latency, overlap computation with data movement, and maximize GPU utilization. #### **Detailed Explanation** **Main Loop Phases:** 1. **Collect Phase**: Gather new requests from API gateway - Non‑blocking polling - Rate limiting and validation - Create request objects 2. **Schedule Phase**: Select batch for next iteration - Consult scheduler with token budget - Prioritize based on queue state - Balance throughput and latency 3. **Prepare Phase**: Setup for execution - Allocate KV cache blocks for new requests - Load prompt tokens for prefill requests - Prepare input tensors 4. **Execute Phase**: Run model forward pass - Load model weights to shared memory - Execute attention computations - Update hidden states layer by layer - Pipeline weight loading with computation 5. **Sample Phase**: Generate next tokens - Apply sampling strategy (temperature, top‑p, etc.) - Select next tokens for each request - Handle EOS tokens 6. **Update Phase**: Maintain system state - Update request progress - Free completed requests - Update KV cache - Collect metrics 7. **Stream Phase**: Send output to clients - Immediate streaming for responsive UX - Handle client disconnections - Maintain connection state **Pipelining Opportunities:** 1. **Weight Prefetching**: Load next layer's weights while computing current layer 2. **KV Cache Preloading**: Load next block of KV cache during attention computation 3. **Output Streaming**: Send tokens as they're generated, not at end 4. **Overlap Compute/IO**: GPU computation while CPU prepares next batch **Performance Optimizations:** 1. **Kernel Fusion**: Combine multiple operations into single GPU kernel 2. **Memory Coalescing**: Arrange data for efficient memory access 3. **Tensor Cores**: Utilize specialized hardware for matrix operations 4. **Asynchronous Execution**: Overlap data transfers with computation #### **Visualization 1: Execution Pipeline Timeline** ```mermaid gantt title Execution Engine Pipeline Timeline (One Iteration) dateFormat SS axisFormat %L section Phase 1: Input Preparation Tokenize & Pad Input :p1, 00:00, 0.3s Allocate KV Cache :p1a, after p1, 0.2s section Phase 2: Layer 1 Execution Load Weights L1 :p2, 00:00.2, 0.2s Compute Attention L1 :p2a, after p2, 0.8s FFN L1 :p2b, after p2a, 0.5s section Phase 3: Layer 2 Execution Load Weights L2 :p3, 00:00.4, 0.2s Compute Attention L2 :p3a, after p3, 0.8s FFN L2 :p3b, after p3a, 0.5s section Phase 4: Layer 3...N ... :p4, 00:01.0, 2s section Phase 5: Output Generation Final Layer Norm :p5, 00:02.5, 0.1s Logits & Sampling :p5a, after p5, 0.2s Stream Output :p5b, after p5a, 0.1s section Phase 6: Cleanup Free Resources :p6, 00:02.8, 0.2s ``` #### **Visualization 2: Engine Loop State Diagram** ```mermaid stateDiagram-v2 [*] --> Initializing Initializing --> Ready : Model loaded, memory allocated state Ready { [*] --> Collecting Collecting --> Scheduling : New requests available Scheduling --> Preparing : Batch selected Preparing --> Executing : Inputs ready Executing --> Sampling : Forward pass complete Sampling --> Updating : Tokens generated Updating --> Streaming : State updated Streaming --> [*] : Iteration complete } Ready --> Error : GPU error, OOM, etc. Error --> Diagnosing : Analyze error Diagnosing --> Recovering : Recovery plan Recovering --> Ready : Recovery successful Recovering --> Failed : Cannot recover Failed --> [*] : Shutdown note right of Ready Each loop iteration: - Processes one token - For all requests in batch - Typically 20-50ms duration end note ``` ### **Lab 6.2: Failure Mode Analysis** #### **Objective** Design robustness mechanisms for common failure scenarios in production LLM serving. #### **Key Concept** A production system must detect failures early, handle them gracefully, and recover automatically when possible, maintaining service availability. #### **Detailed Explanation** **Common Failure Modes:** 1. **Out of Memory (OOM)** - **Causes**: Too many concurrent requests, very long sequences - **Detection**: Memory allocation fails, free memory < threshold - **Recovery**: - Reduce batch size by 30% - Reject new requests temporarily - Offload some KV cache to CPU (if supported) - Kill oldest/lowest priority requests 2. **Request Timeout** - **Causes**: Slow generation, complex sampling, system overload - **Detection**: Request exceeds SLA time (e.g., 30 seconds) - **Recovery**: - Cancel request and return partial response - Log for performance analysis - Adjust scheduling priorities 3. **GPU Hang/Crash** - **Causes**: Driver bugs, hardware failure, kernel errors - **Detection**: Kernel timeout, GPU not responding - **Recovery**: - Reset GPU context - Restart execution engine - Failover to backup GPU if available - Drain and requeue affected requests 4. **High Fragmentation** - **Causes**: Many variable‑length requests, long runtime - **Detection**: Fragmentation ratio > threshold (e.g., 40%) - **Recovery**: - Trigger defragmentation - Temporarily reject large requests - Adjust block size if dynamically configurable 5. **Model Errors** - **Causes**: Weight corruption, precision issues - **Detection**: NaN/Inf in outputs, validation failures - **Recovery**: - Reload model from disk - Switch to backup model version - Disable problematic layers if possible **Monitoring Metrics for Failure Detection:** 1. **Resource Metrics**: - GPU memory usage (%) - GPU utilization (%) - Memory fragmentation ratio - Temperature (GPU, memory) 2. **Performance Metrics**: - Tokens per second - Requests per second - P50, P90, P99 latency - Batch size over time 3. **Error Metrics**: - OOM count - Timeout count - GPU reset count - Failed requests rate **Recovery Strategy Hierarchy:** 1. **Level 1**: Automatic adjustment (batch size, scheduling) 2. **Level 2**: Request‑level recovery (cancel, retry) 3. **Level 3**: Component restart (engine, scheduler) 4. **Level 4**: Full system restart 5. **Level 5**: Manual intervention #### **Visualization 1: Failure Detection & Recovery Flow** ```mermaid graph TB subgraph "Monitoring System" A[Health Monitor] --> B[Check Metrics] B --> C[GPU Memory < 10%?] B --> D[Any request > 30s?] B --> E[GPU Utilization < 20%?] B --> F[Fragmentation > 40%?] end subgraph "Recovery Actions" C -- Yes --> G[Reduce batch size 30%] D -- Yes --> H[Cancel oldest requests] E -- Yes --> I[Check for GPU hang] F -- Yes --> J[Trigger defragmentation] G --> K[Update configuration] H --> L[Send partial responses] I --> M[Reset GPU context] J --> N[Compact memory] K --> O[Continue operation] L --> O M --> O N --> O end subgraph "Escalation" O --> P{Problem resolved?} P -- No --> Q[Restart execution engine] Q --> R{Problem resolved?} R -- No --> S[Failover to backup] S --> T{Problem resolved?} T -- No --> U[Manual intervention required] end ``` #### **Visualization 2: System Health Dashboard** ```mermaid quadrantChart title "System Health Dashboard - Current Status" x-axis "Critical Issues" --> "Healthy" y-axis "Degraded Performance" --> "Optimal Performance" "GPU Memory": [0.85, 0.75] "GPU Utilization": [0.70, 0.80] "Throughput": [0.60, 0.70] "Latency P99": [0.45, 0.60] "Fragmentation": [0.35, 0.55] "Error Rate": [0.90, 0.85] "Alert Threshold": [0.30, 0.30] "Target Zone": [0.70, 0.70] ``` --- ## **Module 7: API & System Integration** *Building the complete system* ### **Lab 7.1: Request Lifecycle Implementation** #### **Objective** Implement the complete flow from client request to response with proper error handling and streaming support. #### **Key Concept** The API layer must be non‑blocking, support streaming for responsive user experience, handle authentication, rate limiting, and maintain request state across distributed components. #### **Detailed Explanation** **API Design:** **REST Endpoint (Synchronous):** ``` POST /v1/completions Content-Type: application/json { "prompt": "Explain quantum computing", "max_tokens": 200, "temperature": 0.7, "stream": false } ``` **WebSocket/SSE (Streaming):** ``` Client connects → Server streams tokens as generated: data: {"token": "Quantum", "finished": false} data: {"token": " computing", "finished": false} ... data: {"token": ".", "finished": true} ``` **Request Processing Pipeline:** 1. **API Gateway Layer**: - Validate JSON schema - Authenticate API key - Apply rate limits (tokens/second per user) - Create unique request ID - Add to distributed queue 2. **Scheduler Layer**: - Read from queue - Analyze request (length, priority) - Add to appropriate priority queue - Track wait time statistics 3. **Execution Layer**: - Get batch from scheduler - Allocate resources (KV cache) - Execute model - Stream tokens back through message bus 4. **Response Layer**: - Maintain WebSocket/SSE connections - Stream tokens as received - Handle client disconnection - Clean up resources on completion **Concurrency Model:** - **IO Threads**: Handle HTTP/WebSocket (async I/O) - **Worker Threads**: Tokenization, detokenization, request processing - **GPU Thread**: Dedicated thread for model execution - **Background Threads**: Monitoring, logging, metrics collection - **Message Queue**: For inter‑component communication **Error Handling:** 1. **Client Errors**: Invalid requests, rate limit exceeded 2. **Server Errors**: OOM, GPU failures, model errors 3. **Network Errors**: Client disconnection, timeouts 4. **Recovery**: Retry logic, fallback responses, graceful degradation #### **Visualization 1: Complete Request Flow** ```mermaid sequenceDiagram participant C as Client participant L as Load Balancer participant G as API Gateway participant Q as Message Queue participant S as Scheduler participant E as Execution Engine participant M as Memory Manager participant GPU C->>L: HTTP POST /generate L->>G: Forward request G->>G: Validate & authenticate G->>G: Apply rate limits G->>Q: Enqueue request (async) G->>C: 202 Accepted + request_id Note over Q,S: Scheduler polls queue S->>Q: Dequeue request S->>M: Reserve KV cache blocks M-->>S: Block IDs allocated S->>E: Add to next batch Note over E,GPU: Execution loop E->>GPU: Forward pass GPU-->>E: Logits E->>E: Sample tokens E->>Q: Stream token (async) Note over Q,G: Gateway streams to client Q->>G: Token available G->>C: SSE: {"token": "Hello", "finished": false} loop Until complete E->>GPU: Next forward pass GPU-->>E: Logits E->>E: Sample next token E->>Q: Stream token Q->>G: Token available G->>C: SSE: Token update end E->>M: Free blocks E->>Q: Final response Q->>G: Request complete G->>C: SSE: {"finished": true} ``` #### **Visualization 2: API Server Architecture** ```mermaid graph TB subgraph "Load Balancing Tier" A[Load Balancer] --> B[API Server 1] A --> C[API Server 2] A --> D[API Server N] end subgraph "API Server (Stateless)" B --> E[Request Handler] C --> F[Request Handler] D --> G[Request Handler] E --> H[Auth Service] F --> H G --> H H --> I[Rate Limiter] I --> J[Message Queue Producer] end subgraph "Message Bus" J --> K[(Redis/Kafka Queue)] end subgraph "Processing Tier" K --> L[Scheduler 1] K --> M[Scheduler 2] L --> N[GPU Worker Pool] M --> N subgraph "GPU Worker" O[Worker Process] --> P[Model Executor] P --> Q[KV Cache Manager] P --> R[Sampling Engine] end end subgraph "Data Tier" S[(Redis: Request State)] --> T[(Redis: Rate Limits)] U[(PostgreSQL: Logs)] --> V[(Prometheus: Metrics)] end subgraph "Response Streaming" N --> W[Message Queue Consumer] W --> X[WebSocket Manager] X --> Y[Client Connections] end ``` ### **Lab 7.2: System Integration & Deployment** #### **Objective** Package the complete system for production deployment with monitoring, logging, scaling, and maintenance capabilities. #### **Key Concept** A production‑ready system needs observability (metrics, logs, traces), scalability (horizontal and vertical), maintainability (configuration, updates), and resilience (failure recovery, backups). #### **Detailed Explanation** **Deployment Architecture Components:** 1. **Load Balancer**: Distributes traffic across API servers - Health checks for API servers - SSL termination - DDoS protection 2. **API Servers**: Stateless, can scale horizontally - Handle HTTP/WebSocket connections - Authentication and rate limiting - Request validation 3. **GPU Workers**: Stateful (hold model weights) - May need model parallelism for large models - GPU affinity for optimal performance - Health monitoring and automatic recovery 4. **Message Queue**: Decouples components - Redis for fast messaging - Kafka for durability if needed - Request state management 5. **Monitoring Stack**: - Prometheus for metrics collection - Grafana for dashboards - AlertManager for notifications - ELK stack for logs 6. **Configuration Management**: - Environment‑specific configs - Secret management (API keys, model paths) - Dynamic configuration updates **Configuration Example:** ```yaml # config.yaml model: name: "llama2-7b" path: "/models/llama2-7b" precision: "float16" quantization: "none" gpu: memory_limit: "40GB" utilization_target: 0.85 temperature_alert: 85 scheduler: batch_size: 32 token_budget: 8192 policy: "hybrid" max_wait_time: "5s" cache: block_size: "16KB" max_blocks: 262144 defrag_threshold: 0.4 api: port: 8000 max_concurrent_requests: 1000 timeout: "30s" streaming: true monitoring: metrics_port: 9090 log_level: "INFO" alert_rules: "alerts.yml" ``` **Scaling Strategies:** 1. **Vertical Scaling**: Larger GPU (A100 → H100) - More memory for longer sequences - Higher throughput for same batch size - Limited by single GPU capacity 2. **Horizontal Scaling**: More GPU workers - Load balancer distributes requests - Linear scaling for throughput - Model must fit in single GPU or use model parallelism 3. **Model Parallelism**: Split model across GPUs - Tensor parallelism: Split layers - Pipeline parallelism: Different layers on different GPUs - Requires careful load balancing 4. **Hybrid Approach**: Combine above strategies - Multiple GPUs per worker (model parallelism) - Multiple workers (horizontal scaling) - Optimal for very large models **Deployment Strategies:** 1. **Blue‑Green Deployment**: - Deploy new version alongside old - Switch traffic when verified - Roll back if issues detected 2. **Canary Deployment**: - Gradually shift traffic to new version - Monitor metrics closely - Complete rollout if successful 3. **Feature Flags**: - Enable/disable features without deployment - A/B testing capabilities - Rapid rollback **Maintenance Operations:** 1. **Model Updates**: - Load new model version in background - Drain old version, switch to new - Keep old version for rollback 2. **GPU Maintenance**: - Drain worker before maintenance - Update drivers/firmware - Return to pool when ready 3. **Monitoring Maintenance**: - Regular backup of metrics/logs - Dashboard updates - Alert rule tuning #### **Visualization 1: Complete Deployment Architecture** ```mermaid graph TB subgraph "Client Layer" A[Web Clients] --> B[Mobile Apps] A --> C[API Consumers] end subgraph "Network Layer" D[CDN] --> E[WAF - Web Application Firewall] E --> F[Load Balancer] end subgraph "Application Layer" F --> G[API Gateway 1] F --> H[API Gateway 2] F --> I[API Gateway N] G --> J[Request Router] H --> K[Request Router] J --> L[Rate Limiter] K --> L L --> M[Auth Service] end subgraph "Processing Layer" M --> N[Message Queue] N --> O[Scheduler Cluster] O --> P[GPU Worker Pool] subgraph "GPU Worker Node" Q[Docker Container] --> R[Model Executor] R --> S[KV Cache Manager] R --> T[Monitoring Agent] end end subgraph "Data Layer" U[(Redis: Cache & Queue)] --> V[(PostgreSQL: Metadata)] W[(S3: Model Weights)] --> X[(S3: Logs Archive)] end subgraph "Monitoring Layer" Y[Prometheus: Metrics] --> Z[Grafana: Dashboards] AA[ELK Stack: Logs] --> BB[Alert Manager] BB --> CC[PagerDuty/Slack/Email] end subgraph "Management Layer" DD[Kubernetes] --> EE[Auto-scaling] FF[Terraform] --> GG[Infrastructure as Code] HH[GitLab CI/CD] --> II[Automated Deployment] end ``` #### **Visualization 2: Configuration Management Hierarchy** ```mermaid graph TB A[nano-vLLM Configuration] --> B[Model Configuration] A --> C[GPU Configuration] A --> D[Scheduler Configuration] A --> E[Cache Configuration] A --> F[API Configuration] A --> G[Monitoring Configuration] B --> B1[name: "llama2-7b"] B --> B2[path: "/models/llama2-7b"] B --> B3[precision: "float16"] B --> B4[quantization: "none"] C --> C1[memory_limit: "40GB"] C --> C2[utilization_target: 0.85] C --> C3[temperature_monitoring: true] D --> D1[batch_size: 32] D --> D2[token_budget: 8192] D --> D3[policy: "hybrid"] D --> D4[max_wait_time: "5s"] E --> E1[block_size: "16KB"] E --> E2[max_blocks: 262144] E --> E3[defrag_threshold: 0.4] F --> F1[port: 8000] F --> F2[max_concurrent: 1000] F --> F3[timeout: "30s"] F --> F4[streaming: true] G --> G1[metrics_port: 9090] G --> G2[log_level: "INFO"] G --> G3[alert_rules: "alerts.yml"] ``` --- ## **Performance Benchmarks & Expected Results** ### **Expected Performance Metrics** **Throughput Scaling:** ```mermaid xychart-beta title "Throughput Scaling with Batch Size" x-axis ["1", "4", "8", "16", "32", "64"] y-axis "Tokens/Second" 0 --> 20000 line [1200, 3800, 6800, 11000, 15000, 18000] bar [5, 32, 68, 92, 94, 95] legend ["Tokens/Second", "GPU Utilization %"] ``` **Latency Distribution:** ```mermaid graph LR subgraph "Latency Percentiles (milliseconds)" A[P50: 45ms] --> B[P90: 89ms] B --> C[P95: 120ms] C --> D[P99: 195ms] D --> E[P99.9: 420ms] end subgraph "Comparison vs Baseline" F[Baseline P99: 450ms] --> G[Improvement: 2.3×] H[nano-vLLM P99: 195ms] --> G end ``` **Memory Efficiency Improvement:** ```mermaid xychart-beta title "Memory Efficiency: Before vs After Optimization" x-axis ["256 tokens", "1024 tokens", "4096 tokens", "16384 tokens"] y-axis "Effective Memory Utilization" 0 --> 100 bar [28, 22, 18, 15] bar [95, 94, 92, 90] legend ["Traditional Allocation", "Paged Allocation"] ``` --- ## **Educational Pathway & Progress Tracking** ### **Learning Journey Timeline** **Week 1-2: Foundation** ```mermaid journey title Week 1-2: Foundation Skills section Day 1-2 Understand Transformer Architecture: 5: Completed section Day 3-4 Implement Basic Inference: 4: Completed section Day 5-7 Add KV Caching: 5: Completed section Day 8-10 Benchmark Performance: 4: Completed section Day 11-14 Optimization Techniques: 3: In Progress ``` **Week 3-4: Batching & Scheduling** ```mermaid gantt title Week 3-4: Batching & Scheduling dateFormat YYYY-MM-DD section Continuous Batching Design Request State Machine :a1, 2024-01-15, 2d Implement Batch Scheduler :a2, after a1, 3d Test GPU Utilization :a3, after a2, 2d section Scheduling Policies Implement FIFO Scheduler :b1, 2024-01-17, 2d Add Priority Queues :b2, after b1, 3d Implement Hybrid Scheduler :b3, after b2, 3d section Performance Testing Benchmark Throughput :c1, 2024-01-22, 2d Measure Latency :c2, after c1, 2d Optimize Parameters :c3, after c2, 2d ``` ### **Skill Development Progress** ```mermaid radar-chart title "Skill Development Through nano-vLLM Labs" axis "Systems Architecture" [0, 10] axis "GPU Programming" [0, 10] axis "Memory Management" [0, 10] axis "API Design" [0, 10] axis "Performance Tuning" [0, 10] axis "Debugging Skills" [0, 10] "Before Starting" [2, 1, 1, 2, 1, 2] "After Module 4" [6, 4, 5, 4, 5, 4] "After Completion" [9, 7, 8, 7, 8, 7] "Target Mastery" [10, 9, 9, 8, 9, 8] ``` --- ## **Capstone Project Ideas** ### **Option A: Add Speculative Decoding** 1. **Design**: Small draft model predicts multiple tokens ahead 2. **Implementation**: Large model verifies in parallel 3. **Expected Improvement**: 2‑3× speedup for easy prompts 4. **Challenges**: Handling verification failures, maintaining quality ### **Option B: Implement Tensor Parallelism** 1. **Design**: Split model across 2‑4 GPUs 2. **Implementation**: Model partitioning, communication layer 3. **Expected Improvement**: Scale to larger models 4. **Challenges**: Communication overhead, load balancing ### **Option C: Build Monitoring Dashboard** 1. **Design**: Real‑time metrics visualization 2. **Implementation**: Prometheus + Grafana integration 3. **Features**: GPU utilization, cache fragmentation, request latency 4. **Challenges**: Real‑time data processing, alerting system ### **Option D: Multi‑Model Support** 1. **Design**: Load multiple models simultaneously 2. **Implementation**: Dynamic model switching, shared cache 3. **Features**: Model routing, automatic selection 4. **Challenges**: Memory management, scheduling conflicts --- ## **Conclusion** ### **Key Learnings** Building nano‑vLLM from scratch teaches essential skills: 1. **Systems Thinking**: Identify bottlenecks in complex systems 2. **Performance Analysis**: Quantify tradeoffs between memory and compute 3. **Architecture Design**: Create clean interfaces between components 4. **Optimization Techniques**: Apply caching, batching, pipelining 5. **Production Readiness**: Handle failures, monitor metrics, scale horizontally ### **Core Principles Demonstrated** 1. **KV Caching**: Transform O(n²) → O(n) via memory‑compute tradeoff 2. **Continuous Batching**: Maximize GPU utilization through request interleaving 3. **Paged Memory**: Eliminate fragmentation via block‑based allocation 4. **Hybrid Scheduling**: Balance throughput, latency, and fairness 5. **Streaming Architecture**: Design for incremental output from start ### **Next Steps** 1. **Read the vLLM Paper**: "Efficient Memory Management for Large Language Model Serving with PagedAttention" 2. **Explore the Code**: github.com/vllm‑project/vllm 3. **Benchmark Your Implementation**: Compare against Hugging Face Transformers 4. **Join the Community**: vLLM Discord, research paper discussions 5. **Extend Further**: Add your own optimization or feature ### **Final Challenge** Extend nano‑vLLM with one original optimization not present in vLLM. Document your design decisions, implement the feature, measure the performance improvement, and share your findings with the community. --- *Document Version: 4.0 | Complete Build‑from‑Scratch Guide with Enhanced Visualizations* *Educational Use | Based on vLLM Architecture Principles* *All diagrams verified for accuracy and readability*

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.