---
# System prepended metadata

title: SLLang paper study

---

# SLLang paper study

## Original Paper
https://arxiv.org/pdf/2312.07104
## Title
SGLang: Efficient Execution of
Structured Language Model Programs
## Summary
SGLang is a framework designed to improve the programming and execution efficiency of "Language Model (LM) Programs"—complex applications that use multiple large language model (LLM) calls, control flow, and structured inputs/outputs.

## Frontend Language:
A domain-specific language embedded in Python that provides primitives for generation (e.g., gen, select) and parallelism control (e.g., fork, join). It simplifies complex workflows, such as multi-modal processing and chained LLM calls, into readable code.
## Runtime (SRT)
A backend optimized to accelerate execution through several novel techniques:
### Radix Attention
Automatically reuses the Key-Value (KV) cache across different generation calls by managing it in a radix tree. This reduces redundant computation for shared prompt prefixes.
### Compressed Finite State Machines (FSM)
Accelerates structured output decoding (like JSON) by analyzing constraints and decoding multiple tokens in a single step whenever possible.
### API Speculative Execution
Optimizes multi-call programs for black-box API models (e.g., GPT-4) to reduce latency and token costs.

## Key Performance Results
Evaluations across various models (Llama-7B/70B, Mixtral-8x7B) and tasks (agent control, logical reasoning, JSON decoding) demonstrate significant improvements over state-of-the-art systems like vLLM, Guidance, and LMQL:
### Throughput
Achieves up to 6.4x higher throughput.
### Latency
Reduces latency by up to 3.7x.
### Multi-modal Performance
Provides up to 6x higher throughput for image and video models like LLaVA.
### Real-world Efficiency
In production deployments like Chatbot Arena, Radix Attention achieved cache hit rates between 52.4% and 74.1%, significantly reducing first-token latency.