Article · Wikipedia archive · Last revised Jun 25, 2026

SGLang

SGLang is an open-source framework for programming and serving large language models and multimodal models. It was introduced by researchers affiliated with LMSYS and other institutions as a system combining a Python-embedded language for structured generation with a runtime for high-throughput inference.

Last revised
Jun 25, 2026
Read time
≈ 2 min
Length
445 w
Citations
13
Source
SGLang
DeveloperLMSYS
Initial releaseJanuary 17, 2024 (2024-01-17)
Written inPython, Rust, CUDA, C++
TypeLarge language model inference engine
LicenseApache License 2.0
Websitesglang.io
Repositorygithub.com/sgl-project/sglang

SGLang (short for Structured Generation Language) is an open-source framework for programming and serving large language models and multimodal models. It was introduced by researchers affiliated with LMSYS1 and other institutions as a system combining a Python-embedded language for structured generation with a runtime for high-throughput inference.234

The project is designed for low latency and high-throughput inference workloads, and its documentation describes support for features such as structured outputs, speculative decoding, continuous batching, quantization, and compatibility with OpenAI-style APIs.5

History

SGLang was publicly introduced in January 2024 by researchers affiliated with Stanford, UC Berkeley, Texas A&M, and Shanghai Jiao Tong University.2 Its academic description later appeared in the proceedings of NeurIPS 2024.3 In January 2026, TechCrunch reported that contributors associated with the project had formed the startup RadixArk to commercialize services around SGLang while continuing its open-source development.67

Architecture

According to the NeurIPS paper, SGLang consists of two main components: a front-end language embedded in Python and a back-end runtime for executing language model programs efficiently.3 The front end provides primitives for generation, selection, and parallel control flow, while the runtime uses a set of optimizations intended to reduce repeated computation and improve throughput.3

Among the techniques described by the project are RadixAttention for reusing key–value cache state across multiple generation calls, compressed finite-state machines for faster constrained decoding, and speculative execution for API-based models.3 The current documentation also describes support for serving both language models and multimodal models across a range of hardware back ends.5

See also

See also

References

References

  1. "LMSYS". GitHub. GitHub, Inc. Retrieved April 22, 2026.
  2. "Fast and Expressive LLM Inference with RadixAttention and SGLang". LMSYS Org. January 17, 2024. Retrieved April 19, 2026.
  3. Zheng, Lianmin; Yin, Liangsheng; Xie, Zhiqiang; Sun, Chuyue; Huang, Jeff; Yu, Cody Hao; Cao, Shiyi; Kozyrakis, Christos; Stoica, Ion; Gonzalez, Joseph E.; Barrett, Clark; Sheng, Ying (2024). SGLang: Efficient Execution of Structured Language Model Programs (PDF). Advances in Neural Information Processing Systems 37. Retrieved April 19, 2026.
  4. "SGLang". UC Berkeley Sky Computing Lab. April 25, 2024. Retrieved April 22, 2026.
  5. "SGLang Documentation". SGLang. Retrieved April 19, 2026.
  6. Hu, Krystal (January 21, 2026). "Sources: Project SGLang spins out as RadixArk with $400M valuation as inference market explodes". TechCrunch. Retrieved April 19, 2026.
  7. R, Vignesh (January 23, 2026). "From Berkeley lab to $400M startup: SGLang becomes RadixArk". TFN. Retrieved April 22, 2026.
External links