VLLM

vLLM
vLLM
Original authors	Sky Computing Lab; Cal Berkeley
Developer	vLLM contributors
Initial release	2023
Written in	Python, CUDA, C++
Type	Large language model inference engine
License	Apache License 2.0
Website	vllm.ai
Repository	github.com/vllm-project/vllm

vLLM is an open-source software framework for inference and serving of large language models and related multimodal models. Originally developed at the University of California, Berkeley's Sky Computing Lab,¹ the project is centered on PagedAttention, a memory-management method for transformer key–value caches, and supports features such as continuous batching, distributed inference, quantization, and OpenAI-compatible APIs.²³⁴ According to a project maintainer, the "v" in vLLM originally referred to "virtual", inspired by virtual memory.⁵

History

vLLM was introduced in 2023 by researchers affiliated with the Sky Computing Lab at UC Berkeley.³² Its core ideas were described in the 2023 paper Efficient Memory Management for Large Language Model Serving with PagedAttention,⁶ which presented the system as a high-throughput and memory-efficient serving engine for large language models.³

In 2025, the PyTorch Foundation announced that vLLM had become a Foundation-hosted project. PyTorch's project page states that the University of California, Berkeley contributed vLLM to the Linux Foundation in July 2024.⁷⁴

In January 2026, TechCrunch reported that the creators of vLLM had launched the startup Inferact to commercialize the project, raising $150 million in seed funding.⁸

Architecture

According to its 2023 paper, vLLM was designed to improve the efficiency of large language model serving by reducing memory waste in the key–value cache used during transformer inference.³ The paper introduced PagedAttention, an algorithm inspired by virtual memory and paging techniques in operating systems, and described vLLM as using block-level memory management and request scheduling to increase throughput while maintaining similar latency.³

The project documentation and repository describe support for continuous batching, chunked prefill, speculative decoding, prefix caching, quantization, and multiple forms of distributed inference and serving.²⁴ PyTorch has described vLLM as a high-throughput, memory-efficient inference and serving engine that supports a range of hardware back ends, including NVIDIA and AMD GPUs, Google TPUs, AWS Trainium, and Intel processors.⁷⁴

References

"vLLM - A High-Throughput and Memory-Efficient Inference and Serving Engine for LLMs". UC Berkeley, Sky Computing Lab.
"GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs". GitHub. GitHub, Inc. Retrieved April 22, 2026.
Kwon, Woosuk; Li, Zhuohan; Zhuang, Siyuan; Sheng, Ying; Zheng, Lianmin; Yu, Cody Hao; Gonzalez, Joseph E.; Zhang, Hao; Stoica, Ion (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles. Retrieved April 22, 2026.
"vLLM". PyTorch. PyTorch Foundation. Retrieved April 22, 2026.
"vLLM full name". GitHub. GitHub, Inc. August 23, 2023. Retrieved April 22, 2026.
Kwon, Woosuk; Li, Zhuohan; Zhuang, Siyuan; Sheng, Ying; Zheng, Lianmin; Yu, Cody Hao; Gonzalez, Joseph E.; Zhang, Hao; Stoica, Ion (September 12, 2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention". arXiv:2309.06180 [cs.LG].
"PyTorch Foundation Welcomes vLLM as a Hosted Project". PyTorch. PyTorch Foundation. May 7, 2025. Retrieved April 22, 2026.
Temkin, Marina (January 22, 2026). "Inference startup Inferact lands $150M to commercialize vLLM". TechCrunch. Retrieved April 22, 2026.

External links

[1] "vLLM - A High-Throughput and Memory-Efficient Inference and Serving Engine for LLMs". UC Berkeley, Sky Computing Lab.

[github-2] "GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs". GitHub. GitHub, Inc. Retrieved April 22, 2026.

[paper-3] Kwon, Woosuk; Li, Zhuohan; Zhuang, Siyuan; Sheng, Ying; Zheng, Lianmin; Yu, Cody Hao; Gonzalez, Joseph E.; Zhang, Hao; Stoica, Ion (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles. Retrieved April 22, 2026.

[pytorch-project-4] "vLLM". PyTorch. PyTorch Foundation. Retrieved April 22, 2026.

[5] "vLLM full name". GitHub. GitHub, Inc. August 23, 2023. Retrieved April 22, 2026.

[6] Kwon, Woosuk; Li, Zhuohan; Zhuang, Siyuan; Sheng, Ying; Zheng, Lianmin; Yu, Cody Hao; Gonzalez, Joseph E.; Zhang, Hao; Stoica, Ion (September 12, 2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention". arXiv:2309.06180 [cs.LG].

[pytorch-hosted-7] "PyTorch Foundation Welcomes vLLM as a Hosted Project". PyTorch. PyTorch Foundation. May 7, 2025. Retrieved April 22, 2026.

[techcrunch-8] Temkin, Marina (January 22, 2026). "Inference startup Inferact lands $150M to commercialize vLLM". TechCrunch. Retrieved April 22, 2026.

1

2

3

4

5

6

7

8

vLLM

Original authors	Sky Computing Lab Cal Berkeley
Developer	vLLM contributors
Initial release	2023
Written in	Python, CUDA, C++
Type	Large language model inference engine
License	Apache License 2.0
Website	vllm.ai
Repository	github.com/vllm-project/vllm

History

Architecture

See also

References

External links