Production-grade open-weights inference. PagedAttention. Continuous batching.

vLLM is the production-grade open-source inference server for open-weights LLMs. Originated at UC Berkeley's Sky Lab and released under Apache 2.0, vLLM is now the canonical inference engine for self-hosted production deployments and the under-the-hood serving layer for many hosted providers including Together AI and Fireworks. PagedAttention for memory-efficient KV cache management, continuous batching for high-throughput multi-user serving, OpenAI-compatible API surface, native CUDA / ROCm / TPU support. The default when you've outgrown Ollama and need real production throughput on dedicated GPUs.

Live · v0.6+ Apache 2.0 Python · CUDA / ROCm / TPU UC Berkeley Sky Lab origin

01 · What it is

An open-source inference engine, with two architectural breakthroughs.

vLLM is a Python library for high-throughput LLM inference. Released as an academic project from UC Berkeley's Sky Lab (formerly RISELab) in mid-2023 and now maintained as an open-source project with broad industry contribution, vLLM has become the de-facto serving layer for self-hosted open-weights LLMs and the hidden infrastructure behind many hosted-inference providers.

Two architectural ideas distinguish vLLM from earlier inference servers:

PagedAttention, the original research contribution, applies virtual-memory-style paging to the KV cache. Conventional inference servers allocate KV cache contiguously, leaving large fragmented gaps when sequences finish at different times. PagedAttention stores KV in fixed-size blocks managed like virtual memory pages, dramatically reducing memory waste — typically 4× better memory utilisation, which translates directly to higher throughput on the same GPU.

Continuous batching (sometimes called dynamic batching or in-flight batching) means new requests can join an in-progress batch instead of waiting for the current batch to finish. Combined with PagedAttention, this lets vLLM serve far more concurrent users on the same hardware than naive batching approaches. The result: hosted providers running vLLM can offer Llama / Qwen / Mistral inference at very competitive per-token prices because the per-GPU throughput is genuinely high.

Why it's the production default

By 2026, vLLM is the inference engine you'll find under the hood at most hosted-inference providers serving open-weights models. Together AI uses vLLM-derived infrastructure. Fireworks builds on vLLM. NVIDIA's TensorRT-LLM is the main alternative; SGLang and TGI are smaller competitors. If you're self-hosting open-weights models for production, vLLM is the path of least resistance and the broadest ecosystem. If you're consuming hosted inference, you're probably using vLLM-served endpoints whether you know it or not.

02 · How it works

Two ways to run it: server or library.

vLLM exposes two surfaces. The API server mode launches an OpenAI-compatible HTTP endpoint at localhost:8000 — drop-in replacement for OpenAI in any code that speaks the OpenAI SDK. The library mode embeds vLLM directly in your Python process for offline batched inference. Both modes share the same engine.

Server mode — the production default. Single command launches an OpenAI-compatible API:

# pip install vllm
# Start an OpenAI-compatible server
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --port 8000

# Query it like OpenAI
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.3-70B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Library mode — for offline batch inference, evaluations, or embedding into a custom serving stack:

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.3-70B-Instruct", tensor_parallel_size=2)
prompts = ["Summarise SA fiscal policy in 2026.", "What's stage 6 load shedding?"]
sampling = SamplingParams(temperature=0.3, max_tokens=512)

outputs = llm.generate(prompts, sampling)
for output in outputs:
    print(output.outputs[0].text)

The OpenAI compatibility shortcut

vLLM's OpenAI-compatible mode means agent frameworks that speak OpenAI work unchanged. Point LangGraph's ChatOpenAI at http://your-vllm-server:8000/v1, set the model name, and your LangGraph stack runs against your self-hosted Llama 70B. Same trick works for the OpenAI Agents SDK, ADK via LiteLLM, CrewAI, and AutoGen. This is what makes vLLM operationally pragmatic — you can swap closed-frontier APIs for self-hosted vLLM by changing one config value.

03 · vs Ollama, TensorRT-LLM, hosted

Where vLLM fits.

vLLM sits between Ollama (dev / single-user) and TensorRT-LLM (extreme NVIDIA-specific optimisation) in the inference stack. Hosted providers (Together / Fireworks / Groq) abstract everything below the API surface; vLLM is what you reach for when you want self-host control and broad model + hardware support.

Layer	Best for	Trade-off
Ollama	Dev, prototyping, single-user laptop / server	Single-tenant, lower throughput; not for production multi-user
vLLM	Production self-host, multi-user, broad model + HW support	More ops complexity than Ollama; less NVIDIA-specific tuning than TensorRT-LLM
TensorRT-LLM	Maximum throughput on NVIDIA hardware specifically	NVIDIA-only, more setup complexity, narrower model support
Together AI / Fireworks	Don't want to self-host; hosted inference simplicity	Per-token billing; less control; vendor lock-in on pricing
Groq	Extreme low-latency on a curated model set	LPU hardware-specific; narrower model catalogue

04 · Decision guide

Pick vLLM when. Skip when.

Use vLLM when

You self-host open-weights LLMs in production
You want OpenAI-compatible API on your own infrastructure
Multi-user / multi-tenant serving with continuous batching matters
You're on AMD ROCm or Google TPU and need broad hardware support
You're building hosted-inference infrastructure for others (vLLM is the standard engine)
You've outgrown Ollama's single-tenant performance
Your scale justifies the GPU + ops investment over hosted per-token billing

Skip when

You're a single developer doing dev work — Ollama is simpler
You don't have GPUs and don't want them — use hosted inference
You need maximum NVIDIA-specific throughput — TensorRT-LLM may edge out vLLM
You're at low volume and per-token hosted billing is cheaper than ops + GPU rental
You only run closed-frontier APIs (Claude / GPT / Gemini) — vLLM is for open-weights
You don't have ops capacity to run a production inference stack

05 · South African context

Where vLLM lands in SA delivery work.

Self-host vs hosted · the SA volume question

For most SA studios, hosted inference (Together AI, Fireworks, Groq) is the right answer until you cross meaningful volume — roughly 50M+ tokens / month on a single model. Below that, the per-token economics of hosted providers beat the ops cost of running vLLM on dedicated GPUs. Above that, self-hosted vLLM on dedicated hardware can structurally undercut hosted billing — particularly relevant when FX exposure on USD-billed hosted services is a concern.

SA hosting reality

Hetzner doesn't have an SA region; their nearest GPU-equipped DC is Helsinki / Frankfurt. For genuinely SA-resident vLLM hosting, options narrow to: GCP Johannesburg with custom GPU instances (A100s available), AWS Cape Town with EC2 GPU SKUs, or local hosting providers with GPU infrastructure. None are as cheap as Hetzner Frankfurt, but all keep data SA-resident. A practical compromise for SA enterprise: vLLM on AWS Cape Town for residency-sensitive workloads, hosted Together AI for everything else.

POPIA-clean self-host

For POPIA-sensitive workloads, vLLM on SA-region GPU instances is the cleanest open-weights inference path. Llama / Mistral / Qwen weights downloaded once, served from SA-region infrastructure, no inference data crossing borders. Audit trail flows into Cloud Logging or CloudWatch; IAM controls integrate with existing AWS / GCP enterprise posture.

06 · Connections