vLLM is the production-grade open-source inference server for open-weights LLMs. Originated at UC Berkeley's Sky Lab and released under Apache 2.0, vLLM is now the canonical inference engine for self-hosted production deployments and the under-the-hood serving layer for many hosted providers including Together AI and Fireworks. PagedAttention for memory-efficient KV cache management, continuous batching for high-throughput multi-user serving, OpenAI-compatible API surface, native CUDA / ROCm / TPU support. The default when you've outgrown Ollama and need real production throughput on dedicated GPUs.
vLLM is a Python library for high-throughput LLM inference. Released as an academic project from UC Berkeley's Sky Lab (formerly RISELab) in mid-2023 and now maintained as an open-source project with broad industry contribution, vLLM has become the de-facto serving layer for self-hosted open-weights LLMs and the hidden infrastructure behind many hosted-inference providers.
Two architectural ideas distinguish vLLM from earlier inference servers:
PagedAttention, the original research contribution, applies virtual-memory-style paging to the KV cache. Conventional inference servers allocate KV cache contiguously, leaving large fragmented gaps when sequences finish at different times. PagedAttention stores KV in fixed-size blocks managed like virtual memory pages, dramatically reducing memory waste — typically 4× better memory utilisation, which translates directly to higher throughput on the same GPU.
Continuous batching (sometimes called dynamic batching or in-flight batching) means new requests can join an in-progress batch instead of waiting for the current batch to finish. Combined with PagedAttention, this lets vLLM serve far more concurrent users on the same hardware than naive batching approaches. The result: hosted providers running vLLM can offer Llama / Qwen / Mistral inference at very competitive per-token prices because the per-GPU throughput is genuinely high.
By 2026, vLLM is the inference engine you'll find under the hood at most hosted-inference providers serving open-weights models. Together AI uses vLLM-derived infrastructure. Fireworks builds on vLLM. NVIDIA's TensorRT-LLM is the main alternative; SGLang and TGI are smaller competitors. If you're self-hosting open-weights models for production, vLLM is the path of least resistance and the broadest ecosystem. If you're consuming hosted inference, you're probably using vLLM-served endpoints whether you know it or not.
vLLM exposes two surfaces. The API server mode launches an OpenAI-compatible HTTP endpoint at localhost:8000 — drop-in replacement for OpenAI in any code that speaks the OpenAI SDK. The library mode embeds vLLM directly in your Python process for offline batched inference. Both modes share the same engine.
Server mode — the production default. Single command launches an OpenAI-compatible API:
# pip install vllm # Start an OpenAI-compatible server vllm serve meta-llama/Llama-3.3-70B-Instruct \ --tensor-parallel-size 2 \ --max-model-len 32768 \ --port 8000 # Query it like OpenAI curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.3-70B-Instruct", "messages": [{"role": "user", "content": "Hello"}] }'
Library mode — for offline batch inference, evaluations, or embedding into a custom serving stack:
from vllm import LLM, SamplingParams llm = LLM(model="meta-llama/Llama-3.3-70B-Instruct", tensor_parallel_size=2) prompts = ["Summarise SA fiscal policy in 2026.", "What's stage 6 load shedding?"] sampling = SamplingParams(temperature=0.3, max_tokens=512) outputs = llm.generate(prompts, sampling) for output in outputs: print(output.outputs[0].text)
vLLM's OpenAI-compatible mode means agent frameworks that speak OpenAI work unchanged. Point LangGraph's ChatOpenAI at http://your-vllm-server:8000/v1, set the model name, and your LangGraph stack runs against your self-hosted Llama 70B. Same trick works for the OpenAI Agents SDK, ADK via LiteLLM, CrewAI, and AutoGen. This is what makes vLLM operationally pragmatic — you can swap closed-frontier APIs for self-hosted vLLM by changing one config value.
vLLM sits between Ollama (dev / single-user) and TensorRT-LLM (extreme NVIDIA-specific optimisation) in the inference stack. Hosted providers (Together / Fireworks / Groq) abstract everything below the API surface; vLLM is what you reach for when you want self-host control and broad model + hardware support.
| Layer | Best for | Trade-off |
|---|---|---|
| Ollama | Dev, prototyping, single-user laptop / server | Single-tenant, lower throughput; not for production multi-user |
| vLLM | Production self-host, multi-user, broad model + HW support | More ops complexity than Ollama; less NVIDIA-specific tuning than TensorRT-LLM |
| TensorRT-LLM | Maximum throughput on NVIDIA hardware specifically | NVIDIA-only, more setup complexity, narrower model support |
| Together AI / Fireworks | Don't want to self-host; hosted inference simplicity | Per-token billing; less control; vendor lock-in on pricing |
| Groq | Extreme low-latency on a curated model set | LPU hardware-specific; narrower model catalogue |
For most SA studios, hosted inference (Together AI, Fireworks, Groq) is the right answer until you cross meaningful volume — roughly 50M+ tokens / month on a single model. Below that, the per-token economics of hosted providers beat the ops cost of running vLLM on dedicated GPUs. Above that, self-hosted vLLM on dedicated hardware can structurally undercut hosted billing — particularly relevant when FX exposure on USD-billed hosted services is a concern.
Hetzner doesn't have an SA region; their nearest GPU-equipped DC is Helsinki / Frankfurt. For genuinely SA-resident vLLM hosting, options narrow to: GCP Johannesburg with custom GPU instances (A100s available), AWS Cape Town with EC2 GPU SKUs, or local hosting providers with GPU infrastructure. None are as cheap as Hetzner Frankfurt, but all keep data SA-resident. A practical compromise for SA enterprise: vLLM on AWS Cape Town for residency-sensitive workloads, hosted Together AI for everything else.
For POPIA-sensitive workloads, vLLM on SA-region GPU instances is the cleanest open-weights inference path. Llama / Mistral / Qwen weights downloaded once, served from SA-region infrastructure, no inference data crossing borders. Audit trail flows into Cloud Logging or CloudWatch; IAM controls integrate with existing AWS / GCP enterprise posture.
ChatOpenAI works against any vLLM endpoint by setting base_url. Self-hosted Llama / Qwen behind LangGraph orchestration is a popular SA cost-optimised stack.