Llama is Meta's open-weights model family — the most-deployed open-weights family in the world by a meaningful margin, with the largest community fine-tune ecosystem of any model. The Llama 3.x and Llama 4 generations cover sizes from 1B (mobile / edge) to 405B+ (frontier-scale, runs on small-server GPUs). Vision variants, code variants, instruction-tuned and base variants all ship. Available through Hugging Face, Ollama, Together AI, Groq, AWS Bedrock, Vertex AI Model Garden, and direct from llama.com. Permissive licence with commercial restrictions only above 700M monthly active users — practically open for almost every business.
Llama is the model family from Meta AI, originally launched in February 2023 with Llama 1 (which leaked publicly and arguably catalysed the entire open-weights movement) and continued through Llama 2 (July 2023, the first commercially-licensed Meta open-weights release), Llama 3 (April 2024), Llama 3.1 (July 2024), Llama 3.2 (September 2024, including vision), Llama 3.3 (December 2024), and Llama 4 (early 2025). Each release ships in multiple sizes covering edge devices through small-server GPUs.
Where Gemma is the frontier-lab open-weights family from Google DeepMind, Llama is the workhorse open-weights family. The structural advantages: the largest community fine-tune ecosystem of any model (tens of thousands of community fine-tunes on Hugging Face), broad inference platform support (every credible LLM hosting provider supports Llama as a default), and steady frontier-adjacent quality (Llama 70B and 405B class models are competitive with mid-tier closed models on most general-purpose benchmarks).
Distribution is wide. Direct downloads from llama.com with licence acceptance. Hugging Face hosts the canonical model cards. Ollama exposes Llama as one of its default model collections (ollama run llama3.3). Hosted inference: Together AI, Groq (specialises in low-latency Llama inference at production scale), Fireworks, AWS Bedrock (in af-south-1 for SA residency on some variants), Google Vertex AI Model Garden. The distribution breadth is itself the practical advantage — you can swap hosting providers without changing the model.
Llama is not released under a fully open licence like Apache 2.0 or MIT. The Llama Community Licence permits free commercial use except for organisations with more than 700 million monthly active users at the time the model is released — in which case Meta requires you to request a separate licence. This affects only a handful of global tech companies. For 99.9% of business users, it's a practically permissive licence. The other notable restriction: Meta requires you to display "Built with Llama" attribution on user-facing applications, and you can't use Llama outputs to train competing models. Read the licence text if your use case might fall in the edge cases.
Llama 3.x and Llama 4 are the active generations as of mid-2026. Llama 4 represents the modern frontier of the family with native multimodal and longer context; Llama 3.x remains widely deployed for production work and has the biggest community fine-tune library. The size points are deliberate — each maps to a hardware target most teams have access to.
Phones, edge devices, lightweight serverless. On-device assistants, classification, basic chat. Llama 3.2 1B / 3B are the canonical size points; Llama 4 nano-class continues the trajectory.
Laptops, single GPU 12GB+. The "good enough for most things" tier; runs comfortably on consumer hardware and is the default for many open-weights production deployments. Llama 3.3 8B is the most-fine-tuned 8B class model in the field.
RTX 4090 / single H100 / multi-GPU server. Where Llama starts feeling competitive with mid-tier closed-frontier models. The most popular size for production self-hosted deployments. Llama 3.3 70B and Llama 4 70B-class models are workhorses.
Multi-H100 / cluster. Llama 3.1 405B and Llama 4 frontier variants. Genuine frontier-adjacent quality on hard reasoning. Most teams access this via hosted providers (Together AI, Groq) rather than self-hosting.
Llama 3.2 Vision ships in 11B and 90B variants with vision-text understanding. Llama 4 brings native multimodal across the family at most size points. Code Llama (the code-specialised variant) is now largely superseded by general Llama variants for code, plus dedicated open-weights coding models (Qwen Coder, DeepSeek Coder). For specifically code-heavy work, Qwen 2.5 Coder remains a stronger pick than Llama at most size points.
Honest open-weights positioning. Llama's strengths sit on community ecosystem reach, hosting breadth, and steady frontier-adjacent quality. Its weaknesses are around vision (Gemma is more consistent), code (Qwen Coder is stronger), and the >700M MAU licence edge cases. Against closed frontier (Claude, GPT, Gemini), Llama wins on cost predictability and residency, loses on absolute frontier-tier reasoning quality.
| Family | Strengths | Watch out for |
|---|---|---|
| Llama (Meta) | Largest fine-tune ecosystem, broadest hosting, steady production quality, runs on Ollama | 700M MAU restriction; "Built with Llama" attribution; vision less consistent than Gemma; code lags Qwen |
| Gemma (Google) | Frontier-lab safety tuning, native multimodal, 128k context, more permissive licence terms | Smaller community fine-tune ecosystem than Llama; smaller deployment surface |
| Qwen (Alibaba) | Best open-weights for code (Qwen Coder), strong multilingual, fast cadence | Chinese training emphasis means English benchmarks slightly lag; less Western community presence |
| Mistral | Function-calling reliability, French / European multilingual, strong instruction-following | Smaller variant range than Llama; less open ecosystem; some variants are commercial-licence-restricted |
| Claude / GPT / Gemini | Frontier reasoning quality on hardest tasks, native extended thinking / search grounding / 1M context | Closed weights; USD billing; cross-border data residency concerns |
Several open-weights families now match or exceed Llama on specific tasks — Gemma on multimodal safety tuning, Qwen on code, Mistral on function-calling. What Llama still has that none of them match: ecosystem breadth. If you self-host, you'll find the most pre-built fine-tunes for Llama. If you use hosted inference, you'll find the most provider options for Llama. If you read research papers, you'll see Llama as the baseline. That ecosystem-network-effect is durable, and explains why Llama remains the default choice when the team doesn't have a specific reason to prefer something else.
The model weights are free under the Llama Community Licence; you pay for compute. Three structurally different cost paths: self-host on your hardware (cheapest at high volume, highest ops cost), run on Ollama for local dev (free), use hosted inference providers (per-token billing, no ops). The right answer depends on volume, latency requirements, and operational appetite.
ollama run llama3.3 for instant local inference. Free, fast, works offline. Default for development and prototyping. Quantised by default; full precision available on capable hardware.
Production-grade serving on your own GPUs. PagedAttention, continuous batching, OpenAI-compatible API. Cheapest per-token at high volume but highest operational complexity (GPU management, scaling, monitoring).
Specialises in extreme low-latency Llama inference using their custom LPU hardware. Often 5-10× faster than other hosted providers. The right pick for latency-sensitive production agents.
Production hosted-inference providers with the broadest Llama variant catalogue. Per-token pricing similar to closed-frontier flash tiers. The most-flexible hosted path for production workloads.
Llama on Bedrock with regional residency including af-south-1 Cape Town for SA-resident inference. Microsoft enterprise contracting model. The path for SA enterprise that wants AWS commercial relationships.
Llama on Vertex AI with regional residency including africa-south1 Johannesburg. Path for SA enterprise on GCP. Some Llama variants only available in US regions; check the model availability matrix.
Llama 70B on hosted providers (Together, Fireworks) typically costs well below Claude Haiku or GPT-4o-mini per token. Llama 405B / Maverick costs more but still meaningfully below frontier closed models. For volume workloads where the closed frontier's quality lift isn't structurally required, Llama via hosted inference can be 5-10× cheaper than running everything on Claude Sonnet or GPT-5. The pattern that wins: tier-route Haiku/Sonnet/Opus or Llama-8B/70B/405B based on task difficulty, mixing both closed and open as needed.
af-south-1For SA banks, insurers, telcos, and government with POPIA cross-border concerns, Llama on AWS Bedrock in Cape Town (af-south-1) is one of the cleanest residency stories. Not every Llama variant lands in Bedrock af-south-1 at launch — check the model availability matrix — but Llama 3.x 70B and the Llama 4 mid-tier are typically available with full regional residency. AWS enterprise contracting handles the commercial side; the data stays on-region.
The cheap and fast SA studio path: develop locally on Ollama-Llama 70B (free, runs on a maxed MacBook), ship to Together AI or Groq for production. Together's per-token costs are very competitive; Groq's LPU latency is genuinely best-in-class for chat agents. For most SA studios watching FX, the Llama-on-Together pattern is meaningfully cheaper than Claude or GPT for the same quality bar at production volume. Tier-route to closed frontier only for the genuinely hard 5-15% of traffic.
Hetzner has no SA region; their nearest GPU-equipped DC is Helsinki / Frankfurt. For genuinely SA-resident self-hosted Llama, options narrow to: GCP Johannesburg with custom GPU instances, AWS Cape Town with EC2 GPU SKUs, or local hosting providers with GPU SKUs. None are as cheap as Hetzner, but all keep data SA-resident. For most SA studios, the self-hosting path makes sense only at scale (above ~50M tokens / month) where the per-token economics structurally beat hosted inference.
ollama run llama3.3 is the canonical "try it" command. Most SA studio dev work runs through this combination.africa-south1.langchain-community integrations cover Together, Groq, Bedrock, Vertex Llama hosting endpoints uniformly.