know.2nth.ai › Agents › Llama

agents · Llama · Skill Leaf

Open weights, at frontier scale.

Llama is Meta's open-weights model family — the most-deployed open-weights family in the world by a meaningful margin, with the largest community fine-tune ecosystem of any model. The Llama 3.x and Llama 4 generations cover sizes from 1B (mobile / edge) to 405B+ (frontier-scale, runs on small-server GPUs). Vision variants, code variants, instruction-tuned and base variants all ship. Available through Hugging Face, Ollama, Together AI, Groq, AWS Bedrock, Vertex AI Model Garden, and direct from llama.com. Permissive licence with commercial restrictions only above 700M monthly active users — practically open for almost every business.

Llama 3.x & Llama 4 · current Open weights 1B · 8B · 70B · 405B+ Llama Community Licence Largest fine-tune ecosystem

01 · What it is

Meta's open-weights flagship.

Llama is the model family from Meta AI, originally launched in February 2023 with Llama 1 (which leaked publicly and arguably catalysed the entire open-weights movement) and continued through Llama 2 (July 2023, the first commercially-licensed Meta open-weights release), Llama 3 (April 2024), Llama 3.1 (July 2024), Llama 3.2 (September 2024, including vision), Llama 3.3 (December 2024), and Llama 4 (early 2025). Each release ships in multiple sizes covering edge devices through small-server GPUs.

Where Gemma is the frontier-lab open-weights family from Google DeepMind, Llama is the workhorse open-weights family. The structural advantages: the largest community fine-tune ecosystem of any model (tens of thousands of community fine-tunes on Hugging Face), broad inference platform support (every credible LLM hosting provider supports Llama as a default), and steady frontier-adjacent quality (Llama 70B and 405B class models are competitive with mid-tier closed models on most general-purpose benchmarks).

Distribution is wide. Direct downloads from llama.com with licence acceptance. Hugging Face hosts the canonical model cards. Ollama exposes Llama as one of its default model collections (ollama run llama3.3). Hosted inference: Together AI, Groq (specialises in low-latency Llama inference at production scale), Fireworks, AWS Bedrock (in af-south-1 for SA residency on some variants), Google Vertex AI Model Garden. The distribution breadth is itself the practical advantage — you can swap hosting providers without changing the model.

The licence reality · Llama Community Licence

Llama is not released under a fully open licence like Apache 2.0 or MIT. The Llama Community Licence permits free commercial use except for organisations with more than 700 million monthly active users at the time the model is released — in which case Meta requires you to request a separate licence. This affects only a handful of global tech companies. For 99.9% of business users, it's a practically permissive licence. The other notable restriction: Meta requires you to display "Built with Llama" attribution on user-facing applications, and you can't use Llama outputs to train competing models. Read the licence text if your use case might fall in the edge cases.

02 · The lineup

Five sizes, two generations actively current.

Llama 3.x and Llama 4 are the active generations as of mid-2026. Llama 4 represents the modern frontier of the family with native multimodal and longer context; Llama 3.x remains widely deployed for production work and has the biggest community fine-tune library. The size points are deliberate — each maps to a hardware target most teams have access to.

Size · edge

1B / 3B

Phones, edge devices, lightweight serverless. On-device assistants, classification, basic chat. Llama 3.2 1B / 3B are the canonical size points; Llama 4 nano-class continues the trajectory.

Size · default

8B

Laptops, single GPU 12GB+. The "good enough for most things" tier; runs comfortably on consumer hardware and is the default for many open-weights production deployments. Llama 3.3 8B is the most-fine-tuned 8B class model in the field.

Size · flagship

70B

RTX 4090 / single H100 / multi-GPU server. Where Llama starts feeling competitive with mid-tier closed-frontier models. The most popular size for production self-hosted deployments. Llama 3.3 70B and Llama 4 70B-class models are workhorses.

Size · frontier

405B / Maverick

Multi-H100 / cluster. Llama 3.1 405B and Llama 4 frontier variants. Genuine frontier-adjacent quality on hard reasoning. Most teams access this via hosted providers (Together AI, Groq) rather than self-hosting.

Vision and code variants

Llama 3.2 Vision ships in 11B and 90B variants with vision-text understanding. Llama 4 brings native multimodal across the family at most size points. Code Llama (the code-specialised variant) is now largely superseded by general Llama variants for code, plus dedicated open-weights coding models (Qwen Coder, DeepSeek Coder). For specifically code-heavy work, Qwen 2.5 Coder remains a stronger pick than Llama at most size points.

03 · vs Gemma, Qwen, Mistral, closed frontier

Where Llama is the right pick.

Honest open-weights positioning. Llama's strengths sit on community ecosystem reach, hosting breadth, and steady frontier-adjacent quality. Its weaknesses are around vision (Gemma is more consistent), code (Qwen Coder is stronger), and the >700M MAU licence edge cases. Against closed frontier (Claude, GPT, Gemini), Llama wins on cost predictability and residency, loses on absolute frontier-tier reasoning quality.

Family	Strengths	Watch out for
Llama (Meta)	Largest fine-tune ecosystem, broadest hosting, steady production quality, runs on Ollama	700M MAU restriction; "Built with Llama" attribution; vision less consistent than Gemma; code lags Qwen
Gemma (Google)	Frontier-lab safety tuning, native multimodal, 128k context, more permissive licence terms	Smaller community fine-tune ecosystem than Llama; smaller deployment surface
Qwen (Alibaba)	Best open-weights for code (Qwen Coder), strong multilingual, fast cadence	Chinese training emphasis means English benchmarks slightly lag; less Western community presence
Mistral	Function-calling reliability, French / European multilingual, strong instruction-following	Smaller variant range than Llama; less open ecosystem; some variants are commercial-licence-restricted
Claude / GPT / Gemini	Frontier reasoning quality on hardest tasks, native extended thinking / search grounding / 1M context	Closed weights; USD billing; cross-border data residency concerns

Why Llama is still the open-weights default

Several open-weights families now match or exceed Llama on specific tasks — Gemma on multimodal safety tuning, Qwen on code, Mistral on function-calling. What Llama still has that none of them match: ecosystem breadth. If you self-host, you'll find the most pre-built fine-tunes for Llama. If you use hosted inference, you'll find the most provider options for Llama. If you read research papers, you'll see Llama as the baseline. That ecosystem-network-effect is durable, and explains why Llama remains the default choice when the team doesn't have a specific reason to prefer something else.

04 · Pricing & hosting reality

Free to download, multiple paths to run.

The model weights are free under the Llama Community Licence; you pay for compute. Three structurally different cost paths: self-host on your hardware (cheapest at high volume, highest ops cost), run on Ollama for local dev (free), use hosted inference providers (per-token billing, no ops). The right answer depends on volume, latency requirements, and operational appetite.

Local · dev

Ollama

ollama run llama3.3 for instant local inference. Free, fast, works offline. Default for development and prototyping. Quantised by default; full precision available on capable hardware.

Self-host · prod

vLLM / TensorRT-LLM

Production-grade serving on your own GPUs. PagedAttention, continuous batching, OpenAI-compatible API. Cheapest per-token at high volume but highest operational complexity (GPU management, scaling, monitoring).

Hosted · speed

Groq

Specialises in extreme low-latency Llama inference using their custom LPU hardware. Often 5-10× faster than other hosted providers. The right pick for latency-sensitive production agents.

Hosted · throughput

Together AI / Fireworks

Production hosted-inference providers with the broadest Llama variant catalogue. Per-token pricing similar to closed-frontier flash tiers. The most-flexible hosted path for production workloads.

Cloud · managed

AWS Bedrock

Llama on Bedrock with regional residency including af-south-1 Cape Town for SA-resident inference. Microsoft enterprise contracting model. The path for SA enterprise that wants AWS commercial relationships.

Cloud · managed

Vertex AI Model Garden

Llama on Vertex AI with regional residency including africa-south1 Johannesburg. Path for SA enterprise on GCP. Some Llama variants only available in US regions; check the model availability matrix.

The cost-per-token reality

Llama 70B on hosted providers (Together, Fireworks) typically costs well below Claude Haiku or GPT-4o-mini per token. Llama 405B / Maverick costs more but still meaningfully below frontier closed models. For volume workloads where the closed frontier's quality lift isn't structurally required, Llama via hosted inference can be 5-10× cheaper than running everything on Claude Sonnet or GPT-5. The pattern that wins: tier-route Haiku/Sonnet/Opus or Llama-8B/70B/405B based on task difficulty, mixing both closed and open as needed.

05 · Decision guide

When Llama is the right model.

Use Llama when

You want the largest open-weights community fine-tune library
Cost predictability matters — per-token costs are meaningfully below closed frontier
POPIA / data-residency forbids closed-frontier US-hosted inference
You self-host on your own GPUs and want broad inference-server support
You need the broadest hosting-provider flexibility (Together / Groq / Bedrock / Vertex)
You're building dev tools or research where Llama is the de-facto baseline
Latency-sensitive workloads where Groq's LPU speed is structural

Skip when

You need frontier reasoning quality — Claude Opus / GPT-5 / Gemini Ultra still lead
1M+ context is structural — Gemini wins on extreme context
You hit the 700M MAU edge case — rare, but check before committing at scale
Native search grounding matters — Gemini's grounding is the only credible answer
Code-specific workloads at smaller sizes — Qwen Coder is structurally stronger
Frontier-lab safety tuning on multimodal — Gemma is a more consistent open-weights pick
You want the absolute simplest licensing — Apache 2.0 / MIT models avoid attribution requirements

06 · South African context

Where Llama lands in SA delivery work.

Enterprise · AWS Bedrock `af-south-1`

For SA banks, insurers, telcos, and government with POPIA cross-border concerns, Llama on AWS Bedrock in Cape Town (af-south-1) is one of the cleanest residency stories. Not every Llama variant lands in Bedrock af-south-1 at launch — check the model availability matrix — but Llama 3.x 70B and the Llama 4 mid-tier are typically available with full regional residency. AWS enterprise contracting handles the commercial side; the data stays on-region.

Studio · Ollama + Together

The cheap and fast SA studio path: develop locally on Ollama-Llama 70B (free, runs on a maxed MacBook), ship to Together AI or Groq for production. Together's per-token costs are very competitive; Groq's LPU latency is genuinely best-in-class for chat agents. For most SA studios watching FX, the Llama-on-Together pattern is meaningfully cheaper than Claude or GPT for the same quality bar at production volume. Tier-route to closed frontier only for the genuinely hard 5-15% of traffic.

Self-hosting on Hetzner / on-prem

Hetzner has no SA region; their nearest GPU-equipped DC is Helsinki / Frankfurt. For genuinely SA-resident self-hosted Llama, options narrow to: GCP Johannesburg with custom GPU instances, AWS Cape Town with EC2 GPU SKUs, or local hosting providers with GPU SKUs. None are as cheap as Hetzner, but all keep data SA-resident. For most SA studios, the self-hosting path makes sense only at scale (above ~50M tokens / month) where the per-token economics structurally beat hosted inference.

07 · Connections