know.2nth.ai › Agents › GLM · Z.ai

agents · GLM · Z.ai · Model Leaf

The open-weight flagship that trades blows with the frontier.

GLM is Zhipu AI's model family, shipped internationally under the Z.ai brand. GLM-5.2 (June 2026) is the current headline: a ~753B-parameter Mixture-of-Experts model with ~40B active per token, a 1M-token context window, and a pure MIT licence. It is built for long-horizon agentic and coding work, and its API is Anthropic-Messages-compatible — so Claude Code retargets to it with a base-URL change. The institutionally-deployable Chinese open-weight flagship: genuinely competitive on agentic engineering, self-hostable to keep your data on your own infrastructure — if you can afford the GPUs.

Live · open weights MIT licence GLM-5.2 · Jun 2026 753B MoE · 40B active ~800GB VRAM to self-host

01 · What it is

Zhipu AI's family, told as one tight arc.

GLM (General Language Model) is the family from Zhipu AI, a Beijing lab spun out of Tsinghua University, now shipping internationally under the Z.ai brand. Its defining trait is cadence: roughly quarterly releases, the most consistent of any Chinese frontier lab. The lineage worth anchoring runs GLM-4.5 (Jul 2025) → GLM-4.6 (Sep 2025, 200K context, MIT) → GLM-4.7 (Dec 2025) → GLM-5 (Feb 2026, 744B MoE) → GLM-5.1 (Apr 2026) → GLM-5.2 (Jun 2026). Z.ai listed on the Hong Kong Stock Exchange in January 2026 — the first publicly-listed Chinese AI lab (ticker 02513.HK).

The strategic point for this tree is not the changelog — it is the licence and the shape of the business. GLM-5.2 ships under pure MIT: free to self-host, fine-tune, and use commercially, with none of the MAU caps or attribution clauses that complicate Llama's Community Licence. Z.ai monetises through its hosted API and the GLM Coding Plan, not by withholding weights. That makes GLM an open-weight model you can run next to your own data and systems-of-record — which is the only frame the Agents domain cares about. Judge it the way this sub-tree judges every model: on tool-use reliability and agent-framework integration, not chat vibes.

Where it sits in the Models band

GLM is the seventh distinct model story in this band and the third open-weight option, alongside Llama and Mistral / Qwen. Where Llama is the broad community workhorse and Qwen Coder is the small-model code specialist, GLM-5.2 is the frontier-adjacent open-weight flagship — the one you reach for when you want closed-frontier-class agentic and coding quality but need the weights in your own hands. The cost of that is hardware, not licence (see §02).

02 · How GLM-5.2 works

The architecture, the context, and the two hard caveats.

GLM-5.2 was released 13 June 2026 to GLM Coding Plan subscribers; the open weights dropped on Hugging Face three days later (zai-org/GLM-5.2 in BF16, zai-org/GLM-5.2-FP8). It is text-only and tuned for long-horizon agentic loops, not one-shot answers.

Architecture

~753B MoE · ~40B active

Mixture-of-Experts: 8 of 256 routed experts fire per token plus 1 shared expert. 78 layers, hidden size 6,144, 64 attention heads. You hold three-quarters of a trillion parameters; you pay compute for only ~40B per token.

Context

1M-token window

Up from GLM-5.1's 200K. Long enough to hold a whole repository, a multi-document brief, or a long agent trajectory in-context — the structural requirement for long-horizon coding work.

Efficiency · IndexShare

~2.9× lower FLOPs at 1M

Reuses one lightweight indexer across every four sparse-attention layers, cutting per-token FLOPs ~2.9× at the 1M-context end. The trick that makes a 1M window economically serveable rather than just nominally supported.

Efficiency · MTP

Multi-token prediction

An improved multi-token-prediction layer for speculative decoding lifts acceptance length ~20%, so more tokens clear per forward pass. Throughput, not capability — but throughput is what makes agentic loops affordable.

Licence

Pure MIT

Self-host, fine-tune, and use commercially with no MAU ceiling and no attribution clause. The monetisation lives in the hosted API and GLM Coding Plan, not in the weights.

Interop

Anthropic-Messages-compatible

The API speaks the Anthropic Messages format, so agent tooling built for Claude — Claude Code included — retargets to GLM with a base-URL change rather than a rewrite. The open-standards seam, made concrete.

Hard caveat #1 — the hardware reality

The MIT licence is free; the hardware is the binding constraint for most teams. The FP8 weights need ~800GB of VRAM spread across multiple GPUs — on the order of 8×H100. Self-hosting GLM-5.2 is not a laptop exercise and not a single-card exercise; it is a serious GPU cluster. This is precisely where the Inference band's honest answer is vLLM on a dedicated host, not Ollama on a workstation. If the cluster is out of reach, hosted access (DeepInfra, OpenRouter, Z.ai's own API) is the realistic path — with the governance trade-off below.

Hard caveat #2 — the serving path, not the weights

Z.ai is Beijing-based, and its former name, Zhipu AI, was added to the US BIS Entity List in January 2025. If you call Z.ai's hosted API, your prompts route through servers governed by China's National Intelligence Law — a real data-path consideration for regulated or sensitive workloads. The distinction that matters: the risk is in the serving path, not in the weights. Self-hosting the open weights, or serving them through a provider you trust, removes the data-path exposure entirely.

The honest tension to hold, rather than resolve glibly: the same export-control geopolitics that makes GLM's hosted API a governance risk is also what makes holding open weights you control — from any lab — a continuity hedge. Both directions are real; weigh them against your own threat model. The continuity side is unpacked in §04.

03 · Where it actually lands

Competitive on agentic engineering — read the rows honestly.

The strong rows below are largely vendor-claimed. Independent composites place GLM-5.2 as the strongest open-weight model while still trailing the closed frontier on the hardest general tasks. Both things are true at once; the real story is not any single number but that a self-hostable, MIT-licensed model now trades blows on agentic coding.

Benchmark	GLM-5.2	Comparison	Source
SWE-bench Pro	62.1	GPT-5.5 · 58.6 (GLM ahead)	Vendor-claimed
FrontierSWE	74.4%	GPT-5.5 72.6% · Claude Opus 4.8 75.1%	Vendor-claimed
Terminal-Bench 2.1	81.0	—	Vendor-claimed
Artificial Analysis Intelligence Index v4.1	51	MiniMax-M3 · DeepSeek V4 Pro 44 · Kimi K2.6 43	Independent

The honest read

On the independent Artificial Analysis Intelligence Index v4.1, GLM-5.2 tops the open-weight field at 51 — ahead of MiniMax-M3, DeepSeek V4 Pro (44) and Kimi K2.6 (43), as reported via Artificial Analysis and Simon Willison. It still trails the closed frontier on the hardest reasoning rows. Net: GLM-5.2 trades blows with Claude Opus 4.8, GPT-5.5 and Gemini 3.1 Pro — winning some rows, losing others. Do not upgrade the vendor rows into flat assertions; the load-bearing claim is that an open-weight, self-hostable model is genuinely in the conversation on agentic engineering.

Cost — named, not oversold

For comparable long-horizon coding, hosted GLM-5.2 runs at roughly one-sixth the cost of GPT-5.5. Published hosted pricing lands around $0.95–$1.40 per 1M input tokens and ~$3.00–$4.10 per 1M output tokens, depending on provider (DeepInfra, OpenRouter). Self-hosting is compute-and-electricity only — no per-token fee, but the ~800GB-VRAM cluster from §02 is the up-front price. Which model is cheaper for you depends entirely on volume; the token saving is real and so is the cluster bill.

04 · Decision guide

The model was the easy part. The harness is the decision.

The most important content on this leaf is not a benchmark. It is the economics of switching — because GLM-5.2 forces the question that the whole Agents domain has been circling: if a cheap open-weight model now does the everyday work, why isn't everyone switching? The answer is that the moat moved from the model to the harness, the routing, and the context around it. Five decision criteria follow, then the use / skip call.

1 · Audit your task distribution before assuming you need the frontier

Most knowledge work is standard and repeatable — first-pass websites, presentation outlines, content synthesis, common coding problems. In that fat middle of the curve, GLM-5.2 is frequently not just a budget alternative but the better execution, notably on front-end aesthetics. The frontier's edge shows up on the hard tail — novel reasoning, ambiguous specs — which is a minority of real usage. Most teams have never measured what share of their AI calls are predictable middle-of-curve work. Measure it first; the answer usually argues for a cheaper model on most of the volume.

2 · “Lift and shift” is a myth — switching models means rewriting the harness

Swapping a cheaper model in is not an API-key change. System prompts, memory architecture, and tool-call / integration handling are all tuned to a specific model's logic and must be re-engineered for the new one. This re-engineering is the concrete cost that sits opposite the token savings. The saving is real; the migration friction is also real; which one dominates depends on your volume. State that plainly and you avoid both the hype and the despair.

3 · The ROI line is currently drawn at gross margin

The teams actually migrating today are “AI-as-a-service” businesses where token cost is cost-of-goods — cutting it hits margin directly, so the harness rewrite pays back fast. For internal-productivity use, the up-front engineering friction often obscures the payback, and waiting is the correct call. This is the honest when-to-switch / when-to-wait criterion: not everyone should migrate yet, and that is not a failure of nerve.

4 · Convenience is the real lock-in, not capability

Frontier labs are defending price with ergonomics — drop-in presence inside Slack and shared workspaces that quietly absorbs uncodified corporate context. A team running entirely on out-of-the-box frontier tooling is, in effect, renting its own operational context back from a third party; cheap open weights don't help if you can't extract yourself. This is the decisions-follow-data-and-systems-of-record position from the Agents hub: own your context layer or stay locked in regardless of model price. (The context-ownership argument lives in the Data domain.)

The routing corollary — the barbell architecture

The practical answer to criteria 1–3 is an auto-router: a cheap open-weight model for center-of-distribution tasks, a frontier model for the hard tail. The industry name for this — confirmed across the June 2026 CNBC open-source-AI panel — is the barbell: reserve the expensive closed frontier for top-tier orchestration, planning, and final review only; route the bulk — heavy document processing, data combing, repetitive agentic loops — to cheap open weights. Agentic work consumes far more tokens than chatbot queries, so the barbell is what stops an agent pipeline from blowing its budget. The scarce thing is the talent to build that router safely, not the models. The seam where routing lives is LiteLLM and OpenRouter — build the router there, not inside this model.

5 · Open weights as a business-continuity backstop (sovereign AI)

Software access can no longer be assumed a stable global constant. Western labs have cut off access to frontier models under regulatory and export pressure, which turns “can we self-host a capable model?” from a cost question into a continuity question. An MIT-licensed, self-hostable model like GLM-5.2 is the backstop — the thing you fall back to if a hosted frontier vendor becomes unavailable to you. Two honest notes: (a) this is redundancy, not a wholesale replacement — the barbell still sends the hard tail to the frontier when it is available; and (b) the same geopolitics cuts both ways for GLM specifically (see the serving-path caveat in §02). The sovereign-AI argument is for holding open weights you control, whoever made them; for SA/EU teams it also overlaps data-sovereignty (your proprietary data stays in-house rather than transiting a foreign vendor). The residency mechanics live in the Tech and Data domains.

Reach for GLM-5.2 when

You want frontier-adjacent agentic / coding quality with the weights in your own hands
Token cost is your cost-of-goods — the harness rewrite pays back on margin
You have (or can rent) a serious GPU cluster, or trust a non-Z.ai hosted provider
You're building the cheap end of a barbell router for high-volume agentic loops
You need a self-hostable continuity backstop against frontier-vendor cut-off
POPIA / data-residency rules out routing sensitive prompts to a foreign hosted API

Skip when

You need the absolute hardest-tail reasoning — the closed frontier still edges it
You have no GPU cluster and the workload is too sensitive for a hosted Chinese API
You want a laptop / load-shedding-resilient local model — Gemma on Ollama fits, GLM doesn't
Your AI use is low-volume internal productivity — the harness rewrite won't pay back yet
You need multimodal — GLM-5.2 is text-only
You can't put the talent on a safe router and just want one drop-in frontier vendor

05 · South African context

The POPIA-friendly path — and the VRAM bill that comes with it.

The self-host story is genuinely POPIA-friendly

For SA teams, the open-weight + self-host route is the clean POPIA answer: the data never leaves your infrastructure, and there is no foreign hosted endpoint in the prompt path. Unlike a closed frontier API, GLM-5.2's MIT weights let you run inference entirely in-country — in your own rack or an SA-region GPU host — so customer data stays where the regulator expects it.

But this is a serious-cluster option, not a load-shedding laptop model

Be honest about the cost: GLM-5.2 is a serious on-prem GPU cluster decision (~800GB VRAM, 8×H100-class), not a lightweight local model. It is the opposite end of the spectrum from Gemma on Ollama, which runs FX-free and load-shedding-resilient on a maxed laptop. If you need a POPIA-friendly model that survives a power cut on a workstation, GLM is the wrong shape — reach for the lighter open-weights leaves. GLM earns its place when you have real GPU infrastructure and want frontier-adjacent quality on it.

The pragmatic SA middle path

For most mid-sized SA teams the near-term move is not a data-centre build. It is to measure the task mix, build the barbell routing muscle, and keep GLM-5.2 as the open-weight backstop you can stand up if a frontier vendor becomes unavailable — using trusted hosted access (a non-Z.ai provider) for the cheap-end volume in the meantime, with eyes open about which prompts may cross a border. Own the context layer regardless; that is the part no model price rescues.

06 · Connections