know.2nth.ai Software Planning with agents
software · Planning with agents · Skill Leaf

The backlog is the prompt.

Strip the ritual and every agile ceremony solves one problem: moving context from one head to another. Agents are stateless — they need that transfer more than humans do, on every task. So nothing about planning is obsolete; it just compiled down from meetings into files. The modern backlog stops being a to-do list and becomes executable specification under version control: the spec is the ticket, the ticket is the prompt, and the operator's job is keeping the queue full of well-specified, independently shippable work.

Spec-driven Backlog as data Operator model Honest tradeoffs

Every ritual was always about moving context.

Sprint planning scopes work before a line of code is written. Backlog refinement chunks a big idea into pieces one worker can hold in their head at once. A PRD gets the picture out of one brain and into a format someone else can execute. Look past the ceremony and agile has always been a set of context-transfer mechanisms — disciplines for getting what's in one person's head into another's before work starts. The practitioner argument that named this cleanly is Josh Owens' “Your AI Doesn't Need Better Prompts. It Needs a Sprint Planning Session.” (Feb 2026).

Here is the reframe the whole leaf turns on: an agent is stateless. It holds nothing between sessions, and it starts every task cold. Where a human teammate accumulates context over months and can coast on it, an agent needs the full transfer on every single task. So the planning disciplines aren't obsolete in the agent era — they're load-bearing in a way they never quite were for humans. What changes is the medium: the context that used to live in a two-hour meeting and three people's memory now has to live in files an agent can read. Planning didn't die. It compiled down from meetings into artifacts under version control.

The one-line version

A ceremony is a context-transfer protocol with a human runtime. Give it an agent runtime and the protocol survives — it just runs against committed text instead of a room full of people. The rest of this leaf is what each ceremony becomes once you write it down for a machine.

Spec-driven development: the spec is the source of truth.

In spec-driven development (SDD), a versioned, structured specification — not the code, not the Jira card — is the source of truth, and code is generated and maintained against it. It exists to answer three specific failure modes of hand-a-model-a-prompt development.

Those three failure modes: intent drift (the code slowly stops matching what anyone actually asked for), context decay (the “why” behind a decision evaporates the moment the session ends), and unverifiable output (you can't tell whether the agent did the right thing because “the right thing” was never written down). A working ticket in this model carries what a good agent brief always carried: outcomes, scope boundaries, constraints, prior decisions, a task breakdown, and verification criteria. That is not new project-management theology — it's the same brief a competent contractor would demand before quoting, made executable.

ToolWhat it isSince
GitHub Spec KitOpen-source toolkit for spec-driven development against coding agentsOpen-sourced Sept 2025
AWS KiroSpec-first IDE that generates requirements, design, and tasks before code2025
Claude Code plan modeA read-only planning pass that produces an approved plan before any edit2025
OpenSpecOpen convention for structured, versioned specifications in the repo2025
BMAD-METHODAgentic agile framework wrapping planning roles around the build loop2025

For the requirement statements themselves, EARS notation (Easy Approach to Requirements Syntax) gives a small, testable grammar — “When <trigger>, the system shall <response>” — that turns a vague wish into something an agent can build against and a reviewer can check.

Honest caveat — SDD is not waterfall with extra YAML

ThoughtWorks' Technology Radar Vol. 33 (2025) places spec-driven development in the Assess ring — worth a look, not yet a default — and warns explicitly against heavy up-front specification and big-bang releases as an antipattern. The discipline earns its keep when specs stay small, living, and close to the code. Push it toward exhaustive up-front documents and you've reinvented waterfall. It is also a genuinely poor fit for exploratory work, where you're still discovering what to build and a firm spec would just be a confident guess.

From a human-only artifact to a machine-operable one.

Once the ticket is a structured spec, the backlog stops being a human-only list and becomes data an agent can operate on — issues in GitHub, Linear, or Jira reached over MCP, or plain markdown work items committed alongside the code. What that changes in practice is concrete: agents cluster duplicates, flag stale tickets, draft sizing, and produce the planning pre-read. The planning meeting stops being “let's build a plan while everyone watches” and starts from a complete draft — the question in the room becomes “do we agree?” instead of “let's write this from scratch.”

The hard boundary — drafting is not deciding

AI is good at clustering and drafting. It is bad at deciding business priority, and pretending otherwise is how teams automate themselves into shipping the wrong thing efficiently. The division of labour that holds: the human ranks the backlog (priority is a judgement about the business, not a text-prediction task), and the engineer who ships the work owns the estimate. The agent prepares; the people decide.

A markdown work item, committed in-repo — runnable-in-spirit
## [WORK-142] Add rate-limit to the /export endpoint

outcome:     /export refuses more than 5 requests/min per API key,
             returning HTTP 429 with a Retry-After header.
scope:       api/export handler + shared limiter. No UI changes.
constraints: Limiter state in existing Redis. No new deps.
decisions:   Per-key, not per-IP (keys already authenticated).
verify:
  - Unit: 6th call within 60s returns 429 + Retry-After.
  - Unit: counter resets after the window.
  - CI: existing /export tests still pass.
context:    See software-for-agents.html §grounding; limiter at
             api/lib/limiter.ts.

That item is small enough for one agent session, states its own acceptance criteria, points at the live code to ground against, and can be verified without a human reading the agent's reasoning — only its diff and its passing tests. A backlog full of items shaped like this is the prompt library the operator runs against.

When spec-to-commit is minutes, two weeks stops being the unit.

The two-week sprint was calibrated to a human build loop. When spec-to-first-commit compresses to minutes, that unit stops being natural — you can cycle through several well-specified items in the time a planning meeting used to take. The habit is sticky, though: State of Agile data still shows roughly 59% of teams on two-week sprints, a share that has declined every year since 2022 (via Cadence, “How to plan a software development sprint in 2026”, June 2026). The mainstream alternatives are flow-based cycles (Linear-style continuous flow) and Shape Up appetites (fixed time, variable scope).

Practical guidance, not dogma

  • Pick cycle length by release cadence, not by convention — the cycle should match how often you actually ship, which for many agent-assisted teams is now continuous.
  • Keep a ~20% interrupt buffer at commit — agent throughput makes it tempting to pack the cycle to 100%, which leaves no room for the review and rework that autonomy generates.
  • Move refinement out of the planning meeting into an AI-assisted 30-minute triage before it — the agent produces the pre-read; the humans spend the meeting deciding, not drafting.

Velocity was calibrated on typing speed. That constraint moved.

Story-point velocity is an estimate of how much a team can write in a cycle. When agents write most of the code, the constraint stops being how fast anyone types and shifts to spec quality and review capacity — and a metric calibrated on the old bottleneck goes haywire against the new one. What survives the shift is measurement of throughput and cycle-time on merged, verified work, and probabilistic forecasting off actual completion data, which beats ceremony-driven estimation precisely because it doesn't care who wrote the code, only what shipped and passed.

State the recalibration honestly, because someone will wave a chart: velocity numbers will move sharply, and stakeholder expectations need resetting. A bigger number is not a better forecast if review is the queue — code drafted but stuck behind a human reviewer isn't delivered, and counting it as velocity just moves the lie downstream. What to watch instead of raw velocity:

Signal

First-pass success rate

Share of agent tasks that pass review and CI on the first attempt. The truest read on spec quality.

Signal

Review queue depth

How many finished-but-unreviewed PRs are waiting. The new bottleneck, made visible.

Signal

CI pass rate on agent PRs

How often agent changes clear the pipeline unaided. Falling rate = specs or tests degrading.

Signal

Rework rate

Share of merged work later reverted or reopened. Catches the “shipped fast, wrong anyway” failure.

Planning's job is keeping the operator's queue full.

The throughline across this branch is the operator: one person commanding several agents in parallel, with human specialists in the loop for review and judgement. In that model, planning has a precise, humble job — keep the operator's queue full of well-specified, independently shippable work items, each small enough for one agent session and each gated by CI and PR review exactly as software/software-for-agents describes. Planning isn't running the build anymore; it's feeding it clean inputs.

What makes this planning model safe

Everything here rests on the machinery from software-for-agents: the repo as context, a CLAUDE.md that orients the agent, SKILL.md procedures it can load, and CI as the safety net that gates every change. A spec-driven backlog is only safe because each item lands as a branch, runs the full pipeline, and waits for a human to approve the diff. Take away the harness and “let agents work the backlog” becomes the reckless sentence it sounds like. With the harness, it's a controlled, auditable operation — the queue is full, and every item that clears it was reviewed and is reversible.

Where this costs more than it returns — and the objection to answer.

No leaf in this branch ships without the counter-case. Spec-driven, agent-operated planning is not free, and it is not always worth it.

The real costs

  • Automation noise. Every auto-generated ticket, sync, and status update creates notification load. Past a point, the drafting that was supposed to save attention drowns the signal it was meant to surface.
  • Governance gap. Deloitte's State of AI 2026 reports only about one in five companies has a mature governance model for autonomous agents. Most teams are running the practice ahead of the guardrails.
  • Quality tail-risk. A Feb 2026 large-scale empirical study (arXiv preprint) counted 110,000+ surviving AI-introduced issues in production repositories. Unit tests catch regressions; they do not catch architectural drift, which accumulates silently under a green pipeline.
  • Spec overhead is a real line item. For small teams, low feature volume, or fast-pivoting pre-PMF work, the specification machinery costs more than it returns — you're documenting decisions you'll reverse next week.

“Human gates are just a hindrance” — the objection, answered

The position exists and deserves a straight answer: if agents can produce a passing PR in minutes, isn't a human reviewer just a bottleneck slowing the machine down? No. The gate isn't there because agents can't type — it's there for risk, compliance, and accountability. Someone has to be answerable for what shipped, a regulator will ask who approved it, and “the model did” is not an answer a POPIA or FSCA review accepts. Review is not friction to be optimised away; it's where responsibility lives. The correct optimisation is making the diff easy to review — small, well-specified, well-tested — not removing the reviewer.

Frontier-grade leverage on hygiene, not budget.

The affordable path — and the audit trail comes free

Same move as software-for-agents: this is leverage SA teams can afford. A spec-and-PR planning discipline gives you frontier-grade agent output on hygiene, not budget — you don't need a research team or a frontier bill, you need work items written well enough for an agent to execute and a pipeline that gates them. And the by-product is exactly what regulators ask for: the versioned spec plus the PR trail is the change-control audit artifact a POPIA or FSCA review wants — who changed what, why, who approved it, and how it was verified — produced automatically as you work, not assembled in a panic before an audit.

Where this links in the tree.

Primary sources only.

Also cited inline: “Spec-Driven Development: From Code to Contract in the Age of AI” (arXiv preprint, Feb 2026); Deloitte, State of AI 2026; Cadence, “How to plan a software development sprint in 2026” (June 2026); Josh Owens, “Your AI Doesn't Need Better Prompts. It Needs a Sprint Planning Session.” (Feb 2026). Linked here only where a stable primary URL resolves.