know.2nth.ai › Agents › AutoGen / AG2

agents · AutoGen / AG2 · Skill Leaf

Agents in conversation. Debate to converge.

AutoGen is Microsoft Research's multi-agent conversation framework. Released in late 2023, it pioneered the idea that multi-agent systems should look like a group chat — agents take turns talking, debate, refine answers, and converge through dialogue rather than explicit graph orchestration. Two flavours coexist in 2026: Microsoft's AutoGen v0.4+ (the modular rewrite under active Microsoft Research development) and AG2 (the community fork that continues the original v0.2 design). Both are MIT-licensed Python frameworks. The strongest fit for research workflows, agent-debate experiments, and academic / R&D use cases — with the honest caveat that production tooling lags LangGraph and CrewAI.

Live · v0.4+ (Microsoft) · AG2 (community) MIT licensed Python Microsoft Research origin Research-strong, production-mixed

01 · What it is

A multi-agent framework where the metaphor is conversation.

AutoGen is a Python framework for building multi-agent systems where the agents primarily communicate through structured conversation. Released by Microsoft Research in October 2023 as one of the first serious multi-agent frameworks, it pioneered patterns that other frameworks later borrowed: a "group chat" with multiple LLM-driven agents and optionally a human, the ConversableAgent base class, and the use of agent-to-agent dialogue as the primary control flow.

The thesis: real reasoning often emerges from disagreement and debate. A single agent monologue tends to commit early to one approach; multiple agents arguing through a problem produces better answers on hard tasks — especially research, planning, and analysis where multiple perspectives genuinely matter. AutoGen codified this pattern when most frameworks were still single-agent ReAct loops.

What confuses people in 2026 is the framework's two-track present. After the original v0.2 design caught on, Microsoft restructured the project: Microsoft AutoGen v0.4+ is a substantial rewrite with a more modular, message-passing-driven architecture, maintained inside the AutoGen GitHub org. AG2 (originally also called "AutoGen 2") is the community-led fork that continued the original v0.2 design and APIs, governed independently. Both are still active. Both work well. They share design DNA but diverge on specific APIs — pick one and stick with it; mixing examples from both creates pain.

Which one to use?

For new projects in 2026, the honest pragmatic answer: start with Microsoft AutoGen v0.4+ if you want active Microsoft research backing, the newest features, and Magentic-One (their multi-agent research framework built on top of AutoGen). Use AG2 if you want continuity with the v0.2 patterns most older tutorials and papers use, or if you prefer community-governed projects. Both are MIT, both are production-capable for research-shaped workloads, and both run against any LLM. Don't agonise over the choice — the design ideas transfer.

02 · How it works

ConversableAgent + GroupChat + an optional human.

Two primitives carry most use cases. ConversableAgent represents an LLM-driven participant; UserProxyAgent represents a human (or a programmatic stand-in for a human, used to drive automated experiments). Several agents go into a GroupChat, and a GroupChatManager picks who speaks next. That's the structural core.

The minimal AG2 (v0.2-style) example — a researcher and a critic debating:

# pip install ag2  (or autogen-agentchat for v0.4+)
from autogen import AssistantAgent, UserProxyAgent, GroupChat, GroupChatManager

llm_config = {"model": "gpt-5", "api_key": "..."}

researcher = AssistantAgent(
    name="researcher",
    system_message="You research SA fintech trends with rigor.",
    llm_config=llm_config,
)

critic = AssistantAgent(
    name="critic",
    system_message="You challenge weak claims and ask for evidence.",
    llm_config=llm_config,
)

human = UserProxyAgent(name="human", human_input_mode="NEVER")

groupchat = GroupChat(
    agents=[human, researcher, critic],
    messages=[],
    max_round=10,
)
manager = GroupChatManager(groupchat=groupchat, llm_config=llm_config)

human.initiate_chat(manager, message="Identify 3 emerging SA fintech trends. Researcher proposes; critic challenges; reach consensus.")

The conversational loop. The GroupChatManager uses an LLM to pick the next speaker based on the conversation so far and each agent's role description. Agents take turns; the chat continues until a termination condition fires (max rounds, a specific keyword, or the manager deciding the conversation has converged). The human or a programmatic stand-in can interject at any point.

v0.4+ changes the surface API meaningfully — messages are now first-class objects, agents communicate through a runtime, and the architecture is built for distributed deployment. The mental model is the same; the code looks different. The autogen-agentchat package provides v0.2-compatible high-level APIs over the v0.4 runtime if you want both: modern architecture, familiar API.

Tools and code execution. AutoGen's signature feature from day one was code-executing agents — an agent that writes Python, runs it in a sandbox, observes output, and refines. Useful for analytical tasks where the agent benefits from actually computing rather than reasoning about computation. v0.4 cleaned up the security model around this; in production, run code execution in containers with explicit sandboxing.

Where conversation beats orchestration

The structural advantage AutoGen has over LangGraph or CrewAI: tasks where the answer benefits from explicit disagreement. Research where one agent advocates and another critiques; planning where one agent proposes and another stress-tests; coding where one agent writes and another reviews. These work because the conversation surface forces both agents to engage with each other's reasoning — producing better outputs than either alone. If your task naturally decomposes into "A says X, B should challenge X," AutoGen / AG2 is the right framework.

03 · The ecosystem

Two libraries, AutoGen Studio, Magentic-One on top.

The framework's ecosystem in 2026 is split between Microsoft and the community. Both publish their own packages, docs, and tutorials. The split is real but pragmatically manageable.

Library · Microsoft

AutoGen v0.4+

Python package autogen-agentchat (high-level) + autogen-core (runtime). Microsoft Research-backed. Modular architecture, distributed-deployment-ready, active development cadence.

Library · community

AG2

Python package ag2. Continues the original v0.2 API. Community-led after Microsoft restructured. Independent governance, MIT, broadly compatible with the v0.2 examples in older tutorials and research papers.

Tool · visual

AutoGen Studio

Microsoft's no-code / low-code UI for designing and running AutoGen workflows. Useful for prototyping conversation patterns and evaluating outputs before committing to code.

Framework · on top

Magentic-One

Microsoft's research-grade multi-agent framework built on AutoGen v0.4. Specialised orchestrator + four pre-built agents (FileSurfer, WebSurfer, Coder, ComputerTerminal) for "complete tasks the way a research assistant would." Open source, still research-leaning.

Production tooling lags the research output

AutoGen / AG2 has been more researcher-focused than production-focused since launch. Compared to LangGraph (LangSmith observability), CrewAI (CrewAI Enterprise), or vendor SDKs (built-in tracing), AutoGen's production-grade tooling story is thinner. Microsoft Azure offers some integration paths, but expect to build observability, retry logic, and audit yourself if AutoGen is your production framework. For research and R&D — the framework's strongest fit — this isn't a blocker. For audit-heavy production agents, it is.

04 · vs LangGraph, CrewAI, vendor SDKs

Where AutoGen / AG2 fits.

All four production-agent frameworks make different bets. AutoGen's bet is conversational debate. CrewAI's is role-based teams. LangGraph's is explicit graph control. Vendor SDKs bet on first-party model alignment. Each is right for a specific shape of project.

Dimension	AutoGen / AG2	CrewAI	LangGraph
Mental model	Group chat · debate · converge	Roles · goals · tasks	Graph nodes · edges · state
Best for	Research, planning, agent-debate	Role-decomposable workflows	Compliance-heavy, audit-driven
Production tooling	Lean — build it yourself	CrewAI Enterprise (paid)	LangSmith (paid)
Code execution	First-class since v0.2	Via tools	Via tool nodes
Visual UI	AutoGen Studio	CrewAI Studio (in Enterprise)	LangGraph Studio
Multi-vendor LLMs	Yes (any compatible API)	Yes (LiteLLM)	Yes (any model)
Worst fit	Production agents needing audit; latency-sensitive flows	Workflows that don't decompose into roles	Single-shot agents; non-graph thinkers

The three multi-agent frameworks side-by-side

If you're choosing a multi-agent framework in 2026, you're effectively choosing between AutoGen / AG2, CrewAI, and LangGraph. AutoGen / AG2 wins on agent-debate research and code-execution-heavy work. CrewAI wins on role-decomposable workflows and prototyping speed. LangGraph wins on production-grade audit and explicit control. Vendor SDKs (Anthropic, OpenAI) sit one layer below this — they're the framework choice when the model is the load-bearing pick. Many production stacks combine two or three via A2A.

05 · Use cases

Where AutoGen / AG2 plays best.

Six patterns that play to the conversational-debate metaphor and code-executing-agent strength. Most are research-leaning rather than customer-facing; that's not a bug — that's the framework's structural fit.

Multi-agent research workflows — the canonical use case. Researcher + critic + summariser agents debate a topic and produce a balanced output. Common pattern in academic / R&D applications and the source of most early AutoGen papers.
Code-executing analysis agents — an agent writes Python, runs it in a sandbox, observes output, refines. Strong fit for data analysis, ML experiments, mathematical proofs. AutoGen's code-execution surface was designed for this from day one.
Planning + critique workflows — one agent proposes a plan; another stress-tests for failure modes; they converge through revision. Useful for project planning, architecture decisions, risk analysis.
Conversational evaluations and red-teaming — multi-agent debate as an evaluation method, where critic agents probe for hallucinations, weak reasoning, or policy violations in another agent's outputs. Increasingly used in eval pipelines.
Magentic-One research-assistant pattern — web research, file analysis, code execution, computer-use. Microsoft's reference implementation of "the research assistant agent" built on AutoGen v0.4. Useful as a starting template for similar workflows.
Educational / curriculum agents — tutor + student + evaluator agents working through a problem. Common pattern in academic AI / education research; fits the conversation metaphor naturally.

06 · Decision guide

Pick AutoGen / AG2 when. Skip when.

The honest framing: AutoGen / AG2 is the right choice for research-shaped work and agent-debate patterns. It's a less-good choice for production agents that need audit, observability, or vendor-tight integrations. Pick the framework that fits the shape of your project — for research that shape is often AutoGen.

Use AutoGen / AG2 when

The task benefits from explicit agent-to-agent debate or critique
Code-executing agents are load-bearing (analysis, ML experiments, math)
Research, R&D, or academic work where novel agent patterns matter
You're building eval / red-teaming infrastructure that uses multi-agent dialogue
Magentic-One's research-assistant pattern fits your use case
You don't need vendor-tight tracing or paid observability tools
Multi-vendor LLM strategy with no preference for a specific cloud

Skip when

You need production-grade audit / tracing / RBAC out-of-the-box — LangGraph + LangSmith wins
The task fits role-based decomposition more than conversational debate — CrewAI fits better
You're on a specific cloud and want first-party SDK alignment — ADK / vendor SDKs win
Latency-sensitive (sub-second) workflows — multi-round conversation adds variance
Compliance-heavy domains (banking, healthcare) where the framework choice signals "we built this seriously" — LangGraph is a more conservative pick
You want managed runtime + scaling + monitoring as paid SaaS — LangGraph Platform / CrewAI Enterprise / Vertex AI Agent Engine all fit better

07 · South African context

Where AutoGen / AG2 lands in SA delivery work.

AutoGen / AG2 has a narrow but valuable role in SA work: research projects, academic / university partnerships, evaluation infrastructure, and the small subset of production work where conversational debate is structurally the right pattern. For mainstream production agent work, other frameworks fit better.

Academia · SA universities and research labs

For SA universities (Wits, UCT, Stellenbosch) and research-leaning organisations, AutoGen / AG2 is genuinely useful. The framework's strength on multi-agent research patterns aligns with academic AI work, and the open-source MIT licence removes commercial barriers. SA AI research groups working on multi-agent debate, evaluation methods, or curriculum agents will find AutoGen the right primary framework.

Studio · eval / red-team infrastructure

For SA studios that need to build evaluation infrastructure for client projects — "test this customer-support agent for hallucinations" or "red-team this RAG system" — AutoGen's multi-agent dialogue patterns are a strong fit. Pair AutoGen for the eval layer with LangGraph or vendor SDKs for the production agent itself; AutoGen runs the tests, the production framework runs the work.

Production caution

For SA enterprise production agents (banking, insurance, healthcare), AutoGen / AG2 is rarely the right primary framework. The thinner production tooling story creates audit and observability gaps that compliance-heavy domains struggle with. Use AutoGen patterns where they shine (research, eval, code-execution) and choose LangGraph or vendor SDKs for the customer-facing production layer. The frameworks aren't mutually exclusive — an AutoGen-driven eval layer testing a LangGraph-driven production agent is a sensible architecture.

08 · Connections