Choosing between the Claude Agent SDK, LangGraph, and CrewAI
Three tools for the same job, each with a different shape. Here is how we choose between them in real Australian client engagements — and what each is quietly bad at.
If you narrow the 2026 AI orchestration stack to one question, it is this: which agent framework are you standardising on? The choice matters because the framework defines how agents hand off, how state is persisted, how humans intervene, how failures propagate, and how much engineering effort sits between you and a production deploy. Pick badly and the next two years are a rewrite.
We evaluate three frameworks most often in Australian client engagements: the Claude Agent SDK from Anthropic, LangGraph from the LangChain team, and CrewAI. Each is good at something different. Here is how we choose between them.
Claude Agent SDK — the production-first choice
The Claude Agent SDK is Anthropic’s first-party runtime for agents. It is the one we default to when the engagement has two characteristics: the workload benefits from Claude’s reasoning strengths (long documents, code, structured analysis, tool use with guardrails) and the client wants the fewest moving parts between prototype and production.
What it is quietly good at: durable multi-step tool use, memory across sessions, native MCP support, safety guardrails that actually work, and observability that maps cleanly to Anthropic’s billing and usage surface. You get telemetry for free. If something breaks you can trace the bad response back to the prompt that produced it without reinventing logging.
What it is quietly bad at: multi-model workloads. If your architecture requires a Gemini long-context pass, a GPT-class voice stage, and a Claude reasoning stage all in the same agent run, the Agent SDK will work but you are using it against the grain. It is opinionated about Claude. That is normally a feature, occasionally a constraint.
We pick it when: the workload is Claude-native, the team is small, and the timeline matters. Most single-vendor enterprise engagements, most internal-tool deployments, most assistant-style products.
LangGraph — the stateful-graph choice
LangGraph is the graph-based orchestration layer from the LangChain team. It is the right choice when the workflow looks like a state machine — branching paths, retries, human-in-the-loop checkpoints, conditional routing, parallel fan-out. If you can whiteboard the workflow as a directed graph with interesting edges, LangGraph is usually the closest fit.
What it is quietly good at: stateful workflows that persist across long horizons, model-agnostic execution (it does not care whether a node calls Claude, GPT, Gemini, or a local Llama), and explicit human-in-the-loop patterns. It gives you a first-class way to say “pause here, wait for a human, resume.” That is harder than it sounds in most agent frameworks.
What it is quietly bad at: everything upstream of the graph. You still need to own deployment, secrets, observability, retries, and the runtime that hosts the graph. LangGraph solves orchestration. It does not solve operations. Pair it with Temporal, or build the operational layer yourself.
We pick it when: the workflow has real branching logic, human review is central to the design, and the client has an engineering team that can own the surrounding operational surface.
CrewAI — the role-based multi-agent choice
CrewAI models agents as roles on a team — a researcher, an analyst, a writer, a reviewer — each with a goal, backstory, tools, and collaboration rules. It maps unusually well to workflows that already exist as human processes, because humans already do their work this way.
What it is quietly good at: rapidly prototyping multi-agent workflows where the shape is “research then analyse then draft then review.” The role abstraction is genuinely productive for non-engineers. We have had operations leads successfully describe a crew over coffee that engineers then implemented in a morning.
What it is quietly bad at: determinism and safety boundaries. The role abstraction encourages agents to be autonomous, which is exactly what you do not want in a regulated workflow where every handoff needs audit. CrewAI can be constrained, but the defaults push in the wrong direction for APRA-facing or ATO-facing work.
We pick it when: the workflow maps cleanly to human roles, speed-to-prototype is the priority, and the consequences of a bad agent action are reversible. Content production, research summarisation, internal knowledge work.
The decision we actually make
In the last eight engagements we have run, the split has been roughly: four Claude Agent SDK, three LangGraph (often paired with Temporal for durability), one CrewAI. The Agent SDK wins when the workload is Claude-native and the team is small. LangGraph wins when the workflow has real state. CrewAI wins when the team needs to move in a week, not a month.
What almost never wins: “let’s use all three.” We have seen that pattern in two client architectures this year and it does not work. The frameworks overlap enough that running all three produces a stack that is neither coherent nor easy to staff. Pick one as the default, use a second deliberately for a specific workload it is better at, and do not let a third in unless there is a specific reason.
What we also consider
AutoGen is a credible fourth option for research-style multi-agent work, particularly inside Microsoft-first shops. Google’s Agent Development Kit matters if the client is Gemini-native. OpenAI’s Assistants and Responses APIs are capable but the agentic primitives are less mature than Anthropic’s today. n8n is not an agent framework but it is the right glue layer when integrating with existing SaaS tools is the critical path.
The MCP ecosystem matters more than any single framework choice. If you pick a framework that does not speak MCP, you are betting that you will not want to swap models in the next two years. That is a bet we do not recommend taking.
How this decision lands in an engagement
This is roughly the conversation we have at the end of the technology-stack session of a deep dive. The question is never “which framework is best” in the abstract. It is “which framework is best given your workload, compliance posture, team skill, and two-year roadmap.” The answer is usually clearer than clients expect going in.
If you want to walk through that decision against your specific operations, our deep-dive engagements include it by default. Two to four weeks, written roadmap, fixed price.