Durable Execution in LangGraph: Agents That Survive Failure and Resume Where They Left Off
Most AI agents are built as a single process holding state in memory: a while loop, local variables, maybe a sleep(). That holds up until the workflow has to outlive the process that started it — and in production it always does. The math is unforgiving: chain ten steps that each succeed 85% of the time and the whole run finishes only about 20% of the time (0.85¹⁰ ≈ 0.20). Without durability, every one of those failures restarts from scratch. The model might be reliable; the tool calls aren't. Better LLMs don't fix network failures — only durable execution does.
The research consensus is that the infrastructure around the model, not the model itself, is where production agents live. The 2026 design-space analysis Dive into Claude Code found that only 1.6% of Claude Code's codebase is AI decision logic; the other 98.4% is operational infrastructure for context management, tool routing, and recovery. LangGraph's answer to that reality is durable execution through its persistence layer — making the agent a row in a checkpoint store, not a stack frame in a living process. This article dissects how that works, the sharp edges it creates, and how to observe a workflow that — by design — no longer runs as a single process.
