Skip to main content

Durable Execution in LangGraph: Agents That Survive Failure and Resume Where They Left Off

· 12 min read
Vadim Nicolai
Senior Software Engineer

Most AI agents are built as a single process holding state in memory: a while loop, local variables, maybe a sleep(). That holds up until the workflow has to outlive the process that started it — and in production it always does. The math is unforgiving: chain ten steps that each succeed 85% of the time and the whole run finishes only about 20% of the time (0.85¹⁰ ≈ 0.20). Without durability, every one of those failures restarts from scratch. The model might be reliable; the tool calls aren't. Better LLMs don't fix network failures — only durable execution does.

The research consensus is that the infrastructure around the model, not the model itself, is where production agents live. The 2026 design-space analysis Dive into Claude Code found that only 1.6% of Claude Code's codebase is AI decision logic; the other 98.4% is operational infrastructure for context management, tool routing, and recovery. LangGraph's answer to that reality is durable execution through its persistence layer — making the agent a row in a checkpoint store, not a stack frame in a living process. This article dissects how that works, the sharp edges it creates, and how to observe a workflow that — by design — no longer runs as a single process.

What durable execution actually means — and why it isn't just retry logic

Durable execution means the agent's state is saved to persistent storage after every logical step, so a crash or restart resumes from the last saved checkpoint — not from the beginning. That is fundamentally different from an in-memory retry loop, where the state dies with the process. In durable execution, the state survives the process.

LangGraph achieves this through its checkpointing system. When you compile a graph with a checkpointer, a StateSnapshot is saved at every super-step. The handle that ties snapshots together is the thread_id: per the LangGraph persistence docs, "The checkpointer uses thread_id as the primary key for storing and retrieving checkpoints. Without it, the checkpointer cannot save state or resume execution after an interrupt." Every resume reloads the last checkpoint and hands control back to the node that was interrupted.

This is not the same as a traditional job queue (Celery, Sidekiq). Those persist tasks but not the intermediate state of a multi-step computation. LangGraph persists the full graph state — accumulated tool outputs, message history, and control-flow position. That is what lets a month-long workflow sleep for a week and then continue exactly where it paused, holding zero compute between actions.

LangGraph's persistence architecture: modes and tradeoffs

LangGraph exposes three durability modes (durable-execution docs):

  • 'exit' — persists changes only when graph execution exits. Best performance, no mid-execution recovery.
  • 'async' — persists changes asynchronously while the next step executes, with "a small risk that LangGraph does not write checkpoints if the process crashes."
  • 'sync' — persists changes synchronously before the next step starts, writing every checkpoint before continuing, at "the cost of some performance overhead."

The choice is a safety-versus-performance dial. For a month-long thread where a missed checkpoint means a re-sent or skipped email, stronger durability is the correct trade. But there is real evidence that most of this checkpoint cost is avoidable. The 2026 Crab checkpoint/restore study found that over 75% of agent turns produce no recovery-relevant state, so blanket checkpointing is mostly waste; its semantics-aware approach raised recovery correctness from 8% to 100% while cutting checkpoint traffic by up to 87% and staying within 1.9% of fault-free execution time. The lesson for LangGraph users: pick the mode by consequence-of-loss, not by reflex.

ModeSafetyBest for
'exit'None mid-executionShort, cheap tasks; non-critical experiments
'async'Medium (risk of lost checkpoint on crash)Long workflows where a missed checkpoint means a re-run, not a side effect
'sync'Maximum (every step persisted before the next)Any workflow where a missed checkpoint produces a real-world duplicate (email, payment, API write)

The backend matters too. LangGraph ships SQLite and Postgres savers and supports custom ones. In the production system I'll use as a worked example, the team built a custom AsyncCloudflareD1Saver over Cloudflare D1 (SQLite), constructed with enable_logging=True — forcing write failures to surface rather than silently swallowing them. D1 rejects statements with more than 100 bound parameters, so batch writes are chunked under that limit. That kind of platform-specific constraint is exactly the sharp edge that doesn't appear in tutorials but dominates real deployments.

Worked example: a month-long campaign on Cloudflare Workers

An email campaign runs as one durable LangGraph thread per (campaign, contact). The thread_id is deterministic: "campaign-{campaign_id}-{contact_id}". The campaign is a six-touch sequence that stops the instant the prospect replies. Default cadence: [0, 4, 7, 7, 7, 7] days — touch 0 immediately, then gaps of 4, 7, 7, 7, 7 days. A single thread spans roughly a month while holding zero compute between touches. The stack: LangGraph for the state machine, the custom AsyncCloudflareD1Saver checkpointer, and a Cloudflare Workers cron as the scheduler.

Node flow: check_reply → compose_touch → [interrupt: approval] → send_touch → schedule_next → [interrupt: cadence] → (back to) check_reply

Two pauses use interrupt(). After scheduling the next touch, schedule_next calls interrupt({"kind": "cadence", "wake_at": ..., "next_step": ...}) and the run ends. Later it is revived with graph.ainvoke(Command(resume=True), cfg), which reloads state and continues at the paused node — a workflow that sleeps for a week and then continues mid-graph, without holding a thread or a connection.

External scheduler: a thin Cloudflare Worker fires every five minutes (*/5 * * * *) and POSTs /cron/tick. The resume driver queries the index — WHERE status='waiting' AND wake_at <= now — and resumes each due thread by id, strictly sequentially (parallel resumes against a single-worker runtime poison the queue), bounded by a batch size of 25 per tick. The scheduler is dumb on purpose — it asks "what's due?" and the durable state answers. The agent's lifetime is decoupled from any one process, and the checkpoint store is the single source of truth.

The sharp edge: nodes re-execute on resume

This is the most important durable-execution fact in LangGraph, and it's easy to miss. From the durable-execution docs: "Nodes after the checkpoint re-execute, including any LLM calls, API requests, or interrupts — which are always re-triggered during replay." Resume re-runs the node; it does not continue from the next line of source.

That makes idempotency non-negotiable for any side-effectful node. In the campaign engine, three mechanisms handle it:

  1. Exactly-once sends. A cadence/duplicate guard checks whether a touch for that (contact, sequence_step) already went out and skips the send if so; the check reads persisted state, so it holds across resumes.
  2. Stop-on-reply. check_reply runs first on every wake and queries for an inbound message since the last touch; if the prospect replied, the thread ends and never sends again. The node re-runs and the world changed while it slept, so re-checking reality is mandatory.
  3. Determinism. A re-executed node must not branch on wall-clock now() or a live random read; non-deterministic reads belong behind the checkpoint.

This is the same instinct the 2025 framework Blueprint First, Model Second generalizes. It argues that "the inherent non-determinism of large language model (LLM) agents limits their application in structured operational environments where procedural fidelity and predictable execution are strict requirements," and pulls workflow decisions out of the model into a deterministic blueprint, reporting a 10.1 percentage-point improvement over baselines. LangGraph's node re-execution forces you to treat every node as a near-pure function over checkpointed state — exactly that discipline.

Durable human-in-the-loop

compose_touch generates a touch but holds it as a draft (status='draft_pending') instead of sending. The cron's due-query only selects status='waiting', so a held draft is structurally impossible to auto-send — only an explicit human approval advances it.

LangGraph treats a pause-for-a-human as the same interrupt() primitive as a pause-for-a-clock (human-in-the-loop docs). The machinery that waits four days waits for a person. Durability is what makes the pause safe across a deploy: the held draft is checkpointed state, not a blocked thread. If the server restarts while waiting for approval, the thread is still there, still holding the draft, still waiting for Command(resume=True).

The 2025 guide Architecting Resilient LLM Agents advocates a Plan-then-Execute pattern with explicit DAG dependencies to give "control-flow integrity." In the campaign engine, the graph structure is the plan, and the human sits as a gate that advances it one step at a time — possible only because the checkpoint persists the plan's state across human decisions.

Observing a workflow that isn't one process: LangSmith

Durable execution creates an observability problem: a single logical workflow no longer runs as one process or one call. The month-long campaign is dozens of separate ainvoke resumes across many cron ticks and process lifetimes — and in LangSmith, each resume is a separate root trace. LangSmith structures each trace as "a tree of runs (a root run for the top-level call and child runs for each inner call)" (LangSmith observability docs); a durable workflow shatters into many such trees.

Three features stitch it back together, and the campaign engine uses all three:

  • Metadata correlation. Every campaign run is tagged runtime:cron and carries metadata={campaign_id, contact_id, sequence_step}; each touch is wrapped in an agent_run_span("campaign_touch", ...) so the email_outreach LLM call nests as a child run. Filtering on metadata.campaign_id reassembles the whole sequence.
  • Outcome feedback on the originating run. When check_reply sees a reply, the graph posts record_outcome_feedback(run_id, "reply_outcome", 1.0) to the touch run that earned it; a no-reply completion posts 0.0. The feedback lands on the exact run even though it finished in a process that exited days earlier — possible because the run id was checkpointed with the thread.
  • Evaluation against a dataset. A LangSmith dataset is "a set of examples, with each example containing an input and an expected or reference output." We keep a final_response dataset for the campaign touch so cadence and copy changes can be A/B tested, and a content-hash prompt-version tag on each touch run flags templating changes.

Once execution is durable, observability has to be durable too: stable metadata, run-anchored feedback, and datasets are what give you the whole picture back.

The research throughline: infrastructure over model

The papers cited here converge on one theme at different levels of the stack: the hard part of production agents isn't the LLM, it's the operational scaffolding.

  • Dive into Claude Code (2026) quantifies the ratio: 98.4% of the codebase is operational scaffolding, not AI logic.
  • Crab (2026) shows checkpoint/restore itself must be semantics-aware for agents — blanket snapshotting wastes resources, and its approach moved recovery correctness from 8% to 100%.
  • Blueprint First, Model Second (2025) demonstrates that pulling control flow out of the model into a deterministic blueprint yields a 10.1 percentage-point improvement.
  • Architecting Resilient LLM Agents (2025) codifies Plan-then-Execute with DAG dependencies as the architecture for control-flow integrity.

If you're building an agent for production, spend your engineering time on the durability infrastructure, not on prompt micro-optimizations.

Practical takeaways for building durable LangGraph agents

  1. Compile with a real checkpointer and a stable thread_id. Without the id, the checkpointer can't reload state. Make the id deterministic so you know where to resume:

    from langgraph.checkpoint.memory import MemorySaver
    from langgraph.graph import StateGraph

    graph = StateGraph(state_schema)
    # ... add nodes and edges ...
    checkpointer = MemorySaver() # use a Postgres/SQLite saver in production
    app = graph.compile(checkpointer=checkpointer)

    config = {"configurable": {"thread_id": "campaign-123-456"}}
    await app.ainvoke(input_data, config)
  2. Pick the durability mode by consequence-of-loss, not reflex. Prefer sync/async over exit when a missed checkpoint means a real-world side effect.

  3. Assume nodes re-execute on resume. Make every side effect idempotent and re-check reality on each wake — the docs guarantee replay re-triggers LLM and API calls, so treat it as a design constraint, not a bug.

  4. Let a dumb external scheduler drive, sequentially on single-worker runtimes; parallel resumes against a shared store invite races unless the checkpointer is transactional.

  5. Treat human pauses as checkpointed state, not blocked threads, so they survive deploys, crashes, and scaling events.

  6. Make observability durable too. Correlate scattered resume traces by stable metadata, anchor feedback to the originating run via checkpointed run ids, and evaluate against a dataset.

The payoff is an agent that runs for a month, survives crashes and deploys, pauses for a human when it must, always resumes from exactly where it left off, and stays observable the whole time — because "where it left off" is a row in a checkpoint store, not a stack frame in a process that may already be gone. The infrastructure isn't glamorous, but it's the thing that separates a demo from a deployment.

References