Skip to main content

Eval Driven Development

· 11 min read
Vadim Nicolai
Senior Software Engineer

Here's the counterintuitive premise: for any LLM application where errors have real consequences, you must build your evaluation harness before you write a single prompt. You don't prompt-engineer by vibes, tweaking until an output looks good. You start by defining what "good" means, instrumenting its measurement, and only then do you optimize. This is Eval-Driven Development. It's the only sane way to build reliable, high-stakes AI systems.

In most software, a bug might crash an app. In high-stakes AI, a bug can trigger a misdiagnosis, approve a fraudulent transaction, deploy vulnerable code to production, or greenlight a toxic post to millions of users. The consequences are not hypothetical. An AI-generated radiology summary that fabricates a nodule sends a patient into an unnecessary biopsy. A compliance pipeline that hallucinates a regulatory citation exposes a bank to enforcement action. A code review agent that misses a SQL injection in a PR puts an entire user base at risk. The tolerance for error in these domains is asymptotically approaching zero. This changes everything about how you build.

The typical LLM workflow—prompt, eyeball output, tweak, repeat—fails catastrophically here. You cannot perceive precision and recall by looking at a single response. You need structured, automated measurement against known ground truth. I learned this building a multi-agent fact-checking pipeline: a five-agent system that ingests documents, extracts claims, cross-references them against source material, and synthesizes a verification report. The entire development process was inverted. The planted errors, the matching algorithm, and the evaluation categories were defined first. Prompt tuning came second, with every change measured against the established baseline. The harness wasn't a validation step; it was the foundation.

1. The Asymmetric Cost of Error Dictates Architecture

In high-stakes AI, false positives and false negatives are not equally bad. The asymmetry is domain-specific, but it's always there.

  • A false negative means the system misses a real problem—an inconsistency in a medical record, a miscalculated risk exposure, an unpatched vulnerability. This is bad—it reduces the system's value—but it's the baseline state of the world without the AI. The document would have gone unreviewed anyway.
  • A false positive means the system raises a false alarm—flagging a healthy scan as abnormal, blocking a legitimate transaction as fraudulent, rejecting safe code as vulnerable. This is actively harmful. It wastes expert time, erodes trust, and trains users to ignore the system. It makes the system a net negative.

Consider a medical record summarizer used during clinical handoffs. A missed allergy (false negative) is dangerous but recoverable—clinicians have other safeguards. A fabricated allergy to a first-line antibiotic (false positive) can delay critical treatment and cause the care team to distrust every future output. In financial compliance, a missed suspicious transaction is bad; flagging a Fortune 500 client's routine wire transfer as money laundering is a relationship-ending event.

This asymmetry directly shapes the evaluation strategy. You cannot collapse quality into a single "accuracy" score. You must measure recall (completeness) and precision (correctness) independently, and you must design your metrics to reflect their unequal impact. In most domains, the architecture must be built to maximize precision, even at some cost to recall. Crying wolf is the cardinal sin.

2. Build a Multi-Layer Diagnostic Harness, Not a Monolith

When a test fails, you need to know why. A single, monolithic eval script conflates pipeline failures, prompt failures, and data-passing bugs. The fact-checking pipeline I built uses a four-layer architecture for diagnostic precision.

  1. The Integrated Harness (run_evals.py): A 700+ line orchestrator that runs the full multi-agent pipeline end-to-end. It executes 30+ structured assertions across six categories (Recall, Precision, Hallucination, Grounding, Consistency, Severity). This layer answers: does the whole system work?
  2. The Promptfoo Pipeline Eval (promptfoo.yaml): A separate layer using the open-source Promptfoo framework. It runs 20+ JavaScript assertions on the same cached pipeline output, providing a standardized web viewer and parallel execution. This layer ensures results are shareable and reproducible.
  3. Agent-Level Evals: Isolated Promptfoo configs that test individual agents (Claim Extractor, Cross-Referencer, Synthesizer) with direct inputs. If the pipeline misses a date inconsistency, this layer tells you if it's because the Cross-Referencer failed to detect it or because the Synthesizer later dropped the finding.
  4. Prompt Precision A/B Tests: Controlled experiments that run the same test cases against two prompt variants: a precise, detailed prompt and a vague, underspecified one. This quantifies the causal impact of prompt engineering choices, separating signal from noise.

This stratification is crucial. The integrated test catches systemic issues, the agent tests isolate component failures, and the A/B tests measure prompt efficacy. Development velocity skyrockets because you can iterate on a single agent in 5 seconds instead of running the full 30-second pipeline.

3. Ground Truth is a Domain Argument, Not a Checklist

Your ground truth test case is the foundation. If it's simplistic, your metrics are lies. You need richly constructed test fixtures that mirror real-world complexity—not toy examples.

For a medical record summarizer, this means building a synthetic patient chart across multiple encounter notes, lab results, and imaging reports, then planting specific errors weighted by clinical severity. For a financial report analyzer, it means constructing a set of interconnected filings—10-K, earnings transcript, risk disclosures—with deliberate inconsistencies.

Here's what a well-designed ground truth looks like. Suppose you're building a pipeline that cross-checks corporate filings. Your test fixture contains eight planted errors, each designed to test a specific capability and weighted by business impact:

  • D-01: Date Discrepancy (Weight: 2). The earnings call transcript references Q3 revenue of $4.2B recognized through September 30; the 10-K reports the fiscal quarter ending October 31. Tests basic cross-document consistency.
  • D-02: Metric Contradiction (Weight: 2). The CEO's letter claims 18% year-over-year growth; the actual financial tables show 11.3%. Tests direct factual contradiction between narrative and data.
  • D-03: Source Misquotation (Weight: 2). The risk disclosure paraphrases a regulatory requirement but subtly inverts a key condition. Tests citation accuracy requiring domain knowledge.
  • D-05: Cross-Document Inference (Weight: 2). The filing claims no material exposure to a specific market; a subsidiary's footnote reveals 30% revenue concentration there. Tests cross-document inference.

The other four errors (unit conversion mistakes, outdated references, omissions, inconsistent terminology) have a weight of 1. The weighting is critical. The total possible weight is 12. A system that catches the four critical errors (weight 2 each) but misses all others scores a weighted recall of 8/12 (67%). A system that catches only the four minor errors scores 4/12 (33%). The metric reflects domain impact, not just a detection count.

4. Matching Algorithms Encode Precision-Recall Tradeoffs

Determining if a pipeline "detected" an error is not a simple string equality. The eval harness uses a nuanced matching algorithm with two primary modes, each representing a deliberate tradeoff.

  • "Any" Mode: The check passes if any keyword from a list appears in the extracted output text. Used for errors where a keyword is uniquely specific. For the date discrepancy (D-01), keywords are "september 30", "october 31", "quarter end", "date discrepancy". If the output contains "october 31," it's almost certainly referring to the planted error. This mode maximizes recall (sensitivity).

  • "Keyword Plus Signal" Mode: The check passes only if BOTH a topic keyword and a signal word are present. Used for ambiguous topics. For the metric contradiction (D-02), keywords are "growth", "18%", "year-over-year". But merely mentioning growth doesn't mean the system detected the contradiction. So it also requires a signal word from "contradict", "incorrect", "does not match", "overstated". This mode maximizes precision (specificity).

The algorithm defensively extracts and concatenates text from nested report fields (like top_findings.description and verified_facts.summary) into flat blobs for searching. This avoids the brittleness of relying on a single, specific output field. The matching logic itself is more predictable than the LLM it evaluates, which is essential.

5. Precision is Measured on Pristine Documents, Recall on Dirty Ones

You cannot measure precision and recall on the same dataset. The harness uses two separate test cases.

Recall is measured on the "dirty" fixture with its eight planted errors. Did you find them?

Precision is measured on a separate "clean" fixture—a set of four internally consistent documents with zero planted errors. Any finding produced here is a false positive. The precision checks are pragmatic:

  • P-01: Clean docs should produce at most one finding. (A zero-tolerance policy is unrealistic for stochastic LLMs; allowing one accommodates inherent noise.)
  • P-02: No verified facts marked contradictory.
  • P-03: No claims marked unsupported or misleading.
  • P-04: Overall confidence score >= 0.6.

The False Discovery Rate (false positives / total findings) is calculated here. It's a more intuitive metric for stakeholders: "30% of this system's findings are wrong" is clearer than "precision is 70%."

6. Evidence Grounding is a Bulwark Against Hallucination

The most important guardrail is evidence grounding. Every finding must cite its source. The harness implements a mechanistic check: for each finding's evidence text, does a substring of at least 10 characters appear verbatim in the concatenated source documents?

This prevents the system from generating plausible-sounding findings plucked from thin air—the exact failure mode that has already caused real damage across industries. When a code review agent says "this function is vulnerable to path traversal," the engineer must be able to see the actual code it's referencing. When a medical summarizer says "patient has a documented penicillin allergy," the clinician must be able to trace that to a specific note. Grounding makes claims traceable.

The threshold is deliberately set at >=50% of findings grounded, not 100%. Why? Because some legitimate findings are inferential. A finding like "the company's stated growth rate is inconsistent with its reported revenue figures" synthesizes data from two different documents—the narrative section and the financial tables. The synthesized insight may not appear verbatim anywhere. The 50% threshold catches egregious hallucination while permitting necessary analytical synthesis.

7. LLM-as-Judge is a Semantic Supplement, Not the Primary

An LLM can be used as an evaluator ("LLM-as-Judge") to catch semantic matches keyword search misses (e.g., "date discrepancy" vs. "temporal inconsistency between filings"). However, it is an opt-in secondary signal, not the primary metric.

This is a critical philosophical stance. Using an LLM to evaluate an LLM introduces the very uncertainty you're trying to measure. It's non-deterministic, costly (80+ API calls per run), and creates an infinite regress (who evaluates the evaluator?).

In my pipeline, the primary judge is the deterministic keyword algorithm. The LLM judge (DeepSeek, temperature 0.0) runs in parallel, and the combined metrics take the union of matches from both methods. You get the transparency and stability of keywords, augmented by the semantic understanding of the judge. The final scoring is defensible and explainable.

Practical Takeaways for Your Project

  1. Invert the Workflow. For any non-trivial LLM app, define your evaluation metrics, ground truth, and matching logic before prompt engineering. Build the harness first.
  2. Separate Precision & Recall. Test them on different datasets. Use a "clean" case to measure false positives and a "dirty" case to measure detection capability.
  3. Implement Mechanistic Grounding. Add a mandatory check that evidence strings appear in source text. It's the single best guardrail against catastrophic hallucination.
  4. Build a Multi-Layer Eval. Use integrated tests for end-to-end validation and isolated agent-level tests for rapid iteration and root-cause analysis.
  5. Weight Your Ground Truth. Not all errors are equal. Embed domain-specific severity into your scoring.
  6. Prefer Deterministic Evaluation. Use LLM-as-Judge as a semantic supplement, not your primary scoring mechanism. Your measurement tool must be more stable than the system under test.
  7. Persist Everything. Log every eval run with a git SHA to a database (SQLite works). You need to track metrics over time to detect regressions the moment they happen.

The Broader Implication

Eval-Driven Development is the necessary maturation of LLM engineering for any domain where outputs have consequences—healthcare, finance, compliance, code review, content moderation, education. The era of shipping AI features built on prompt-engineering-by-anecdote is ending. The next generation of AI applications will be distinguished not by the cleverness of their prompts, but by the rigor of their evaluation.

The four-layer architecture, weighted ground truth, and grounding checks described here form a blueprint. It acknowledges that the hardest part of building trustworthy AI isn't the model orchestration; it's building the system that tells you, unequivocally and at every step, whether you can trust it. Start there.

Forget Elite DORA Scores. Your Platform’s Job is to Make Slow Teams Less Slow.

· 5 min read
Vadim Nicolai
Senior Software Engineer

If your platform team’s North Star is getting every development squad into the “elite” performer bracket for DORA metrics, you’re aiming at the wrong target. You’re probably making things worse. I’ve watched organizations obsess over average deployment frequency or lead time, only to see platform complexity balloon and team friction increase. The real goal isn’t to build a rocket ship for your top performers; it’s to build a reliable highway for everyone else.

The corrective lens comes from a pivotal but under-appreciated source: the CNCF’s Platform Engineering Metrics whitepaper. It makes a contrarian, data-backed claim that cuts through the industry hype. The paper states bluntly that platform teams should focus on “improving the performance of the lowest-performing teams” and “reducing the spread of outcomes, not just the average.” This isn’t about settling for mediocrity. It’s about systemic stability and scaling effectively. When you measure platform success by how much you compress the variance in team performance, you start building for adoption and predictability—not vanity metrics.

The Misapplied DORA Dashboard

The DORA research is foundational, but it’s been weaponized by poor management. Deployment Frequency, Lead Time for Changes, Time to Restore Service, and Change Failure Rate are diagnostic metrics for teams. They were never intended to be uniform, organization-wide KPIs. A team maintaining a legacy financial monolith with regulatory gates will never deploy daily. A greenfield mobile app team shouldn’t be the benchmark for everyone.

Yet, leadership demands: “Why isn’t Team X also elite?” So, platform teams are tasked with building capabilities to chase these uniform highs. This leads to a fatal misalignment. The platform’s roadmap gets driven by the needs of the already-fast teams who want newer, shinier abstractions, while the teams struggling with basic deployment toil get left further behind. The performance spread widens, and systemic risk increases.

Why Variance is the Real Enemy

The CNCF paper’s focus on the lowest performers isn’t about compassion; it’s about risk management and predictability.

  1. Reduces Systemic Risk: A single team with a terrible Change Failure Rate or a glacial lead time can become a bottleneck for the entire product release. It can cause outages that impact dependent services. Improving the floor of your performance distribution directly mitigates this operational risk. As the whitepaper notes, this focus “reduces risk” for the entire organization.
  2. Enables Business Predictability: Product and business planning are built on forecasts. If delivery timelines are a wild guess because team capabilities vary wildly, planning fails. Compressing performance toward a reliable, predictable baseline makes forecasting possible. This is what Abby Bangser means by “The goal is predictability, not necessarily pushing every team to elite.”
  3. Scales Effective Practice: A platform is a vehicle for disseminating best practices. If it only serves the teams that already have their act together, it fails. Success, as Manuel Pais frames it, is about the “flow of most teams, not top ones.” By making the “paved road” so simple and effective that struggling teams naturally adopt it, you encode and scale good practices where they’re needed most.

Measuring the Spread, Not Just the Average

So, what do you put on your dashboard instead?

  • Ditch Averages, Embrace Distributions: Stop reporting “Average Lead Time: 3 days.” Start reporting “Lead Time Std Dev: Reduced from 10 days to 4 days over Q2” or “Bottom Quartile Lead Time: Improved from 14 days to 7 days.” This shift tells the true story of platform impact.
  • Track the Paved Path Adoption Rate: The most telling platform metric might be the percentage of services/deployments using the platform's standardized workflow. High adoption with low outcome variance is a slam-dunk success signal. It means the easy path is also the good path.
  • Use Cognitive Load as a Leading Indicator: The whitepaper hints at this. Survey developer cognitive load. A decrease in average reported cognitive load is a powerful leading indicator that DORA variance will soon follow. If your platform makes the hard things easier for the teams struggling the most, you’re winning.

Addressing the Counterarguments: Yes, You Can Still Innovate

The immediate pushback is that this approach fosters complacency and stifles high performers. This is a false dichotomy.

A well-designed platform provides a solid, boring, and incredibly reliable foundation for 80-90% of all use cases—this is the variance-reducing constraint. However, it must also provide well-abstracted escape hatches. Elite teams should be able to opt-out of the paved road for a specific, justified reason (e.g., experimenting with a new service mesh) without having to rebuild the entire CI/CD pipeline from scratch. The platform enables their innovation by giving them a stable base to jump from, not by letting them pave their own chaotic roads everywhere.

The goal is not to cap the ceiling but to raise the floor so high that “going off-road” is a deliberate, costly choice for exceptional needs, not a necessity borne of frustration.

Practical Takeaways for Your Platform Team

  1. Audit Your Dashboards Today: Replace every average DORA metric with a view of its distribution (a histogram) and track the standard deviation over time. This one change will reframe every roadmap discussion.
  2. Prioritize by Friction, Not by Glamour: Your highest-priority features should come from the teams using your platform the least or struggling the most. Their pain points are your most valuable product insights.
  3. Redefine Platform ROI for Leadership: Stop talking about “faster deployments.” Start framing value as “reduced operational risk” and “increased delivery predictability for product planning.” Translate a reduced standard deviation in lead time into weeks saved in project timelines. This is the language of business.

The broader implication is that platform engineering is ultimately about engineering management. It’s about applying constraints thoughtfully to create a more predictable, scalable, and lower-risk system. The CNCF whitepaper gives us the correct measure: don’t be distracted by the stars. Focus on lifting the tide for everyone, and you’ll build a platform that truly moves the business forward.

Claude Code Doesn't Index Your Codebase. Here's What It Does Instead.

· 21 min read
Vadim Nicolai
Senior Software Engineer

Last verified: March 2026

Boris Cherny's team built RAG into early Claude Code. They tested it against agentic search. Agentic search won — not narrowly. A Claude engineer confirmed it in a Hacker News thread: "In our testing we found that agentic search outperformed [it] by a lot, and this was surprising."

That thread is the clearest primary source on how Claude Code actually works — and why it works that way. Most articles on the topic paraphrase it from memory. This one starts from the source.

Q: Does Claude Code index your codebase? A: No. Claude Code does not pre-index your codebase or use vector embeddings. Instead, it uses filesystem tools — Glob for file pattern matching, Grep for content search, and Read for loading specific files — to explore code on demand as it works through each task. Anthropic calls this "agentic search."


The Confession: What Boris Cherny Actually Said

In a public Hacker News thread, Boris Cherny — principal software engineer at Anthropic and Claude Code's creator — wrote this directly:

"Early versions of Claude Code used RAG + a local vector db, but we found pretty quickly that agentic search generally works better. It is also simpler and doesn't have the same issues around security, privacy, staleness, and reliability."

That post on X/Twitter was followed by a comment from a Claude engineer in the same HN thread: "Right — Claude Code doesn't use RAG currently. In our testing we found that agentic search outperformed [it] by a lot, and this was surprising."

The "surprising" qualifier matters. This was not a story where the team started with a principled position and built to confirm it. They built RAG, tested it, found it underperformed, and redesigned. The result is an architecture that runs against the grain of every major competing tool — and the gap is not accidental.

Cherny's background shapes how to read this decision. Before Anthropic, he was a principal engineer at Meta. He describes Claude Code's origin as a personal experiment: he gave the model a bash tool, watched it autonomously write AppleScript to query his music library, and realized the implication. An agent with tools beats a script with pre-retrieved context. That insight drove Claude Code's entire design. The YC Startup Library interview goes deeper on this philosophy: Cherny believes the future of development lies in "agent topologies" — multiple agents with fresh, isolated context windows working in parallel, not a single large agent with accumulated, polluted memory.

The architectural bet against indexing is downstream from that belief.


How Claude Code Actually Searches Your Code

"Agentic" means the model drives the search process rather than receiving pre-retrieved context. Claude Code decides what to look for, picks the right tool, acts on the result, and loops until it has enough to complete the task. The loop is think → act → observe → repeat, continuing until the model produces a plain text response with no tool call attached.

What makes this work in practice is that the tools have very different cost profiles — and Claude Code is designed to use them in cost order.

The Tool Hierarchy with Token Economics

ToolWhat It DoesToken CostUse Case
GlobFile path pattern matchingNear-zero — returns paths onlyworkers/**/*.toml, src/**/*.graphql
GrepRegex content search (powered by ripgrep)Lightweight — returns matching linescreateD1HttpClient, is_remote_eu
ReadFull file contents into contextHeavy — 500–5,000 tokens per fileConfirm and load a specific file
Explore agentIsolated read-only sub-agent (Haiku model)Isolated — does not touch main context windowDeep codebase exploration across many files

Eighteen built-in tools are confirmed in BrightCoding's reverse-engineering of Claude Code's minified JS, including Bash, Grep, Glob, Read, WebFetch, and the Task tool that spawns sub-agents. The Piebald-AI GitHub repo tracks all system prompt components and sub-agent prompts per version, updated within minutes of each Claude Code release. George Sung independently confirmed the same loop structure in January 2026 by forking Ollama to intercept API traffic.

Glob is the opening move. workers/**/*.toml costs almost nothing — it returns file paths, not file contents. Claude Code uses Glob to narrow the search space before any expensive operations begin.

Grep does heavier lifting: searching file contents by regex. Running grep -r "createD1HttpClient" . returns every line containing that string, with surrounding context. It is fast, exact, and composable. Claude Code chains Grep calls the way a developer would in a terminal — each search informed by the previous result, progressively narrowing toward the relevant files.

# The kind of grep chain Claude Code runs:
grep -r "createD1HttpClient" src/
grep -r "D1HttpClient" src/db/
grep -r "import.*d1-http" src/

Read loads a full file into the context window. A 200-line TypeScript file costs roughly 500–1,500 tokens. Claude Code reserves Read for files already identified as relevant via Glob and Grep — it is the confirm step, not the discovery tool.

The Explore Sub-Agent Architecture

For deep exploration, Claude Code spawns an Explore sub-agent: a read-only specialist that runs on the Haiku model inside its own isolated context window. The Piebald-AI repo documents three sub-agent prompt types with their sizes as of current versions: Explore agent (516 tokens), Plan mode enhanced (633 tokens), Task tool (294 tokens).

The Explore agent can Glob, Grep, Read, and run limited Bash (list, copy, move). It cannot create or modify files. When it finishes, it returns a summary to the main agent — not raw file contents. That summary preserves the insight while discarding the tokens.

This is the key isolation property: exploration work does not consume the main conversation's context budget. Cherny has described this as essential to his "agent topologies" philosophy — fresh context windows prevent the main session from accumulating irrelevant content from early searches that turned out to be dead ends.

Q: How does Claude Code search code in large repositories? A: Claude Code uses a three-tool hierarchy: Glob (lightweight file path pattern matching), Grep (content search returning matching lines), and Read (full file content into context). For deep exploration, it spawns an Explore sub-agent — a read-only Haiku model with its own isolated context window — to keep heavy search from consuming the main conversation's token budget.


The Economics: Why This Approach Is Viable at Scale

The most important financial fact about Claude Code's architecture is the 92% prompt prefix reuse rate. LMCache's December 2025 analysis found that across all phases of Claude Code's agentic loop — including the ReAct-based sub-agent loops — the same prefix (system prompt, tool definitions, CLAUDE.md contents) appears in 92% of turns.

This matters because of how Anthropic's prompt caching works: cache write tokens cost 1.25x base price, but cache read tokens cost only 0.1x. For a 2M-token session, processing without caching costs 6.00.Withprefixcachingat926.00. With prefix caching at 92% reuse, that drops to 1.152 — an 81% cost reduction.

Without this, the "burn tokens iteratively" critique would be damning. With it, the economics of agentic search become defensible even on large codebases.

There is a real pricing cliff to understand. Claude API input tokens are priced at 3/millionupto200Ktokensperrequest;beyond200K,alltokensinthatrequestcost3/million up to 200K tokens per request; beyond 200K, all tokens in that request cost 6/million — a 2x jump. This is a hard threshold, not a gradual escalation. Agentic sessions that accumulate significant context must manage this cliff deliberately. Anthropic's cost documentation estimates heavy API coding sessions at 3,650+/month.ClaudeMaxat3,650+/month. Claude Max at 200/month works out to approximately 18x cheaper for intensive use — which is why most developers using Claude Code heavily are on the subscription plan rather than the API.

The latency problem with sequential tool calls is real — but being solved. Relace's Fast Agentic Search (FAS) showed what is possible: an RL-trained sub-agent calling 4–12 tools in parallel instead of sequentially. Each sequential tool call takes 1–2 seconds; 20 sequential turns means 20–40 seconds of latency. FAS reduced 20 turns to 5 and 10 turns to 4, a 4x latency reduction, while maintaining accuracy comparable to Claude Sonnet 4.5. The bottleneck is sequential execution, not the agentic approach itself.


How the Competition Does It: Cursor, Windsurf, and Copilot

Claude Code's no-index bet cuts against the design of every major competing tool.

ToolSearch StrategyIndex LocationPrivacy ModelFreshness
Claude CodeAgentic: Glob → Grep → Read → Explore agentsNo index (runtime search)Data never leaves machineAlways current (filesystem reads)
CursorSemantic vector RAG + optional @CodebaseTurbopuffer (cloud) + local cacheEmbeddings + masked paths in cloudMerkle-tree delta sync; incremental lag
Windsurf CascadeAST-level semantic RAG, local indexLocal (+ optional remote)Local-first; enterprise optionsAuto-updated on file change
GitHub CopilotCode-tuned transformer embeddingsGitHub API (remote) + local for under 750 filesEmbeddings in GitHub cloudIndexed per commit; local for uncommitted
Zed AIAutomatic context discovery (agentic-leaning)Varies by model providerDepends on providerRuntime

Cursor is the most technically detailed comparison. The Engineers Codex analysis documents the full pipeline: Cursor computes a Merkle tree of hashes of all valid files, sends delta diffs to AWS-cached embedding storage, and queries Turbopuffer — a serverless vector and full-text search engine — at inference time. Only metadata is stored in the cloud: masked paths (each path component hashed with a secret key and fixed nonce), line ranges, and embedding vectors. Raw source code never leaves the machine. Indexing time dropped from a median of 7.87s to 525ms after optimization. Cursor shows an index status indicator; Claude Code shows nothing, because nothing needs to build.

Windsurf Cascade takes a different approach: AST-level indexing, building semantic blocks at function, method, and class boundaries rather than naive text chunks. The index starts immediately on workspace open and stays updated automatically on file change. It is local-first, which gives it the freshness advantage of no sync lag.

GitHub Copilot went generally available with semantic search in March 2025. The embedding model is a proprietary transformer fine-tuned on source code. For projects under 750 files, VS Code builds a local advanced index automatically; 750–2,500 files requires a manual trigger; above 2,500 falls back to a basic index. Uncommitted changes use a hybrid local approach.

The user experience difference is immediate: Cursor and Copilot require a setup phase with progress indicators. Claude Code requires nothing. That zero-friction start is not just UX polish — it reflects the architecture. There is genuinely nothing to build.

Q: What is the difference between Claude Code and Cursor indexing? A: Cursor proactively indexes your codebase using tree-sitter chunking and vector embeddings stored in Turbopuffer, updated incrementally via Merkle tree sync. Claude Code does not index at all — it searches on demand using grep-style exact-match tools. Cursor wins on semantic and conceptual search; Claude Code wins on precision, freshness, and zero setup time.


Why Anthropic Chose Grep Over Embeddings

Q: Why doesn't Claude Code use RAG? A: Claude Code's creator Boris Cherny explained on Hacker News that early versions did use RAG with a local vector database, but the team found agentic search consistently outperformed it. The main reasons: precision (grep finds exact matches, embeddings introduce fuzzy positives), simplicity (no index to build or maintain), freshness (a pre-built index drifts from code during active editing), and privacy (no data leaves the machine for embedding computation).

The precision argument is the strongest one for code specifically. createD1HttpClient either appears in a file or it does not. There is no fuzzy positive. Vector embeddings can surface "conceptually adjacent" code that shares no tokens with the target symbol — and in a coding context, conceptual adjacency without textual match is usually noise, not signal.

There is also academic validation. An Amazon Science paper published February 2026 (arXiv 2602.23368, "Keyword Search Is All You Need") ran a systematic comparison of RAG against agentic keyword search across retrieval tasks and found that keyword search via agentic tool use achieves over 90% of RAG-level performance without a vector database. The benchmark focused on document Q&A rather than code navigation specifically — but the principle that exact-match retrieval with iterative refinement competes with semantic search holds in the code context where symbols are precise by definition.

Anthropic's own engineering blog makes the philosophical case explicit. Their September 2025 post, "Effective Context Engineering for AI Agents", states: "Good context engineering means finding the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome." The "just in time" framing is key — agents should maintain lightweight identifiers (file paths, function names) and load data at runtime rather than pre-loading a large static context.

The December 2024 "Building Effective Agents" post reinforces this: "The most successful implementations use simple, composable patterns rather than complex frameworks." The basic building block is an LLM enhanced with retrieval, tools, and memory — but critically, with the model generating its own search queries rather than receiving pre-retrieved context.

Four specific objections drove the RAG abandonment decision. Security: an index stored somewhere is a target; Cursor's path masking adds cryptographic complexity that Claude Code avoids entirely. Privacy: embeddings of proprietary code leak information even as dense vectors; research on embedding inversion has shown partial text recovery in some settings. Staleness: an index built at session start is stale as soon as the first file changes. Reliability: every additional system is a failure point; vector DBs have latency spikes, embedding APIs have rate limits, sync pipelines have bugs.


The Real Costs: Token Burn and the Semantic Miss

The strongest published critique of agentic search came from Milvus. Their argument: "Grep is a dead end that drowns you in irrelevant matches, burns tokens, and stalls your workflow. Without semantic understanding, it's like asking your AI to debug blindfolded." They propose their Claude Context vector MCP plugin as a hybrid fix, claiming 40% token reduction.

Milvus sells a vector database. That commercial interest is transparent and worth noting. It does not make the technical criticism wrong.

The token burn problem is real on common terms. Search useState across a React codebase and you will get hundreds of matches across dozens of files. Claude Code must either process all of them (expensive) or refine the query (adds turns). On codebases with inconsistent naming or high churn, the refinement loop can consume substantial context before reaching the target file.

The 200K token pricing cliff makes this worse when hit: any request exceeding 200K input tokens pays 2x on all tokens in that request, not just the excess. The jump from 3/millionto3/million to 6/million is a hard threshold, not a gradual escalation — and agentic sessions on large codebases with vague prompts can hit it faster than expected.

The semantic miss problem is the other genuine limitation. Grep finds what you name. If createD1HttpClient was renamed buildGatewayClient six months ago, grep finds nothing. Vector embeddings preserve semantic relationships across renames — a real advantage on codebases with heavy refactoring history or cryptic abbreviation conventions.

In practice, Claude Code compensates by running multiple searches: "auth", "session", "token", "middleware", "jwt", "bearer" — triangulating toward the module rather than naming it directly. This multi-step reasoning is something static embedding retrieval cannot do (a vector DB returns its top-k hits and stops). But it costs more turns and more tokens than a single well-placed semantic query would.

Where agentic search wins clearly:

  • Exact symbol lookup — function names, class names, import paths are precise by definition
  • Active editing sessions — grep reads current filesystem state; no index can be as fresh
  • Security and privacy contexts — zero data leaves the machine
  • Well-named, medium-sized codebases — consistent naming discipline eliminates most semantic miss risk

Where proactive indexing wins:

  • Large monorepos — millions of lines where iterative grep exploration burns context faster than it narrows
  • Conceptual search — "find all places we handle authentication" without knowing exact symbol names
  • Unfamiliar codebases — when you cannot yet name what you are looking for, semantic similarity is more useful than exact match
  • Enterprise teams — persistent cross-session context without re-exploration cost

Q: Is agentic search better than RAG for code? A: For many workloads, yes. A February 2026 Amazon Science paper (arXiv 2602.23368) found keyword search via agentic tool use achieves over 90% of RAG-level performance without a vector database. For code specifically, exact-match search outperforms semantic retrieval on stable, well-named codebases because code symbols are precise. RAG's advantage is on conceptual search across large repos with inconsistent naming.


What Developers Built to Fill the Gap

The community response to Claude Code's no-index architecture is itself a data point. Developers who needed semantic search on top of agentic search built it as an MCP extension rather than switching tools.

Several projects emerged:

  • Claude Context (Milvus/Zilliz) — an MCP server adding vector-powered semantic search to Claude Code's tool set; the same Milvus that wrote the critique built the fix
  • claude-codebase-indexer — vector-based search with intelligent chunking as a Claude Code add-on
  • claude-code-project-index — a PROJECT_INDEX system for persistent architectural awareness across sessions
  • CocoIndex — real-time codebase indexing designed to work alongside any AI coding agent
  • ast-grep — structural search understanding code ASTs, not raw text; finds patterns like "all arrow functions returning a Promise" without exact symbol names

The architectural significance: Claude Code is simultaneously an MCP client (connecting to external tool servers like these) and an MCP server (exposing its own file editing and command execution tools to Claude Desktop, Cursor, and Windsurf). The MCP documentation describes both directions. The no-index architecture is not a closed position — it is a composable default. Vector search is a plugin away for anyone who needs it.

The community's response tells us who the current architecture serves well (developers on medium-to-large codebases with disciplined naming who need precision and privacy) and who it does not fully serve out of the box (teams working on large legacy systems with inconsistent conventions where conceptual search across sessions would save significant time).


Where This Is Going

Context windows keep expanding. Claude Sonnet 4.6 supports 1M tokens in beta. At that scale, the distinction between "indexing" and "just loading everything" starts to blur — a sufficiently large context window could theoretically hold a medium-sized codebase in its entirety.

There is a catch. NxCode's analysis of Opus 4.6 at 1M tokens documents a 17-point MRCR retrieval accuracy drop as context fills (93% at shorter contexts, 76% at 1M tokens). Large context is available but not free of quality degradation — models lose precision at the edges of their effective attention range. Loading an entire codebase into a context window does not guarantee the model uses that context accurately.

Three trajectories are running in parallel:

Agentic search improves its execution. Relace's parallel tool call result — 4x latency reduction by calling 4–12 tools simultaneously via RL-trained optimization — shows the sequential bottleneck can be engineered away. The fundamental approach stays the same; the execution gets more efficient. Expect Claude Code's own tool execution to move in this direction.

Hybrid architectures become the production consensus. The HN community thread on agentic vs. RAG in production reflects what practitioners are reaching for at enterprise scale: vector prefiltering to narrow candidates, followed by agentic confirmation. Faster first-query response from embeddings, precision and freshness from grep-based verification. Neither architecture alone is the final answer for the largest systems.

Context window economics change the calculus. With 1M token contexts and Anthropic's 81% cost reduction from prefix caching, the "loading an entire codebase is prohibitively expensive" constraint is weakening. Anthropic's principle — "just in time retrieval of the smallest possible set of high-signal tokens" — remains the right engineering philosophy, but the practical threshold for "too large to load" keeps rising.

What is not changing is Cherny's underlying bet. Claude Code is described by its creator as "a Unix utility, not a product". The design principle is "do the simple thing first": memory is a markdown file, prompt summarization is done simply, search is grep. Complexity is deferred until it is demonstrated to be necessary. The RAG experiment demonstrated it was not — at least not for the majority of workloads.

The one scenario where indexing becomes necessary is the scenario that is genuinely hard to grep: a monorepo at Google or Meta scale, with millions of files, multiple programming languages, decades of naming inconsistency, and teams who need to ask conceptual questions about code they have never read. That is a real workload. It is not the workload Claude Code was designed for.

For the rest — developers working on their own codebases, on team projects with shared naming conventions, on repositories they understand well enough to name what they are looking for — the agentic search approach holds. Grep is precise, fresh, and private. The model learns to search the way you would, because it has the same tools you do. And as 1M-token context windows become the baseline, the gap between "search" and "load everything" shrinks further — which means the principle Anthropic bet on (retrieve just in time, keep context tight, prefer simplicity) only becomes more relevant as the underlying capability improves.


References

Primary sources

Architecture and reverse engineering

Competitor architecture

Performance and benchmarks

Community tools

Pixel-Perfect UI with Playwright and Figma MCP: What Actually Works in 2026

· 14 min read
Vadim Nicolai
Senior Software Engineer

I asked an AI coding assistant to implement a page layout from a Figma design. It got the heading size wrong (28px instead of 24px), inserted a 4px gap where there should have been 8px, and hallucinated a duplicate magnifying glass icon inside the search bar. The overall structure was fine. The details were not.

This is the state of AI-assisted design-to-code in 2026. The tools get you 65-80% of the way there, then leave you in a no-man's land where the remaining pixels matter more than all the ones that came before. Every frontend engineer who has shipped production UI knows: "close enough" is not close enough.

I spent a session trying to close that gap using the toolchain everyone is talking about -- Figma MCP for design context, headless Playwright for runtime measurement, and an AI assistant for the correction loop. Here is what happened, what broke, and what produced results.

Two Paradigms of Multi-Agent AI: Rust Parallel Agents vs Claude Code Agent Teams

· 28 min read
Vadim Nicolai
Senior Software Engineer
TL;DR

Three multi-agent coordination positions, one codebase. A static Rust/Tokio fan-out assigns 20 agents at compile time with zero coordination overhead. A team.rs library implements the full Claude Code agent-teams model in pure Rust — TaskQueue, Mailbox, PlanGate, ShutdownToken — and the study pipeline now uses it to run a 2-step search→write flow with inter-worker messaging. Claude Code agent teams invert every assumption of static fan-out: dynamic task claiming, file-locked concurrency, full bidirectional messaging. The decision rule is one question: do your agents need to talk to each other? If no, tokio::spawn + Arc<T>. If yes: build team.rs, or use TeamCreate.

Multi-agent AI engineering has become a core discipline in production software development. The interesting question is no longer whether to build multi-agent systems. It is how — and specifically, which architectural pattern to reach for given the nature of the work. The clearest demonstration is that multiple fundamentally different paradigms live inside the same codebase.

When this article was first published, the comparison was binary: the Rust crate used bare tokio::spawn fan-out while Claude Code provided the coordination model. That binary is no longer accurate. The research crate now ships team.rs — a 641-line generic coordination library in pure Rust that implements the complete Claude Code agent-teams model. The codebase now demonstrates all three positions simultaneously.

Why Multi-Agent AI Systems Are Having a Moment in 2026

Agent papers grew from roughly 820 in 2024 to over 2,500 in 2025. Enterprise AI projects using multi-agent architectures reportedly reached 72% in 2025. LangGraph, the most-adopted orchestration framework in the ecosystem, leads adoption; AutoGen and CrewAI follow. The concept has moved from research to production infrastructure faster than most practitioners anticipated.

What the research papers do not tell you is which architectural pattern to use. That is the gap this article closes.

Paradigm 1: Infrastructure-Owned Parallelism — The Rust/DeepSeek Approach

The research crate is a real Rust binary that fans out up to 20 parallel DeepSeek agents against Semantic Scholar, collects their outputs, and writes results to Cloudflare D1. Its architecture is aggressive in its simplicity.

The entry point (research/src/bin/research_agent.rs) exposes five subcommands: research (single agent), study (20 parallel agents over agentic-coding topics), prep (10 parallel agents over application-prep topics), enhance (10 agents per application section), and backend (20 agents for backend interview prep). Every subcommand follows the same pattern: define a static list of tasks, queue them, spawn workers, collect results.

The task list is a compile-time constant:

// research/src/study.rs — 20 topics, statically defined
pub const TOPICS: &[TopicDef] = &[
TopicDef { slug: "tool-use-patterns", ... },
TopicDef { slug: "react-agent-loop", ... },
// ... 18 more
];

The task structure is fully known before the binary starts. There is no runtime negotiation over which agent handles which topic.

How the DeepSeek Tool-Use Loop Works in Rust

Each spawned agent runs the same inner loop, implemented in research/src/agent.rs. The loop is a direct implementation of the OpenAI-compatible function-calling protocol — without a Python SDK wrapper, without a framework abstraction layer:

// research/src/agent.rs — the agentic tool-use loop
impl DeepSeekAgent {
pub async fn prompt(&self, user_prompt: String) -> Result<String> {
let mut messages: Vec<Value> = vec![
json!({"role": "system", "content": self.preamble}),
json!({"role": "user", "content": user_prompt}),
];

loop {
let resp: Value = self.http
.post(&format!("{}/v1/chat/completions", self.base_url))
.bearer_auth(&self.api_key)
.json(&body)
.send().await?
.json().await?;

let finish_reason = resp["choices"][0]["finish_reason"]
.as_str().unwrap_or("stop");

match finish_reason {
"tool_calls" => {
// Execute each requested tool, append results, loop again
messages.push(message.clone());
for call in calls {
let result = tool.call_json(args).await?;
messages.push(json!({
"role": "tool",
"tool_call_id": call_id,
"content": result,
}));
}
}
_ => {
// "stop" — return the final content
return message["content"].as_str().map(String::from)...;
}
}
}
}
}

The Tool trait that backs this loop uses async_trait and is simple by design:

#[async_trait]
pub trait Tool: Send + Sync {
fn name(&self) -> &str;
fn definition(&self) -> ToolDefinition;
async fn call_json(&self, args: Value) -> Result<String>;
}

Tools register their own JSON Schema via definition(), and the agent loop dispatches by name. In the research crate, the tools are search_papers (Semantic Scholar API) and get_paper_detail. Agents in the study subcommand use both tools for paper lookup; agents in the prep subcommand run without tools — direct chat completions for speed, because their task structure does not require external lookups.

Spawning Parallel Agents with Tokio

The prep pipeline still demonstrates the flat fan-out pattern — no inter-worker communication, no dependency graph. The APPLICATION_TOPICS path is the cleanest example of infrastructure-owned parallelism:

// research/src/study.rs — run_prep()
let queue: TaskQueue<TopicDef> = TaskQueue::new();
for topic_def in APPLICATION_TOPICS {
queue.push(topic_def.slug, *topic_def, vec![], 2).await;
}

let mailbox = Mailbox::new();
let (_shutdown_tx, shutdown) = shutdown_pair();
let summary = TeamLead::new(APPLICATION_TOPICS.len())
.run(queue, mailbox, shutdown, move |ctx, task| {
let api_key = Arc::clone(&api_key);
let d1 = Arc::clone(&d1);
let topic_def = task.payload;
async move {
info!(worker = %ctx.worker_id, topic = topic_def.slug, "Prep agent starting");
let row = run_direct_agent(topic_def, &api_key).await?;
d1.insert_study_topic(&row)
.await
.with_context(|| format!("D1 insert failed for {}", topic_def.slug))?;
info!(worker = %ctx.worker_id, topic = topic_def.slug, "Saved to D1");
Ok::<(), anyhow::Error>(())
}
})
.await;

No mailbox communication between workers. No dependencies. Each worker reads its own topic, makes its own API call, writes its own row to D1. Workers never communicate with each other. This is the flat fan-out case expressed through the team abstraction — functionally equivalent to a bare tokio::spawn loop, but now with retry, idle notifications, and cooperative shutdown included for free.

Shared state is wrapped in Arc<T> and cloned cheaply into each task. A Tokio task carries roughly 64 bytes of overhead and spawns in sub-microsecond time. Spinning up 20 agents adds negligible latency to program startup.

The Third Path: Implementing Agent-Teams Primitives in Rust

After observing the mismatch between "static fan-out is too rigid for the study pipeline" and "spinning up a full Claude session per research topic is too expensive," the research crate grew a third position: research/src/team.rs, a 641-line Rust coordination library that implements the complete Claude Code agent-teams model natively.

This is not an accidental similarity. The module-level doc comment states the goal explicitly, mapping every agent-teams concept to its Rust equivalent:

Agent-teams conceptteam.rs equivalent
Shared task listTaskQueue<P>
Atomic task claimingTaskQueue::claim
Task dependenciesdepends_on in TaskQueue::push
Retry on failuremax_attempts + re-queue on fail
Queue change notificationTaskQueue::notify_handle
Lead / worker separationTeamLead + TeamContext
Worker identitystable worker-NN IDs
Peer discoveryctx.peer_ids in TeamContext
Point-to-point messageMailbox::send
Broadcast to all teammatesMailbox::broadcast
Idle notificationsworker → team-lead inbox on exit
Plan approval gatePlanGate
Cooperative shutdownShutdownToken / shutdown_pair

Every concept from the Claude Code agent-teams documentation has a direct Rust/Tokio equivalent. The target audience is clear: engineers who need the coordination semantics of agent teams but cannot or will not run a full Claude session per task — whether because of WASM constraints, cost at scale, or infrastructure ownership requirements.

The full implementation is in research/src/team.rs.

TaskQueue — Atomic Claiming with Dependency Support

TaskQueue<P> is generic over the task payload type. Its claim() method is the coordination core — it holds the mutex for the full claim operation, computes which tasks have their dependencies satisfied, and claims the lowest available ID:

// research/src/team.rs — TaskQueue::claim
pub async fn claim(&self, worker: &str) -> Option<(TaskId, String, P)> {
let mut s = self.inner.lock().await;
let done: HashSet<TaskId> = s.tasks.values()
.filter(|t| t.status == TaskStatus::Completed)
.map(|t| t.id)
.collect();
let id = s.tasks.values()
.filter(|t| {
t.status == TaskStatus::Pending
&& t.depends_on.iter().all(|d| done.contains(d))
})
.map(|t| t.id)
.min()?; // lowest ID wins (ID-order preference)
let task = s.tasks.get_mut(&id).unwrap();
task.status = TaskStatus::Claimed(worker.into());
task.attempts += 1;
Some((id, task.name.clone(), task.payload.clone()))
}

Tasks are pushed with an explicit dependency list:

// research/src/team.rs — TaskQueue::push
pub async fn push(
&self,
name: impl Into<String>,
payload: P,
depends_on: Vec<TaskId>, // IDs that must be Completed before this can be claimed
max_attempts: u32,
) -> TaskId {
let mut s = self.inner.lock().await;
let id = s.next_id;
s.next_id += 1;
s.tasks.insert(id, TaskEntry {
id,
name: name.into(),
payload,
status: TaskStatus::Pending,
depends_on,
attempts: 0,
max_attempts,
});
id
}

Failure handling re-queues the task as Pending if attempts remain, permanently marks it Failed otherwise, and notifies idle workers via Notify:

// research/src/team.rs — TaskQueue::fail
pub async fn fail(&self, id: TaskId) {
{
let mut s = self.inner.lock().await;
if let Some(t) = s.tasks.get_mut(&id) {
if t.attempts >= t.max_attempts {
warn!(task = %t.name, attempts = t.attempts, "Task permanently failed");
t.status = TaskStatus::Failed;
} else {
warn!(task = %t.name, attempt = t.attempts, max = t.max_attempts,
"Task failed — re-queuing for retry");
t.status = TaskStatus::Pending;
}
}
}
self.changed.notify_waiters();
}

Mailbox — Point-to-Point and Broadcast Messaging

The Mailbox is an Arc-wrapped HashMap<String, VecDeque<Envelope>> — named inboxes, FIFO order. Any string can be an inbox name: worker IDs, task slugs, topic slugs. From the doc comment:

Workers write to named inboxes and read from their own. The inbox name can be a worker ID, a task name, a topic slug — any agreed-upon key. This mirrors the agent-teams mailbox where teammates message each other directly without going through the lead.

Point-to-point send:

// research/src/team.rs — Mailbox::send
pub async fn send(
&self,
from: impl Into<String>,
to: impl Into<String>,
subject: impl Into<String>,
body: impl Into<String>,
) { ... }

Broadcast delivers the same message to every recipient in the slice:

// research/src/team.rs — Mailbox::broadcast
pub async fn broadcast(
&self,
from: impl Into<String>,
recipients: &[&str],
subject: impl Into<String>,
body: impl Into<String>,
) {
let from = from.into();
let subject = subject.into();
let body = body.into();
for recipient in recipients {
self.send(from.clone(), *recipient, subject.clone(), body.clone()).await;
}
}

Blocking receive parks the task until a message arrives:

// research/src/team.rs — Mailbox::recv_wait
pub async fn recv_wait(&self, inbox: &str) -> Envelope {
loop {
if let Some(env) = self.recv(inbox).await {
return env;
}
self.notify.notified().await;
}
}

The Envelope struct carries a monotonic message ID, sender, recipient, subject, and body (plain text or JSON):

// research/src/team.rs
pub struct Envelope {
pub id: u64,
pub from: String,
pub to: String,
pub subject: String,
pub body: String,
}

TeamLead::run() — The Worker Driver

TeamLead holds two fields: worker_count and idle_poll_ms. Its run() method is fully generic — the task payload type, return type, and worker closure are all type parameters:

// research/src/team.rs — TeamLead::run (signature)
pub async fn run<P, R, F, Fut>(
&self,
queue: TaskQueue<P>,
mailbox: Mailbox,
shutdown: ShutdownToken,
worker_fn: F,
) -> QueueSummary
where
P: Send + Clone + 'static,
R: Send + 'static,
F: Fn(TeamContext<P>, WorkerTask<P>) -> Fut + Send + Sync + Clone + 'static,
Fut: std::future::Future<Output = anyhow::Result<R>> + Send,

Each worker loop checks shutdown, claims tasks, invokes the worker closure, and handles success or failure. When idle, workers wait on a Notify handle rather than busy-polling:

// research/src/team.rs — worker loop inside TeamLead::run
loop {
if shutdown.is_cancelled() {
info!(worker = %worker_id, "Shutdown requested — exiting");
break;
}

match queue.claim(&worker_id).await {
Some((id, name, payload)) => {
info!(worker = %worker_id, task = %name, "Claimed task");
let ctx = TeamContext {
worker_id: worker_id.clone(),
peer_ids: peer_ids.clone(),
queue: queue.clone(),
mailbox: mailbox.clone(),
shutdown: shutdown.clone(),
};
let task = WorkerTask { id, name: name.clone(), payload };
match worker_fn(ctx, task).await {
Ok(_) => queue.complete(id).await,
Err(e) => {
tracing::error!(worker = %worker_id, task = %name, "Task failed: {e}");
queue.fail(id).await;
}
}
}
None => {
if queue.all_done().await {
info!(worker = %worker_id, "All tasks done — idle");
break;
}
let notify = queue.notify_handle();
tokio::select! {
_ = notify.notified() => {}
_ = tokio::time::sleep(Duration::from_millis(idle_poll_ms)) => {}
}
}
}
}

// Idle notification — mirrors agent-teams: "teammates notify the lead when they finish"
mailbox.send(
&worker_id,
"team-lead",
"idle",
format!("{worker_id} idle — queue: {} pending, ...", summary_snapshot.pending),
).await;

Workers send an "idle" message to the "team-lead" inbox on exit. This mirrors the agent-teams behavior where teammates automatically notify the lead when they finish.

Peer Discovery via TeamContext

Each worker receives a TeamContext containing its own ID, the list of all peer IDs, the shared queue, the shared mailbox, and the shutdown token:

// research/src/team.rs
pub struct TeamContext<P: Clone + Send + 'static> {
pub worker_id: String,
/// IDs of all other active workers — mirrors agent-teams members array.
pub peer_ids: Vec<String>,
pub queue: TaskQueue<P>,
pub mailbox: Mailbox,
pub shutdown: ShutdownToken,
}

peer_ids is computed by TeamLead::run() before spawning. Each worker gets all IDs except its own:

// research/src/team.rs — inside TeamLead::run
let all_ids: Vec<String> = (1..=self.worker_count)
.map(|i| format!("worker-{:02}", i))
.collect();

// per-worker:
let peer_ids: Vec<String> = all_ids.iter()
.filter(|id| *id != &worker_id)
.cloned()
.collect();

Workers can address each other directly via ctx.mailbox.send(&ctx.worker_id, peer_id, ...) using ctx.peer_ids as the address book — the exact same model as the agent-teams members array.

Cooperative Shutdown via ShutdownToken

The ShutdownToken uses a watch::channel — the lead's sender writes true to signal shutdown, and each worker checks the value between task iterations, never inside task execution:

// research/src/team.rs
#[derive(Clone)]
pub struct ShutdownToken(watch::Receiver<bool>);

impl ShutdownToken {
pub fn is_cancelled(&self) -> bool { *self.0.borrow() }
}

pub struct ShutdownSender(watch::Sender<bool>);

impl ShutdownSender {
pub fn shutdown(&self) { let _ = self.0.send(true); }
}

pub fn shutdown_pair() -> (ShutdownSender, ShutdownToken) {
let (tx, rx) = watch::channel(false);
(ShutdownSender(tx), ShutdownToken(rx))
}

From the doc comment: "Workers poll is_cancelled() between task iterations. They always finish their current task before checking — matching the agent-teams behaviour: 'teammates finish their current request before shutting down'." Workers are never cancelled mid-flight; the shutdown is cooperative.

PlanGate — Plan Approval Gate

PlanGate is the Rust equivalent of Claude Code's plan approval flow. Workers call submit_and_wait() and block on a oneshot::Receiver. The lead calls approve() or reject(), which sends on the oneshot::Sender and unblocks the worker:

// research/src/team.rs — PlanGate
pub async fn submit_and_wait(&self, worker_id: &str, plan: &str) -> PlanDecision {
let (tx, rx) = tokio::sync::oneshot::channel();
info!(worker = %worker_id, "Plan submitted, awaiting approval");
self.pending.lock().await.insert(
worker_id.into(),
PlanEntry { plan: plan.into(), tx },
);
self.notify.notify_waiters();
rx.await.unwrap_or(PlanDecision::Rejected { feedback: "Gate dropped".into() })
}

pub async fn approve(&self, worker_id: &str) {
if let Some(e) = self.pending.lock().await.remove(worker_id) {
info!(worker = %worker_id, "Plan approved");
let _ = e.tx.send(PlanDecision::Approved);
}
}

pub async fn reject(&self, worker_id: &str, feedback: &str) {
if let Some(e) = self.pending.lock().await.remove(worker_id) {
warn!(worker = %worker_id, "Plan rejected");
let _ = e.tx.send(PlanDecision::Rejected { feedback: feedback.into() });
}
}

The minimal example from the module's doc comment shows the full API surface in a dozen lines:

// research/src/team.rs — doc example
let queue: TaskQueue<String> = TaskQueue::new();
queue.push("greet", "hello".into(), vec![], 2).await;

let mailbox = Mailbox::new();
let (_sd_tx, shutdown) = shutdown_pair();

let summary = TeamLead::new(2)
.run(queue, mailbox, shutdown, |_ctx, task| async move {
println!("{}: {}", task.name, task.payload);
Ok::<(), anyhow::Error>(())
})
.await;

assert_eq!(summary.completed, 1);

The 2-Step Mailbox Pipeline: search→write via Mailbox

The study pipeline is where team.rs coordination replaces the old static fan-out. For each of the 20 agentic-coding topics, the pipeline queues two dependent tasks: a Search task that queries Semantic Scholar and deposits findings into the mailbox, and a Write task that reads those findings and generates the study guide.

The ResearchTask Enum

// research/src/study.rs
#[derive(Clone)]
enum ResearchTask {
Search(TopicDef),
Write { topic: TopicDef, category: &'static str },
}

The old run_single_agent() function — which handled the full research-and-write pipeline in one agent — has been replaced by two phase-specific functions: search_topic_papers() (runs the tool-use agent with SearchPapers and GetPaperDetail tools, returns raw findings as markdown) and write_study_guide() (pure-completion agent, no tools, receives findings string, returns a StudyTopicRow). Only the search phase needs the Semantic Scholar API; the write phase is deterministic given the findings.

Queuing Paired Tasks with Dependencies

For each topic, run_topics() pushes two tasks. The write:{slug} task carries the search task's ID in its depends_on list, so TaskQueue::claim cannot return it until the paired search task is completed:

// research/src/study.rs — run_topics()
let queue: TaskQueue<ResearchTask> = TaskQueue::new();
for topic_def in topics {
let search_id = queue
.push(
format!("search:{}", topic_def.slug),
ResearchTask::Search(*topic_def),
vec![], // no dependencies
2, // max 2 attempts
)
.await;
queue
.push(
format!("write:{}", topic_def.slug),
ResearchTask::Write { topic: *topic_def, category },
vec![search_id], // blocked until search completes
2,
)
.await;
}

For 20 topics, this pushes 40 tasks total. The queue enforces that no write:{slug} task can be claimed until its paired search:{slug} is completed.

The TeamLead::run() Call

The old bare tokio::spawn loop is replaced by TeamLead::new(topics.len()).run(...). The number of workers equals the number of topics, so search and write tasks for different topics can overlap even while write tasks within one topic block on their own search:

// research/src/study.rs — run_topics()
let mailbox = Mailbox::new();
let (_shutdown_tx, shutdown) = shutdown_pair();
let summary = TeamLead::new(topics.len())
.run(queue, mailbox, shutdown, move |ctx, task| {
let api_key = Arc::clone(&api_key);
let scholar = Arc::clone(&scholar);
let d1 = Arc::clone(&d1);
async move {
match task.payload {
ResearchTask::Search(topic) => {
info!(worker = %ctx.worker_id, topic = topic.slug, "Search phase starting");
let findings = search_topic_papers(topic, &scholar, &api_key).await?;
ctx.mailbox
.send(&ctx.worker_id, format!("findings:{}", topic.slug), "paper-findings", findings)
.await;
info!(worker = %ctx.worker_id, topic = topic.slug, "Search phase done, findings in mailbox");
}
ResearchTask::Write { topic, category } => {
info!(worker = %ctx.worker_id, topic = topic.slug, "Write phase starting");
let env = ctx.mailbox.recv_wait(&format!("findings:{}", topic.slug)).await;
let row = write_study_guide(topic, category, &env.body, &api_key).await?;
d1.insert_study_topic(&row)
.await
.with_context(|| format!("D1 insert failed for {}", topic.slug))?;
info!(worker = %ctx.worker_id, topic = topic.slug, "Saved to D1");
}
}
Ok::<(), anyhow::Error>(())
}
})
.await;

The mailbox inbox name convention is findings:{slug}. The search worker sends to that inbox; the write worker calls recv_wait(&format!("findings:{slug}")), blocking until the message is available. Task dependency in the queue guarantees the Write task cannot even be claimed until Search completes, so recv_wait unblocks quickly in practice — but the mailbox blocking provides a safety net if the dependency graph and the mailbox arrive slightly out of sync.

This is what the study.rs pipeline looked like before team.rs existed: isolated agents, no inter-worker communication, outputs collected after-the-fact from D1. Adding the mailbox turned it from independent parallel agents into a coordinated pipeline where one worker's output is another's input — exactly the pattern the Claude Code agent-teams SendMessage primitive enables.

Paradigm 2: Platform-Managed Agent Teams — The Claude Code Approach

Claude Code's experimental agent teams feature inverts every architectural assumption of static fan-out. Where the Rust system owns its concurrency at the OS level, Claude teams delegate coordination to the platform. Where Rust pre-assigns tasks via a queue, Claude teams use a shared task list with file-locked claiming at runtime. Where the flat Rust fan-out has isolated agents, Claude teammates send messages to each other directly.

The feature is enabled in the nomadically.work repo via .claude/settings.json:

{
"env": {
"CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS": "1"
}
}

That one line unlocks five coordination primitives: TeamCreate, TaskCreate, TaskUpdate, TaskList, and SendMessage. Each teammate is a full, independent Claude Code session — a separate process with its own context window, its own file system access, and its own ability to write and run code.

How Claude Code Agent Teams Self-Organize

The team lead creates a shared task list stored in ~/.claude/tasks/{team-name}/. Teammates discover available tasks by calling TaskList, claim work by calling TaskUpdate (setting themselves as owner), and the platform uses file locking to prevent two teammates from claiming the same task simultaneously. When a teammate finds something unexpected, they send a direct message to the relevant peer via SendMessage — the lead does not need to relay it.

The nomadically.work repo uses Claude agent teams in its SDD (Spec-Driven Development) orchestrator. The /sdd:ff command spawns a team where spec-writing and design run in parallel — two teammates working simultaneously, each producing artifacts the other may need to reference. The key point is that these phases are not fully independent: a spec decision can change a design constraint. If that happens, the teammates can tell each other directly.

Teammate display cycles through sessions via Shift+Down in-process, or splits into panes in tmux or iTerm2. The team lead's conversation history does not carry to teammates — each starts fresh from the shared CLAUDE.md project context and the task description. Context contamination is a real overhead: a teammate may re-investigate something the lead already resolved, spending tokens to rediscover context the lead has but could not transfer.

Known limitations apply: no session resumption, task status can lag under contention, no nested teams (teammates cannot spawn sub-teams), and the lead is fixed for the team's lifetime. These are experimental constraints, not permanent design decisions — but they are real constraints today.

Comparing the Three Positions: A Decision Framework

These are not competing patterns converging toward the same solution. They occupy distinct positions on a coordination spectrum, each optimal for a different class of work.

The sharpest question to ask before choosing is: do your agents need to talk to each other?

If the answer is no — if you can define all tasks before the run starts, if each agent's output is independent, if a failure in one agent should not affect the scope of another — then flat fan-out is the right call. The run_prep() path in study.rs demonstrates this with TeamLead providing retry and shutdown for free, but no mailbox communication.

If the answer is yes — and you own your infrastructure, need deterministic concurrency control, or operate under WASM or cost constraints that make full Claude sessions per task unviable — then team.rs is the answer. You get the full coordination model (messaging, dependency graphs, plan gates, cooperative shutdown) at Tokio cost, not Claude token cost.

If the answer is yes — and you want the coordination model without writing it, prefer natural-language task definitions, need human-in-the-loop steering mid-run, or cannot afford the maintenance overhead of a custom coordination library — then Claude Code agent teams are the answer.

DimensionRust flat fan-outRust with team.rsClaude Code agent teams
Task assignmentStatic, pre-queuedDynamic, atomic claim()Dynamic, file-locked claiming
Inter-agent communicationNoneMailbox::send / broadcastFull bidirectional via SendMessage
Task dependency supportNonedepends_on: Vec<TaskId>Blocked/unblocked dependency graph
Task retryManualmax_attempts + re-queuePlatform-managed
Human-in-the-loopFire-and-forgetPlanGate::submit_and_waitDirect message injection to any teammate
Cooperative shutdownNoneShutdownToken / watch channelPlatform-managed
Concurrency overhead~64 bytes + sub-μs spawnSame — TeamLead uses tokio::spawn internallyFull context window per teammate; token-linear scaling
Partial failure handlingCounter; peers continuefail() re-queues within max_attemptsFailed teammate replaceable without aborting team
Task dynamismZeroRe-queue on failure; dependency graph changes effective availabilityTasks can be created, re-assigned, or cancelled at runtime
ObservabilityStructured logs (tracing)Structured logs + QueueSummary + mailbox inbox countsTeammate display modes (in-process, tmux, iTerm2)
Infrastructure ownershipFullFullPlatform-managed

Cost, Latency, and Observability Tradeoffs

The Rust crate's cost model is transparent regardless of which coordination layer you use. Workers make independent API calls to DeepSeek. Each call consumes tokens proportional to the agent's preamble, context, and tool results. Total cost is roughly N times the cost of a single agent — no platform overhead, no coordination messages, no duplicate context. The team.rs coordination layer is in-process Rust with zero token cost.

Claude agent teams cost more per the official documentation, though no specific multiplier is published. Each teammate carries its own full context window. Broadcast messages sent to all teammates multiply by team size. The official recommendation is 3–5 teammates with 5–6 tasks each — beyond that, coordination overhead accumulates faster than parallelism saves.

Latency follows the opposite pattern. The Rust system's wall-clock time is bounded by the slowest agent plus network latency — typically 30–90 seconds for 20 agents running fully parallel. The team.rs 2-step pipeline adds the mailbox handoff latency (sub-millisecond, in-process), which is negligible compared to LLM inference time. A Claude team doing the same breadth of work sequentially within a single session would take proportionally longer.

The engineering category is well-compensated precisely because operating these systems at production scale requires understanding these tradeoffs, not just knowing the API.

When to Build Your Own vs Use Claude Code Agent Teams

Build infrastructure-owned concurrency (Rust team.rs, Python asyncio, TypeScript Promise.all) when:

  • Task structure is fully or partially known before execution starts
  • You need deterministic concurrency control with predictable retry behavior
  • You are running on constrained infrastructure (Cloudflare Workers, WASM) where a full agent session per task is not viable
  • Per-token cost matters at scale — flat API cost per agent, no platform overhead
  • Inter-agent communication is needed but full Claude sessions per worker are too expensive
  • You want compile-time type safety over agent payload shapes

Use Claude Code agent teams when:

  • The task is exploratory — agents may discover things that change the plan
  • Agents need to challenge or build on each other's reasoning in natural language
  • Task dependencies are dynamic — you cannot know the full task graph upfront
  • You want human steering capability mid-run without aborting the whole run
  • Orchestration code itself is a maintenance burden you want to avoid writing
  • Task definitions benefit from natural language rather than typed enum variants

The run_prep() path is an example of flat fan-out. The run_topics() pipeline is an example of team.rs coordination. The SDD orchestrator is an example of Claude agent teams. All three exist in the same codebase because the tasks they handle are structurally different — not because one pattern supersedes the others.

One nuance worth stating plainly: the static fan-out pattern is not Rust-specific. Python's asyncio.gather() and TypeScript's Promise.all() implement the same model. The Rust implementation is a hook into the nomadically.work codebase, not an argument for Rust as the only language for this problem. The DeepSeek API is OpenAI-compatible; the tool-use loop in agent.rs could be ported to Python in an afternoon. The Rust choice reflects specific constraints — WASM compilation targets, type-safe JSON handling, and zero-cost abstractions for a system intended for Cloudflare Worker environments. Those are valid reasons; they are also not universal.

What This Means for the Future of AI-Powered Software Development

The three positions now occupy distinct points on a coordination spectrum that will remain relevant regardless of how individual frameworks evolve.

At one end: static fan-out, owned concurrency, zero coordination overhead, compile-time task structure. Maximally efficient for embarrassingly parallel work where the task graph is known. Gets faster as inference costs fall and async runtimes improve.

In the middle: owned-infrastructure coordination (team.rs or equivalent), dynamic task claiming, in-process messaging, cooperative shutdown, plan gates. Maximally efficient when you need coordination semantics but cannot pay full-session cost per worker. Gets easier to build as the primitives become better understood.

At the other end: platform-managed coordination, dynamic teams, full messaging infrastructure, runtime task discovery in natural language. Maximally flexible for exploratory work where the task graph emerges during execution. Gets cheaper as context window costs fall and team-size recommendations increase.

The emerging challenge — genuinely unsolved — is automated task structure detection: given a goal, should the system fan-out statically, build a team.rs-style queue, or stand up a full agent team? The agentic frameworks (Claude Agent SDK, OpenAI Agents SDK, LangGraph) are converging on common primitives for describing tasks and dependencies. But the decision of which concurrency model to use still requires human judgment about the nature of the work.

That judgment is increasingly a senior engineering skill — and it is what separates engineers who can operate these systems at production scale from those who merely know the API.


FAQ

What is the Rust equivalent of Claude Code agent teams? The team.rs module in the nomadically.work research crate implements full parity: TaskQueue replaces the shared task list, TaskQueue::claim handles atomic claiming, Mailbox::send and Mailbox::broadcast replace SendMessage, PlanGate implements the plan approval gate, and ShutdownToken (via tokio::sync::watch) handles cooperative shutdown. Every agent-teams primitive has a direct Rust/Tokio equivalent.

What is the difference between multi-agent orchestration and agent swarms? Orchestration implies a coordinator that assigns tasks to workers based on a defined structure — the coordinator knows the plan. Swarms imply emergent coordination where agents self-organize without a central planner. Claude Code agent teams are closer to orchestration (a lead agent coordinates); the team.rs library is also orchestration (a TeamLead drives the queue); the bare tokio::spawn fan-out is neither — it is static parallelism without ongoing coordination of any kind.

How does Claude Code agent teams pricing work? Each teammate is a full Claude session consuming its own token budget. The official documentation describes cost as higher than a single session, scaling linearly with team size. Broadcast messages multiply by team size. Targeted teammate-to-teammate messages add tokens to both sending and receiving contexts. No specific multiplier is published.

Can I run AI agents in parallel with Rust? Yes. For flat fan-out (no inter-agent communication needed), the tokio::spawn + Arc<T> pattern is idiomatic. Wrap shared clients in Arc, clone into each spawned task, collect JoinHandles, await results. For coordination (dynamic claiming, messaging, dependencies, retry), use TeamLead::new(n).run(queue, mailbox, shutdown, worker_fn) from research/src/team.rs. The overhead for either is approximately 64 bytes per task and sub-microsecond spawn latency — the team.rs coordination layer is in-process Rust with no additional cost.

How do I implement inter-agent messaging in Rust? Use a shared Mailbox: a Mutex<HashMap<String, VecDeque<Envelope>>> with a Notify for wake-up. Workers call mailbox.send(from, to, subject, body) to deposit messages into named inboxes; receivers call recv_wait(inbox) to block until a message arrives. For broadcast (send to all peers simultaneously), pass &ctx.peer_ids as recipients. Worker addresses (peer_ids) are pre-computed by TeamLead::run() so every worker can address peers directly without going through the lead.

What is cooperative shutdown in Tokio async Rust? Cooperative shutdown means workers finish their current task before stopping — they are never cancelled mid-flight. In Tokio, implement with watch::channel(false): the lead calls sender.send(true) to signal shutdown; each worker checks *receiver.borrow() between task iterations (not inside task execution). This matches the Claude Code agent-teams behavior where "teammates finish their current request before shutting down." The ShutdownToken / shutdown_pair() pattern in team.rs is a direct implementation of this.

How do I implement task dependencies in an async task queue? Store tasks as HashMap<TaskId, TaskEntry> behind a Mutex. Each TaskEntry has a depends_on: Vec<TaskId> field. claim() locks the queue, computes the set of completed IDs, and picks the lowest-ID pending task whose all dependencies are in that set. On complete() or fail(), call notify.notify_waiters() to wake idle workers blocked on queue.notify_handle().notified().await. Workers that go idle call tokio::select! on the notify handle and a poll timeout, then re-attempt claim() on wake.

What is a plan approval gate in multi-agent systems? A plan gate is a synchronization point where a worker submits its plan and blocks until the lead approves or rejects it — used to give a human or lead agent a chance to review before the worker makes irreversible changes. In Rust, implement with Mutex<HashMap<worker_id, oneshot::Sender<PlanDecision>>>: the worker calls submit_and_wait(plan) which inserts a oneshot channel sender and awaits the receiver. The lead calls approve(worker_id) or reject(worker_id, feedback), which sends on the channel and unblocks the worker. PlanGate in team.rs is a direct implementation.

What is DeepSeek's tool use API? DeepSeek's tool use (function calling) is an OpenAI-compatible API feature where the model returns structured tool_calls JSON when it needs external data. The caller executes the requested function, appends the result as a tool message, and calls the API again. This repeats until finish_reason == "stop". The agent.rs loop implements this directly in Rust without a framework dependency.

When should I use a multi-agent system instead of a single agent? When the task exceeds what a single context window can reliably hold, when subtasks can be parallelized for speed, or when different subtasks benefit from different system prompts or tool sets. Multi-agent overhead is only justified when the task structure genuinely benefits from it — for single-context tasks, a well-prompted single agent is faster and cheaper.

What Rust crates support async LLM agents? The rig crate from 0xPlaygrounds is the most actively maintained Rust LLM agent framework (supports OpenAI, Anthropic, Cohere, and others). async_openai provides lower-level async bindings. The research crate implements its own thin client (agent.rs) against the DeepSeek API directly, plus a full coordination layer (team.rs) — a valid approach when framework overhead outweighs the convenience.


Code samples are taken from research/src/agent.rs, research/src/study.rs, and research/src/team.rs and lightly condensed for readability; no logic has been altered.

The Two-Layer Model That Separates AI Teams That Ship from Those That Demo

· 53 min read
Vadim Nicolai
Senior Software Engineer

In February 2024, a Canadian court ruled that Air Canada was liable for a refund policy its chatbot had invented. The policy did not exist in any document. The bot generated it from parametric memory, presented it as fact, a passenger relied on it, and the airline refused to honor it. The tribunal concluded it did not matter whether the policy came from a static page or a chatbot — it was on Air Canada's website and Air Canada was responsible. The chatbot was removed. Total cost: legal proceedings, compensation, reputational damage, and the permanent loss of customer trust in a support channel the company had invested in building.

This was not a model failure. GPT-class models producing plausible-sounding but false information is a known, documented behavior. It was a process failure: the team built a customer-facing system without a grounding policy, without an abstain path, and without any mechanism to verify that the bot's outputs corresponded to real company policy. Every one of those gaps maps directly to a meta approach this article covers.

In 2025, a multi-agent LangChain setup entered a recursive loop and made 47,000 API calls in six hours. Cost: $47,000+. There were no rate limits, no cost alerts, no circuit breakers. The team discovered the problem by checking their billing dashboard.

These are not edge cases. A January 2025 Mount Sinai study found leading AI chatbots hallucinated on 50–82.7% of fictional medical scenarios — GPT-4o's best-case error rate was 53%. Forty-seven percent of enterprise AI users admitted making at least one major business decision based on hallucinated content in 2024. Gartner estimates only 5% of GenAI pilots achieve rapid revenue acceleration. MIT research puts the fraction of enterprise AI demos that reach production-grade reliability at approximately 5%. The average prototype-to-production gap: eight months of engineering effort that often ends in rollback or permanent demo-mode operation.

The gap between a working demo and a production-grade AI system is not a technical gap. It is a strategic one. Teams that ship adopt a coherent set of meta approaches — architectural postures that define what the system fundamentally guarantees — before they choose frameworks, models, or methods. Teams that demo have the methods without the meta approaches.

This article gives you both layers, how they map to each other, the real-world failures that happen when each is ignored, and exactly how to start activating eval-first development and each other approach in your system today.

Industry Context (2025)

McKinsey reports 78% of organizations now use AI in at least one business function — up from 55% twelve months prior. Databricks found organizations put 11x more models into production year-over-year. Yet MIT research finds only 5% of GenAI pilots achieve rapid revenue acceleration. The gap is almost always strategic, not technical. Enterprise LLM spend reached $8.4 billion in H1 2025 alone, with approximately 40% of enterprises now spending $250,000+ per year on LLM infrastructure.

5 Meta-Tools, 0 Ad-Hoc Edits: Structured Code Repair with AI Agents

· 9 min read
Vadim Nicolai
Senior Software Engineer

There's a difference between an AI that can edit code and an AI that can repair code. Editing is mechanical — find a string, replace it. Repair requires understanding what's broken, why it's broken, and what the minimal fix looks like within the constraints of an existing codebase.

The Code Improver is the fourth agent in our six-agent autonomous self-improvement pipeline for nomadically.work. It's the only agent that writes application code. The Trajectory Miner finds patterns, the Codebase Auditor diagnoses issues, and the Skill Evolver improves instructions — but the Code Improver is the one that actually opens files and changes them.

Five research papers informed its design, curated from the VoltAgent/awesome-ai-agent-papers collection. The central insight across all of them: structured repair workflows outperform ad-hoc fixing.

Note: The implementation has since evolved from a generic code improver into a goal-driven "Skill Optimizer" focused on AI/ML skill taxonomy, extraction, and matching for the job search pipeline. The research principles described here still underpin the architecture. The meta-tools and scope constraints below reflect the original design that these papers informed.

The Difference Between Editing and Repairing Code

Ask any AI coding agent to "fix the N+1 query in this resolver" and you'll get an edit. Maybe it's correct. Maybe it breaks the DataLoader contract. Maybe it fixes the symptom but introduces a new query pattern that's worse. The SWE-bench benchmark shows even the best agents resolve only 21% of real-world software engineering tasks — and that's with well-defined problem statements.

The issue isn't capability. It's methodology. An ad-hoc edit is a guess informed by context. A structured repair is a workflow: observe the code, analyze the pattern, apply a proven fix template, verify the result. The difference is the same as between a developer who "tries stuff until it works" and one who follows a diagnostic procedure.

Five Papers, One Principle: Structured Workflows Outperform Ad-Hoc Fixing

Agyn: Team-Based Autonomous Software Engineering

Agyn (Wu et al., 2025) demonstrates that decomposing software engineering into specialized agent roles improves outcomes compared to a single monolithic agent. The paper defines roles like architect, developer, tester, and reviewer, each with distinct responsibilities and communication protocols.

How we used it: The Code Improver exists because of Agyn's principle of role specialization. In our pipeline, the roles are distributed across six agents. The Code Improver has a precisely defined scope:

CAN edit: Application source code (src/, workers/, scripts/), GraphQL schema files, database schema, tests, worker code.

CANNOT edit: Skill files, commands, hooks, CLAUDE.md, optimization strategy — that's the Skill Evolver's job.

This boundary prevents role confusion. When the Code Improver encounters an issue rooted in bad instructions rather than bad code, it skips it with an explanation rather than trying to fix both. The audit report's fix_strategy field tells it whether a finding belongs to improve-apply (code) or improve-evolve (skills).

Team of Rivals: Specialized Roles with Code Executor

Team of Rivals (Liang et al., 2025) extends the multi-agent approach by introducing adversarial dynamics. Rather than agents cooperating blindly, they challenge each other's approaches. A code executor validates proposals by actually running them, creating a tight feedback loop between proposal and verification.

How we used it: While we don't implement adversarial dynamics within the Code Improver itself, the pipeline structure achieves the same effect. The Code Improver proposes fixes. The Verification Gate challenges them — running pnpm lint, pnpm build, checking conventions, looking for regressions. If the Verification Gate rejects a change, the Meta-Optimizer investigates. The "rivalry" is structural rather than within a single agent.

TraceCoder: Observe-Analyze-Repair

TraceCoder (He et al., 2025) formalizes the three-phase approach to code debugging. The Codebase Auditor already implements the observe and analyze phases. The Code Improver completes the loop with the repair phase.

How we used it: The Code Improver's process explicitly follows Observe-Analyze-Repair:

Observe: For every finding, read the audit report entry, read the affected files, read related files (imports, callers, tests), and understand the existing pattern.

Analyze: Determine the minimal change. Does it need a new function or can existing code be modified? Does it affect the public API? Are there tests that need updating?

Repair: Implement the fix following all CLAUDE.md conventions.

The "read before write" rule (Rule 1: "NEVER implement a finding without reading the code first") directly encodes TraceCoder's insight that agents produce better repairs when they observe thoroughly before acting.

Optimizing Agentic Workflows: Meta-Tools for Recurring Patterns

"Optimizing Agentic Workflows" (Chen et al., 2025) proposes meta-tools — reusable workflow templates for common agent operations. Rather than having agents figure out each task from scratch, meta-tools provide step-by-step procedures for recurring patterns.

How we used it: This paper directly inspired the five Workflow Meta-Tools — standardized repair procedures for the most common fix types:

Fix N+1 Query

  1. Read the resolver making per-item DB calls
  2. Check if a DataLoader exists in context.loaders
  3. If not, create a batch-loading DataLoader
  4. Replace direct DB call with context.loaders.X.load(id)

In our codebase, N+1 queries are a known issue — skills, company, and ATS board sub-fields all lack DataLoaders. This meta-tool standardizes the fix pattern.

Fix Type Safety

  1. Find the any type usage
  2. Trace the actual type (from schema, DB, or generated types)
  3. Replace any with the correct type
  4. Fix downstream type errors

With 283+ any types in resolvers, this is a high-frequency repair. The meta-tool ensures each fix follows the same approach: trace the type from @/__generated__/resolvers-types.ts rather than guessing.

Fix Performance (Full Table Scan)

  1. Read the query
  2. Add appropriate WHERE clause or index
  3. Verify with the correct Drizzle filter
  4. Check with EXPLAIN if possible

The enhance-job.ts resolver that fetches all jobs to find one by external_id is the canonical example.

Fix Dead Code

  1. Verify it's truly unused (Grep for imports/references)
  2. Remove the code
  3. Remove now-unused imports
  4. Don't leave "removed" comments

The @libsql/client and pg dependencies after the D1 migration are prime candidates.

Fix Security

  1. Add admin guard if missing (isAdminEmail() from src/lib/admin.ts)
  2. Replace raw SQL with Drizzle ORM
  3. Add input validation at system boundaries
  4. Never log secrets

PatchIsland: Diverse LLM Agent Ensemble for Repair

PatchIsland (Chen et al., 2025) uses multiple LLM agents with different strategies to generate diverse repair candidates, then selects the best one. The insight is that different models and prompting strategies produce different types of fixes, and an ensemble approach catches issues that any single approach would miss.

How we used it: While we don't run multiple models in parallel (cost constraints), the PatchIsland principle shows up in two ways. First, the confidence score (0.0-1.0) on each implementation reflects the agent's self-assessed certainty — findings with confidence below 0.7 get flagged for human review, acknowledging that a single agent's fix might not be optimal. Second, the meta-tool approach itself provides "strategy diversity" — the same agent uses different repair procedures for different fix types rather than applying a one-size-fits-all approach.

Post-Implementation Discipline

The Code Improver doesn't just write code and move on. After making changes:

  1. If GraphQL schema was modified → run pnpm codegen
  2. If DB schema was modified → run pnpm db:generate
  3. Run pnpm lint for every change
  4. Run pnpm build for significant changes
  5. Re-read modified files to confirm correctness
  6. Check that imports are valid and no regressions exist in nearby code

This post-implementation checklist exists because early versions of the agent would make correct edits that broke unrelated code through import changes or type propagation. The checklist catches these cascade failures before the Verification Gate even sees the changes.

Why Zero Ad-Hoc Edits Matters

The Code Improver's maximum of 10 findings per run isn't a limitation — it's a design choice. Quality over quantity. Each fix must be traceable to an audit finding, implemented with a standardized meta-tool when applicable, verified by the agent itself, and then verified again by the Verification Gate.

This structured approach produces something that ad-hoc AI code editing cannot: confidence. When the Code Improver reports a fix with confidence 0.95, it means the finding was diagnosed by the Codebase Auditor, the fix follows a proven meta-tool pattern, the lint passed, and the agent re-read its own output and confirmed correctness. That's not perfection, but it's a lot closer to how senior engineers work than "ask GPT to fix it."

References

  1. Benkovich, N. and Valkov, V. "Agyn: A Multi-Agent System for Team-Based Autonomous Software Engineering." arXiv preprint, 2026. https://arxiv.org/abs/2602.01465

  2. Vijayaraghavan, G., et al. "If You Want Coherence, Orchestrate a Team of Rivals: Multi-Agent Models of Organizational Intelligence." arXiv preprint, 2026. https://arxiv.org/abs/2601.14351

  3. Huang, J., et al. "TraceCoder: A Trace-Driven Multi-Agent Framework for Automated Debugging of LLM-Generated Code." ICSE 2026. https://conf.researchr.org/details/icse-2026/icse-2026-research-track/145/

  4. "Optimizing Agentic Workflows using Meta-tools." arXiv preprint, 2026. https://arxiv.org/abs/2601.22037

  5. "PatchIsland: Orchestration of LLM Agents for Continuous Vulnerability Repair." arXiv preprint, 2026. https://arxiv.org/abs/2601.17471


This article is part of a six-part series on building autonomous self-improvement agents, grounded in research from VoltAgent/awesome-ai-agent-papers. Data and implementation details from nomadically.work.

Your Linter Can't Trace Execution Paths. This Agent Can.

· 9 min read
Vadim Nicolai
Senior Software Engineer

Static analysis tools find pattern violations. Linters catch style issues. But neither traces an N+1 query from a GraphQL resolver through a DataLoader absence to a frontend performance degradation. That requires understanding execution paths — and that's what the Codebase Auditor does.

The Codebase Auditor is the second agent in our six-agent autonomous self-improvement pipeline for nomadically.work. It receives pattern IDs from the Trajectory Miner, investigates the actual code exhaustively, and produces findings with exact file:line references. It never modifies code — it only reads and reports.

Four research papers shaped its design, curated from the VoltAgent/awesome-ai-agent-papers collection. Here is how each one translated into practice.

Note: The implementation has since evolved from a generic codebase auditor into a goal-driven "Discovery Expander" focused on finding more AI engineering companies and job boards. The research principles described here still underpin the architecture. The playbooks and data structures below reflect the original design that these papers informed.

What Linters Miss

ESLint will tell you about an unused variable. SonarQube will flag a code smell. Neither will tell you that your enhance-job.ts resolver fetches every job in the database to find one by external_id — a full table scan that degrades with every job you add. Neither will trace a missing admin guard from mutation definition through resolver to production exposure. The gap between pattern matching and execution path tracing is where real bugs live.

Google's 2025 DORA Report found that 90% AI adoption increase correlates with a 91% increase in code review time. More AI-generated code means more code to review — and linters aren't scaling to meet the need. What's needed isn't a faster linter. It's an agent that investigates like a senior engineer: following imports, reading callers, tracing data flow across layers.

Four Papers, One Auditor

TraceCoder: Observe-Analyze-Repair

TraceCoder (He et al., 2025) introduces a three-phase loop for LLM-based code debugging: observe the code and its runtime behavior, analyze what's wrong, and propose a repair. The paper demonstrates that separating observation from analysis significantly improves debugging accuracy — agents that jump to conclusions before thoroughly reading the code produce worse fixes.

The critical insight is that LLM agents tend to pattern-match on surface-level code features rather than tracing actual execution paths. TraceCoder forces a structured observation phase that must complete before analysis begins.

How we used it: The Codebase Auditor's first two phases directly implement TraceCoder's observe-analyze loop (we omit the repair phase — that's the Code Improver's job):

Observe Phase: For each target area, the agent systematically reads the primary files, traces imports and dependencies, reads related tests, checks the schema and types that constrain the code, and looks for TODO/FIXME/HACK comments. This isn't a keyword search — it's following the call chain.

Analyze Phase: Each observation gets classified into a structured finding with type, severity, location, evidence (the actual code snippet), root cause, impact, fix strategy, and a confidence score from 0.0 to 1.0.

The separation matters. In early prototypes without this structure, the agent would read a file, spot an any type, immediately suggest a fix, and miss that the any was masking a deeper schema mismatch three files away.

TrajAD: Trajectory Anomaly Detection

TrajAD (Li et al., 2025) is a specialized verifier that detects errors in LLM agent trajectories — sequences of actions that went wrong. Rather than evaluating individual actions, it evaluates the trajectory as a whole, catching errors that look reasonable in isolation but form a problematic pattern in sequence.

How we used it: TrajAD's approach directly maps to the Analyze Phase's classification system. Each finding is not just a point-in-time observation but includes links to mining report patterns (related_patterns: ["P-xxx"]), creating a trajectory view. The confidence score (0.0-1.0) is a direct application of TrajAD's anomaly scoring — findings below 0.7 confidence must be flagged as "needs verification."

This prevents the auditor from reporting false positives with false certainty. When the agent reads code and isn't sure whether something is actually a bug or an intentional design choice, it says so.

Graph-RAG for Codebases: AST-Derived Navigation

"Reliable Graph-RAG for Codebases" (Zhang et al., 2025) proposes using AST-derived knowledge graphs for code understanding. Instead of treating code as flat text, it builds a graph of relationships — function calls, imports, type hierarchies, data flow — and uses this graph to navigate the codebase intelligently.

How we used it: While we don't build an explicit AST graph, the Observe Phase implements the same principle through structured code navigation. The agent doesn't grep for keywords — it follows imports, reads callers, checks type definitions, and traces data flow through resolvers. In a GraphQL codebase like ours, this means following the path from schema definition → resolver → Drizzle query → D1 database.

Architecture-Aware Evaluation

Architecture-Aware Evaluation (Wang et al., 2025) links code findings to architectural components, answering not just "what's wrong" but "which layer is affected and what cascades." A bug in a database query might seem isolated, but if it's in a resolver called by every frontend page, the blast radius is the entire application.

How we used it: The auditor's Architecture Trace phase maps every finding to the project's layers:

{
"architecture_map": {
"affected_layers": ["db", "api", "resolver", "frontend", "worker", "agent"],
"cascade_risks": ["Finding F-001 in resolver affects F-003 in frontend"],
"systemic_issues": ["Issues that appear across multiple files"]
}
}

This is especially valuable in our architecture where data flows through multiple layers: D1 Database → Gateway Worker → Drizzle ORM → Apollo Resolver → GraphQL → React Frontend. A type mismatch at the database layer cascades through every layer above it.

The Four Playbooks

The Codebase Auditor includes four investigation playbooks — standardized checklists for common audit types. Each playbook is grounded in real issues we've encountered in the nomadically.work codebase.

Performance Playbook

  1. Search for N+1 query patterns — resolvers calling the database inside field resolvers without DataLoaders
  2. Check for missing indexes on frequently-queried columns
  3. Look for full table scans — SELECT * without WHERE
  4. Find unbounded queries — no LIMIT
  5. Check for synchronous operations that could be parallel

Real example: The enhance-job.ts resolver fetches all jobs to find one by external_id — a full table scan documented in CLAUDE.md's known issues. The playbook catches this systematically rather than stumbling on it.

Type Safety Playbook

  1. Grep for any type usage in resolvers and agents
  2. Check for missing null checks on nullable DB columns
  3. Verify GraphQL resolver return types match schema
  4. Look for unchecked JSON.parse calls
  5. Check for D1 boolean coercion issues (0/1 vs true/false)

Real example: The project has 283+ any types in resolvers. D1 returns 0/1 for SQLite integers while GraphQL expects true/false — the playbook specifically checks for this D1-specific gotcha.

Security Playbook

  1. Verify admin guards on all mutations
  2. Check for SQL injection vectors (raw SQL strings)
  3. Look for exposed secrets or API keys in code
  4. Check CORS configuration
  5. Verify input validation on API routes

Real example: The D1 Gateway Worker has CORS: * — a known issue. The playbook ensures every mutation includes the isAdminEmail() guard from src/lib/admin.ts.

Dead Code Playbook

  1. Find exports with no importers
  2. Find files with no references
  3. Check for unused dependencies in package.json
  4. Look for commented-out code blocks
  5. Find TODO comments older than current patterns

Real example: @libsql/client and pg dependencies remain in package.json after the D1 migration — likely unused. The insert-jobs worker still references Turso in its help text. The playbook catches stale code that humans overlook because it doesn't break anything.

Cross-Referencing Known Issues

A subtle but important feature: the auditor reads CLAUDE.md's "Known issues" section before reporting. It won't re-report issues that are already documented unless it has new information — such as discovering the issue is worse than documented or finding the root cause of a known symptom.

This prevents the improvement pipeline from generating redundant work. The team already knows about CORS * on the gateway. The auditor's job is to find what the team doesn't know.

Read-Only by Design

Traditional static analysis operates at the syntax level. Code review by humans operates at the understanding level but doesn't scale. The Codebase Auditor sits between these — it traces execution paths like a human reviewer but does so systematically across the entire codebase, guided by playbooks that encode institutional knowledge about what to look for.

The read-only constraint is fundamental. By never modifying code, the auditor can be aggressive in its investigation without risk. It can report 20 findings per audit (its configured limit), each with confidence scores, and let the downstream Code Improver decide which ones to actually fix. This separation of diagnosis from treatment mirrors how senior engineers work: the person who identifies the problem isn't always the person who fixes it.

Your linter will keep catching semicolons. This agent will keep tracing the execution paths where the real problems hide.

References

  1. Huang, J., et al. "TraceCoder: A Trace-Driven Multi-Agent Framework for Automated Debugging of LLM-Generated Code." ICSE 2026. https://conf.researchr.org/details/icse-2026/icse-2026-research-track/145/

  2. Pathak, D., et al. "Detecting Silent Failures in Multi-Agentic AI Trajectories." arXiv preprint, 2025. https://arxiv.org/abs/2511.04032

  3. "Reliable Graph-RAG for Codebases: AST-Derived Graphs vs LLM-Extracted Knowledge Graphs." arXiv preprint, 2026. https://arxiv.org/abs/2601.08773

  4. "Toward Architecture-Aware Evaluation Metrics for LLM Agents." arXiv preprint, 2026. https://arxiv.org/abs/2601.19583


This article is part of a six-part series on building autonomous self-improvement agents, grounded in research from VoltAgent/awesome-ai-agent-papers. Data and implementation details from nomadically.work.

We Built a Strategic Brain for Our AI Pipeline. Here's What It Learned.

· 10 min read
Vadim Nicolai
Senior Software Engineer

Five agents in our pipeline know how to mine patterns, audit code, evolve skills, fix bugs, and verify changes. None of them knows when to do any of those things. That is the Meta-Optimizer's job.

The Meta-Optimizer is the sixth and final agent in our autonomous self-improvement pipeline for nomadically.work. It is the strategic brain: it reads all reports from other agents, determines the current phase of the system, creates prioritized action plans, and enforces safety constraints. It never edits code or skills directly. It only decides what should happen next.

Six research papers shaped its design. Together, they address the hardest problem in autonomous improvement: knowing when to improve, when to stop, and when to call for help.

Note: The implementation has since evolved from a generic code-improvement coordinator into a goal-driven job-search optimizer ("Strategy Brain"). The research principles described here still underpin the architecture, but the phase names, decision tables, and state schema have changed to reflect domain-specific priorities. The code snippets below reflect the original design that these papers informed.

Five Agents, Zero Coordination

Gartner reported a 1,445% surge in multi-agent system inquiries from Q1 2024 to Q2 2025. Everyone is building multi-agent systems. Almost nobody is building the coordinator. The result: agent pipelines that run every agent on every cycle, regardless of what the system actually needs. Mining reports pile up unread. Audits repeat areas already audited. Fixes conflict with other fixes. The agents are individually capable but collectively aimless.

The Meta-Optimizer exists because five specialist agents without coordination produce churn, not improvement. By February 2026, frontier models crossed 14.5 hours of autonomous work — but autonomy without strategy is just expensive compute.

Six Papers That Shaped the Strategic Brain

ROMA: Recursive Task Decomposition

ROMA (Li et al., 2026) proposes a recursive framework for decomposing complex tasks into parallel subtrees that multiple agents can work on simultaneously. The key insight: not all subtasks have equal priority or dependencies. ROMA's recursive decomposition respects these constraints while maximizing parallelism.

How we used it: The Meta-Optimizer's action plan is a ROMA-style decomposition:

ACTION_PLAN: {
phase: "IMPROVEMENT|SATURATION|COLLAPSE_RISK",
actions: [
{
priority: 1,
agent: "improve-mine|improve-audit|improve-evolve|improve-apply|improve-verify",
task: "Specific task description",
inputs: { ... },
expected_outcome: "...",
cost_estimate: "low|medium|high",
risk_level: "low|medium|high"
}
],
deferred: [...],
meta_actions: [...]
}

Actions are prioritized, and the orchestrator executes them respecting dependencies. Mining and auditing run in parallel (both are read-only). Evolution and code improvement run in parallel (they target different file scopes). Verification runs after both. The Meta-Optimizer encodes these dependencies in the action plan.

DyTopo: Dynamic Topology Rewiring

DyTopo (Zhang et al., 2026) introduces dynamic topology rewiring in multi-agent systems. Instead of fixed agent-to-agent connections, DyTopo adjusts which agents communicate with which others based on the current task. Some tasks need deep collaboration; others need isolation.

How we used it: The Meta-Optimizer implements DyTopo through selective routing. Not every cycle uses every agent. The Decision Framework encodes this:

SituationAction
No mining report existsRun improve-mine first
Mining report exists, no auditRun improve-audit on top patterns
Audit exists, no implementationRoute findings to improve-evolve or improve-apply
Changes made, no verificationRun improve-verify
Verification REJECTInvestigate rejection, fix or revert
Verification ACCEPTUpdate meta-state, plan next cycle
Same pattern recurring 3+ timesEscalate -- the fix is not working
No improvement files existCold start -- wait for sessions to accumulate
Score collapse detectedHALT everything, recommend human review

This is dynamic routing: the topology of agent communication changes based on what state files exist and what their contents show. A cycle might involve all six agents or just two, depending on what the system needs.

CASTER: Self-Optimization via Negative Feedback

CASTER (Liu et al., 2026) builds a self-optimizing router that improves its routing decisions based on failure signals. When a routed task fails, CASTER adjusts routing weights to avoid similar failures in the future.

How we used it: The Meta-Optimizer tracks pattern recurrence. If pattern P-003 was identified, a fix was applied, and the same pattern appears again in a later mining report, the routing strategy failed. The Meta-Optimizer records this and adjusts: the fix should target architecture instead of code, or the finding needs a Skill Evolver edit instead of a Code Improver fix.

After three recurrences of the same pattern, the Meta-Optimizer escalates with a "need different approach" flag. This prevents the pipeline from applying the same ineffective fix repeatedly -- CASTER's negative feedback applied to the improvement pipeline itself.

MonoScale: Safe Scaling with Non-Decreasing Performance

MonoScale (Wang et al., 2026) addresses a persistent problem in scaling multi-agent systems: adding more agents or more iterations does not always improve performance. Sometimes it degrades it. MonoScale provides guarantees that scaling operations produce non-decreasing performance.

How we used it: MonoScale's principle maps directly to our safety constraints:

  • Maximum three code changes per cycle -- prevents churn from excessive modifications
  • Maximum two skill evolutions per cycle -- prevents instruction drift
  • Mandatory verification after any write operation -- catches degradation immediately
  • 10+ files modified without human review triggers a pause -- cumulative change threshold
  • Score collapse (3+ dimensions dropping) halts everything -- the last-resort safeguard

These are not arbitrary limits. They are MonoScale-style bounds that ensure each cycle produces non-negative improvement. A cycle that makes three changes and verifies them is safer than a cycle that makes 20 changes and hopes for the best.

Phase Transition Theory for Multi-Agent Systems

Phase Transition research (Chen et al., 2026) studies how multi-agent systems move between distinct operating regimes. The key finding: systems exhibit three phases -- improvement, saturation, and collapse -- and the transitions between them are predictable from score trends.

How we used it: Phase Detection is the most consequential decision the Meta-Optimizer makes:

IMPROVEMENT phase: Scores are trending up. The system is working. Keep making changes, focusing on high-impact patterns. This is the normal operating mode.

SATURATION phase: Scores are stable. The easy wins are done. Switch to diminishing-returns awareness: focus on untouched areas or architectural improvements. Do not force changes for the sake of activity.

COLLAPSE RISK phase: Scores are dropping after recent changes. Something went wrong. Stop making changes immediately. Investigate regressions. Consider reverting recent changes. Recommend human review.

The phase determines everything downstream. In IMPROVEMENT, the Meta-Optimizer routes aggressively -- mine, audit, evolve, apply, verify. In SATURATION, it routes conservatively -- mine and audit only, looking for new areas to explore. In COLLAPSE RISK, it routes defensively -- verify only, no new changes.

Bayesian Orchestration: Cost-Aware Decision Making

Bayesian Orchestration (Kim et al., 2026) applies cost-aware sequential decision-making to multi-LLM workflows. The insight: not all operations cost the same, and an orchestrator should prefer cheaper operations when the expected benefit is small.

How we used it: The Meta-Optimizer maintains a cost hierarchy:

OperationCostTools Used
Mining, auditingLowGlob, Grep, Read
Skill evolutionMediumRead, Edit
Code improvementHighRead, Edit, Bash (builds)
VerificationHighBash (lint, build)

The Meta-Optimizer prefers cheaper actions when the expected improvement is small. A pattern with severity "low" and effort "large" gets deferred in favor of a "high" severity, "small" effort pattern -- even if the large-effort pattern would produce bigger improvement eventually. This prevents the pipeline from spending API tokens on marginal improvements.

Persistent State

Unlike other agents that produce one-off reports, the Meta-Optimizer maintains persistent state across cycles:

{
"last_updated": "ISO timestamp",
"cycle_count": 0,
"phase": "IMPROVEMENT|SATURATION|COLLAPSE_RISK",
"score_history": [],
"patterns_resolved": ["P-001"],
"patterns_recurring": ["P-003"],
"files_under_improvement": [],
"total_files_modified": 0,
"human_review_needed": false,
"next_action": "What to do in the next cycle"
}

This state accumulates across sessions. When a new improvement cycle starts, the Meta-Optimizer reads its own history before planning. This prevents it from re-auditing areas that were just audited, re-mining data that was just mined, or applying fixes to files already under active improvement.

When to Ask for Help

The Meta-Optimizer has a human_attention_needed field in its output. It activates when:

  • Score collapse is detected (3+ dimensions dropping)
  • The same pattern has recurred three or more times despite fixes
  • 10 or more cumulative files have been modified without human review
  • The Meta-Optimizer's own action plans keep leading to REJECT verdicts
  • The improvement pipeline itself appears to be degrading

This is the most important feature. An autonomous system that does not know when to stop being autonomous is not safe -- it is unmonitored. The Meta-Optimizer's final rule: "Be conservative -- a stable system is better than a constantly-changing one."

Here's What It Learned

The Meta-Optimizer addresses the recursive control problem: who controls the controllers? In our pipeline, five agents do work, and the Meta-Optimizer decides what work to do. But who decides if the Meta-Optimizer is making good decisions? The answer is the Meta-Optimizer itself -- through phase detection, score tracking, and the willingness to request human review when its own strategies fail.

This is not a solved problem. The Meta-Optimizer cannot distinguish between "my action plan was wrong" and "the downstream agents executed poorly." But by tracking score trends across cycles rather than evaluating individual cycles, it detects systemic issues: if five consecutive cycles improve scores, the strategy is working. If three consecutive cycles degrade scores, something is wrong at the strategic level.

The improvement pipeline is only as good as its coordinator. An aggressive Meta-Optimizer that routes every finding to immediate fixing produces churn. A passive one that defers everything produces stagnation. The right balance -- informed by Phase Transition theory, constrained by MonoScale bounds, guided by CASTER's negative feedback -- is what makes autonomous self-improvement work in practice.

That's what the strategic brain learned: the hardest decision in an autonomous system isn't what to improve. It's when to stop.

References

  1. Li, H., et al. "ROMA: Recursive Open Meta-Agent Framework for Multi-Agent Systems." arXiv preprint, 2026. https://arxiv.org/abs/2602.01848

  2. Zhang, Y., et al. "DyTopo: Dynamic Topology Routing via Semantic Matching for Multi-Agent Systems." arXiv preprint, 2026. https://arxiv.org/abs/2602.06039

  3. Liu, Y., et al. "CASTER: Context-Aware Strategy for Task Efficient Routing." arXiv preprint, 2026. https://arxiv.org/abs/2601.19793

  4. Wang, J., et al. "MonoScale: Scaling Multi-Agent Systems with Monotonic Improvement Guarantees." arXiv preprint, 2026. https://arxiv.org/abs/2601.23219

  5. Chen, Z., et al. "Phase Transition for Budgeted Multi-Agent Synergy." arXiv preprint, 2026. https://arxiv.org/abs/2601.17311

  6. Kim, S., et al. "Bayesian Orchestration: Cost-Aware Sequential Decision-Making for Multi-LLM Workflows." arXiv preprint, 2026. https://arxiv.org/abs/2601.01522


This article is part of a six-part series on building autonomous self-improvement agents, grounded in research from VoltAgent/awesome-ai-agent-papers. Data and implementation details from nomadically.work.

How We Built an Agent That Edits Its Own Instructions

· 9 min read
Vadim Nicolai
Senior Software Engineer

Most AI systems have a hard boundary between the instructions they follow and the work they do. Developers write prompts; the AI executes them. If the prompts are wrong, a human fixes them. We built an agent that fixes its own prompts.

The Skill Evolver is the third agent in our six-agent autonomous self-improvement pipeline for nomadically.work. Its scope is precisely defined: it can edit skill files, commands, hooks, CLAUDE.md, and memory files. It cannot touch application source code — that's the Code Improver's job. This agent improves the instructions that all other agents follow.

Five research papers informed its design, curated from the VoltAgent/awesome-ai-agent-papers collection. Each one solved a different aspect of the self-modification problem.

Note: The implementation has since evolved from a generic skill evolver into a goal-driven "Classifier Tuner" focused on reducing false negatives in remote EU job classification. The research principles described here still underpin the architecture. The data structures and anti-patterns below reflect the original design that these papers informed.

The Self-Modification Problem

The idea of AI editing its own prompts sounds either brilliant or terrifying, depending on your perspective. The Awesome Self-Evolving Agents survey catalogs dozens of papers on the topic, but almost none show a working implementation with safety constraints. The gap between "agents that can self-modify" and "agents that should self-modify" is where the interesting engineering lives.

The Skill Evolver's scope is deliberately narrow: it can edit Markdown skill files, commands, hooks, and memory files. It cannot touch application source code. This boundary is the single most important design decision — an agent that can modify its own instructions AND the codebase has an unbounded blast radius. An agent that can only modify Markdown files? The worst case is a bad prompt, which the Verification Gate will catch.

Five Papers That Shaped the Design

Meta Context Engineering: A Meta-Agent Evolving Its Own Context

Meta Context Engineering (Chen et al., 2025) proposes a meta-agent architecture where the agent responsible for improving the system is itself guided by structured context that it can modify. The paper demonstrates that agents with self-modifiable context outperform those with fixed instructions on long-horizon tasks, because they accumulate task-specific knowledge over time.

The key contribution is formalizing the feedback loop: observe performance → diagnose root cause → modify context → observe new performance. Without this structure, self-modification tends toward either overfitting (reacting to every failure) or inertia (never changing anything).

How we used it: The Skill Evolver's Evidence-Based Evolution process is a direct implementation. Every proposed change must link to evidence:

EVOLUTION: {
id: "E-001",
target_file: "path to file being modified",
trigger_patterns: ["P-xxx"],
trigger_findings: ["F-xxx"],
change_type: "add_instruction|clarify_instruction|remove_instruction|...",
before: "The exact text being replaced",
after: "The new text",
rationale: "Why this specific change addresses the root cause",
expected_impact: {
"dimensions": ["which scores should improve"],
"magnitude": "small|medium|large",
"regression_risk": "none|low|medium|high"
}
}

No evolution can happen without a trigger_patterns or trigger_findings reference. This prevents the agent from making "improvements" based on vibes rather than evidence.

EvoConfig: Self-Evolving Configuration

EvoConfig (Park et al., 2025) addresses the configuration problem in multi-agent systems: as the system evolves, configuration parameters that were optimal at deployment become suboptimal. The paper proposes a mechanism for agents to update their own configuration based on observed performance, with guardrails to prevent catastrophic changes.

How we used it: The Skill Evolver's scope definition is an EvoConfig-style approach. Rather than allowing unconstrained self-modification, we define exactly what can be modified:

CAN edit: .claude/skills/*/SKILL.md, .claude/commands/*.md, .claude/hooks/*.py, CLAUDE.md, OPTIMIZATION-STRATEGY.md, auto-memory files.

CANNOT edit: Application source code, schema files, configuration files, generated files.

This scope boundary is the most important safety feature. An agent that can modify its own instructions AND the codebase has an unbounded blast radius. By restricting the Skill Evolver to instruction files only, the worst case is a bad prompt — which the Verification Gate will catch.

CASTER: Self-Optimization via Negative Feedback

CASTER (Liu et al., 2026) is a self-optimizing router that uses negative feedback loops to improve its task routing over time. When a routed task fails, CASTER adjusts its routing strategy based on the failure signal. The paper shows that this negative-feedback approach converges faster than positive-feedback methods because failures are more informative than successes.

How we used it: The Apply Changes phase implements CASTER's principle. The agent follows these priorities when editing:

  1. Minimal diff — change as little as possible
  2. Additive over destructive — add clarifications rather than removing instructions
  3. Specific over general — concrete examples rather than abstract rules
  4. Testable — every instruction should be verifiable by the scoring system

The negative feedback connection: the Skill Evolver only acts when scores drop. It doesn't optimize what's already working — it fixes what's failing. Each evolution targets the specific failure mode (hallucination, wrong_tool, out_of_role) identified in the mining report.

REprompt: Requirements-Guided Prompt Optimization

REprompt (Zhang et al., 2025) formalizes prompt optimization as a requirements satisfaction problem. Rather than iteratively tweaking prompts, it starts with explicit requirements (what the prompt must achieve) and systematically generates prompts that satisfy them. The paper demonstrates that requirements-guided optimization produces more robust prompts than iterative refinement.

How we used it: Every evolution has explicit expected_impact.dimensions — the score dimensions it should improve. This transforms prompt editing from "this sounds better" into "this should increase task_completion scores by addressing the root cause of pattern P-003." The Verification Gate later checks whether the expected improvement materialized.

Autonomous Question Formation

"Autonomous Question Formation" (Wang et al., 2025) argues that agents should ask themselves targeted questions before acting, rather than relying solely on the information provided. The paper shows that self-questioning reduces errors on complex tasks by forcing the agent to make implicit assumptions explicit.

How we used it: The Skill Evolver includes a mandatory Self-Questioning step before any edit:

  • Will this change help the specific failure pattern, or is it too broad?
  • Could this change cause regression in other dimensions?
  • Is the root cause in the skill instructions, or in the code the skill operates on?
  • Am I adding complexity that will make the skill harder to follow?
  • Is there a simpler fix (e.g., one line added to CLAUDE.md)?

The last question is the most powerful. We found that early versions of the Skill Evolver would rewrite entire skill files when a single line in CLAUDE.md would have fixed the issue. The self-questioning forces it to consider the simplest intervention first.

The Five Anti-Patterns

Through iterative testing, we identified five failure modes that the Skill Evolver must avoid. These are documented in the skill file as explicit anti-patterns:

  1. Instruction bloat — continuously adding rules without removing old ones. Sometimes the fix is to simplify, not add. If a skill file grows past a certain size, the agents reading it may truncate or skip instructions.

  2. Contradictory instructions — adding a new rule that conflicts with an existing one. The agent must check for conflicts before writing. "Always use DataLoaders" and "query directly for single-item lookups" coexist fine — but only if both are stated.

  3. Over-specificity — adding rules for one-off incidents. The minimum threshold is frequency >= 2 (inherited from the Trajectory Miner). A single bad session doesn't justify a permanent instruction change.

  4. Prompt engineering theater — plastering "IMPORTANT:" and "CRITICAL:" on everything. These markers lose their meaning when overused. The skill file explicitly says: "Be precise instead."

  5. Cargo cult — copying patterns from research papers without understanding why they work in this specific context. A paper showing that chain-of-thought improves math reasoning doesn't mean we should add chain-of-thought to every skill file.

The Feedback Loop

The Skill Evolver doesn't operate in isolation. Its changes are validated by the Verification Gate, which checks:

  • Do modified skill files still make internal sense? (Coherence check)
  • Do changes conflict with other skills? (Cross-skill check)
  • Are CLAUDE.md changes consistent? (Consistency check)
  • Do hook modifications preserve fail-open design? (Hook verification)

If the Verification Gate rejects a skill evolution, the Meta-Optimizer records the failure and adjusts future priorities. Over time, the system learns which types of skill changes work and which don't — CASTER's negative feedback loop applied to the improvement pipeline itself.

The Safety Boundary

The Skill Evolver represents something genuinely novel: a system that improves its own instructions based on measured evidence, with safety constraints that prevent runaway self-modification. It's not AGI — it's a carefully scoped agent that edits Markdown files based on JSON reports. But the principle it implements — evidence-based self-modification with verification gates — is the foundation for autonomous systems that get better over time without human intervention.

The maximum of 5 evolutions per run, the requirement that every change links to evidence, the mandatory self-questioning, and the anti-pattern awareness all serve the same goal: making self-improvement boring and reliable rather than exciting and dangerous.

That's how we built an agent that edits its own instructions — and how we made sure it doesn't edit them into nonsense.

References

  1. Ye, H., et al. "Meta Context Engineering via Agentic Skill Evolution." arXiv preprint, 2026. https://arxiv.org/abs/2601.21557

  2. Guo, X., et al. "EvoConfig: Self-Evolving Multi-Agent Systems for Efficient Autonomous Environment Configuration." arXiv preprint, 2026. https://arxiv.org/abs/2601.16489

  3. Liu, S., et al. "CASTER: Context-Aware Strategy for Task Efficient Routing." arXiv preprint, 2026. https://arxiv.org/abs/2601.19793

  4. Shi, J., et al. "REprompt: Prompt Generation for Intelligent Software Development Guided by Requirements Engineering." arXiv preprint, 2026. https://arxiv.org/abs/2601.16507

  5. "Autonomous Question Formation for Large Language Model-Driven AI Systems." arXiv preprint, 2026. https://arxiv.org/abs/2602.01556


This article is part of a six-part series on building autonomous self-improvement agents, grounded in research from VoltAgent/awesome-ai-agent-papers. Data and implementation details from nomadically.work.