Skip to main content

6 posts tagged with "autonomous-agents"

View All Tags

5 Meta-Tools, 0 Ad-Hoc Edits: Structured Code Repair with AI Agents

· 9 min read
Vadim Nicolai
Senior Software Engineer

There's a difference between an AI that can edit code and an AI that can repair code. Editing is mechanical — find a string, replace it. Repair requires understanding what's broken, why it's broken, and what the minimal fix looks like within the constraints of an existing codebase.

The Code Improver is the fourth agent in our six-agent autonomous self-improvement pipeline for nomadically.work. It's the only agent that writes application code. The Trajectory Miner finds patterns, the Codebase Auditor diagnoses issues, and the Skill Evolver improves instructions — but the Code Improver is the one that actually opens files and changes them.

Five research papers informed its design, curated from the VoltAgent/awesome-ai-agent-papers collection. The central insight across all of them: structured repair workflows outperform ad-hoc fixing.

Note: The implementation has since evolved from a generic code improver into a goal-driven "Skill Optimizer" focused on AI/ML skill taxonomy, extraction, and matching for the job search pipeline. The research principles described here still underpin the architecture. The meta-tools and scope constraints below reflect the original design that these papers informed.

The Difference Between Editing and Repairing Code

Ask any AI coding agent to "fix the N+1 query in this resolver" and you'll get an edit. Maybe it's correct. Maybe it breaks the DataLoader contract. Maybe it fixes the symptom but introduces a new query pattern that's worse. The SWE-bench benchmark shows even the best agents resolve only 21% of real-world software engineering tasks — and that's with well-defined problem statements.

The issue isn't capability. It's methodology. An ad-hoc edit is a guess informed by context. A structured repair is a workflow: observe the code, analyze the pattern, apply a proven fix template, verify the result. The difference is the same as between a developer who "tries stuff until it works" and one who follows a diagnostic procedure.

Five Papers, One Principle: Structured Workflows Outperform Ad-Hoc Fixing

Agyn: Team-Based Autonomous Software Engineering

Agyn (Wu et al., 2025) demonstrates that decomposing software engineering into specialized agent roles improves outcomes compared to a single monolithic agent. The paper defines roles like architect, developer, tester, and reviewer, each with distinct responsibilities and communication protocols.

How we used it: The Code Improver exists because of Agyn's principle of role specialization. In our pipeline, the roles are distributed across six agents. The Code Improver has a precisely defined scope:

CAN edit: Application source code (src/, workers/, scripts/), GraphQL schema files, database schema, tests, worker code.

CANNOT edit: Skill files, commands, hooks, CLAUDE.md, optimization strategy — that's the Skill Evolver's job.

This boundary prevents role confusion. When the Code Improver encounters an issue rooted in bad instructions rather than bad code, it skips it with an explanation rather than trying to fix both. The audit report's fix_strategy field tells it whether a finding belongs to improve-apply (code) or improve-evolve (skills).

Team of Rivals: Specialized Roles with Code Executor

Team of Rivals (Liang et al., 2025) extends the multi-agent approach by introducing adversarial dynamics. Rather than agents cooperating blindly, they challenge each other's approaches. A code executor validates proposals by actually running them, creating a tight feedback loop between proposal and verification.

How we used it: While we don't implement adversarial dynamics within the Code Improver itself, the pipeline structure achieves the same effect. The Code Improver proposes fixes. The Verification Gate challenges them — running pnpm lint, pnpm build, checking conventions, looking for regressions. If the Verification Gate rejects a change, the Meta-Optimizer investigates. The "rivalry" is structural rather than within a single agent.

TraceCoder: Observe-Analyze-Repair

TraceCoder (He et al., 2025) formalizes the three-phase approach to code debugging. The Codebase Auditor already implements the observe and analyze phases. The Code Improver completes the loop with the repair phase.

How we used it: The Code Improver's process explicitly follows Observe-Analyze-Repair:

Observe: For every finding, read the audit report entry, read the affected files, read related files (imports, callers, tests), and understand the existing pattern.

Analyze: Determine the minimal change. Does it need a new function or can existing code be modified? Does it affect the public API? Are there tests that need updating?

Repair: Implement the fix following all CLAUDE.md conventions.

The "read before write" rule (Rule 1: "NEVER implement a finding without reading the code first") directly encodes TraceCoder's insight that agents produce better repairs when they observe thoroughly before acting.

Optimizing Agentic Workflows: Meta-Tools for Recurring Patterns

"Optimizing Agentic Workflows" (Chen et al., 2025) proposes meta-tools — reusable workflow templates for common agent operations. Rather than having agents figure out each task from scratch, meta-tools provide step-by-step procedures for recurring patterns.

How we used it: This paper directly inspired the five Workflow Meta-Tools — standardized repair procedures for the most common fix types:

Fix N+1 Query

  1. Read the resolver making per-item DB calls
  2. Check if a DataLoader exists in context.loaders
  3. If not, create a batch-loading DataLoader
  4. Replace direct DB call with context.loaders.X.load(id)

In our codebase, N+1 queries are a known issue — skills, company, and ATS board sub-fields all lack DataLoaders. This meta-tool standardizes the fix pattern.

Fix Type Safety

  1. Find the any type usage
  2. Trace the actual type (from schema, DB, or generated types)
  3. Replace any with the correct type
  4. Fix downstream type errors

With 283+ any types in resolvers, this is a high-frequency repair. The meta-tool ensures each fix follows the same approach: trace the type from @/__generated__/resolvers-types.ts rather than guessing.

Fix Performance (Full Table Scan)

  1. Read the query
  2. Add appropriate WHERE clause or index
  3. Verify with the correct Drizzle filter
  4. Check with EXPLAIN if possible

The enhance-job.ts resolver that fetches all jobs to find one by external_id is the canonical example.

Fix Dead Code

  1. Verify it's truly unused (Grep for imports/references)
  2. Remove the code
  3. Remove now-unused imports
  4. Don't leave "removed" comments

The @libsql/client and pg dependencies after the D1 migration are prime candidates.

Fix Security

  1. Add admin guard if missing (isAdminEmail() from src/lib/admin.ts)
  2. Replace raw SQL with Drizzle ORM
  3. Add input validation at system boundaries
  4. Never log secrets

PatchIsland: Diverse LLM Agent Ensemble for Repair

PatchIsland (Chen et al., 2025) uses multiple LLM agents with different strategies to generate diverse repair candidates, then selects the best one. The insight is that different models and prompting strategies produce different types of fixes, and an ensemble approach catches issues that any single approach would miss.

How we used it: While we don't run multiple models in parallel (cost constraints), the PatchIsland principle shows up in two ways. First, the confidence score (0.0-1.0) on each implementation reflects the agent's self-assessed certainty — findings with confidence below 0.7 get flagged for human review, acknowledging that a single agent's fix might not be optimal. Second, the meta-tool approach itself provides "strategy diversity" — the same agent uses different repair procedures for different fix types rather than applying a one-size-fits-all approach.

Post-Implementation Discipline

The Code Improver doesn't just write code and move on. After making changes:

  1. If GraphQL schema was modified → run pnpm codegen
  2. If DB schema was modified → run pnpm db:generate
  3. Run pnpm lint for every change
  4. Run pnpm build for significant changes
  5. Re-read modified files to confirm correctness
  6. Check that imports are valid and no regressions exist in nearby code

This post-implementation checklist exists because early versions of the agent would make correct edits that broke unrelated code through import changes or type propagation. The checklist catches these cascade failures before the Verification Gate even sees the changes.

Why Zero Ad-Hoc Edits Matters

The Code Improver's maximum of 10 findings per run isn't a limitation — it's a design choice. Quality over quantity. Each fix must be traceable to an audit finding, implemented with a standardized meta-tool when applicable, verified by the agent itself, and then verified again by the Verification Gate.

This structured approach produces something that ad-hoc AI code editing cannot: confidence. When the Code Improver reports a fix with confidence 0.95, it means the finding was diagnosed by the Codebase Auditor, the fix follows a proven meta-tool pattern, the lint passed, and the agent re-read its own output and confirmed correctness. That's not perfection, but it's a lot closer to how senior engineers work than "ask GPT to fix it."

References

  1. Benkovich, N. and Valkov, V. "Agyn: A Multi-Agent System for Team-Based Autonomous Software Engineering." arXiv preprint, 2026. https://arxiv.org/abs/2602.01465

  2. Vijayaraghavan, G., et al. "If You Want Coherence, Orchestrate a Team of Rivals: Multi-Agent Models of Organizational Intelligence." arXiv preprint, 2026. https://arxiv.org/abs/2601.14351

  3. Huang, J., et al. "TraceCoder: A Trace-Driven Multi-Agent Framework for Automated Debugging of LLM-Generated Code." ICSE 2026. https://conf.researchr.org/details/icse-2026/icse-2026-research-track/145/

  4. "Optimizing Agentic Workflows using Meta-tools." arXiv preprint, 2026. https://arxiv.org/abs/2601.22037

  5. "PatchIsland: Orchestration of LLM Agents for Continuous Vulnerability Repair." arXiv preprint, 2026. https://arxiv.org/abs/2601.17471


This article is part of a six-part series on building autonomous self-improvement agents, grounded in research from VoltAgent/awesome-ai-agent-papers. Data and implementation details from nomadically.work.

Your Linter Can't Trace Execution Paths. This Agent Can.

· 9 min read
Vadim Nicolai
Senior Software Engineer

Static analysis tools find pattern violations. Linters catch style issues. But neither traces an N+1 query from a GraphQL resolver through a DataLoader absence to a frontend performance degradation. That requires understanding execution paths — and that's what the Codebase Auditor does.

The Codebase Auditor is the second agent in our six-agent autonomous self-improvement pipeline for nomadically.work. It receives pattern IDs from the Trajectory Miner, investigates the actual code exhaustively, and produces findings with exact file:line references. It never modifies code — it only reads and reports.

Four research papers shaped its design, curated from the VoltAgent/awesome-ai-agent-papers collection. Here is how each one translated into practice.

Note: The implementation has since evolved from a generic codebase auditor into a goal-driven "Discovery Expander" focused on finding more AI engineering companies and job boards. The research principles described here still underpin the architecture. The playbooks and data structures below reflect the original design that these papers informed.

What Linters Miss

ESLint will tell you about an unused variable. SonarQube will flag a code smell. Neither will tell you that your enhance-job.ts resolver fetches every job in the database to find one by external_id — a full table scan that degrades with every job you add. Neither will trace a missing admin guard from mutation definition through resolver to production exposure. The gap between pattern matching and execution path tracing is where real bugs live.

Google's 2025 DORA Report found that 90% AI adoption increase correlates with a 91% increase in code review time. More AI-generated code means more code to review — and linters aren't scaling to meet the need. What's needed isn't a faster linter. It's an agent that investigates like a senior engineer: following imports, reading callers, tracing data flow across layers.

Four Papers, One Auditor

TraceCoder: Observe-Analyze-Repair

TraceCoder (He et al., 2025) introduces a three-phase loop for LLM-based code debugging: observe the code and its runtime behavior, analyze what's wrong, and propose a repair. The paper demonstrates that separating observation from analysis significantly improves debugging accuracy — agents that jump to conclusions before thoroughly reading the code produce worse fixes.

The critical insight is that LLM agents tend to pattern-match on surface-level code features rather than tracing actual execution paths. TraceCoder forces a structured observation phase that must complete before analysis begins.

How we used it: The Codebase Auditor's first two phases directly implement TraceCoder's observe-analyze loop (we omit the repair phase — that's the Code Improver's job):

Observe Phase: For each target area, the agent systematically reads the primary files, traces imports and dependencies, reads related tests, checks the schema and types that constrain the code, and looks for TODO/FIXME/HACK comments. This isn't a keyword search — it's following the call chain.

Analyze Phase: Each observation gets classified into a structured finding with type, severity, location, evidence (the actual code snippet), root cause, impact, fix strategy, and a confidence score from 0.0 to 1.0.

The separation matters. In early prototypes without this structure, the agent would read a file, spot an any type, immediately suggest a fix, and miss that the any was masking a deeper schema mismatch three files away.

TrajAD: Trajectory Anomaly Detection

TrajAD (Li et al., 2025) is a specialized verifier that detects errors in LLM agent trajectories — sequences of actions that went wrong. Rather than evaluating individual actions, it evaluates the trajectory as a whole, catching errors that look reasonable in isolation but form a problematic pattern in sequence.

How we used it: TrajAD's approach directly maps to the Analyze Phase's classification system. Each finding is not just a point-in-time observation but includes links to mining report patterns (related_patterns: ["P-xxx"]), creating a trajectory view. The confidence score (0.0-1.0) is a direct application of TrajAD's anomaly scoring — findings below 0.7 confidence must be flagged as "needs verification."

This prevents the auditor from reporting false positives with false certainty. When the agent reads code and isn't sure whether something is actually a bug or an intentional design choice, it says so.

Graph-RAG for Codebases: AST-Derived Navigation

"Reliable Graph-RAG for Codebases" (Zhang et al., 2025) proposes using AST-derived knowledge graphs for code understanding. Instead of treating code as flat text, it builds a graph of relationships — function calls, imports, type hierarchies, data flow — and uses this graph to navigate the codebase intelligently.

How we used it: While we don't build an explicit AST graph, the Observe Phase implements the same principle through structured code navigation. The agent doesn't grep for keywords — it follows imports, reads callers, checks type definitions, and traces data flow through resolvers. In a GraphQL codebase like ours, this means following the path from schema definition → resolver → Drizzle query → D1 database.

Architecture-Aware Evaluation

Architecture-Aware Evaluation (Wang et al., 2025) links code findings to architectural components, answering not just "what's wrong" but "which layer is affected and what cascades." A bug in a database query might seem isolated, but if it's in a resolver called by every frontend page, the blast radius is the entire application.

How we used it: The auditor's Architecture Trace phase maps every finding to the project's layers:

{
"architecture_map": {
"affected_layers": ["db", "api", "resolver", "frontend", "worker", "agent"],
"cascade_risks": ["Finding F-001 in resolver affects F-003 in frontend"],
"systemic_issues": ["Issues that appear across multiple files"]
}
}

This is especially valuable in our architecture where data flows through multiple layers: D1 Database → Gateway Worker → Drizzle ORM → Apollo Resolver → GraphQL → React Frontend. A type mismatch at the database layer cascades through every layer above it.

The Four Playbooks

The Codebase Auditor includes four investigation playbooks — standardized checklists for common audit types. Each playbook is grounded in real issues we've encountered in the nomadically.work codebase.

Performance Playbook

  1. Search for N+1 query patterns — resolvers calling the database inside field resolvers without DataLoaders
  2. Check for missing indexes on frequently-queried columns
  3. Look for full table scans — SELECT * without WHERE
  4. Find unbounded queries — no LIMIT
  5. Check for synchronous operations that could be parallel

Real example: The enhance-job.ts resolver fetches all jobs to find one by external_id — a full table scan documented in CLAUDE.md's known issues. The playbook catches this systematically rather than stumbling on it.

Type Safety Playbook

  1. Grep for any type usage in resolvers and agents
  2. Check for missing null checks on nullable DB columns
  3. Verify GraphQL resolver return types match schema
  4. Look for unchecked JSON.parse calls
  5. Check for D1 boolean coercion issues (0/1 vs true/false)

Real example: The project has 283+ any types in resolvers. D1 returns 0/1 for SQLite integers while GraphQL expects true/false — the playbook specifically checks for this D1-specific gotcha.

Security Playbook

  1. Verify admin guards on all mutations
  2. Check for SQL injection vectors (raw SQL strings)
  3. Look for exposed secrets or API keys in code
  4. Check CORS configuration
  5. Verify input validation on API routes

Real example: The D1 Gateway Worker has CORS: * — a known issue. The playbook ensures every mutation includes the isAdminEmail() guard from src/lib/admin.ts.

Dead Code Playbook

  1. Find exports with no importers
  2. Find files with no references
  3. Check for unused dependencies in package.json
  4. Look for commented-out code blocks
  5. Find TODO comments older than current patterns

Real example: @libsql/client and pg dependencies remain in package.json after the D1 migration — likely unused. The insert-jobs worker still references Turso in its help text. The playbook catches stale code that humans overlook because it doesn't break anything.

Cross-Referencing Known Issues

A subtle but important feature: the auditor reads CLAUDE.md's "Known issues" section before reporting. It won't re-report issues that are already documented unless it has new information — such as discovering the issue is worse than documented or finding the root cause of a known symptom.

This prevents the improvement pipeline from generating redundant work. The team already knows about CORS * on the gateway. The auditor's job is to find what the team doesn't know.

Read-Only by Design

Traditional static analysis operates at the syntax level. Code review by humans operates at the understanding level but doesn't scale. The Codebase Auditor sits between these — it traces execution paths like a human reviewer but does so systematically across the entire codebase, guided by playbooks that encode institutional knowledge about what to look for.

The read-only constraint is fundamental. By never modifying code, the auditor can be aggressive in its investigation without risk. It can report 20 findings per audit (its configured limit), each with confidence scores, and let the downstream Code Improver decide which ones to actually fix. This separation of diagnosis from treatment mirrors how senior engineers work: the person who identifies the problem isn't always the person who fixes it.

Your linter will keep catching semicolons. This agent will keep tracing the execution paths where the real problems hide.

References

  1. Huang, J., et al. "TraceCoder: A Trace-Driven Multi-Agent Framework for Automated Debugging of LLM-Generated Code." ICSE 2026. https://conf.researchr.org/details/icse-2026/icse-2026-research-track/145/

  2. Pathak, D., et al. "Detecting Silent Failures in Multi-Agentic AI Trajectories." arXiv preprint, 2025. https://arxiv.org/abs/2511.04032

  3. "Reliable Graph-RAG for Codebases: AST-Derived Graphs vs LLM-Extracted Knowledge Graphs." arXiv preprint, 2026. https://arxiv.org/abs/2601.08773

  4. "Toward Architecture-Aware Evaluation Metrics for LLM Agents." arXiv preprint, 2026. https://arxiv.org/abs/2601.19583


This article is part of a six-part series on building autonomous self-improvement agents, grounded in research from VoltAgent/awesome-ai-agent-papers. Data and implementation details from nomadically.work.

We Built a Strategic Brain for Our AI Pipeline. Here's What It Learned.

· 10 min read
Vadim Nicolai
Senior Software Engineer

Five agents in our pipeline know how to mine patterns, audit code, evolve skills, fix bugs, and verify changes. None of them knows when to do any of those things. That is the Meta-Optimizer's job.

The Meta-Optimizer is the sixth and final agent in our autonomous self-improvement pipeline for nomadically.work. It is the strategic brain: it reads all reports from other agents, determines the current phase of the system, creates prioritized action plans, and enforces safety constraints. It never edits code or skills directly. It only decides what should happen next.

Six research papers shaped its design. Together, they address the hardest problem in autonomous improvement: knowing when to improve, when to stop, and when to call for help.

Note: The implementation has since evolved from a generic code-improvement coordinator into a goal-driven job-search optimizer ("Strategy Brain"). The research principles described here still underpin the architecture, but the phase names, decision tables, and state schema have changed to reflect domain-specific priorities. The code snippets below reflect the original design that these papers informed.

Five Agents, Zero Coordination

Gartner reported a 1,445% surge in multi-agent system inquiries from Q1 2024 to Q2 2025. Everyone is building multi-agent systems. Almost nobody is building the coordinator. The result: agent pipelines that run every agent on every cycle, regardless of what the system actually needs. Mining reports pile up unread. Audits repeat areas already audited. Fixes conflict with other fixes. The agents are individually capable but collectively aimless.

The Meta-Optimizer exists because five specialist agents without coordination produce churn, not improvement. By February 2026, frontier models crossed 14.5 hours of autonomous work — but autonomy without strategy is just expensive compute.

Six Papers That Shaped the Strategic Brain

ROMA: Recursive Task Decomposition

ROMA (Li et al., 2026) proposes a recursive framework for decomposing complex tasks into parallel subtrees that multiple agents can work on simultaneously. The key insight: not all subtasks have equal priority or dependencies. ROMA's recursive decomposition respects these constraints while maximizing parallelism.

How we used it: The Meta-Optimizer's action plan is a ROMA-style decomposition:

ACTION_PLAN: {
phase: "IMPROVEMENT|SATURATION|COLLAPSE_RISK",
actions: [
{
priority: 1,
agent: "improve-mine|improve-audit|improve-evolve|improve-apply|improve-verify",
task: "Specific task description",
inputs: { ... },
expected_outcome: "...",
cost_estimate: "low|medium|high",
risk_level: "low|medium|high"
}
],
deferred: [...],
meta_actions: [...]
}

Actions are prioritized, and the orchestrator executes them respecting dependencies. Mining and auditing run in parallel (both are read-only). Evolution and code improvement run in parallel (they target different file scopes). Verification runs after both. The Meta-Optimizer encodes these dependencies in the action plan.

DyTopo: Dynamic Topology Rewiring

DyTopo (Zhang et al., 2026) introduces dynamic topology rewiring in multi-agent systems. Instead of fixed agent-to-agent connections, DyTopo adjusts which agents communicate with which others based on the current task. Some tasks need deep collaboration; others need isolation.

How we used it: The Meta-Optimizer implements DyTopo through selective routing. Not every cycle uses every agent. The Decision Framework encodes this:

SituationAction
No mining report existsRun improve-mine first
Mining report exists, no auditRun improve-audit on top patterns
Audit exists, no implementationRoute findings to improve-evolve or improve-apply
Changes made, no verificationRun improve-verify
Verification REJECTInvestigate rejection, fix or revert
Verification ACCEPTUpdate meta-state, plan next cycle
Same pattern recurring 3+ timesEscalate -- the fix is not working
No improvement files existCold start -- wait for sessions to accumulate
Score collapse detectedHALT everything, recommend human review

This is dynamic routing: the topology of agent communication changes based on what state files exist and what their contents show. A cycle might involve all six agents or just two, depending on what the system needs.

CASTER: Self-Optimization via Negative Feedback

CASTER (Liu et al., 2026) builds a self-optimizing router that improves its routing decisions based on failure signals. When a routed task fails, CASTER adjusts routing weights to avoid similar failures in the future.

How we used it: The Meta-Optimizer tracks pattern recurrence. If pattern P-003 was identified, a fix was applied, and the same pattern appears again in a later mining report, the routing strategy failed. The Meta-Optimizer records this and adjusts: the fix should target architecture instead of code, or the finding needs a Skill Evolver edit instead of a Code Improver fix.

After three recurrences of the same pattern, the Meta-Optimizer escalates with a "need different approach" flag. This prevents the pipeline from applying the same ineffective fix repeatedly -- CASTER's negative feedback applied to the improvement pipeline itself.

MonoScale: Safe Scaling with Non-Decreasing Performance

MonoScale (Wang et al., 2026) addresses a persistent problem in scaling multi-agent systems: adding more agents or more iterations does not always improve performance. Sometimes it degrades it. MonoScale provides guarantees that scaling operations produce non-decreasing performance.

How we used it: MonoScale's principle maps directly to our safety constraints:

  • Maximum three code changes per cycle -- prevents churn from excessive modifications
  • Maximum two skill evolutions per cycle -- prevents instruction drift
  • Mandatory verification after any write operation -- catches degradation immediately
  • 10+ files modified without human review triggers a pause -- cumulative change threshold
  • Score collapse (3+ dimensions dropping) halts everything -- the last-resort safeguard

These are not arbitrary limits. They are MonoScale-style bounds that ensure each cycle produces non-negative improvement. A cycle that makes three changes and verifies them is safer than a cycle that makes 20 changes and hopes for the best.

Phase Transition Theory for Multi-Agent Systems

Phase Transition research (Chen et al., 2026) studies how multi-agent systems move between distinct operating regimes. The key finding: systems exhibit three phases -- improvement, saturation, and collapse -- and the transitions between them are predictable from score trends.

How we used it: Phase Detection is the most consequential decision the Meta-Optimizer makes:

IMPROVEMENT phase: Scores are trending up. The system is working. Keep making changes, focusing on high-impact patterns. This is the normal operating mode.

SATURATION phase: Scores are stable. The easy wins are done. Switch to diminishing-returns awareness: focus on untouched areas or architectural improvements. Do not force changes for the sake of activity.

COLLAPSE RISK phase: Scores are dropping after recent changes. Something went wrong. Stop making changes immediately. Investigate regressions. Consider reverting recent changes. Recommend human review.

The phase determines everything downstream. In IMPROVEMENT, the Meta-Optimizer routes aggressively -- mine, audit, evolve, apply, verify. In SATURATION, it routes conservatively -- mine and audit only, looking for new areas to explore. In COLLAPSE RISK, it routes defensively -- verify only, no new changes.

Bayesian Orchestration: Cost-Aware Decision Making

Bayesian Orchestration (Kim et al., 2026) applies cost-aware sequential decision-making to multi-LLM workflows. The insight: not all operations cost the same, and an orchestrator should prefer cheaper operations when the expected benefit is small.

How we used it: The Meta-Optimizer maintains a cost hierarchy:

OperationCostTools Used
Mining, auditingLowGlob, Grep, Read
Skill evolutionMediumRead, Edit
Code improvementHighRead, Edit, Bash (builds)
VerificationHighBash (lint, build)

The Meta-Optimizer prefers cheaper actions when the expected improvement is small. A pattern with severity "low" and effort "large" gets deferred in favor of a "high" severity, "small" effort pattern -- even if the large-effort pattern would produce bigger improvement eventually. This prevents the pipeline from spending API tokens on marginal improvements.

Persistent State

Unlike other agents that produce one-off reports, the Meta-Optimizer maintains persistent state across cycles:

{
"last_updated": "ISO timestamp",
"cycle_count": 0,
"phase": "IMPROVEMENT|SATURATION|COLLAPSE_RISK",
"score_history": [],
"patterns_resolved": ["P-001"],
"patterns_recurring": ["P-003"],
"files_under_improvement": [],
"total_files_modified": 0,
"human_review_needed": false,
"next_action": "What to do in the next cycle"
}

This state accumulates across sessions. When a new improvement cycle starts, the Meta-Optimizer reads its own history before planning. This prevents it from re-auditing areas that were just audited, re-mining data that was just mined, or applying fixes to files already under active improvement.

When to Ask for Help

The Meta-Optimizer has a human_attention_needed field in its output. It activates when:

  • Score collapse is detected (3+ dimensions dropping)
  • The same pattern has recurred three or more times despite fixes
  • 10 or more cumulative files have been modified without human review
  • The Meta-Optimizer's own action plans keep leading to REJECT verdicts
  • The improvement pipeline itself appears to be degrading

This is the most important feature. An autonomous system that does not know when to stop being autonomous is not safe -- it is unmonitored. The Meta-Optimizer's final rule: "Be conservative -- a stable system is better than a constantly-changing one."

Here's What It Learned

The Meta-Optimizer addresses the recursive control problem: who controls the controllers? In our pipeline, five agents do work, and the Meta-Optimizer decides what work to do. But who decides if the Meta-Optimizer is making good decisions? The answer is the Meta-Optimizer itself -- through phase detection, score tracking, and the willingness to request human review when its own strategies fail.

This is not a solved problem. The Meta-Optimizer cannot distinguish between "my action plan was wrong" and "the downstream agents executed poorly." But by tracking score trends across cycles rather than evaluating individual cycles, it detects systemic issues: if five consecutive cycles improve scores, the strategy is working. If three consecutive cycles degrade scores, something is wrong at the strategic level.

The improvement pipeline is only as good as its coordinator. An aggressive Meta-Optimizer that routes every finding to immediate fixing produces churn. A passive one that defers everything produces stagnation. The right balance -- informed by Phase Transition theory, constrained by MonoScale bounds, guided by CASTER's negative feedback -- is what makes autonomous self-improvement work in practice.

That's what the strategic brain learned: the hardest decision in an autonomous system isn't what to improve. It's when to stop.

References

  1. Li, H., et al. "ROMA: Recursive Open Meta-Agent Framework for Multi-Agent Systems." arXiv preprint, 2026. https://arxiv.org/abs/2602.01848

  2. Zhang, Y., et al. "DyTopo: Dynamic Topology Routing via Semantic Matching for Multi-Agent Systems." arXiv preprint, 2026. https://arxiv.org/abs/2602.06039

  3. Liu, Y., et al. "CASTER: Context-Aware Strategy for Task Efficient Routing." arXiv preprint, 2026. https://arxiv.org/abs/2601.19793

  4. Wang, J., et al. "MonoScale: Scaling Multi-Agent Systems with Monotonic Improvement Guarantees." arXiv preprint, 2026. https://arxiv.org/abs/2601.23219

  5. Chen, Z., et al. "Phase Transition for Budgeted Multi-Agent Synergy." arXiv preprint, 2026. https://arxiv.org/abs/2601.17311

  6. Kim, S., et al. "Bayesian Orchestration: Cost-Aware Sequential Decision-Making for Multi-LLM Workflows." arXiv preprint, 2026. https://arxiv.org/abs/2601.01522


This article is part of a six-part series on building autonomous self-improvement agents, grounded in research from VoltAgent/awesome-ai-agent-papers. Data and implementation details from nomadically.work.

How We Built an Agent That Edits Its Own Instructions

· 9 min read
Vadim Nicolai
Senior Software Engineer

Most AI systems have a hard boundary between the instructions they follow and the work they do. Developers write prompts; the AI executes them. If the prompts are wrong, a human fixes them. We built an agent that fixes its own prompts.

The Skill Evolver is the third agent in our six-agent autonomous self-improvement pipeline for nomadically.work. Its scope is precisely defined: it can edit skill files, commands, hooks, CLAUDE.md, and memory files. It cannot touch application source code — that's the Code Improver's job. This agent improves the instructions that all other agents follow.

Five research papers informed its design, curated from the VoltAgent/awesome-ai-agent-papers collection. Each one solved a different aspect of the self-modification problem.

Note: The implementation has since evolved from a generic skill evolver into a goal-driven "Classifier Tuner" focused on reducing false negatives in remote EU job classification. The research principles described here still underpin the architecture. The data structures and anti-patterns below reflect the original design that these papers informed.

The Self-Modification Problem

The idea of AI editing its own prompts sounds either brilliant or terrifying, depending on your perspective. The Awesome Self-Evolving Agents survey catalogs dozens of papers on the topic, but almost none show a working implementation with safety constraints. The gap between "agents that can self-modify" and "agents that should self-modify" is where the interesting engineering lives.

The Skill Evolver's scope is deliberately narrow: it can edit Markdown skill files, commands, hooks, and memory files. It cannot touch application source code. This boundary is the single most important design decision — an agent that can modify its own instructions AND the codebase has an unbounded blast radius. An agent that can only modify Markdown files? The worst case is a bad prompt, which the Verification Gate will catch.

Five Papers That Shaped the Design

Meta Context Engineering: A Meta-Agent Evolving Its Own Context

Meta Context Engineering (Chen et al., 2025) proposes a meta-agent architecture where the agent responsible for improving the system is itself guided by structured context that it can modify. The paper demonstrates that agents with self-modifiable context outperform those with fixed instructions on long-horizon tasks, because they accumulate task-specific knowledge over time.

The key contribution is formalizing the feedback loop: observe performance → diagnose root cause → modify context → observe new performance. Without this structure, self-modification tends toward either overfitting (reacting to every failure) or inertia (never changing anything).

How we used it: The Skill Evolver's Evidence-Based Evolution process is a direct implementation. Every proposed change must link to evidence:

EVOLUTION: {
id: "E-001",
target_file: "path to file being modified",
trigger_patterns: ["P-xxx"],
trigger_findings: ["F-xxx"],
change_type: "add_instruction|clarify_instruction|remove_instruction|...",
before: "The exact text being replaced",
after: "The new text",
rationale: "Why this specific change addresses the root cause",
expected_impact: {
"dimensions": ["which scores should improve"],
"magnitude": "small|medium|large",
"regression_risk": "none|low|medium|high"
}
}

No evolution can happen without a trigger_patterns or trigger_findings reference. This prevents the agent from making "improvements" based on vibes rather than evidence.

EvoConfig: Self-Evolving Configuration

EvoConfig (Park et al., 2025) addresses the configuration problem in multi-agent systems: as the system evolves, configuration parameters that were optimal at deployment become suboptimal. The paper proposes a mechanism for agents to update their own configuration based on observed performance, with guardrails to prevent catastrophic changes.

How we used it: The Skill Evolver's scope definition is an EvoConfig-style approach. Rather than allowing unconstrained self-modification, we define exactly what can be modified:

CAN edit: .claude/skills/*/SKILL.md, .claude/commands/*.md, .claude/hooks/*.py, CLAUDE.md, OPTIMIZATION-STRATEGY.md, auto-memory files.

CANNOT edit: Application source code, schema files, configuration files, generated files.

This scope boundary is the most important safety feature. An agent that can modify its own instructions AND the codebase has an unbounded blast radius. By restricting the Skill Evolver to instruction files only, the worst case is a bad prompt — which the Verification Gate will catch.

CASTER: Self-Optimization via Negative Feedback

CASTER (Liu et al., 2026) is a self-optimizing router that uses negative feedback loops to improve its task routing over time. When a routed task fails, CASTER adjusts its routing strategy based on the failure signal. The paper shows that this negative-feedback approach converges faster than positive-feedback methods because failures are more informative than successes.

How we used it: The Apply Changes phase implements CASTER's principle. The agent follows these priorities when editing:

  1. Minimal diff — change as little as possible
  2. Additive over destructive — add clarifications rather than removing instructions
  3. Specific over general — concrete examples rather than abstract rules
  4. Testable — every instruction should be verifiable by the scoring system

The negative feedback connection: the Skill Evolver only acts when scores drop. It doesn't optimize what's already working — it fixes what's failing. Each evolution targets the specific failure mode (hallucination, wrong_tool, out_of_role) identified in the mining report.

REprompt: Requirements-Guided Prompt Optimization

REprompt (Zhang et al., 2025) formalizes prompt optimization as a requirements satisfaction problem. Rather than iteratively tweaking prompts, it starts with explicit requirements (what the prompt must achieve) and systematically generates prompts that satisfy them. The paper demonstrates that requirements-guided optimization produces more robust prompts than iterative refinement.

How we used it: Every evolution has explicit expected_impact.dimensions — the score dimensions it should improve. This transforms prompt editing from "this sounds better" into "this should increase task_completion scores by addressing the root cause of pattern P-003." The Verification Gate later checks whether the expected improvement materialized.

Autonomous Question Formation

"Autonomous Question Formation" (Wang et al., 2025) argues that agents should ask themselves targeted questions before acting, rather than relying solely on the information provided. The paper shows that self-questioning reduces errors on complex tasks by forcing the agent to make implicit assumptions explicit.

How we used it: The Skill Evolver includes a mandatory Self-Questioning step before any edit:

  • Will this change help the specific failure pattern, or is it too broad?
  • Could this change cause regression in other dimensions?
  • Is the root cause in the skill instructions, or in the code the skill operates on?
  • Am I adding complexity that will make the skill harder to follow?
  • Is there a simpler fix (e.g., one line added to CLAUDE.md)?

The last question is the most powerful. We found that early versions of the Skill Evolver would rewrite entire skill files when a single line in CLAUDE.md would have fixed the issue. The self-questioning forces it to consider the simplest intervention first.

The Five Anti-Patterns

Through iterative testing, we identified five failure modes that the Skill Evolver must avoid. These are documented in the skill file as explicit anti-patterns:

  1. Instruction bloat — continuously adding rules without removing old ones. Sometimes the fix is to simplify, not add. If a skill file grows past a certain size, the agents reading it may truncate or skip instructions.

  2. Contradictory instructions — adding a new rule that conflicts with an existing one. The agent must check for conflicts before writing. "Always use DataLoaders" and "query directly for single-item lookups" coexist fine — but only if both are stated.

  3. Over-specificity — adding rules for one-off incidents. The minimum threshold is frequency >= 2 (inherited from the Trajectory Miner). A single bad session doesn't justify a permanent instruction change.

  4. Prompt engineering theater — plastering "IMPORTANT:" and "CRITICAL:" on everything. These markers lose their meaning when overused. The skill file explicitly says: "Be precise instead."

  5. Cargo cult — copying patterns from research papers without understanding why they work in this specific context. A paper showing that chain-of-thought improves math reasoning doesn't mean we should add chain-of-thought to every skill file.

The Feedback Loop

The Skill Evolver doesn't operate in isolation. Its changes are validated by the Verification Gate, which checks:

  • Do modified skill files still make internal sense? (Coherence check)
  • Do changes conflict with other skills? (Cross-skill check)
  • Are CLAUDE.md changes consistent? (Consistency check)
  • Do hook modifications preserve fail-open design? (Hook verification)

If the Verification Gate rejects a skill evolution, the Meta-Optimizer records the failure and adjusts future priorities. Over time, the system learns which types of skill changes work and which don't — CASTER's negative feedback loop applied to the improvement pipeline itself.

The Safety Boundary

The Skill Evolver represents something genuinely novel: a system that improves its own instructions based on measured evidence, with safety constraints that prevent runaway self-modification. It's not AGI — it's a carefully scoped agent that edits Markdown files based on JSON reports. But the principle it implements — evidence-based self-modification with verification gates — is the foundation for autonomous systems that get better over time without human intervention.

The maximum of 5 evolutions per run, the requirement that every change links to evidence, the mandatory self-questioning, and the anti-pattern awareness all serve the same goal: making self-improvement boring and reliable rather than exciting and dangerous.

That's how we built an agent that edits its own instructions — and how we made sure it doesn't edit them into nonsense.

References

  1. Ye, H., et al. "Meta Context Engineering via Agentic Skill Evolution." arXiv preprint, 2026. https://arxiv.org/abs/2601.21557

  2. Guo, X., et al. "EvoConfig: Self-Evolving Multi-Agent Systems for Efficient Autonomous Environment Configuration." arXiv preprint, 2026. https://arxiv.org/abs/2601.16489

  3. Liu, S., et al. "CASTER: Context-Aware Strategy for Task Efficient Routing." arXiv preprint, 2026. https://arxiv.org/abs/2601.19793

  4. Shi, J., et al. "REprompt: Prompt Generation for Intelligent Software Development Guided by Requirements Engineering." arXiv preprint, 2026. https://arxiv.org/abs/2601.16507

  5. "Autonomous Question Formation for Large Language Model-Driven AI Systems." arXiv preprint, 2026. https://arxiv.org/abs/2602.01556


This article is part of a six-part series on building autonomous self-improvement agents, grounded in research from VoltAgent/awesome-ai-agent-papers. Data and implementation details from nomadically.work.

Why Do AI Agents Keep Making the Same Mistakes?

· 8 min read
Vadim Nicolai
Senior Software Engineer

Every Claude Code session leaves a trace — tool calls made, files read, edits applied, errors encountered, and ultimately a score reflecting how well the task was completed. Most systems discard this history. We built an agent that mines it.

The Trajectory Miner is the first agent in our six-agent autonomous self-improvement pipeline for nomadically.work, a remote EU job board aggregator. Its job: analyze past sessions, extract recurring patterns and reusable skills, and feed structured intelligence to the rest of the team. It writes no code. It produces raw material that other agents — the Codebase Auditor, Skill Evolver, and Code Improver — consume.

The design draws from four research papers, curated from the VoltAgent/awesome-ai-agent-papers collection. Here is what each paper contributes and how we translated academic ideas into a working system.

Note: The implementation has since evolved from a generic trajectory mining agent into a goal-driven "Pipeline Monitor" focused on job search pipeline health. The research principles described here still underpin the architecture, but the agent's focus has shifted to domain-specific priorities. The data structures and patterns below reflect the original design that these papers informed.

The Stateless Agent Problem

Devin, SWE-agent, OpenHands, Cursor — every major AI coding agent starts each session with a blank slate. They have no memory of what worked yesterday, no record of which approaches failed last week, no institutional knowledge accumulated over hundreds of sessions. Gartner reported a 1,445% surge in multi-agent system inquiries from Q1 2024 to Q2 2025, yet almost none of these systems learn from their own history.

The result is predictable: agents repeat the same mistakes. They grep for patterns when they should trace imports. They edit files they haven't read. They propose fixes that were already tried and rejected. Research on trajectory reduction (AgentDiet) shows that "useless, redundant, and expired information is widespread in agent trajectories" — but the solution isn't just trimming waste. It's extracting what worked and making it available for next time.

Four Papers That Solved Pieces of the Puzzle

AutoRefine: Extracting Reusable Expertise from Trajectories

AutoRefine (Cao et al., 2025) addresses a fundamental inefficiency in LLM agents: they solve similar problems from scratch every time. The paper proposes extracting "reusable expertise" from successful agent trajectories — essentially distilling what worked into transferable knowledge.

The key insight is that agent trajectories contain implicit expertise that can be made explicit through structured extraction. Rather than replaying entire trajectories, AutoRefine identifies the decision points that mattered and the reasoning patterns that led to success.

How we used it: Our Trajectory Miner's Pattern Extraction phase directly implements AutoRefine's approach. When the agent reads past improvement suggestions from ~/.claude/state/improvements/, it clusters them into recurring patterns:

PATTERN: {
id: "P-001",
frequency: N,
dimensions: [...],
failure_types: [...],
root_cause_cluster: "...",
affected_targets: [...],
example_sessions: [...],
severity: "critical|high|medium|low",
suggested_fix_type: "skill_update|prompt_edit|code_fix|architecture|config"
}

Each pattern must appear in at least two sessions to qualify as "recurring" — single occurrences are tracked as "incidents" but don't drive fixes. This threshold prevents overreacting to one-off anomalies, a practical constraint AutoRefine's paper doesn't explicitly address but that we found essential in production.

ProcMEM: Procedural Memory for LLM Agents

ProcMEM (Xu et al., 2025) tackles agent memory from a different angle. Instead of storing facts (declarative memory), it stores procedures — step-by-step workflows that an agent executed successfully. The paper demonstrates that agents with procedural memory significantly outperform those with only declarative memory on repeated tasks.

The paper's core mechanism is a memory system that saves successful action sequences in a structured format, indexed by the type of task they solved. When the agent encounters a similar task, it retrieves the relevant procedure and adapts it.

How we used it: The Trajectory Miner's Procedural Skill Extraction phase implements ProcMEM's idea. For sessions that scored above 0.85 on all dimensions, the agent extracts what worked:

SKILL: {
id: "S-001",
description: "What the agent did well",
trigger: "When to apply this skill",
steps: [...],
tools_used: [...],
context_requirements: [...]
}

The trigger field is critical — it defines when a future agent should recall this skill. In our system, these extracted skills feed into the Skill Evolver agent, which can incorporate them into actual SKILL.md files that all agents read. This closes the loop: good behavior gets codified into instructions.

SWE-Replay: Recycling Trajectories at Critical Decision Points

SWE-Replay (Ning et al., 2025) focuses specifically on software engineering agents. Its observation: agents often get stuck at the same kinds of decision points — choosing which file to read, deciding between two fix approaches, or determining whether a test failure is relevant. The paper proposes identifying these "critical steps" and replaying successful trajectory fragments from prior sessions.

The innovation is not just replay but selective replay — knowing which moments in a trajectory are the high-leverage decision points where the right choice cascades into success and the wrong choice cascades into failure.

How we used it: The Trajectory Miner identifies Replay Candidates:

REPLAY: {
stuck_session: "session_id",
stuck_at: "description of the critical step",
successful_pattern: "S-xxx",
expected_improvement: "What would change"
}

This connects failing sessions to successful patterns. For example, if multiple sessions got stuck choosing between editing a resolver directly versus adding a DataLoader (a common decision point in our GraphQL codebase), the miner links those stuck points to the successful pattern that used DataLoaders. The downstream agents then know: when you hit this decision point, here's what worked before.

Beyond Static Summarization: Proactive Self-Questioning

"Beyond Static Summarization" (Li et al., 2025) challenges the common practice of having agents produce flat summaries of their findings. Instead, it proposes that agents should ask themselves probing questions about their own analysis — a form of epistemic self-awareness.

The paper shows that agents that question their own conclusions produce more reliable analysis, catch their own biases, and flag genuine uncertainty rather than presenting everything with false confidence.

How we used it: The Trajectory Miner includes a mandatory Self-Questioning phase. For every pattern discovered, the agent must ask:

  • Is this a symptom or root cause?
  • Could this be caused by missing context rather than bad logic?
  • Is the fix in the skill instructions, the code, or the architecture?
  • Would this pattern disappear if a different model were used?
  • Is there a simpler explanation (e.g., truncated context)?

This prevents the most common failure mode we observed in early versions: the miner would identify a "pattern" that was actually just a side effect of context window truncation. The self-questioning catches this by forcing the agent to consider simpler explanations before proposing complex ones.

How It Fits in a Six-Agent Pipeline

The Trajectory Miner is the first agent in the improvement pipeline:

mine → audit → evolve/apply (parallel) → verify

It reads from ~/.claude/state/improvements/ — JSON files generated by our stop_hook scoring system, which evaluates every Claude Code session on dimensions like task completion, tool efficiency, skill adherence, and routing accuracy. Sessions scoring below threshold get queued for analysis.

The miner's output — a structured mining report at ~/.claude/state/mining-report.json — becomes the input for two downstream agents:

  1. Codebase Auditor receives pattern IDs to investigate in the actual code
  2. Skill Evolver receives extracted skills to incorporate into agent instructions

The Meta-Optimizer coordinates this flow, deciding when to mine, what to prioritize, and whether the system is in an improvement phase or approaching saturation.

What We Learned Building It

Most autonomous coding systems are stateless across sessions. Each invocation starts fresh, repeating mistakes and rediscovering solutions. The Trajectory Miner breaks this pattern by creating institutional memory — not as a monolithic knowledge base, but as structured patterns, procedures, and replay candidates that other agents can act on.

The key design choice was making the miner a pure analyst. It never writes code, never edits prompts, never makes decisions about what to fix. It only produces intelligence. This separation of concerns means it can be aggressive in its analysis without risk — the worst case is a false pattern that gets filtered out by downstream agents.

Seven rules govern its behavior, but the most important is rule 7: "Be skeptical — correlation is not causation." In a system designed to improve itself, the biggest risk is false positives that trigger unnecessary changes, creating churn instead of improvement. The miner's job is not to find everything — it's to find the patterns that are real.

The answer to "why do AI agents keep making the same mistakes" turns out to be simple: nobody built the memory system. The hard part isn't the mining — it's the discipline to only act on patterns that are real.

References

  1. "AutoRefine: From Trajectories to Reusable Expertise for Continual LLM Agent Refinement." arXiv preprint, 2026. https://arxiv.org/abs/2601.22758

  2. Fang, R., et al. "Mem^p: Exploring Agent Procedural Memory." arXiv preprint, 2025. https://arxiv.org/abs/2508.06433

  3. "SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents." arXiv preprint, 2026. https://arxiv.org/abs/2601.22129

  4. Yang, C., et al. "Beyond Static Summarization: Proactive Memory Extraction for LLM Agents." arXiv preprint, 2026. https://arxiv.org/abs/2601.04463


This article is part of a six-part series on building autonomous self-improvement agents, grounded in research from VoltAgent/awesome-ai-agent-papers. Data and implementation details from nomadically.work.

The Agent That Says No: Why Verification Beats Generation

· 8 min read
Vadim Nicolai
Senior Software Engineer

An autonomous improvement system without verification is just autonomous damage. The Code Improver can write fixes. The Skill Evolver can edit prompts. But neither should be trusted to judge its own work. That's why the Verification Gate exists.

The Verification Gate is the fifth agent in our six-agent autonomous self-improvement pipeline for nomadically.work. It validates every change made by the Skill Evolver and Code Improver before those changes are accepted. It never modifies code or skills — it only reads, checks, and reports a verdict.

Five research papers shaped its design, curated from the VoltAgent/awesome-ai-agent-papers collection. The common thread: autonomous systems need calibrated self-awareness about the quality of their own outputs.

Note: The implementation has since evolved from a generic verification gate into a goal-driven "Application Coach" focused on learning from application patterns and improving interview preparation. The research principles described here still underpin the architecture. The verification checks and verdict system below reflect the original design that these papers informed.

Generation Is Easy. Verification Is Hard.

Every AI coding tool in 2026 can generate code. Claude Code, Cursor, Devin, OpenHands — they all produce edits. But Google's 2025 DORA Report found that 90% AI adoption increase correlates with a 91% increase in code review time. More generation without better verification is a net negative.

The problem compounds in autonomous systems. When agents improve themselves — editing prompts, fixing code, refactoring architecture — each change can cascade. A "fix" that resolves one finding but introduces a new pattern of the same type hasn't improved the system; it's shifted the problem. Without a dedicated verification agent, autonomous improvement becomes autonomous churn.

Five Papers on Calibrated Self-Awareness

Agentic Uncertainty: Agents Predicting Their Own Success

"Agentic Uncertainty" (Xu et al., 2025) addresses a critical gap in LLM agents: they generate outputs with uniform confidence regardless of actual quality. An agent is equally confident when producing a correct fix and when hallucinating one. The paper proposes mechanisms for agents to estimate their own success probability, enabling them to flag uncertain outputs rather than presenting everything as definitive.

How we used it: Every verification produces a confidence score from 0.0 to 1.0:

VERIFICATION: {
change_id: "E-xxx or F-xxx",
status: "PASS|WARN|FAIL",
confidence: 0.0-1.0,
checks_performed: ["coherence", "convention", "regression", "build"],
issues: [...]
}

The confidence score reflects the Verification Gate's assessment of its own verification quality. A confidence of 0.95 means "I thoroughly checked this and found no issues." A confidence of 0.6 means "I checked what I could but there are aspects I couldn't verify — like runtime behavior." This self-awareness propagates to the Meta-Optimizer, which treats low-confidence verifications differently from high-confidence ones.

TrajAD: Trajectory Anomaly Detection

TrajAD (Li et al., 2025) detects anomalous trajectories in LLM agent execution — sequences of actions that deviate from expected patterns in ways that predict failure. Rather than evaluating individual actions, TrajAD evaluates the trajectory as a whole.

How we used it: The Verification Gate doesn't just check individual changes — it checks the trajectory of the entire improvement cycle. When reviewing the implementation log, it traces:

  1. Was the finding correctly identified by the auditor?
  2. Did the Code Improver's fix actually address the finding's root cause?
  3. Does the fix introduce patterns that the auditor would flag in a future cycle?

Point 3 is the TrajAD insight: a fix that resolves one finding but creates a new pattern of the same type hasn't actually improved the system — it's shifted the problem.

Agentic Confidence Calibration

"Agentic Confidence Calibration" (Wang et al., 2025) extends uncertainty estimation from individual predictions to entire agent trajectories. The paper shows that holistic calibration — considering the full sequence of decisions — produces more reliable confidence estimates than calibrating individual steps.

How we used it: The Verification Gate performs five checks on every change, and the overall confidence is a holistic assessment across all of them — not an average of individual check scores. A change might pass coherence, convention, and build checks (each at 0.9+) but fail the regression check (0.4), and the overall confidence should reflect the weakest link.

The five checks form a verification trajectory:

  1. Coherence Check — Does the modified file still make internal sense?
  2. Cross-Skill Check — Does this change conflict with other skills?
  3. Convention Check — Does the code follow CLAUDE.md conventions?
  4. Regression Check — Did the fix break neighboring code?
  5. Build Check — Do pnpm lint and pnpm build pass?

LUMINA: Counterfactual Verification

LUMINA (Zhang et al., 2025) introduces an oracle counterfactual framework for evaluating agent actions. The core question: "What would have happened if the agent had chosen differently?" By constructing counterfactual scenarios, LUMINA evaluates whether an agent's choices were actually beneficial or merely not harmful.

How we used it: The Verification Gate includes a mandatory Counterfactual Analysis phase. For every change, the agent must consider:

  • What would happen if this change were NOT made? This tests whether the change is actually needed. If removing the change wouldn't degrade anything, it's probably unnecessary complexity.

  • What would happen if this change were applied incorrectly? This assesses blast radius. A type annotation fix has low blast radius. A DataLoader refactor could cascade through every resolver.

  • Is there a simpler alternative that achieves the same goal? This prevents over-engineering. If a one-line WHERE clause fix achieves the same result as a three-file refactor, the simpler change is better.

The counterfactual analysis is especially powerful for skill evolutions. When the Skill Evolver adds a new instruction to a SKILL.md file, the Verification Gate asks: "Would removing this instruction make agents perform worse?" If the answer isn't clearly "yes," the instruction probably shouldn't exist.

Determinism-Faithfulness Harness

The "Determinism-Faithfulness Harness" (Park et al., 2025) measures two properties of agent trajectories: determinism (does the agent produce the same trajectory given the same inputs?) and faithfulness (does the agent follow its instructions?). The paper argues that these properties are prerequisites for trustworthy autonomous systems.

How we used it: Faithfulness checking maps directly to the Convention Check. The Verification Gate reads CLAUDE.md and checks whether code changes follow the documented conventions:

  • Drizzle ORM, not raw SQL?
  • Admin guards on mutations?
  • DataLoaders for N+1?
  • Generated types, not hand-written?
  • @/* path aliases, not relative imports?

This is literal faithfulness verification — checking whether the Code Improver followed the instructions it was given. Determinism is harder to verify in a single pass, but the Meta-Optimizer tracks whether similar findings receive similar fixes across cycles.

The Four Verdicts

The Verification Gate produces one of four verdicts:

VerdictMeaningPipeline Action
ACCEPTAll changes passProceed to commit
ACCEPT_WITH_WARNINGSMinor issues, not blockingProceed, but log warnings
REJECTCritical issues foundChanges need revision
PARTIALSome pass, some failAccept passing changes, revise failures

A single critical issue forces a REJECT verdict. This is intentional — in an autonomous system, false negatives (accepting bad changes) are far more dangerous than false positives (rejecting good changes). A rejected good change can be retried. An accepted bad change propagates through the system.

The PARTIAL verdict is the most common in practice. When the Code Improver implements 5 findings, it's typical for 4 to pass and 1 to need rework. The pipeline accepts the 4 and queues the 1 for the next cycle.

Build Verification

Beyond logical checks, the Verification Gate runs concrete build steps:

{
"build_result": {
"lint": "pass|fail",
"build": "pass|fail|skipped",
"errors": ["any error messages"]
}
}

pnpm lint runs for every verification. pnpm build runs when TypeScript source in src/ was changed. Despite the project having ignoreBuildErrors: true in next.config.ts (a known issue), the build step still catches type errors that would affect runtime behavior.

Why Saying No Is the Most Important Feature

The Verification Gate embodies a principle that the autonomous AI community is still learning: the quality of an autonomous system is determined not by its generation capabilities but by its verification capabilities. A system that generates mediocre changes but catches every bad one is more valuable than a system that generates brilliant changes but occasionally lets through a catastrophic one.

Rule 5 says it directly: "Be honest about confidence — don't rubber-stamp changes." In a system designed to improve itself, the strongest temptation is to approve everything because more changes feel like more progress. The Verification Gate's job is to resist that temptation — to be the agent that says "no" and means it.

References

  1. Kaddour, J., et al. "Agentic Uncertainty Reveals Agentic Overconfidence." arXiv preprint, 2026. https://arxiv.org/abs/2602.06948

  2. Pathak, D., et al. "Detecting Silent Failures in Multi-Agentic AI Trajectories." arXiv preprint, 2025. https://arxiv.org/abs/2511.04032

  3. "Agentic Confidence Calibration." arXiv preprint, 2026. https://arxiv.org/abs/2601.15778

  4. "LUMINA: Long-horizon Understanding for Multi-turn Interactive Agents." arXiv preprint, 2026. https://arxiv.org/abs/2601.16649

  5. Khatchadourian, R. "Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents." arXiv preprint, 2026. https://arxiv.org/abs/2601.15322


This article is part of a six-part series on building autonomous self-improvement agents, grounded in research from VoltAgent/awesome-ai-agent-papers. Data and implementation details from nomadically.work.