Skip to main content

Langfuse Features: Prompts, Tracing, Scores, Usage

· 11 min read
Vadim Nicolai
Senior Software Engineer

A comprehensive guide to implementing Langfuse features for production-ready AI applications, covering prompt management, tracing, evaluation, and observability.

Overview

This guide covers:

  • Prompt management with caching and versioning
  • Distributed tracing with OpenTelemetry
  • User feedback and scoring
  • Usage tracking and analytics
  • A/B testing and experimentation

OpenRouter Integration with DeepSeek

· 9 min read
Vadim Nicolai
Senior Software Engineer

This article documents the complete OpenRouter integration implemented in Nomadically.work, using DeepSeek models exclusively through a unified API.

Architecture Overview

Module Structure

Core Features

1. Provider Configuration

The provider layer handles OpenRouter API communication using the OpenAI SDK compatibility layer.

Implementation Details:

  • Uses @ai-sdk/openai package for API compatibility
  • Lazy-loaded provider instance to support testing without API key
  • Configurable reasoning tokens (default: 10,000 max_tokens)
  • Custom headers for analytics and tracking

2. DeepSeek Model Access

Five DeepSeek models are available through the integration:

Model Selection Guide:

  • DeepSeek Chat: General-purpose conversations, Q&A, text generation
  • DeepSeek R1: Complex reasoning, multi-step analysis, decision-making
  • DeepSeek Coder: Code generation, debugging, technical documentation
  • R1 Distill Qwen 32B: Faster inference for reasoning tasks
  • R1 Distill Llama 70B: High-quality reasoning with better performance

3. Agent Creation Patterns

Three patterns for creating agents with different levels of abstraction:

Pattern Comparison:

PatternUse CaseFlexibilitySetup Time
TemplatesQuick prototyping, demosLowSeconds
HelpersStandard agents with custom configMediumMinutes
DirectAdvanced use cases, full controlHighMinutes

4. Agent Template Flow

5. Configuration System

Usage Examples

Basic Agent Creation

import { agentTemplates } from "@/openrouter";

// Quick start with template
const assistant = agentTemplates.assistant();

const response = await assistant.generate([
{ role: "user", content: "What are remote work benefits?" },
]);

Custom Agent with Specific Model

import { createChatAgent, deepseekModels } from "@/openrouter";

// Using helper function
const jobClassifier = createChatAgent({
id: "job-classifier",
name: "Job Classifier",
instructions: "You are an expert at classifying job postings.",
model: "chat",
});

// Or using model directly
import { Agent } from "@mastra/core/agent";

const reasoningAgent = new Agent({
model: deepseekModels.r1(),
name: "Reasoning Agent",
instructions: "Think step by step about complex problems.",
});

Advanced Configuration

import { createOpenRouter, DEEPSEEK_MODELS } from "@/openrouter";

const customProvider = createOpenRouter({
reasoning: {
max_tokens: 15000,
},
headers: {
"HTTP-Referer": "https://nomadically.work",
"X-Title": "Job Platform AI",
},
});

const model = customProvider(DEEPSEEK_MODELS.R1);

Data Flow

Request Flow

Error Handling Flow

Integration Points

Mastra Agent Integration

Environment Configuration

Required Variables

# Core configuration
OPENROUTER_API_KEY=sk-or-v1-your-api-key-here

# Optional configuration
OPENROUTER_SITE_NAME="Nomadically.work"
OPENROUTER_SITE_URL="https://nomadically.work"

Deployment Flow

Performance Characteristics

Model Comparison

Benefits

OpenRouter Advantages

Testing Strategy

Test Coverage

Run tests with:

pnpm test:openrouter

Type Safety

TypeScript Types

Migration Path

From Direct DeepSeek SDK

Resources

Summary

This OpenRouter integration provides:

  • Unified API Access - Single interface for all DeepSeek models
  • Type-Safe - Full TypeScript support with compile-time validation
  • Flexible - Three levels of abstraction for different use cases
  • Production-Ready - Error handling, fallbacks, and monitoring
  • Well-Tested - Comprehensive test suite with live API validation
  • Well-Documented - Complete examples and migration guides

The module is designed for scalability, maintainability, and developer experience while providing reliable access to state-of-the-art AI models through OpenRouter's infrastructure.

AI-Driven Company Enrichment with DeepSeek via Cloudflare Browser Rendering

· 4 min read
Vadim Nicolai
Senior Software Engineer

This page documents an AI-first enrichment pipeline that turns a company website into a clean, structured company profile you can safely persist into your database and expose through GraphQL.

The core idea is simple:

  • Use Cloudflare Browser Rendering /json to load a real rendered page (including JavaScript-heavy sites).
  • Use DeepSeek to convert the rendered page into a strict JSON-only object (no markdown, no prose).

High-level architecture

This pipeline has five clear layers, each with a single responsibility:

  • Entry: GraphQL mutation identifies the target company.
  • Acquisition: Browser Rendering fetches a fully rendered page.
  • Extraction: DeepSeek converts HTML into JSON-only structure.
  • Governance: validation, normalization, and audit snapshot.
  • Persistence: upserts for company + ATS boards, then return.

Classification

A single enum-like category so downstream logic can branch cleanly:

  • company.category is one of:
    • CONSULTANCY | AGENCY | STAFFING | DIRECTORY | PRODUCT | OTHER | UNKNOWN

UNKNOWN is intentionally allowed to prevent “forced certainty”.

Two links that unlock most automation:

  • company.careers_url — best official careers entrypoint (prefer internal)
  • company.linkedin_url — best LinkedIn company page (/company/...)

Hiring infrastructure

Detect ATS/job boards (useful for job syncing, vendor analytics, integrations):

  • ats_boards[] entries containing:
    • url
    • vendor
    • board_type (ats | careers_page | jobs_board)
    • confidence (0..1)
    • is_active

Provenance and uncertainty

To keep AI outputs accountable:

  • evidence — where it came from (URL) + any known fetch metadata
  • notes[] — uncertainty/caveats without polluting structured fields

Top-down architecture


Why Cloudflare Browser Rendering /json is the right AI boundary

Many company websites are JS-heavy (SPAs), and the key links you want (Careers, LinkedIn, ATS) often live in:

  • global navigation/header
  • footer “social” section
  • content that only appears after JS renders

The /json endpoint is designed to extract structured JSON from the rendered page, using:

  • url (or html)
  • a prompt (and optionally response_format for JSON Schema depending on provider support)
  • custom_ai to route extraction through your chosen model

For JS-heavy pages, waiting for rendering to finish matters. This is why the extractor uses:

  • gotoOptions.waitUntil = "networkidle0"

AI contract: JSON-only output

When you route through custom_ai with BYO providers, schema-enforced responses can be provider-dependent. The safest universal strategy is:

  • treat the prompt as a strict contract
  • demand ONLY valid JSON
  • define the expected shape explicitly
  • instruct null/[] for unknown values
  • push uncertainty into notes[]

This turns an LLM into a bounded parser.


Implementation: Cloudflare-first with a direct DeepSeek fallback

Below is the same flow, expressed as architecture instead of code:

  • Inputs: company id/key and target URL.
  • Acquisition: Browser Rendering /json fetches a rendered page.
  • Extraction: DeepSeek produces a JSON-only record.
  • Governance: validate, normalize, and snapshot the output.
  • Persistence: upsert company + ATS boards, then return result.

Persistence guardrails (keep the AI safe)

Even with JSON-only output, the DB write must remain your code’s responsibility.

1) Validate shape before persistence

At minimum, verify:

  • company.name exists and is non-empty
  • any present URLs are absolute (https://...)
  • arrays are arrays
  • category is one of the allowed values

If validation fails, either retry extraction (stricter prompt) or fall back.

2) Canonicalize URLs before upserts

To avoid duplicates, normalize:

  • remove #fragment
  • normalize trailing slash
  • lowercase host
  • optionally strip tracking params

3) Treat vendor and board_type as hints

LLMs can emit vendor variants (e.g., Smart Recruiters, smartrecruiters). Normalize before mapping to enums.

4) Always snapshot the raw extraction

Saving the full ExtractionResult into companySnapshots.extracted buys you:

  • debugging (“why did this change?”)
  • regression detection
  • prompt iteration without losing history

References

https://github.com/nicolad/nomadically.work

https://nomadically.work/

Agent Skills spec + Mastra integration

· 9 min read
Vadim Nicolai
Senior Software Engineer

Agent Skills Specification

Source: https://agentskills.io/specification

This document defines the Agent Skills format.

Directory structure

A skill is a directory containing at minimum a SKILL.md file:

skill-name/
└── SKILL.md # Required

Tip: You can optionally include additional directories such as scripts/, references/, and assets/ to support your skill.

SKILL.md format

The SKILL.md file must contain YAML frontmatter followed by Markdown content.

Frontmatter (required)

Minimal example:

---
name: skill-name
description: A description of what this skill does and when to use it.
---

With optional fields:

---
name: pdf-processing
description: Extract text and tables from PDF files, fill forms, merge documents.
license: Apache-2.0
metadata:
author: example-org
version: "1.0"
---
FieldRequiredNotes
nameYesMax 64 characters. Lowercase letters, numbers, and hyphens only. Must not start or end with a hyphen.
descriptionYesMax 1024 characters. Non-empty. Describes what the skill does and when to use it.
licenseNoLicense name or reference to a bundled license file.
compatibilityNoMax 500 characters. Indicates environment requirements (intended product, system packages, network access, etc.).
metadataNoArbitrary key-value mapping for additional metadata.
allowed-toolsNoSpace-delimited list of pre-approved tools the skill may use. (Experimental)
name field

The required name field:

  • Must be 1-64 characters
  • May only contain unicode lowercase alphanumeric characters and hyphens (a-z and -)
  • Must not start or end with -
  • Must not contain consecutive hyphens (--)
  • Must match the parent directory name

Valid examples:

name: pdf-processing
name: data-analysis
name: code-review

Invalid examples:

name: PDF-Processing  # uppercase not allowed
name: -pdf  # cannot start with hyphen
name: pdf--processing  # consecutive hyphens not allowed
description field

The required description field:

  • Must be 1-1024 characters
  • Should describe both what the skill does and when to use it
  • Should include specific keywords that help agents identify relevant tasks

Good example:

description: Extracts text and tables from PDF files, fills PDF forms, and merges multiple PDFs. Use when working with PDF documents or when the user mentions PDFs, forms, or document extraction.

Poor example:

description: Helps with PDFs.
license field

The optional license field:

  • Specifies the license applied to the skill
  • We recommend keeping it short (either the name of a license or the name of a bundled license file)

Example:

license: Proprietary. LICENSE.txt has complete terms
compatibility field

The optional compatibility field:

  • Must be 1-500 characters if provided
  • Should only be included if your skill has specific environment requirements
  • Can indicate intended product, required system packages, network access needs, etc.

Examples:

compatibility: Designed for Claude Code (or similar products)
compatibility: Requires git, docker, jq, and access to the internet

Note: Most skills do not need the compatibility field.

metadata field

The optional metadata field:

  • A map from string keys to string values
  • Clients can use this to store additional properties not defined by the Agent Skills spec
  • We recommend making your key names reasonably unique to avoid accidental conflicts

Example:

metadata:
author: example-org
version: "1.0"
allowed-tools field

The optional allowed-tools field:

  • A space-delimited list of tools that are pre-approved to run
  • Experimental. Support for this field may vary between agent implementations

Example:

allowed-tools: Bash(git:*) Bash(jq:*) Read

Body content

The Markdown body after the frontmatter contains the skill instructions. There are no format restrictions. Write whatever helps agents perform the task effectively.

Recommended sections:

  • Step-by-step instructions
  • Examples of inputs and outputs
  • Common edge cases

Note: The agent will load this entire file once it's decided to activate a skill. Consider splitting longer SKILL.md content into referenced files.

Optional directories

scripts/

Contains executable code that agents can run. Scripts should:

  • Be self-contained or clearly document dependencies
  • Include helpful error messages
  • Handle edge cases gracefully

Supported languages depend on the agent implementation. Common options include Python, Bash, and JavaScript.

references/

Contains additional documentation that agents can read when needed:

  • REFERENCE.md - Detailed technical reference
  • FORMS.md - Form templates or structured data formats
  • Domain-specific files (finance.md, legal.md, etc.)

Keep individual reference files focused. Agents load these on demand, so smaller files mean less use of context.

assets/

Contains static resources:

  • Templates (document templates, configuration templates)
  • Images (diagrams, examples)
  • Data files (lookup tables, schemas)

Progressive disclosure

Skills should be structured for efficient use of context:

  1. Metadata (~100 tokens): The name and description fields are loaded at startup for all skills
  2. Instructions (< 5000 tokens recommended): The full SKILL.md body is loaded when the skill is activated
  3. Resources (as needed): Files (e.g. those in scripts/, references/, or assets/) are loaded only when required

Keep your main SKILL.md under 500 lines. Move detailed reference material to separate files.

File references

When referencing other files in your skill, use relative paths from the skill root:

See [the reference guide](references/REFERENCE.md) for details.

Run the extraction script:
scripts/extract.py

Keep file references one level deep from SKILL.md. Avoid deeply nested reference chains.

Validation

Use the skills-ref reference library to validate your skills:

skills-ref validate ./my-skill

This checks that your SKILL.md frontmatter is valid and follows all naming conventions.


Documentation index first

The Agent Skills docs are designed to be discovered via a single index file (llms.txt). Use that as the entrypoint whenever you’re exploring the spec surface area.


What are skills?

Agent Skills are a lightweight, file-based format for packaging reusable agent instructions and workflows (plus optional scripts/assets). Agents use progressive disclosure:

  1. Discovery: load only name + description metadata
  2. Activation: load the full SKILL.md body for a matching task
  3. Execution: read references / run scripts as needed
    oaicite:1

Skill directory structure

Minimum required:

skill-name/
└── SKILL.md

Common optional directories (same convention is used by Mastra workspaces):

skill-name/
├── SKILL.md
├── references/ # extra docs (optional)
├── scripts/ # executable code (optional)
└── assets/ # templates/images/etc. (optional)

SKILL.md specification essentials

Frontmatter requirements

SKILL.md must start with YAML frontmatter with at least:

  • name (strict naming constraints; should match the folder name)
  • description (non-empty; should say what + when; include “trigger keywords”)

Optional fields defined by the spec include license, compatibility, metadata, and experimental allowed-tools.

Body content

After frontmatter: normal Markdown instructions. The spec recommends practical steps, examples, and edge cases (and keeping SKILL.md reasonably small to support progressive disclosure).

A spec-friendly template

---
name: code-review
description: Reviews code for quality, style, and potential issues. Use when asked to review PRs, diffs, TypeScript/Node projects, or linting failures.
license: Apache-2.0
compatibility: Requires node and access to repository files
metadata:
version: "1.0.0"
tags: "development review"
---

# Code Review

## When to use this skill
- Trigger phrases: "review this PR", "code review", "lint errors", "style guide"

## Procedure
1. Identify the change scope and risk.
2. Check for correctness, edge cases, and error handling.
3. Verify style rules in references/style-guide.md.
4. If available, run scripts/lint.ts and summarize results.

## Output format
- Summary
- Issues (by severity)
- Suggested diffs
- Follow-ups/tests

Note: Mastra’s docs show version and tags as top-level keys in frontmatter. Depending on your validator/tooling, the safest cross-implementation choice is to store extras under metadata. (mastra.ai)


Mastra integration

Mastra workspaces support skills starting in @mastra/core@1.1.0. (mastra.ai)

1) Place skills under your workspace filesystem basePath

Mastra treats skill paths as relative to the workspace filesystem basePath. (mastra.ai)

In your repo, the main workspace is configured with:

  • basePath: "./src/workspace"
  • skills: ["/skills"]

That means the actual on-disk skills folder should be:

./src/workspace/skills/
/your-skill-name/
SKILL.md

2) Configure skills on a workspace

Mastra enables discovery by setting skills on the workspace. (mastra.ai)

import { Workspace, LocalFilesystem } from "@mastra/core/workspace";

export const workspace = new Workspace({
filesystem: new LocalFilesystem({ basePath: "./src/workspace" }),
skills: ["/skills"],
});

You can provide multiple skill directories (still relative to basePath). (mastra.ai)

skills: [
"/skills", // Project skills
"/team-skills", // Shared team skills
],

3) Dynamic skill directories (context-aware)

Mastra also supports a function form for skills, so you can vary skill sets by user role, tenant, environment, etc. (mastra.ai)

skills: (context) => {
const paths = ["/skills"];
if (context.user?.role === "developer") paths.push("/dev-skills");
return paths;
},

4) What Mastra does “under the hood”

When a skill is activated, its instructions are added to the conversation context and the agent can access references/scripts in that skill folder. Mastra describes the runtime flow as: (mastra.ai)

  1. List available skills in the system message
  2. Allow agents to activate skills during conversation
  3. Provide access to skill references and scripts

This maps cleanly onto the Agent Skills “discovery → activation → execution” model. (agentskills.io)

5) Skill search and indexing in Mastra

Mastra workspaces support BM25, vector, and hybrid search. (mastra.ai)

If BM25 or vector search is enabled, Mastra will automatically index skills so agents can search within skill content to find relevant instructions. (mastra.ai)

Example (BM25-only):

const workspace = new Workspace({
filesystem: new LocalFilesystem({ basePath: "./src/workspace" }),
skills: ["/skills"],
bm25: true,
});

If you enable vector or hybrid search, indexing uses your embedder and vector store (and BM25 uses tokenization + term statistics). (mastra.ai)


Repo conventions that work well

  • One skill per folder, folder name matches frontmatter.name.

  • Keep SKILL.md focused on the “operator manual”; push deep theory to references/.

  • Put runnable helpers in scripts/ and make them deterministic (clear inputs/outputs).

  • Treat destructive actions as opt-in:

    • Use workspace tool gating (approval required, delete disabled) for enforcement.
    • Optionally declare allowed-tools in SKILL.md for portability across other skill runtimes. (agentskills.io)

AI-Powered Skill Extraction with Cloudflare Embeddings and a Vector Taxonomy

· 4 min read
Vadim Nicolai
Senior Software Engineer

This bulk processor extracts structured skill tags for job postings using an AI pipeline that combines:

  • Embedding generation via Cloudflare Workers AI (@cf/baai/bge-small-en-v1.5, 384-dim)
  • Vector retrieval over a skills taxonomy (Turso/libSQL index skills_taxonomy) for candidate narrowing
  • Mastra workflow orchestration for LLM-based structured extraction + validation + persistence
  • Production-grade run controls: robust logging, progress metrics, graceful shutdown, and per-item failure isolation

It’s designed for real-world runs where you expect rate limits, transient failures, and safe restarts.


Core constraint: embedding dimension ↔ vector index schema

The taxonomy retrieval layer is backed by a Turso/libSQL vector index:

  • Index name: skills_taxonomy
  • Embedding dimension (required): 384
  • Embedding model: @cf/baai/bge-small-en-v1.5 (384-dim)

If the index dimension isn’t 384, vector search can fail or degrade into meaningless similarity scores.
The script prevents this by validating stats.dimension === 384 before processing.


Architecture overview (pipeline flow)


Retrieval + extraction: what happens per job

  • Convert relevant job text to embeddings using Cloudflare Workers AI.
  • Use vector similarity search in skills_taxonomy to retrieve top-N candidate skills.
  • Candidates constrain the downstream LLM step (better precision, lower cost).

2) Extraction: structured inference via Mastra workflow

A cached Mastra workflow (extractJobSkillsWorkflow) performs:

  • prompt + schema-driven extraction
  • normalization (matching to taxonomy terms/ids)
  • validation (reject malformed outputs)
  • persistence into job_skill_tags

On failure, the script logs workflow status and step details for debugging.


Cloudflare Workers AI embeddings

Model contract and hardening

  • Model: @cf/baai/bge-small-en-v1.5
  • Vectors: 384 dimensions
  • Input contract: strict array of strings
  • Timeout: 45s (AbortController)
  • Output contract: explicit response shape checks (fail early on unexpected payloads)

This is important because embedding pipelines can silently drift if the response shape changes or inputs are malformed.

Dimension enforcement (non-negotiable)

If skills_taxonomy was created/seeded with a different dimension:

  • similarity search becomes invalid (best case: errors; worst case: plausible-but-wrong matches)

The script enforces stats.dimension === 384 to keep retrieval semantically meaningful.


Turso/libSQL vector taxonomy index

  • Storage: Turso (libSQL)
  • Index: skills_taxonomy
  • Schema dimension: 384
  • Role: retrieval layer for skills ontology/taxonomy

The script also ensures the index is populated (count > 0), otherwise it fails fast and directs you to seed.


Reliability and operational controls

Observability: console + file tee logs

  • tees console.log/warn/error to a timestamped file and the terminal
  • log naming: extract-job-skills-<ISO timestamp>-<pid>.log
  • degrades to console-only logging if file IO fails

Graceful termination

  • SIGINT / SIGTERM sets a shouldStop flag
  • the loop exits after the current job completes
  • avoids interrupting in-flight workflow steps (embedding/LLM/DB writes)

Idempotency / restart safety

Even after selecting jobs without tags, the script re-checks:

  • jobAlreadyHasSkills(jobId)

This avoids duplicate inference when:

  • you restart mid-run
  • multiple workers run concurrently
  • the initial query snapshot becomes stale

Throughput shaping

  • sequential processing
  • a fixed 1s backoff between jobs (simple, reliable rate-limit mitigation)

Failure modes

Retrieval layer failures (index health)

Triggers:

  • index missing
  • dimension mismatch (not 384)
  • empty index (count === 0)

Behavior: fail fast with actionable logs (recreate index / re-seed / verify DB target).

Embedding timeouts

Symptom: embedding call exceeds 45s and aborts. Behavior: job fails; run continues.

Mitigations:

  • chunk long descriptions upstream
  • add retry/backoff on transient 429/5xx
  • monitor Workers AI service health

Workflow failures

Behavior: job is marked failed; run continues. Logs include step trace and error payload to accelerate debugging.


Quick reference

  • Embeddings: Cloudflare Workers AI @cf/baai/bge-small-en-v1.5 (384-dim)
  • Retrieval: Turso/libSQL vector index skills_taxonomy (384-dim)
  • Orchestration: Mastra workflow extractJobSkillsWorkflow
  • Persistence: job_skill_tags
  • Embedding timeout: 45s
  • Stop behavior: graceful after current job (SIGINT / SIGTERM)

AI Observability for LLM Evals with Langfuse

· 10 min read
Vadim Nicolai
Senior Software Engineer

This article documents an evaluation harness for a Remote EU job classifier—but the real focus is AI observability: how to design traces, spans, metadata, scoring, and run-level grouping so you can debug, compare, and govern LLM behavior over time.

The script runs a batch of curated test cases, loads the latest production prompt from Langfuse (with a safe fallback), executes a structured LLM call, scores results, and publishes metrics back into Langfuse. That gives you:

  • Reproducibility (prompt versions + test set + session IDs)
  • Debuggability (one trace per test case; inspect inputs/outputs)
  • Comparability (run-level aggregation; trend metrics across changes)
  • Operational safety (flush guarantees, CI thresholds, rate-limit control)

Why "observability-first" evals matter

A typical eval script prints expected vs actual and calls it a day. That's not enough once you:

  • iterate prompts weekly,
  • swap models,
  • add guardrails,
  • change schemas,
  • tune scoring rubrics,
  • and need to explain regressions to humans.

Observability-first evals answer questions like:

  • Which prompt version produced the regression?
  • Is accuracy stable but confidence becoming overconfident?
  • Are failures clustered by location phrasing ("EMEA", "EU timezone", "Worldwide")?
  • Did we increase tokens/latency without improving correctness?
  • Can we click from CI logs straight into the trace of the failing example?

Langfuse becomes your "flight recorder": the trace is the unit of truth for what happened.


End-to-end architecture


Observability design: what gets traced and why

Trace strategy: one trace per test case

Principle: if you can't click into an individual example, you can't debug.

Each test case produces a Langfuse trace (think "request-level unit"), tagged with:

  • sessionId: groups a full run (critical for comparisons)
  • testCaseId, description: anchors the trace to your dataset
  • prompt metadata: name/label/version/hash (ideal)
  • model metadata: provider, model name, parameters (ideal)

This makes failures navigable and filterable.

Span strategy: one generation per model call

Inside each trace, you create a generation span for the model call:

  • captures input (prompt + job posting)
  • captures output (structured object + reason)
  • captures usage (token counts)
  • optionally captures latency (recommended)
  • optionally captures model params (temperature, top_p, etc.)

Even if the script is "just evals," treat each example like production traffic. That's how you build a reliable debugging workflow.


Prompt governance: Langfuse prompts + fallback behavior

Your harness fetches a prompt by name and label:

  • name: job-classifier
  • label: production

If prompt retrieval fails or is disabled (e.g., SKIP_LANGFUSE_PROMPTS=true), it uses a local fallback prompt.

Observability tip: always record the effective prompt identity

To compare runs, you want "which exact prompt did this use?" in trace metadata. If your prompt fetch returns versioning info, store:

  • promptName
  • promptLabel
  • promptVersion or promptId or promptHash

If it does not return version info, you can compute a stable hash of the prompt text and store that (lightweight, extremely useful).


Structured output: Zod as an observability contract

The classifier returns:

  • isRemoteEU: boolean
  • confidence: "high" | "medium" | "low"
  • reason: string

Why structured output is observability, not just "parsing"

A strict schema:

  • removes ambiguity ("was that JSON-ish text or valid data?")
  • enables stable scoring and aggregation
  • prevents downstream drift as prompts change
  • improves triage because the same fields are always present

If you ever add fields like region, countryHints, remotePolicy, do it via schema extension and keep historical compatibility in your scorer.


The full eval lifecycle as a trace model

This is what you want stored per test case:

When a case fails, you should be able to answer in one click:

  • Which prompt version?
  • What input text exactly?
  • What output object exactly?
  • What scoring decision and why?
  • Was the model "confidently wrong"?

Scoring and metrics: accuracy is necessary, insufficient

Your harness logs two scores:

  1. remote-eu-accuracy A numeric score from your scorer. This can be binary (0/1) or continuous (0..1). Continuous is often better because it supports partial credit and more informative trend analysis.

  2. confidence-match A binary score (1/0) tracking whether the model's confidence matches expected confidence.

Observability tip: store scorer metadata as the comment (or trace metadata)

A score without context is hard to debug. For incorrect cases, write comments like:

  • expected vs actual isRemoteEU
  • expected vs actual confidence
  • a short reason ("Predicted EU-only due to 'EMEA' but posting says US time zones")

Also consider storing structured metadata (if your Langfuse SDK supports it) so you can filter/group later.


Run-level grouping: session IDs as your "eval run" primitive

A sessionId = eval-${Date.now()} groups the whole batch. This enables:

  • "show me all traces from the last run"
  • comparisons across runs
  • slicing by prompt version across sessions
  • CI links that land you on the run dashboard

Recommendation: include additional stable tags:

  • gitSha, branch, ciBuildId (if running in CI)
  • model and promptVersion (for quick comparisons)

Even if you don't have them now, design the metadata schema so adding them later doesn't break anything.


Mermaid: evaluation flow, sequence, and data model (together)

1) Flow: control plane of the batch run

2) Sequence: what actually happens per case

3) Data model: eval artifacts


How to run (and make it debuggable in one click)

Environment variables

Required:

  • LANGFUSE_SECRET_KEY
  • LANGFUSE_PUBLIC_KEY
  • LANGFUSE_BASE_URL
  • DEEPSEEK_API_KEY

Optional:

  • SKIP_LANGFUSE_PROMPTS=true (use local prompt fallback)

Run:

pnpm tsx scripts/eval-remote-eu-langfuse.ts

Local prompt fallback:

SKIP_LANGFUSE_PROMPTS=true pnpm tsx scripts/eval-remote-eu-langfuse.ts

Observability tip: print a stable "run header"

In console output (and CI logs), it helps to print:

  • sessionId
  • model name
  • prompt version/hash
  • total test cases

That turns logs into an index into Langfuse.


Debugging workflow: from CI failure to root cause

When accuracy drops below threshold and CI fails, you want a deterministic workflow:

  1. Open the Langfuse session for the run (grouped by sessionId)

  2. Filter traces where remote-eu-accuracy = 0 (or below some threshold)

  3. For each failing trace:

    • check prompt version/hash
    • check job posting input text (location phrasing is often the culprit)
    • inspect structured output (especially confidence)
    • read the reason for the scorer's decision

Practical tips & gotchas (observability edition)

1) Always flush telemetry

If you exit early, you can lose the most important traces. Ensure flushAsync() happens even on errors (e.g., in a finally block) and only exit after flush completes.

2) Don't parallelize blindly

Parallel execution improves speed but can:

  • amplify rate limits
  • introduce noisy latency
  • create non-deterministic output ordering in logs

If you do parallelize, use bounded concurrency and capture per-case timing.

3) Track prompt identity, not just prompt text

Prompt text alone is hard to compare across runs. Record version/hash so you can correlate changes with performance.

4) Separate "correctness" from "calibration"

A model can get higher accuracy while becoming confidently wrong on edge cases. Keeping confidence-match (or richer calibration metrics later) prevents hidden regressions.

5) Add slice metrics before you add more test cases

Instead of only "overall accuracy," compute accuracy by category:

  • "EU-only"
  • "Worldwide remote"
  • "EMEA" phrasing
  • "Hybrid" / "On-site"
  • "Contractor / employer-of-record constraints"

This reveals what's actually breaking when a prompt changes.


Suggested next upgrades (high leverage)

A) Add latency and cost proxies

Record:

  • duration per generation span (ms)
  • token totals per case

Then you can chart:

  • cost/latency vs accuracy
  • regressions where prompt got longer but not better

B) Add a "reason quality" score (optional, small rubric)

Create a third score like reason-quality to detect when explanations degrade (too vague, irrelevant, or missing key constraints). Keep it light—don't overfit to phrasing.

C) Prompt A/B within the same run

Evaluate production vs candidate prompts on the same test set:

  • two sessions (or two labels within one session)
  • compare metrics side-by-side in Langfuse

Docusaurus note: Mermaid support

If Mermaid isn't rendering, enable it in Docusaurus:

// docusaurus.config.js
const config = {
markdown: { mermaid: true },
themes: ["@docusaurus/theme-mermaid"],
};
module.exports = config;

The takeaway: observability is the eval superpower

A well-instrumented eval harness makes improvements measurable and regressions explainable:

  • traces turn examples into clickable evidence
  • structured outputs stabilize scoring
  • session IDs make runs comparable
  • multiple metrics prevent hidden failure modes

If you treat evals like production requests—with traces, spans, and scores—you'll iterate faster and break less.

Schema-First RAG with Eval-Gated Grounding and Claim-Card Provenance

· 7 min read
Vadim Nicolai
Senior Software Engineer

This article documents a production-grade architecture for generating research-grounded therapeutic content. The system prioritizes verifiable artifacts (papers → structured extracts → scored outputs → claim cards) over unstructured text.

You can treat this as a “trust pipeline”: retrieve → normalize → extract → score → repair → persist → generate.

Evals for Workflow-First Production LLMs: Contracts, Rubrics, Sampling, and Observability

· 12 min read
Vadim Nicolai
Senior Software Engineer

Building Production Evals for LLM Systems

Building LLM systems you can measure, monitor, and improve

Large language models feel like software, but they don’t behave like software.

With conventional programs, behavior is mostly deterministic: if tests pass, you ship, and nothing changes until you change the code. With LLM systems, behavior can drift without touching a line—model updates, prompt edits, temperature changes, tool availability, retrieval results, context truncation, and shifts in real-world inputs all move the output distribution.

So “it seems to work” isn’t a strategy. Evals are how you turn an LLM feature from a demo into an engineered system you can:

  • Measure (quantify quality across dimensions)
  • Monitor (detect drift and regressions early)
  • Improve (pinpoint failure modes and iterate)

This doc builds evals from first principles and anchors everything in a concrete example: a workflow that classifies job postings as Remote EU (or not), outputs a structured JSON contract, and attaches multiple scorers (deterministic + LLM-as-judge) to generate reliable evaluation signals.


1) The core idea: make quality observable

An eval is a function:

Eval(input, output, context?, ground_truth?) → score + reason + metadata

A single scalar score is rarely enough. You want:

  • Score: for trendlines, comparisons, and gating
  • Reason: for debugging and iteration
  • Metadata: to reproduce and slice results (model version, prompt version, retrieval config, toolset, sampling rate, time)

When you do this consistently, evals become the LLM equivalent of:

  • unit tests + integration tests,
  • observability (logs/metrics/traces),
  • QA plus post-release monitoring.

2) “Correct” is multi-dimensional

In LLM systems, quality is a vector.

Even if the final label is right, the output can still be unacceptable if:

  • it invents support in the explanation (hallucination),
  • it violates the rubric (misalignment),
  • it fails formatting constraints (schema noncompliance),
  • it’s unhelpful or vague (low completeness),
  • it includes unsafe content (safety).

So you don’t build one eval. You build a panel of scorers that measure different axes.


3) Deterministic vs model-judged evals

3.1 Deterministic evals (cheap, stable, strict)

No model involved. Examples:

  • schema validation
  • required fields present (e.g., reason non-empty)
  • bounds checks (confidence ∈ {low, medium, high})
  • regex checks (must not include disallowed fields)

Strengths: fast, repeatable, low variance Limitations: shallow; can’t grade nuance like “is this reason actually supported?”

3.2 LLM-as-judge evals (powerful, fuzzy, variable)

Use a second model (the judge) to grade output against a rubric and evidence.

Strengths: can evaluate nuanced properties like grounding, rubric adherence, and relevance Limitations: cost/latency, judge variance, judge drift, and susceptibility to prompt hacking if unconstrained

In production, the winning pattern is: deterministic guardrails + rubric-based judge scoring + sampling.


4) The “Remote EU” running example

4.1 Task

Input:

  • title
  • location
  • description

Output contract:

{
"isRemoteEU": true,
"confidence": "high",
"reason": "Short evidence-based justification."
}

4.2 Why this is a great eval example

Job posts are full of ambiguous and misleading phrases:

  • “EMEA” is not EU-only
  • “CET/CEST” is a timezone, not eligibility
  • UK is not in the EU
  • Switzerland/Norway are in Europe but not EU
  • “Hybrid” is not fully remote
  • Multi-location lists can mix EU and non-EU constraints

This creates exactly the kind of environment where “vibes” fail and measurement matters.


5) Workflow-first evaluation architecture

A practical production architecture separates:

  • serving (fast path that returns a result),
  • measurement (scoring and diagnostics, often sampled).

Why this split matters

If your most expensive scorers run inline on every request, your feature inherits their cost and latency. A workflow-first approach gives you options:

  • always-on “must-have” scoring,
  • sampled deep diagnostics,
  • offline golden-set evaluation in CI.

6) Contracts make evaluation reliable: rubric + schema

6.1 Rubric is the spec

If you can’t state what “correct” means, you can’t measure it consistently.

Your rubric should define:

  • positive criteria (what qualifies),
  • explicit negatives (what disqualifies),
  • ambiguous cases and how to resolve them,
  • precedence rules (what overrides what).

6.2 Schema is the contract

Structured output makes evaluation composable:

  • score isRemoteEU separately from reason,
  • validate confidence vocabulary,
  • enforce required fields deterministically.

7) Design the scorer suite as a “sensor panel”

A robust suite typically includes:

7.1 Always-on core

  • Domain correctness judge (rubric-based)
  • Deterministic sanity (schema + hasReason)
  • Optionally: lightweight grounding check (if user-facing)

7.2 Sampled diagnostics

  • Faithfulness / hallucination (judge-based)
  • Prompt alignment
  • Answer relevancy
  • Completeness / keyword coverage (careful: can be gamed)

7.3 Low-rate tail-risk

  • Toxicity
  • Bias
  • (Domain-dependent) policy checks

8) The anchor metric: domain correctness as a strict judge

Generic “relevance” is not enough. You need:

“Is isRemoteEU correct under this rubric for this job text?”

8.1 What a good judge returns

A strong judge returns structured, actionable feedback:

  • score ∈ [0, 1]
  • isCorrect boolean
  • mainIssues[] (typed failure modes)
  • reasoning (short justification)
  • optional evidenceQuotes[] (snippets that support the judgment)

8.2 The “use only evidence” constraint

The most important instruction to judges:

Use ONLY the job text + rubric. Do not infer missing facts.

Without this, your judge will “helpfully” hallucinate implied constraints, and your metric becomes untrustworthy.


9) Deterministic sanity checks: tiny effort, huge payoff

Even with a schema, add simple checks:

  • reason.trim().length > 0
  • confidence in an allowed set
  • optional length bounds for reason (prevents rambling)

These are cheap, stable, and catch silent regressions early.


10) Grounding: the trust layer

In many real products, the worst failure is not “wrong label.” It’s unsupported justification.

A model can guess the right label but invent a reason. Users trust the reason more than the label. When the reason lies, trust is gone.

Useful grounding dimensions:

  • Faithfulness: does the reason match the job text?
  • Non-hallucination: does it avoid adding unsupported claims?
  • Context relevance: does it actually use provided context?

Normalize score direction

If a scorer returns “lower is better” (hallucination/toxicity), invert it so higher is always better:

This prevents endless mistakes in dashboards and thresholds.


11) Aggregation: how many metrics become decisions

You typically want three layers:

11.1 Hard gates (binary invariants)

Examples:

  • schema valid
  • hasReason = 1
  • correctness score ≥ threshold
  • non-hallucination ≥ threshold (if user-facing)

11.2 Soft composite score (trend tracking)

A weighted score helps compare versions, but should not hide hard failures.

11.3 Diagnostics (why it failed)

Store mainIssues[] and judge reasons so you can cluster and fix.


12) Slicing: where the real insight lives

A single global average is rarely useful. You want to slice by meaningful features:

For Remote EU:

  • contains “EMEA”
  • contains “CET/CEST”
  • mentions UK
  • mentions hybrid/on-site
  • mentions “Europe” (ambiguous)
  • multi-location list present
  • mentions “EU work authorization”

This turns “accuracy dropped” into “accuracy dropped specifically on CET-only job posts.”


13) The Remote EU rubric as a decision tree

A rubric becomes much easier to debug when you can visualize precedence rules.

Here’s an example decision tree (adapt to your policy):

This makes edge cases explicit and makes judge behavior easier to audit.


14) Sampling strategy: cost-aware measurement

A practical scoring policy:

  • Always-on: correctness + sanity
  • 25% sampled: grounding + alignment + completeness
  • 10% sampled: safety canaries
  • 0%: tool-call accuracy until you actually use tools

This gives you statistical visibility with bounded cost.

If you want deeper rigor:

  • increase sampling on releases,
  • reduce sampling during stable periods,
  • bias sampling toward risky slices (e.g., posts containing “EMEA” or “CET”).

15) Calibration: make “confidence” mean something

If you output confidence: high|medium|low, treat it as a measurable claim.

Track:

  • P(correct | high)
  • P(correct | medium)
  • P(correct | low)

A healthy confidence signal produces a separation like:

  • high ≫ medium ≫ low

If “high” is only marginally better than “medium,” you’re emitting vibes, not confidence.


16) Turning evals into improvement: the feedback loop

Evals are not a report card. They’re a loop.

  1. Collect runs + eval artifacts
  2. Cluster failures by mainIssues[]
  3. Fix prompt/rubric/routing/post-processing
  4. Re-run evals (golden set + sampled prod)
  5. Gate release based on regressions

The key operational shift: you stop debating anecdotes and start shipping changes backed by measured deltas.


17) Golden sets: fast regression detection

A golden set is a curated collection of test cases representing:

  • core behavior,
  • common edge cases,
  • historical failures.

Even 50–200 examples catch a shocking amount of regression.

For Remote EU, include cases mentioning:

  • “Remote EU only”
  • “Remote Europe” (ambiguous)
  • “EMEA only”
  • “CET/CEST only”
  • UK-only
  • Switzerland/Norway-only
  • hybrid-only (single city)
  • multi-location lists mixing EU and non-EU
  • “EU work authorization required” without explicit countries

Run the golden set:

  • on every prompt/model change (CI),
  • nightly as a drift canary.

18) Judge reliability: making LLM-as-judge dependable

Judge scoring is powerful, but you must treat the judge prompt like production code.

18.1 Techniques that reduce variance

  • force structured judge output (JSON schema)
  • use a clear rubric with precedence rules
  • include explicit negative examples
  • constrain the judge: “use only provided evidence”
  • keep judge temperature low
  • store judge prompt version + rubric version

18.2 Disagreement as signal

If you run multiple judges or compare judge vs deterministic heuristics, disagreement highlights ambiguous cases worth:

  • rubric refinement,
  • targeted prompt updates,
  • additional training data,
  • routing policies.

19) Production gating patterns

Not every system should block on evals, but you can safely gate high-risk cases.

Common gates:

  • schema invalid → retry
  • correctness below threshold → rerun with stronger model or request clarification (if user-facing)
  • low grounding score → regenerate explanation constrained to cite evidence
  • confidence low → route or mark uncertain

20) Beyond classifiers: evals for tool-using agents

Once your agent calls tools (search, databases, parsers, RAG), evals expand to include:

  • Tool selection correctness: did it call tools when needed?
  • Argument correctness: were tool parameters valid?
  • Faithful tool usage: did the model use tool outputs correctly?
  • Over-calling: did it waste calls?

This is where agentic systems often succeed or fail in production.


21) A practical checklist

Spec & contracts

  • Rubric defines positives, negatives, precedence, ambiguous cases
  • Output schema enforced
  • Prompt and rubric are versioned artifacts

Scorers

  • Always-on: correctness + sanity
  • Sampled: grounding + alignment + completeness
  • Low-rate: safety checks
  • Scores normalized so higher is better

Ops

  • Metrics stored with reasons + metadata
  • Slices defined for high-risk patterns
  • Golden set exists and runs in CI/nightly
  • Feedback loop ties evals directly to prompt/rubric/routing changes

Closing

Without evals, you can demo. With evals, you can ship—and keep shipping.

A workflow-first pattern—rubric + schema + domain correctness judge + grounding diagnostics + sampling + feedback loop—turns an LLM from a “text generator” into an engineered system you can measure, monitor, and improve like any serious production service.


Appendix: Reusable Mermaid snippets

A) System architecture

B) Eval taxonomy

C) Feedback loop

Agentic Job Pre-Screening with LangGraph + DeepSeek: Auto-Reject Fake “Remote” Roles

· 7 min read
Vadim Nicolai
Senior Software Engineer

Introduction

Remote job postings are noisy, inconsistent, and often misleading. A role is labeled “Remote”, but the actual constraints show up in one sentence buried halfway down the description:

  • “Remote (US only)”
  • “Must be authorized to work in the U.S. without sponsorship”
  • “EU/EEA only due to payroll constraints”
  • “Must overlap PST business hours”
  • “Hybrid, 2 days/week in-office”

This article breaks down a LangGraph System that pre-screens job postings using DeepSeek structured extraction, then applies deterministic rules to instantly decide:

✅ Apply
❌ Reject (with reasons + quotes)

The goal is simple: filter out non-viable jobs before you spend time applying.


The Problem: “Remote” Doesn’t Mean “Work From Anywhere”

Why Traditional Filters Fail

Keyword filters (“remote”, “anywhere”) fail because job descriptions are written inconsistently and constraints can be phrased in dozens of ways:

  1. Remote but country-restricted
  2. Remote but timezone-restricted
  3. Remote but payroll-limited
  4. Remote but no visa sponsorship
  5. Remote but actually hybrid

Instead of relying on fragile string matching, we use an LLM to read the description like a human, but output machine-usable constraints.


System Overview

This agent evaluates job postings in two phases:

  1. Analyze job text (DeepSeek + structured schema)
  2. Check eligibility (deterministic rules)

What It Detects

  • Location scope
    • US-only / EU-only / Global / Specific regions / Unknown
  • Remote status
    • fully-remote / remote-with-restrictions / hybrid / on-site / unknown
  • Visa sponsorship
    • explicit yes/no/unknown
  • Work authorization requirements
    • must be authorized in US/EU, or not specified
  • Timezone restrictions
    • PST overlap / CET overlap / etc.

Tech Stack

  • LangGraph: workflow orchestration and state transitions
  • DeepSeek: high-signal extraction from messy job text (deepseek-chat)
  • LangChain structured output: strict schema → stable parsing
  • Deterministic rules engine: eligibility enforcement without “LLM vibes”

Architecture Patterns

1) LangGraph Workflow

Instead of a linear script, the system is a graph-driven workflow:

This shape is production-friendly because the workflow can expand safely:

  • add salary checks
  • add tech stack fit scoring
  • add seniority mismatch detection
  • add contractor vs employee constraints

Typed State + Structured Extraction

State Model (TypedDict)

LangGraph becomes far more reliable when state is explicit:

class JobScreeningState(TypedDict):
job_title: str
company: str
description: str
location: str
url: str

# Candidate requirements
candidate_needs_visa_sponsorship: bool
requires_fully_remote: bool
requires_worldwide_remote: bool
candidate_locations: List[str]

# Output
is_eligible: bool
rejection_reasons: List[str]

# Extracted requirements
location_requirement: Optional[str]
specific_regions: List[str]
excluded_regions: List[str]
visa_sponsorship_available: Optional[bool]
work_authorization_required: Optional[str]
remote_status: str
timezone_restrictions: List[str]
confidence: str
key_phrases: List[str]
analysis_explanation: Optional[str]

DeepSeek Extraction: Converting Messy Text Into Policy Constraints

Why Structured Output Is Non-Negotiable

Freeform LLM output is fragile. A production system needs predictable extraction. This agent forces DeepSeek into a strict schema:

class JobAnalysisSchema(TypedDict):
location_requirement: Literal["US-only", "EU-only", "Global", "Specific-regions", "Unknown"]
specific_regions: List[str]
excluded_regions: List[str]
remote_status: Literal["fully-remote", "remote-with-restrictions", "hybrid", "on-site", "unknown"]
visa_sponsorship_available: Optional[bool]
work_authorization_required: Literal["US-only", "EU-only", "Any", "Unknown"]
timezone_restrictions: List[str]
confidence: Literal["high", "medium", "low"]
key_phrases: List[str]
explanation: str

With this contract, the agent can safely feed extracted requirements into deterministic logic.


Token Efficiency: Keep Only High-Signal Lines

Job descriptions are long. Constraints are usually short. To reduce tokens and improve extraction precision, the system trims input to keyword-adjacent lines:

KEYWORDS = (
"remote", "anywhere", "worldwide", "timezone", "sponsor", "visa",
"authorized", "work authorization", "must be located", "eligible to work",
"location", "region", "country", "overlap", "hours", "time zone"
)

def _keep_relevant(text: str, window: int = 2) -> str:
lines = text.splitlines()
keep = set()
for i, ln in enumerate(lines):
if any(k in ln.lower() for k in KEYWORDS):
for j in range(max(0, i - window), min(len(lines), i + window + 1)):
keep.add(j)
return "\n".join(lines[i] for i in sorted(keep)) or text

This improves the system in four ways:

  • lower inference cost
  • faster runtime
  • less noise
  • fewer hallucination opportunities

Heuristics + DeepSeek: Hybrid Extraction That Wins

Before invoking DeepSeek, the system runs a tiny heuristic pre-check:

  • detects obvious “Remote (Worldwide)”
  • detects “Remote (US only)”
  • detects “Hybrid / On-site”
def _fast_heuristic_precheck(state: JobScreeningState) -> Optional[Dict[str, Any]]:
loc = state.get("location", "") or ""
desc = state.get("description", "") or ""
seed: Dict[str, Any] = {}

if _looks_worldwide(loc) or _looks_worldwide(desc):
seed["location_requirement"] = "Global"
seed["remote_status"] = "fully-remote"

if (_looks_us_only(loc) or _looks_us_only(desc)) and not seed.get("location_requirement"):
seed["location_requirement"] = "US-only"

if _looks_hybrid_or_onsite(loc):
seed["remote_status"] = "hybrid"

return seed if seed else None

DeepSeek still performs the full extraction, but seeding improves resilience against incomplete metadata.


Eligibility Rules: Enforcing Worldwide Remote Strictly

The most valuable mode is strict worldwide remote filtering:

If requires_worldwide_remote=True, the job must satisfy ALL of the following:

  • remote_status == "fully-remote"
  • location_requirement == "Global"
  • no specific_regions
  • no timezone_restrictions
if state["requires_worldwide_remote"]:
if state["remote_status"] != "fully-remote":
rejection_reasons.append(
f"Not worldwide-remote: remote status is '{state['remote_status']}'"
)
if state["location_requirement"] != "Global":
rejection_reasons.append(
f"Not worldwide-remote: location requirement is '{state['location_requirement']}'"
)
if state["specific_regions"]:
rejection_reasons.append(
f"Not worldwide-remote: restricted to {state['specific_regions']}"
)
if state["timezone_restrictions"]:
rejection_reasons.append(
f"Not worldwide-remote: timezone restrictions {state['timezone_restrictions']}"
)

This instantly rejects “remote marketing” jobs like:

  • “Remote, EU only”
  • “Remote, US/Canada preferred”
  • “Remote, PST overlap required”

Visa Sponsorship Semantics: Correct and Safe

Sponsorship logic is easy to get wrong. The correct behavior:

  • reject only when sponsorship is explicitly not available (False)
  • do not reject on unknown (None)
if state["candidate_needs_visa_sponsorship"]:
if state["visa_sponsorship_available"] is False:
rejection_reasons.append(
"Job does not offer visa sponsorship, but candidate needs sponsorship"
)

This avoids dropping jobs that simply don’t mention sponsorship.


Explainability: Rejection Reasons + Key Phrases

Trust requires receipts. The system stores:

  • rejection_reasons (deterministic outcomes)
  • key_phrases (quotes that triggered the decision)
  • analysis_explanation (LLM summary for debugging)

That produces outputs like:

  • “Job requires US location; candidate is not in US”
  • “Not worldwide-remote: timezone restrictions ['US Pacific business hours']”
  • key phrases like “Must be authorized to work in the U.S. without sponsorship”

Real-World Test Scenarios

The included test suite covers the most common job board traps:

  1. US-only remote + no sponsorship
  2. Remote worldwide (work from anywhere)
  3. EU-only remote
  4. Remote with timezone overlap requirement

This validates both extraction quality and deterministic enforcement.


Production Enhancements

1) Add a Match Score (Not Only Pass/Fail)

Binary decisions are clean, but scoring improves ranking:

  • 100 = perfect match
  • 70 = acceptable
  • 30 = not worth it
  • 0 = reject

2) Cache Results by URL Hash

You already compute a stable thread_id from the job URL. Persist results keyed by:

  • url_hash
  • model version
  • rule version

This prevents re-analyzing duplicate postings.

3) Detect Payroll Constraints Explicitly

Add signals for:

  • “We can only hire in countries where we have an entity”
  • “Deel/Remote.com limited coverage”
  • “W2 only / no contractors”

This is one of the highest ROI improvements for global applicants.


Conclusion

This LangGraph System turns job descriptions into enforceable constraints:

  • DeepSeek extracts remote reality, location scope, and sponsorship signals
  • Structured output makes extraction stable and machine-safe
  • Deterministic rules enforce candidate requirements precisely
  • Worldwide-remote mode filters out fake “remote” listings instantly
  • Decisions are explainable with reasons and quotes

This is how you scale job hunting without wasting time: automate rejection early, apply only where it can actually work.

References

Building Long-Running TTS Pipelines with LangGraph: Orchestrating Long-Form Audio Generation

· 17 min read
Vadim Nicolai
Senior Software Engineer

Introduction

Generating long-form audio content—audiobooks spanning hours, educational courses, or extended podcasts—presents unique challenges: API rate limits, network failures, resource constraints, and the sheer duration of processing. This article explores a production-ready architecture for long-running TTS pipelines that can gracefully handle long-form generation tasks, resume after failures, and maintain state across distributed systems.

Built with LangGraph, the system orchestrates complex workflows involving AI content generation (DeepSeek), text-to-speech conversion (OpenAI TTS), and distributed storage (Cloudflare R2). The key innovation: PostgreSQL checkpointing enables resumable execution, making it possible to generate 5-30+ minute audio segments reliably, even when individual API calls or processing steps fail.

The Challenge: Long-Form Audio at Scale

Why Long-Running Pipelines Are Hard

Traditional TTS approaches fail at scale:

  1. Time Constraints: A 30-minute audio narrative requires ~4,500 words, chunked into 10-15 API calls, taking 2-5 minutes to generate
  2. Failure Points: Each step (text generation, chunking, TTS, storage) can fail independently
  3. Memory Pressure: Holding all audio segments in memory for hours is impractical
  4. Cost Management: Retrying from scratch wastes API credits and compute time
  5. State Loss: Without persistence, crashes mean starting over

Our Solution: Stateful Orchestration

  • LangGraph manages workflow state transitions
  • PostgreSQL persists checkpoints after each successful step
  • R2 provides durable storage for completed segments
  • Resumable execution using thread_id for job recovery

System Overview

The pipeline orchestrates three main workflows:

  1. Research Generation: Structured content research using DeepSeek
  2. Narrative Text Generation: Long-form content creation with context awareness
  3. Audio Synthesis: Text-to-speech conversion with OpenAI TTS and Cloudflare R2 storage

Tech Stack

  • LangGraph: State machine orchestration with built-in checkpointing
  • DeepSeek: Long-form text generation (deepseek-chat, 2500+ token outputs)
  • OpenAI TTS: Streaming audio synthesis (gpt-4o-mini-tts, 4096 char limit per request)
  • PostgreSQL: Durable checkpointing for long-running jobs (Neon serverless for production)
  • Cloudflare R2: S3-compatible storage with zero egress fees (critical for multi-GB audio)
  • FastAPI: Async REST API for non-blocking long operations
  • Docker: Containerized deployment with ffmpeg for audio merging

Why This Stack for Long-Running Jobs:

  • Postgres checkpointing: Resume from any point in the workflow (text generation → chunking → TTS → upload)
  • Streaming TTS: Memory-efficient direct-to-disk writes (no buffering entire audio in RAM)
  • R2 durability: Segments uploaded immediately, survive process crashes
  • Async execution: Non-blocking background processing for hours-long jobs

Architecture Patterns

1. Core LangGraph State Machine

The system implements three distinct LangGraph workflows, each optimized for specific tasks.

2. Research Generation Pipeline

The research pipeline generates structured research content using a focused LangGraph workflow.

Key Features:

  • Low temperature (0.3) for factual accuracy
  • Structured JSON output with validation
  • Evidence level classification (A/B/C)
  • Relevance scoring for topic matching

3. Long-Form Text Generation Pipeline

The most sophisticated workflow, supporting both full generation and audio-only modes.

Conditional Routing Logic:

def should_skip_text_generation(state: TextState) -> str:
"""Route to text generation or skip to audio."""
if state.get("existing_content") and state["existing_content"].get("text"):
return "chunk_text" # Audio-only mode
return "generate_text" # Full generation

def should_generate_audio(state: TextState) -> str:
"""Route to audio generation or end."""
if state.get("generate_audio", True):
return "chunk_text"
return END # Text-only mode

4. Audio Generation Pipeline (Standalone)

A simplified pipeline for generic long-form narration.

Iterative Chunk Processing:

The system uses a recursive edge pattern for processing chunks:

g.add_conditional_edges(
"tts_one_chunk",
edge_should_continue,
{
"tts_one_chunk": "tts_one_chunk", # Loop back
"finalize": "finalize", # Exit loop
},
)

def edge_should_continue(state: JobState) -> str:
if state["chunk_index"] < len(state["chunks"]):
return "tts_one_chunk"
return "finalize"

Deep Dive: Key Architectural Components

State Management

LangGraph uses typed state dictionaries for type safety and IDE support:

class TextState(TypedDict):
# Input metadata
content_id: int
title: str
content_type: str
language: str
target_duration_minutes: int | None

# Generation data
research_items: list[dict]
existing_content: dict | None
generated_text: str | None

# TTS fields
voice: str
chunks: List[str]
segment_urls: List[str]
manifest_url: Optional[str]
audio_url: Optional[str]

# Control flow
generate_audio: bool
database_saved: bool
error: str | None

Postgres Checkpointing: The Key to Long-Running Resilience

For long-running jobs, checkpointing is non-negotiable. Without it, a network glitch at minute 25 of a 30-minute generation means restarting from scratch.

How Checkpointing Works:

async def run_pipeline(state: TextState, thread_id: str):
db_url = os.getenv("DATABASE_URL")

async with AsyncPostgresSaver.from_conn_string(db_url) as checkpointer:
await checkpointer.setup() # Creates checkpoint tables
app = build_graph(checkpointer=checkpointer)
config = {"configurable": {"thread_id": thread_id}}

# LangGraph automatically saves state after each node execution
final_state = await app.ainvoke(state, config=config)
return final_state

What Gets Checkpointed:

  • Complete state dictionary after each node
  • Edge transitions and routing decisions
  • Timestamps and execution metadata
  • Partial results (generated text, uploaded segment URLs)

Recovery Example:

# Job crashes after generating 8 of 12 TTS segments
# Resume with same thread_id:
final_state = await run_pipeline(initial_state, thread_id="job-12345")

# LangGraph:
# 1. Loads last checkpoint from Postgres
# 2. Sees 8 segments already uploaded to R2
# 3. Continues from segment 9
# 4. Completes remaining 4 segments

Production Benefits:

  • Cost Savings: No wasted API calls on retry
  • Time Efficiency: Resume from 80% complete, not 0%
  • Reliability: Transient failures (rate limits, timeouts) don't kill long-form jobs
  • Observability: Query checkpoint table to monitor progress
  • Parallel Execution: Multiple jobs with different thread_id values

Text Chunking Algorithm: Optimizing for Long-Form Narration

For 30-minute audio (4,500+ words), naive chunking creates jarring transitions. Our algorithm balances API constraints with narrative flow:

Constraints:

  • OpenAI TTS: 4,096 character limit per request
  • Target: ~4,000 chars per chunk (safety margin)
  • Goal: Natural pauses at paragraph/sentence boundaries

Strategy:

def chunk_text(text: str, max_chars: int = 4000) -> List[str]:
"""
Multi-level chunking for long-form content:

1. Split by paragraphs (\n\n) - natural topic boundaries
2. Accumulate paragraphs until approaching 4K limit
3. If single paragraph > 4K, split by sentences
4. If single sentence > 4K, split mid-sentence (rare edge case)

Result: 10-15 chunks for 30-min audio, each ending at natural pause
"""
paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
chunks = []
buf = []

for p in paragraphs:
candidate = "\n\n".join(buf + [p]) if buf else p
if len(candidate) <= max_chars:
buf.append(p)
else:
if not buf:
# Paragraph too large - split by sentences
sentences = re.split(r"(?<=[.!?])\s+", p)
# Accumulate sentences with same logic...
else:
chunks.append("\n\n".join(buf))
buf = [p]

return chunks

Why This Matters for Long-Form:

  • Seamless Merging: Chunk boundaries at natural pauses prevent audio glitches
  • Even Distribution: Avoids tiny final chunks (better for progress tracking)
  • Memory Efficiency: Process one chunk at a time, not entire 4,500-word text
  • Resumability: Each chunk is independent; can resume mid-sequence

OpenAI TTS Streaming

Efficient audio generation using streaming responses:

async def node_tts_one_chunk(state: JobState) -> JobState:
chunk_text = state["chunks"][state["chunk_index"]]
segment_path = f"segment_{state['chunk_index']:04d}.mp3"

client = OpenAI()

# Stream directly to disk (memory efficient)
with client.audio.speech.with_streaming_response.create(
model="gpt-4o-mini-tts",
voice=state["voice"],
input=chunk_text,
response_format="mp3",
) as response:
response.stream_to_file(segment_path)

# Upload to R2
r2_url = upload_to_r2(segment_path, state["job_id"])

return {
**state,
"segment_urls": [*state["segment_urls"], r2_url],
"chunk_index": state["chunk_index"] + 1,
}

Audio Merging Strategy

The system uses ffmpeg for high-quality concatenation:

async def node_generate_audio(state: TextState) -> TextState:
# Generate all segments...

# Create concat list for ffmpeg
file_list_path.write_text(
"\n".join(f"file '{segment}'" for segment in segment_paths)
)

# Merge using ffmpeg (codec copy - fast and lossless)
subprocess.run([
"ffmpeg", "-f", "concat", "-safe", "0",
"-i", str(file_list_path),
"-c", "copy", # No re-encoding
str(merged_path)
])

# Fallback to binary concatenation if ffmpeg unavailable
if not merged_path.exists():
with open(merged_path, "wb") as merged:
for segment in segment_paths:
merged.write(segment.read_bytes())

Cloudflare R2 Integration

S3-compatible storage for globally distributed audio:

def get_r2_client():
return boto3.client(
's3',
endpoint_url=f'https://{R2_ACCOUNT_ID}.r2.cloudflarestorage.com',
aws_access_key_id=R2_ACCESS_KEY_ID,
aws_secret_access_key=R2_SECRET_ACCESS_KEY,
config=Config(signature_version='s3v4'),
)

def upload_to_r2(file_path: Path, job_id: str) -> str:
key = f"{job_id}/{file_path.name}"

client.put_object(
Bucket=R2_BUCKET_NAME,
Key=key,
Body=file_path.read_bytes(),
ContentType='audio/mpeg',
)

return f"{R2_PUBLIC_DOMAIN}/{key}"

Structured Content Generation

Narrative Architecture Framework

The system implements a flexible content framework with customizable sections:

Key Components:

  1. Introduction (2-3 min): Hook the listener and set expectations
  2. Context: Background information and relevance
  3. Core Content: Main topic introduction with clear structure
  4. Examples: Concrete illustrations and case studies
  5. Deep Dive: Detailed exploration of key concepts
  6. Applications: Practical use cases and implementation
  7. Advanced Topics: Nuanced discussion for engaged learners
  8. Synthesis: Connect all concepts together
  9. Takeaways: Summary of key points
  10. Conclusion: Clear closing and next steps

Dynamic Content Adaptation

def build_content_prompt(state: TextState) -> str:
minutes = state.get("target_duration_minutes") or 5
target_words = int(minutes * 150) # 150 words per minute narration

content_type = state.get("content_type")

# Select architecture based on content type
architecture = generate_content_architecture(content_type)

return f"""
Create a {state['language']} narrative for audio:

TOPIC: {state['title']}
TYPE: {content_type}
TARGET: {target_words} words ({minutes} minutes)

{architecture}

RESEARCH CONTEXT:
{format_research_summary(state['research_items'])}

Requirements:
- Plain text only (no markdown)
- Natural paragraph breaks
- Engaging, clear tone
- Appropriate language for audio listening
"""

API Endpoints

FastAPI Service Layer

Endpoint Implementations:

@app.post("/api/research/generate")
async def research_endpoint(req: ResearchRequest):
"""Generate research context using LangGraph + DeepSeek."""
return await generate_research(req)

@app.post("/api/text/generate")
async def text_endpoint(req: TextGenerationRequest):
"""Generate long-form text content (text-only mode)."""
return await generate_text(req)

@app.post("/api/audio/generate")
async def audio_endpoint(req: TextGenerationRequest):
"""Generate audio from existing content (audio-only mode)."""
return await generate_audio(req)

@app.post("/api/tts/generate")
async def tts_endpoint(req: TTSRequest, background_tasks: BackgroundTasks):
"""Generic TTS generation (fire-and-forget)."""
return await generate_tts(req, background_tasks)

Deployment Architecture

Docker Containerization

FROM python:3.12-slim

WORKDIR /app

# Install ffmpeg for audio merging
RUN apt-get update && apt-get install -y ffmpeg && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Expose port
EXPOSE 8080

# Run FastAPI server
CMD ["uvicorn", "langgraph_server:app", "--host", "0.0.0.0", "--port", "8080"]

Environment Configuration

# AI Services
DEEPSEEK_API_KEY=sk-...
OPENAI_API_KEY=sk-...

# Database (Neon Postgres)
DATABASE_URL=postgresql://user:pass@host/db?sslmode=require

# Cloudflare R2
R2_ACCOUNT_ID=...
R2_ACCESS_KEY_ID=...
R2_SECRET_ACCESS_KEY=...
R2_BUCKET_NAME=longform-tts
R2_PUBLIC_DOMAIN=https://pub-longform-tts.r2.dev

Cloudflare Workers Deployment

# wrangler.toml
name = "langgraph-tts"
compatibility_date = "2024-01-01"

[build]
command = "docker build -t langgraph-tts ."

[[services]]
name = "langgraph-tts"
image = "langgraph-tts:latest"

[env]
PORT = "8080"

[[r2_buckets]]
binding = "LONGFORM_TTS"
bucket_name = "longform-tts"

Production Considerations

Performance Metrics for Long-Running Jobs

Benchmarks (30-minute audio generation):

StageDurationCheckpointedRetryable
Text Generation (DeepSeek)30-60s✅ After completion✅ Full retry
Text Chunking<1s✅ After completion✅ Instant
TTS Segments (1–12)10-20s each✅ After each segment✅ Per-segment
Audio Merging (ffmpeg)1–3s✅ After completion✅ Full retry
R2 Upload (merged)2-5s✅ After completion✅ Full retry
Total Pipeline3-5 minutes15+ checkpointsGranular recovery

Long-Running Job Profile:

# Example: 2-hour audiobook chapter
text_length = 18,000 words
chunks = 45 # ~4,000 chars each
tts_time = 45 * 15s = 11.25 minutes
text_gen_time = 2-3 minutes
total_time = ~15 minutes for 2-hour audio

# Checkpoint frequency:
# - 1 after text generation
# - 45 after each TTS segment
# - 1 after merge
# Total: 47 recovery points

Failure Recovery Times:

  • Crash at 80% complete → Resume in 1–2 seconds, continue from segment 36/45
  • Network timeout on segment 20 → Retry only segment 20, not segments 1–19
  • Database connection loss → Reconnect and load last checkpoint (<500ms)

Error Handling & Resilience

async def node_generate_text(state: TextState) -> TextState:
try:
llm = ChatDeepSeek(model="deepseek-chat", temperature=0.7, max_tokens=2500)
prompt = build_therapeutic_prompt(state)

resp = await llm.ainvoke([HumanMessage(content=prompt)])
text = clean_for_tts(resp.content)

return {**state, "generated_text": text, "error": None}
except Exception as e:
print(f"❌ Text generation failed: {e}")
return {**state, "error": str(e)}

Monitoring & Observability

Key metrics to track:

  1. Generation Metrics:

    • Text generation latency (DeepSeek)
    • TTS latency per chunk (OpenAI)
    • Total pipeline duration
  2. Quality Metrics:

    • Text length vs target duration
    • Chunk count and size distribution
    • Audio segment file sizes
  3. Infrastructure Metrics:

    • R2 upload success rate
    • Database checkpoint writes
    • ffmpeg merge success rate
  4. Cost Metrics:

    • DeepSeek token usage
    • OpenAI TTS character count
    • R2 storage and bandwidth

Scaling Patterns

Horizontal Scaling:

  • FastAPI instances behind load balancer
  • Stateless design (state in Postgres)
  • R2 for distributed storage

Batch Processing:

async def batch_generate_audio(goal_ids: List[int]):
"""Process multiple goals in parallel."""
tasks = [run_pipeline(build_state(id), f"batch-{id}") for id in goal_ids]
results = await asyncio.gather(*tasks, return_exceptions=True)
return results

Queue-Based Processing:

  • Use background tasks for long-running jobs
  • Celery/Redis for distributed task queue
  • Webhook callbacks for completion notifications

Performance Optimization

Chunking Optimization

# Optimize chunk size for TTS quality vs API limits
OPTIMAL_CHUNK_SIZE = 3000 # Sweet spot for natural pauses

# Parallel TTS generation (with rate limiting)
async def parallel_tts_generation(chunks: List[str], max_concurrent: int = 3):
semaphore = asyncio.Semaphore(max_concurrent)

async def generate_with_limit(chunk, index):
async with semaphore:
return await generate_tts_segment(chunk, index)

tasks = [generate_with_limit(c, i) for i, c in enumerate(chunks)]
return await asyncio.gather(*tasks)

Caching Strategy

# Cache research results for similar goals
@lru_cache(maxsize=100)
def get_research_for_goal_type(therapeutic_type: str, age: int):
"""Cache research by type + age bracket."""
return fetch_research(therapeutic_type, age)

# Cache text generation for re-use
async def get_or_generate_text(goal_id: int):
existing = await db.fetch_story(goal_id)
if existing and existing.created_at > datetime.now() - timedelta(days=7):
return existing.text
return await generate_new_text(goal_id)

Testing Strategy

Unit Tests

def test_chunk_text_respects_limit():
long_text = "word " * 2000
chunks = chunk_text(long_text, max_chars=4000)

for chunk in chunks:
assert len(chunk) <= 4000

def test_clean_for_tts_removes_markdown():
text = "# Title\n\n**bold** and `code`"
cleaned = clean_for_tts(text)
assert "#" not in cleaned
assert "**" not in cleaned
assert "`" not in cleaned

Integration Tests

@pytest.mark.asyncio
async def test_full_pipeline():
state = {
"goal_id": 1,
"goal_title": "Test anxiety reduction",
"therapeutic_goal_type": "anxiety_reduction",
"age": 8,
# ... other fields
}

result = await run_pipeline(state, "test-thread-1")

assert result["generated_text"] is not None
assert len(result["chunks"]) > 0
assert result["audio_url"] is not None
assert result["error"] is None

Lessons Learned

1. State Design is Critical

  • Use TypedDict for type safety
  • Keep state flat (avoid deep nesting)
  • Include metadata for debugging (timestamps, IDs)

2. Checkpoint Strategically

  • Not all workflows need checkpointing
  • Audio-only mode: disable checkpoints to avoid schema issues
  • Use thread_id conventions: {workflow}-{entity_id}-{timestamp}

3. Error Recovery

  • Graceful degradation (segments work even if merge fails)
  • Fallback strategies (binary concat if ffmpeg unavailable)
  • Preserve partial results (individual segments in R2)

4. Cost Management

  • Monitor token usage (DeepSeek is cost-effective at 0.14/0.14/0.28 per 1M tokens)
  • OpenAI TTS: $15 per 1M characters
  • R2 storage: $0.015/GB/month (much cheaper than S3)

5. Content Quality

  • Structured frameworks improve consistency
  • Repetition aids retention and comprehension
  • Audience-appropriate language is crucial for engagement

Future Enhancements

1. Multi-Voice Narratives

# Support character dialogue with different voices
voices = {
"narrator": "cedar",
"child_character": "nova",
"parent_character": "marin"
}

2. Emotion-Adaptive TTS

# Adjust voice parameters based on content emotion
def get_tts_params(text: str) -> dict:
sentiment = analyze_sentiment(text)

if sentiment == "calm":
return {"speed": 0.9, "pitch": 0}
elif sentiment == "energetic":
return {"speed": 1.1, "pitch": 2}

3. Real-Time Streaming

# Stream audio as it's generated (SSE)
async def stream_audio_generation(goal_id: int):
async for chunk_url in generate_audio_stream(goal_id):
yield f"data: {json.dumps({'chunk_url': chunk_url})}\n\n"

4. Multilingual Support

  • Expand beyond Romanian and English
  • Voice selection per language
  • Cultural adaptation of content frameworks

Conclusion

This LangGraph-based TTS architecture demonstrates several key patterns:

  1. Composable Workflows: Three distinct pipelines sharing common components
  2. Conditional Routing: Smart flow control based on state
  3. Durable Execution: PostgreSQL checkpointing for resilience
  4. Streaming Efficiency: Direct-to-disk TTS for memory optimization
  5. Distributed Storage: R2 for globally accessible audio

The system successfully processes 5-30+ minute long-form narratives (up to 7,000+ words), generating research-backed content, converting to high-quality audio, and delivering via CDN—all while maintaining resumability after failures and full observability.

Real-World Performance:

  • 30-minute generation: 12-15 TTS chunks, ~3-5 minutes total processing time
  • Failure recovery: Resume from any checkpoint in <1 second
  • Cost efficiency: 0.020.02-0.07 per 30-minute audio (DeepSeek + OpenAI TTS)
  • Throughput: 10+ concurrent jobs on single instance

Key Takeaways for Long-Running Pipelines:

  • LangGraph + Postgres checkpointing is essential for long-form workflows
  • Streaming TTS to disk prevents memory exhaustion on long generations
  • Smart chunking (4K chars) balances API limits with narrative coherence
  • Immediate R2 uploads ensure partial results survive crashes
  • Async architecture enables fire-and-forget long operations
  • Thread-based recovery makes interrupted jobs trivial to resume

The architecture scales to long-form audio generation: audiobooks (10+ hours), comprehensive courses, documentary narration, or serialized storytelling—any use case where reliability and resumability are non-negotiable.

References


This architecture powers long-form audio generation, combining LangGraph orchestration, OpenAI TTS streaming, and distributed storage for production-ready AI audio systems.