Skip to main content

3 posts tagged with "deepeval"

View All Tags

Multi-Modal Evaluation for AI-Generated LEGO Parts: A Production DeepEval Pipeline

· 19 min read
Vadim Nicolai
Senior Software Engineer

Your AI pipeline generates a parts list for a LEGO castle MOC. It says you need 12x "Brick 2 x 4" in Light Bluish Gray, 8x "Arch 1 x 4" in Dark Tan, and 4x "Slope 45 2 x 1" in Sand Green. The text looks plausible. But does the part image next to "Arch 1 x 4" actually show an arch? Does the quantity make sense for a castle build? Would this list genuinely help someone source bricks for the build?

These are multi-modal evaluation questions — they span text accuracy, image-text coherence, and practical usefulness. Standard unit tests cannot answer them. This article walks through a production evaluation pipeline built with DeepEval that evaluates AI-generated LEGO parts lists across five axes, using image metrics that most teams haven't touched yet.

The system is real. It runs in Bricks, a LEGO MOC discovery platform built with Next.js 19, LangGraph, and Neon PostgreSQL. The evaluation judge is DeepSeek — not GPT-4o — because you don't need a frontier model to grade your outputs.

Synthetic Evaluation with DeepEval: A Production RAG Testing Framework

· 13 min read
Vadim Nicolai
Senior Software Engineer

Your RAG pipeline passes all 20 of your hand-written test questions. It retrieves the right context, generates grounded answers, and the demo looks great. Then it goes to production, and users start asking the 21st question — the one that exposes a retrieval gap, a hallucinated citation, or a context window that silently truncated the most relevant chunk. You had 20 tests for a knowledge base with 55 documents. That's 0.4% coverage. The other 99.6% was untested surface area.

This guide shows how to close that gap. We walk through a production implementation that generates 330+ synthetic test cases from 55 AI engineering lessons, evaluates a LangGraph-based RAG pipeline across 10+ metrics, and runs hyperparameter sweeps to find optimal retrieval configurations — all automated with DeepEval and pytest.

DeepEval for Healthcare AI: Eval-Driven Compliance That Actually Catches PII Leakage Before the FDA Does

· 20 min read
Vadim Nicolai
Senior Software Engineer

The most dangerous failure mode for a healthcare AI isn't inaccuracy—it's a compliance breach you didn't test for. A model can generate a perfect clinical summary and still violate HIPAA by hallucinating a patient's name that never existed. Under the Breach Notification Rule, that fabricated yet plausible Protected Health Information (PHI) constitutes a reportable incident. Most teams discover these gaps during an audit or, worse, after a breach. The alternative is to treat compliance not as a post-hoc checklist, but as an integrated, automated evaluation layer that fails your CI pipeline before bad code ships. This is eval-driven compliance, and it's the only way to build healthcare AI that doesn't gamble with regulatory extinction.

Reference implementation: Every code example in this article is drawn from Agentic Healthcare, an open-source blood test intelligence app that tracks 7 clinical ratios over time using velocity-based trajectory analysis. The full eval suite, compliance architecture, and production code are available in the GitHub repository.