Skip to content
Workflow Patternscritical

Agent Evals (Golden Datasets, LLM-as-Judge)

Evals are systematic tests of agent quality across a representative input set. Different from unit tests: outputs are open-ended (not exact-match), so success is graded by automated rules, regex/JSON checks, or another LLM (LLM-as-judge). Without evals, every prompt change is a gut-feel gamble.

Memory anchor

Evals are A/B testing for prompts — the dataset is your test panel, the grader is your survey instrument. Without it, prompt changes are taste tests by the chef on his own cooking.

Expected depth

Components of an eval harness: (1) golden dataset — 50–500 representative inputs labeled with expected behavior; (2) runner — invokes the agent on each input; (3) grader — scores outputs (rule-based for structured tasks, LLM-judge for subjective; humans for the spot-check tier); (4) metrics — pass@k, accuracy, latency, cost per request; (5) regression tracking — store every run so you can diff prompt changes. Tools: Langfuse, Braintrust, PromptFoo, Inspect AI, OpenAI Evals. Workflow: change a prompt or model → re-run evals → compare metrics → ship if improved, revert if regressed. Treat evals like CI tests — block merges on regression.

Deep — senior internals

Eval failure modes: (a) golden dataset that doesn't cover edge cases — production fails on inputs nothing in the dataset resembles; (b) LLM-judge bias — same model that produces output also grades it; mitigate by using a different model family for judging or by anchoring judges with rubrics; (c) overfitting — prompt iteration optimized to pass evals but breaks on real users; rotate the dataset, hold out a private test set; (d) eval cost — 500 inputs × 3 graders × every prompt change adds up; use a smaller cheaper model where rule-based grading is possible, reserve LLM-judge for the subjective tier. Pass@k: run the same input k times, success if any of k succeeds — useful for stochastic tasks where the model occasionally gets it right. Production traces (Langfuse, Honeycomb) feed back into the dataset — sample real user inputs that failed, label them, add them to the golden set. Senior engineers ship eval harnesses alongside the agent; junior engineers ship prompts and pray.

🎤Interview-ready answer

Evals are systematic quality tests for an agent — inputs in, outputs graded against expected behavior. Different from unit tests because outputs are open-ended, so grading uses rules, regex, or another LLM (LLM-as-judge). I treat them like CI tests: golden dataset of 50–500 representative inputs, runner that invokes the agent, grader that scores outputs, regression tracking so I can diff prompt changes. Tools I'd reach for: Langfuse, Braintrust, PromptFoo, Inspect AI. The biggest risks are overfitting to the eval set (so I keep a private holdout), LLM-judge bias (I use a different model family for grading or anchor judges with rubrics), and dataset gaps (I sample real production failures and label them back into the set). Without evals, every prompt change is a gut-feel gamble.

Common trap

Shipping prompts based on a few hand-tested inputs without a golden dataset. The agent might pass your 5 favorite test cases and fail on 30% of real production inputs you never thought to try.