What is Agent Evals (Golden Datasets, LLM-as-Judge) in AI Coding Agents?

Question

Accepted Answer

Evals are systematic quality tests for an agent — inputs in, outputs graded against expected behavior. Different from unit tests because outputs are open-ended, so grading uses rules, regex, or another LLM (LLM-as-judge). I treat them like CI tests: golden dataset of 50–500 representative inputs, runner that invokes the agent, grader that scores outputs, regression tracking so I can diff prompt changes. Tools I'd reach for: Langfuse, Braintrust, PromptFoo, Inspect AI. The biggest risks are overfitting to the eval set (so I keep a private holdout), LLM-judge bias (I use a different model family for grading or anchor judges with rubrics), and dataset gaps (I sample real production failures and label them back into the set). Without evals, every prompt change is a gut-feel gamble.