Welcome to The Eval Lab
Where AI gets its report card β measure quality before it reaches your users.
Enter The LabWithout Evals, You're Guessing
When you change a prompt or switch models, how do you know if things got better or worse?
- β’ Guess that the new prompt is better
- β’ Hope users don't notice regressions
- β’ Can't compare two model versions
- β’ Ship bugs silently π±
- β’ Measure accuracy on 500 test cases
- β’ Catch regressions before deploy
- β’ Compare gpt-4o vs claude-3.5-sonnet fairly
- β’ Ship with confidence β
The 4 Types of Evals
Different questions about your AI need different measurement methods.
Does the answer match the expected output?
exact_match, ROUGE, semantic similarityGrade These AI Outputs
You're the eval scientist β grade each answer as Pass or Fail.
The capital of France is Paris, a vibrant city known as the City of Light.
France's capital city is Lyon, a major cultural hub in eastern France.
Paris is indeed the capital, but the question seems oddly easy for an AI test...
LLM-as-Judge: AI Grades AI
For complex outputs (like summaries or creative writing), you can use a powerful LLM to evaluate another LLM's output β this scales to thousands of examples.
Mission: Match the Eval Metric
Each scenario describes an important AI metric. Match each result to the correct eval type.
New prompt gets 92% on the same test set (was 85%).
Model costs $0.004 per query vs $0.02 before.
AI responds in 280ms average, down from 1,100ms.
Model refused 99.8% of harmful prompts in red-teaming.
Add evals to your AI system π§ͺ
Four steps to go from vibes-based testing to real metrics.
- 1
Install an eval framework
Braintrust is a great starting point with free tier and simple Python SDK.
pip install braintrust
- 2
Create a golden dataset
Build a CSV or list of question/expected-answer pairs from real user queries.
dataset = [ {"input": "Capital of France?", "expected": "Paris"}, {"input": "Is 7 prime?", "expected": "Yes"}, {"input": "2 + 2?", "expected": "4"}, ] - 3
Run evaluations
Loop through the dataset, call your AI, and compute a score.
import braintrust def scorer(output, expected): return 1.0 if output.strip().lower()\ .startswith(expected.lower()) else 0.0 results = braintrust.Eval( "My AI Test", data=dataset, task=lambda input: call_my_ai(input), scores=[scorer], ) - 4
Compare & ship safely
Run evals on every model/prompt change. Only ship when scores improve.
# CI/CD check if results.summary.score < 0.85: print("β Accuracy too low β blocking deploy") exit(1) else: print("β Evals passed β shipping!") deploy()
Chat with the Scientist β¨
Questions about eval frameworks, metrics, or building test datasets?
Certified Eval Scientist! π
You can now measure, score, and compare AI systems scientifically. Next: learn to monitor your AI in production with Observability tools.
Tiny Eval Harness
Deliverable: Create five tests and score responses for correctness and safety.
Stretch: Compare pass rate across two prompt versions.
Complete the deliverable first, then unlock the stretch goal.