AI City Popy 🏙️

District 10 · Testing Lab

🏙️ Journey

0/13

🧪

📊

⚗️

✅

District 10 · Testing Lab

Welcome to The Eval Lab

Where AI gets its report card — measure quality before it reaches your users.

Enter The Lab

🎯

Accuracy

🛡️

Safety

💰

Cost

⚡

Latency

Without Evals, You're Guessing

When you change a prompt or switch models, how do you know if things got better or worse?

🎲

Without Evals

• Guess that the new prompt is better
• Hope users don't notice regressions
• Can't compare two model versions
• Ship bugs silently 😱

📊

With Evals

• Measure accuracy on 500 test cases
• Catch regressions before deploy
• Compare gpt-4o vs claude-3.5-sonnet fairly
• Ship with confidence ✅

The 4 Types of Evals

Different questions about your AI need different measurement methods.

🎯

Accuracy Eval

Does the answer match the expected output?

Key Metric

exact_match, ROUGE, semantic similarity

💡 Expected: 'Paris' | Got: 'Paris' ✅

Grade These AI Outputs

You're the eval scientist — grade each answer as Pass or Fail.

Question

What is the capital of France?

Expected: Paris

Model A

The capital of France is Paris, a vibrant city known as the City of Light.

Model B

France's capital city is Lyon, a major cultural hub in eastern France.

Model C

Paris is indeed the capital, but the question seems oddly easy for an AI test...

LLM-as-Judge: AI Grades AI

For complex outputs (like summaries or creative writing), you can use a powerful LLM to evaluate another LLM's output — this scales to thousands of examples.

🤖 Automated Eval Run

QUESTION

Explain quantum entanglement simply.

AI RESPONSE

Quantum entanglement is when two particles are linked so measuring one instantly affects the other, no matter the distance.

Mission: Match the Eval Metric

Each scenario describes an important AI metric. Match each result to the correct eval type.

New prompt gets 92% on the same test set (was 85%).

Model costs $0.004 per query vs $0.02 before.

AI responds in 280ms average, down from 1,100ms.

Model refused 99.8% of harmful prompts in red-teaming.

Get started

Add evals to your AI system 🧪

Four steps to go from vibes-based testing to real metrics.

1
Install an eval framework
Braintrust is a great starting point with free tier and simple Python SDK.
```
pip install braintrust
```

Create a golden dataset

Build a CSV or list of question/expected-answer pairs from real user queries.

dataset = [
  {"input": "Capital of France?", "expected": "Paris"},
  {"input": "Is 7 prime?",         "expected": "Yes"},
  {"input": "2 + 2?",              "expected": "4"},
]

Run evaluations

Loop through the dataset, call your AI, and compute a score.

import braintrust

def scorer(output, expected):
    return 1.0 if output.strip().lower()\
           .startswith(expected.lower()) else 0.0

results = braintrust.Eval(
    "My AI Test",
    data=dataset,
    task=lambda input: call_my_ai(input),
    scores=[scorer],
)

Compare & ship safely

Run evals on every model/prompt change. Only ship when scores improve.

# CI/CD check
if results.summary.score < 0.85:
    print("❌ Accuracy too low — blocking deploy")
    exit(1)
else:
    print("✅ Evals passed — shipping!")
    deploy()

Ask Popy

Chat with the Scientist ✨

Questions about eval frameworks, metrics, or building test datasets?

Hi! I'm Popy 🧪 Ask me anything about AI evaluation, metrics, golden datasets, or LLM-as-judge!

Certified Eval Scientist! 🎓

You can now measure, score, and compare AI systems scientifically. Next: learn to monitor your AI in production with Observability tools.

Continue Journey →📺 Watch Tower →

Mini Project

Build Quest

Tiny Eval Harness

Deliverable: Create five tests and score responses for correctness and safety.

Stretch: Compare pass rate across two prompt versions.

Complete the deliverable first, then unlock the stretch goal.

👩‍🏫 Teacher Academy

📺 Watch Tower