AI City Popy πŸ™οΈ
πŸ§ͺ
πŸ“Š
βš—οΈ
βœ…
District 10 Β· Testing Lab

Welcome to The Eval Lab

Where AI gets its report card β€” measure quality before it reaches your users.

Enter The Lab
🎯
Accuracy
πŸ›‘οΈ
Safety
πŸ’°
Cost
⚑
Latency

Without Evals, You're Guessing

When you change a prompt or switch models, how do you know if things got better or worse?

🎲
Without Evals
  • β€’ Guess that the new prompt is better
  • β€’ Hope users don't notice regressions
  • β€’ Can't compare two model versions
  • β€’ Ship bugs silently 😱
πŸ“Š
With Evals
  • β€’ Measure accuracy on 500 test cases
  • β€’ Catch regressions before deploy
  • β€’ Compare gpt-4o vs claude-3.5-sonnet fairly
  • β€’ Ship with confidence βœ…

The 4 Types of Evals

Different questions about your AI need different measurement methods.

🎯
Accuracy Eval

Does the answer match the expected output?

Key Metric
exact_match, ROUGE, semantic similarity
πŸ’‘ Expected: 'Paris' | Got: 'Paris' βœ…

Grade These AI Outputs

You're the eval scientist β€” grade each answer as Pass or Fail.

Question
What is the capital of France?
Expected: Paris
Model A

The capital of France is Paris, a vibrant city known as the City of Light.

Model B

France's capital city is Lyon, a major cultural hub in eastern France.

Model C

Paris is indeed the capital, but the question seems oddly easy for an AI test...

LLM-as-Judge: AI Grades AI

For complex outputs (like summaries or creative writing), you can use a powerful LLM to evaluate another LLM's output β€” this scales to thousands of examples.

πŸ€– Automated Eval Run
QUESTION
Explain quantum entanglement simply.
AI RESPONSE
Quantum entanglement is when two particles are linked so measuring one instantly affects the other, no matter the distance.

Mission: Match the Eval Metric

Each scenario describes an important AI metric. Match each result to the correct eval type.

New prompt gets 92% on the same test set (was 85%).

Model costs $0.004 per query vs $0.02 before.

AI responds in 280ms average, down from 1,100ms.

Model refused 99.8% of harmful prompts in red-teaming.

Get started

Add evals to your AI system πŸ§ͺ

Four steps to go from vibes-based testing to real metrics.

  1. 1

    Install an eval framework

    Braintrust is a great starting point with free tier and simple Python SDK.

    pip install braintrust
  2. 2

    Create a golden dataset

    Build a CSV or list of question/expected-answer pairs from real user queries.

    dataset = [
      {"input": "Capital of France?", "expected": "Paris"},
      {"input": "Is 7 prime?",         "expected": "Yes"},
      {"input": "2 + 2?",              "expected": "4"},
    ]
  3. 3

    Run evaluations

    Loop through the dataset, call your AI, and compute a score.

    import braintrust
    
    def scorer(output, expected):
        return 1.0 if output.strip().lower()\
               .startswith(expected.lower()) else 0.0
    
    results = braintrust.Eval(
        "My AI Test",
        data=dataset,
        task=lambda input: call_my_ai(input),
        scores=[scorer],
    )
  4. 4

    Compare & ship safely

    Run evals on every model/prompt change. Only ship when scores improve.

    # CI/CD check
    if results.summary.score < 0.85:
        print("❌ Accuracy too low β€” blocking deploy")
        exit(1)
    else:
        print("βœ… Evals passed β€” shipping!")
        deploy()
Ask Popy

Chat with the Scientist ✨

Questions about eval frameworks, metrics, or building test datasets?

Hi! I'm Popy πŸ§ͺ Ask me anything about AI evaluation, metrics, golden datasets, or LLM-as-judge!

Certified Eval Scientist! πŸŽ“

You can now measure, score, and compare AI systems scientifically. Next: learn to monitor your AI in production with Observability tools.

Mini Project
Build Quest

Tiny Eval Harness

Deliverable: Create five tests and score responses for correctness and safety.

Stretch: Compare pass rate across two prompt versions.

Complete the deliverable first, then unlock the stretch goal.

Previous
πŸ‘©β€πŸ« Teacher Academy
Next
πŸ“Ί Watch Tower