Skip to main content
Evaluations are how you test and measure if your AI system is working well. Just like a teacher grading student work, evaluations help you understand your AI’s strengths and weaknesses.

Why Evaluations Matter

AI is Different from Regular Software

Traditional Software TestingAI System Testing
You write code that does exactly what you tell itAI generates responses that vary
Same input always gives same outputSame question might get different (but valid) answers
Easy to test: either it works or it doesn’tQuality is subjective - what’s “good” isn’t always clear
Think of it like: Testing a calculator (traditional) vs. grading an essay (AI). The calculator is either right or wrong. The essay requires judgment about quality, clarity, and helpfulness.

Types of Evaluations

Automated Testing

Like running tests on regular software, but for AI quality. How it works:
  1. Create a set of test questions
  2. Define what good answers look like
  3. Have the AI answer all test questions
  4. Automatically score the answers
  5. Get a quality percentage
Example test set:Question: “What’s your refund policy?” Expected answer: “30-day money-back guarantee”Question: “How do I reset my password?” Expected answer: Should mention the “Forgot Password” linkQuestion: “What are your business hours?” Expected answer: Should list the correct hours
Advantages ✓Limitations ⚠️
Very fast - can test thousands of questions in minutesOnly catches what you test for
Runs automatically whenever you make changesMay miss nuanced quality issues
Gives objective, consistent scoresRequires creating and maintaining test datasets
Can run continuously in productionCan’t evaluate subjective quality well

Human Review

Real people rate the quality of AI responses. How it works:
  1. Select sample AI responses (random or specific)
  2. Show them to human reviewers
  3. Reviewers rate quality on different dimensions
  4. Aggregate ratings to understand performance
What reviewers check:
  • Is this answer accurate? (Yes/No)
  • Is this answer helpful? (1-5 stars)
  • Is this answer on-topic? (Yes/No)
  • Would you be satisfied with this response? (Thumbs up/down)
  • Any problems? (Free text feedback)
Advantages ✓Limitations ⚠️
Catches subtle quality issuesExpensive and time-consuming
Evaluates subjective aspects (tone, helpfulness)Can’t review everything
Provides context and explanationsReviewers may disagree with each other
Can identify unexpected problemsHard to scale

User Feedback

Let your actual users tell you how you’re doing. Simple feedback methods:
MethodCharacteristicsWhen to Use
Thumbs up/downEasy for users to give, high response rate, limited detailQuick validation, high-volume interactions
Star ratings (1-5)More nuanced than thumbs, users understand the pattern, still very simpleWhen you need more granular feedback
Follow-up behaviorDid user immediately ask another question? (might indicate bad answer)
Did user copy the response? (good sign)
Did user stop using the product? (bad sign)
Implicit feedback without asking users
Advantages ✓Limitations ⚠️
Real feedback from real usersMany users don’t leave feedback
Reflects actual satisfactionMay be biased (angry users more likely to respond)
No extra work for youDoesn’t tell you HOW to improve
Continuous feedback streamCan be vague or unhelpful

The Best Approach: Combine All Three

Recommended strategy:
  1. Automated testing runs continuously (catches obvious problems fast)
  2. Human review samples randomly (catches quality issues)
  3. User feedback provides ongoing validation (real-world quality check)

What to Measure

Accuracy Metrics

MetricWhat It MeasuresExampleGood ForNot Good For
Exact matchDoes the answer exactly match what you expected?”What’s 2+2?” should be exactly “4”Factual questions with one right answerQuestions with multiple valid answers
Meaning matchDoes the answer mean the same thing, even if worded differently?“30-day guarantee” vs. “You have 30 days to return” (both correct)Most real-world questionsSimple pass/fail checking (requires sophisticated evaluation)
Fact-checkingAre the claims in the answer actually true? Compare against your knowledge base and detect when AI makes things upVerifying “Our refund policy is 30 days” against actual policy documentsDomains where accuracy matters (medical, legal, financial)Subjective or opinion-based responses

Quality Metrics

MetricKey QuestionsWhat to Look For
RelevanceDoes the answer actually address the question asked? Is it on-topic?Answer stays focused on the question without unnecessary tangents
CompletenessDoes it answer the full question? Is anything missing?All parts of the question are addressed with important details included
ClarityIs it easy to understand? Is it well-structured?Answer is well-organized and appropriate for the audience (not too technical or too simple)
HelpfulnessDoes it solve the user’s problem? Does it provide actionable information?Users can take action based on the response and would be satisfied with it

Performance Metrics

MetricWhat to TrackExample Target
Response timeHow long users wait (track averages and worst cases)95% of queries under 2 seconds
Success ratePercentage of requests that work without errorsLess than 1% error rate
Cost per queryHow much each answer costs (track trends and identify expensive queries)Average cost under $0.01 per query

Building Your Test Set

Start with Real Examples

Collect actual user questions:
  • Look at your logs or analytics
  • What do users actually ask?
  • What are the most common questions?
Example starting set:
  • 20 most common questions
  • 10 questions that previously failed
  • 10 edge cases or tricky questions
  • 10 questions for new features
Total: 50 question test set (good starting point)

Cover Different Scenarios

Common cases (60%)

  • Typical questions most users ask
  • Should have high success rate
  • Baseline quality check

Edge cases (20%)

  • Unusual or tricky questions
  • Boundary conditions
  • Things that might break

Known failures (20%)

  • Questions that broke before
  • Regression testing
  • Make sure old problems don’t come back

Include Hard Samples - Your Evals Should Never Pass 100%

Critical principle: If your evaluation is passing 100% of the time, it means your test set is too easy and won’t catch real problems.
A good evaluation should challenge your AI system and expose its limits. This helps you understand where improvement is needed and prevents false confidence. Why hard samples matter:

Find Real Weaknesses

Easy questions hide problems. Hard questions expose where your AI actually struggles, showing you what needs improvement.

Prevent Regression

When you make changes, hard samples catch subtle degradations that easy questions would miss.

Set Realistic Expectations

Understanding your AI’s limits helps you set appropriate expectations for users and stakeholders.

Drive Improvement

You can’t improve what you don’t measure. Hard samples show you the frontier of what’s possible.
Types of hard samples to include:
CategoryExamplesWhy Include It
Ambiguous questions”What’s the status?” (of what?), “How much does it cost?” (which product?)Tests if AI asks for clarification instead of guessing
Multi-step reasoning”If I buy product A and return it within 15 days, can I use that refund toward product B?”Tests complex logic and policy understanding
Conflicting informationQuestions where documents contain outdated or contradictory dataTests source prioritization and version awareness
Domain-specific jargonQuestions using industry terms or abbreviations your users actually useTests real-world vocabulary coverage
Out-of-scope questionsQuestions your AI shouldn’t answer (legal advice, medical diagnosis)Tests safety and boundary recognition
Trick questionsQuestions with false premises like “Why is your refund policy 60 days?” when it’s actually 30 daysTests fact-checking and correction ability
Target distribution: Aim for evaluation results between 70-90% pass rate. If you’re consistently above 90%, your tests are probably too easy. If you’re below 70%, focus on the fundamentals first.
Example hard sample scenarios:
Easy sample: “What are your business hours?”
  • Clear question, simple answer from knowledge base
Hard sample: “I need to talk to someone NOW because my order was supposed to arrive for my daughter’s birthday YESTERDAY and it’s still not here and you’re ruining everything!”
  • Tests: Emotion handling, escalation, urgency recognition, empathy
  • Should: Acknowledge frustration, offer immediate help, escalate to human if needed
Hard sample: “I’m traveling in Japan and need support but your phone lines are closed, what do I do?”
  • Tests: Timezone awareness, alternative support channel knowledge
  • Should: Suggest email/chat, mention timezone differences, offer self-service options
Easy sample: “How do I install the package?”
  • Direct question from docs
Hard sample: “I’m getting error XYZ when I follow the installation guide, what’s wrong?”
  • Tests: Error diagnosis, documentation gap recognition, troubleshooting
  • Should: Reference relevant error documentation, suggest debugging steps
Hard sample: “Can I use version 2.0 features with version 1.5 installed?”
  • Tests: Version compatibility knowledge, cross-referencing docs
  • Should: Explain compatibility, suggest upgrade path if needed
Easy sample: “What colors does the Classic T-Shirt come in?”
  • Straightforward product spec question
Hard sample: “I’m 5’10” and 180 lbs, usually wear a medium in Nike, what size should I get in your Classic T-Shirt?”
  • Tests: Size comparison, brand knowledge, fit guidance
  • Should: Provide sizing advice or admit uncertainty and suggest size chart
Hard sample: “Which is better for hiking: Product A or Product B?”
  • Tests: Comparative analysis, use-case matching, feature prioritization
  • Should: Compare relevant features, ask clarifying questions about hiking needs
Best practice: Review your failed test cases quarterly. The questions your AI fails today should help you create better training examples, improve your data, or refine your prompts. Each failure is an opportunity to make your system better.

Define “Good” Answers

For each test question, document:

The expected answer

  • What should the AI say?
  • What are acceptable variations?
  • What would be wrong?

Why this is correct

  • Where does this information come from?
  • What makes this the right answer?
  • When was this verified?

Update regularly

  • Review monthly or quarterly
  • Update when products/policies change
  • Remove outdated questions

Analyzing Results

Look for Patterns

Example failure analysis:Failed questions breakdown:
  • Refund policy: 15 failures (37%)
  • Shipping times: 10 failures (25%)
  • Product specs: 8 failures (20%)
  • Other: 7 failures (18%)
Action: Focus on improving refund policy content first
What to look for:
  • Which categories fail most?
  • Are there common themes?
  • What types of questions struggle?
  • Where should you focus improvement efforts?

Track Progress Over Time

Month-over-month comparison:
  • January: 72% accuracy - Baseline
  • February: 76% accuracy - +4% (improved data)
  • March: 79% accuracy - +3% (better prompts)
  • April: 81% accuracy - +2% (model upgrade)
What this tells you:
  • You’re making steady progress
  • Improvements are slowing (diminishing returns)
  • May need new approaches for further gains

Segment by Question Type

Break down performance:
  • Factual questions: 88% accuracy
  • How-to questions: 75% accuracy
  • Opinion questions: 62% accuracy
Insights:
  • Factual answers are good
  • How-to needs improvement
  • Opinion questions need different approach

Best Practices

Start Simple, Add Complexity

1

Week 1

  • Create 20-30 test questions
  • Run manual evaluation
  • Calculate basic accuracy
2

Week 2

  • Expand to 50 questions
  • Add automated scoring
  • Track over time
3

Month 2

  • Add human review
  • Collect user feedback
  • Segment by category
4

Month 3

  • Implement continuous monitoring
  • Set quality thresholds
  • Auto-alert on regressions

Set Clear Standards

Must have ✓Nice to have ⭐
Accuracy: At least 80%Accuracy: Above 90%
Speed: 95% of queries under 2 secondsSpeed: Average under 1 second
Errors: Less than 1% failure rateUser rating: 4.5/5 stars
Use these to:
  • Decide if changes are good enough to deploy
  • Know when you’re ready to launch
  • Track progress toward goals

Test Regularly

FrequencyTest TypePurpose
DailyQuick automated check (smoke test)Catch breaking changes immediately
WeeklyFull automated test suiteComprehensive quality check across all scenarios
MonthlyHuman quality reviewEvaluate subjective quality and find edge cases
QuarterlyComprehensive auditDeep dive into performance, update test sets, strategic planning

Use Multiple Metrics

Don’t rely on just one number:
  • High accuracy but slow → Users unhappy
  • Fast but inaccurate → Users get wrong answers
  • Accurate and fast but expensive → Not sustainable
Balance multiple dimensions:
  • Quality (accuracy, helpfulness)
  • Performance (speed, reliability)
  • Cost (per query, total)

Update Your Tests

Your evaluation dataset is a living document:
  • Add new questions when features change
  • Remove outdated questions
  • Update expected answers when policies change
  • Review and refresh quarterly

Common Mistakes

❌ Only Testing Happy Paths

The mistake: Only testing questions you expect to work.Why it’s wrong: Edge cases and errors are where problems hide.Better approach: Deliberately test weird, unusual, and difficult questions.

❌ Letting Tests Get Stale

The mistake: Creating a test set once and never updating it.Why it’s wrong: Your product evolves, but your tests don’t, leading to false confidence.Better approach: Review and update tests monthly, especially after changes.

❌ Ignoring User Feedback

The mistake: Only looking at automated metrics, ignoring what users say.Why it’s wrong: Metrics might look good while users are actually unhappy.Better approach: Combine automated metrics with real user feedback.

❌ Not Investigating Failures

The mistake: Seeing a failure, noting it, moving on.Why it’s wrong: You don’t learn what caused it or how to prevent it.Better approach: For each failure, understand why it failed and how to fix it.

Getting Started

1

Week 1: Create Your First Test Set

Gather 20-30 questions:
  • 15 common questions from user logs
  • 5 tricky or edge case questions
  • 5 questions you know should work
Define expected answers:
  • Write what the correct answer should be
  • Note key points that must be included
  • Mark optional details
2

Week 2: Run Your First Evaluation

Manual approach:
  • Ask your AI each question
  • Compare to expected answers
  • Score each as pass/fail or 1-5 stars
  • Calculate percentage
Simple metric:20 questions passed / 30 total = 67% accuracyThis is your baseline!
3

Week 3: Make Improvements

Based on failures:
  • Which questions failed?
  • Why did they fail?
  • What can you improve? (Data? Prompts? Model?)
Make targeted changes:
  • Fix the biggest issues first
  • One change at a time
  • Re-evaluate after each change
4

Week 4: Set Up Regular Testing

Automate if possible:
  • Run tests daily or weekly
  • Track results over time
  • Alert on regressions
Make it routine:
  • Test before every deployment
  • Review results weekly
  • Update test set monthly

Next Steps