Why Evaluations Matter
AI is Different from Regular Software
| Traditional Software Testing | AI System Testing |
|---|---|
| You write code that does exactly what you tell it | AI generates responses that vary |
| Same input always gives same output | Same question might get different (but valid) answers |
| Easy to test: either it works or it doesn’t | Quality is subjective - what’s “good” isn’t always clear |
Think of it like: Testing a calculator (traditional) vs. grading an essay (AI). The calculator is either right or wrong. The essay requires judgment about quality, clarity, and helpfulness.
Types of Evaluations
Automated Testing
Like running tests on regular software, but for AI quality. How it works:- Create a set of test questions
- Define what good answers look like
- Have the AI answer all test questions
- Automatically score the answers
- Get a quality percentage
| Advantages ✓ | Limitations ⚠️ |
|---|---|
| Very fast - can test thousands of questions in minutes | Only catches what you test for |
| Runs automatically whenever you make changes | May miss nuanced quality issues |
| Gives objective, consistent scores | Requires creating and maintaining test datasets |
| Can run continuously in production | Can’t evaluate subjective quality well |
Human Review
Real people rate the quality of AI responses. How it works:- Select sample AI responses (random or specific)
- Show them to human reviewers
- Reviewers rate quality on different dimensions
- Aggregate ratings to understand performance
What reviewers check:
- Is this answer accurate? (Yes/No)
- Is this answer helpful? (1-5 stars)
- Is this answer on-topic? (Yes/No)
- Would you be satisfied with this response? (Thumbs up/down)
- Any problems? (Free text feedback)
| Advantages ✓ | Limitations ⚠️ |
|---|---|
| Catches subtle quality issues | Expensive and time-consuming |
| Evaluates subjective aspects (tone, helpfulness) | Can’t review everything |
| Provides context and explanations | Reviewers may disagree with each other |
| Can identify unexpected problems | Hard to scale |
User Feedback
Let your actual users tell you how you’re doing. Simple feedback methods:| Method | Characteristics | When to Use |
|---|---|---|
| Thumbs up/down | Easy for users to give, high response rate, limited detail | Quick validation, high-volume interactions |
| Star ratings (1-5) | More nuanced than thumbs, users understand the pattern, still very simple | When you need more granular feedback |
| Follow-up behavior | Did user immediately ask another question? (might indicate bad answer) Did user copy the response? (good sign) Did user stop using the product? (bad sign) | Implicit feedback without asking users |
| Advantages ✓ | Limitations ⚠️ |
|---|---|
| Real feedback from real users | Many users don’t leave feedback |
| Reflects actual satisfaction | May be biased (angry users more likely to respond) |
| No extra work for you | Doesn’t tell you HOW to improve |
| Continuous feedback stream | Can be vague or unhelpful |
The Best Approach: Combine All Three
Recommended strategy:
- Automated testing runs continuously (catches obvious problems fast)
- Human review samples randomly (catches quality issues)
- User feedback provides ongoing validation (real-world quality check)
What to Measure
Accuracy Metrics
| Metric | What It Measures | Example | Good For | Not Good For |
|---|---|---|---|---|
| Exact match | Does the answer exactly match what you expected? | ”What’s 2+2?” should be exactly “4” | Factual questions with one right answer | Questions with multiple valid answers |
| Meaning match | Does the answer mean the same thing, even if worded differently? | “30-day guarantee” vs. “You have 30 days to return” (both correct) | Most real-world questions | Simple pass/fail checking (requires sophisticated evaluation) |
| Fact-checking | Are the claims in the answer actually true? Compare against your knowledge base and detect when AI makes things up | Verifying “Our refund policy is 30 days” against actual policy documents | Domains where accuracy matters (medical, legal, financial) | Subjective or opinion-based responses |
Quality Metrics
| Metric | Key Questions | What to Look For |
|---|---|---|
| Relevance | Does the answer actually address the question asked? Is it on-topic? | Answer stays focused on the question without unnecessary tangents |
| Completeness | Does it answer the full question? Is anything missing? | All parts of the question are addressed with important details included |
| Clarity | Is it easy to understand? Is it well-structured? | Answer is well-organized and appropriate for the audience (not too technical or too simple) |
| Helpfulness | Does it solve the user’s problem? Does it provide actionable information? | Users can take action based on the response and would be satisfied with it |
Performance Metrics
| Metric | What to Track | Example Target |
|---|---|---|
| Response time | How long users wait (track averages and worst cases) | 95% of queries under 2 seconds |
| Success rate | Percentage of requests that work without errors | Less than 1% error rate |
| Cost per query | How much each answer costs (track trends and identify expensive queries) | Average cost under $0.01 per query |
Building Your Test Set
Start with Real Examples
Collect actual user questions:- Look at your logs or analytics
- What do users actually ask?
- What are the most common questions?
Cover Different Scenarios
Common cases (60%)
- Typical questions most users ask
- Should have high success rate
- Baseline quality check
Edge cases (20%)
- Unusual or tricky questions
- Boundary conditions
- Things that might break
Known failures (20%)
- Questions that broke before
- Regression testing
- Make sure old problems don’t come back
Include Hard Samples - Your Evals Should Never Pass 100%
A good evaluation should challenge your AI system and expose its limits. This helps you understand where improvement is needed and prevents false confidence. Why hard samples matter:Find Real Weaknesses
Easy questions hide problems. Hard questions expose where your AI actually struggles, showing you what needs improvement.
Prevent Regression
When you make changes, hard samples catch subtle degradations that easy questions would miss.
Set Realistic Expectations
Understanding your AI’s limits helps you set appropriate expectations for users and stakeholders.
Drive Improvement
You can’t improve what you don’t measure. Hard samples show you the frontier of what’s possible.
| Category | Examples | Why Include It |
|---|---|---|
| Ambiguous questions | ”What’s the status?” (of what?), “How much does it cost?” (which product?) | Tests if AI asks for clarification instead of guessing |
| Multi-step reasoning | ”If I buy product A and return it within 15 days, can I use that refund toward product B?” | Tests complex logic and policy understanding |
| Conflicting information | Questions where documents contain outdated or contradictory data | Tests source prioritization and version awareness |
| Domain-specific jargon | Questions using industry terms or abbreviations your users actually use | Tests real-world vocabulary coverage |
| Out-of-scope questions | Questions your AI shouldn’t answer (legal advice, medical diagnosis) | Tests safety and boundary recognition |
| Trick questions | Questions with false premises like “Why is your refund policy 60 days?” when it’s actually 30 days | Tests fact-checking and correction ability |
Target distribution: Aim for evaluation results between 70-90% pass rate. If you’re consistently above 90%, your tests are probably too easy. If you’re below 70%, focus on the fundamentals first.
Customer Support Chatbot
Customer Support Chatbot
Easy sample: “What are your business hours?”
- Clear question, simple answer from knowledge base
- Tests: Emotion handling, escalation, urgency recognition, empathy
- Should: Acknowledge frustration, offer immediate help, escalate to human if needed
- Tests: Timezone awareness, alternative support channel knowledge
- Should: Suggest email/chat, mention timezone differences, offer self-service options
Technical Documentation Assistant
Technical Documentation Assistant
Easy sample: “How do I install the package?”
- Direct question from docs
- Tests: Error diagnosis, documentation gap recognition, troubleshooting
- Should: Reference relevant error documentation, suggest debugging steps
- Tests: Version compatibility knowledge, cross-referencing docs
- Should: Explain compatibility, suggest upgrade path if needed
E-commerce Product Assistant
E-commerce Product Assistant
Easy sample: “What colors does the Classic T-Shirt come in?”
- Straightforward product spec question
- Tests: Size comparison, brand knowledge, fit guidance
- Should: Provide sizing advice or admit uncertainty and suggest size chart
- Tests: Comparative analysis, use-case matching, feature prioritization
- Should: Compare relevant features, ask clarifying questions about hiking needs
Best practice: Review your failed test cases quarterly. The questions your AI fails today should help you create better training examples, improve your data, or refine your prompts. Each failure is an opportunity to make your system better.
Define “Good” Answers
For each test question, document:The expected answer
- What should the AI say?
- What are acceptable variations?
- What would be wrong?
Why this is correct
- Where does this information come from?
- What makes this the right answer?
- When was this verified?
Update regularly
- Review monthly or quarterly
- Update when products/policies change
- Remove outdated questions
Analyzing Results
Look for Patterns
Example failure analysis:Failed questions breakdown:
- Refund policy: 15 failures (37%)
- Shipping times: 10 failures (25%)
- Product specs: 8 failures (20%)
- Other: 7 failures (18%)
- Which categories fail most?
- Are there common themes?
- What types of questions struggle?
- Where should you focus improvement efforts?
Track Progress Over Time
Segment by Question Type
Break down performance:
- Factual questions: 88% accuracy
- How-to questions: 75% accuracy
- Opinion questions: 62% accuracy
- Factual answers are good
- How-to needs improvement
- Opinion questions need different approach
Best Practices
Start Simple, Add Complexity
Set Clear Standards
| Must have ✓ | Nice to have ⭐ |
|---|---|
| Accuracy: At least 80% | Accuracy: Above 90% |
| Speed: 95% of queries under 2 seconds | Speed: Average under 1 second |
| Errors: Less than 1% failure rate | User rating: 4.5/5 stars |
- Decide if changes are good enough to deploy
- Know when you’re ready to launch
- Track progress toward goals
Test Regularly
| Frequency | Test Type | Purpose |
|---|---|---|
| Daily | Quick automated check (smoke test) | Catch breaking changes immediately |
| Weekly | Full automated test suite | Comprehensive quality check across all scenarios |
| Monthly | Human quality review | Evaluate subjective quality and find edge cases |
| Quarterly | Comprehensive audit | Deep dive into performance, update test sets, strategic planning |
Use Multiple Metrics
Balance multiple dimensions:- Quality (accuracy, helpfulness)
- Performance (speed, reliability)
- Cost (per query, total)
Update Your Tests
Your evaluation dataset is a living document:
- Add new questions when features change
- Remove outdated questions
- Update expected answers when policies change
- Review and refresh quarterly
Common Mistakes
❌ Only Testing Happy Paths
The mistake: Only testing questions you expect to work.Why it’s wrong: Edge cases and errors are where problems hide.Better approach: Deliberately test weird, unusual, and difficult questions.
❌ Letting Tests Get Stale
The mistake: Creating a test set once and never updating it.Why it’s wrong: Your product evolves, but your tests don’t, leading to false confidence.Better approach: Review and update tests monthly, especially after changes.
❌ Ignoring User Feedback
The mistake: Only looking at automated metrics, ignoring what users say.Why it’s wrong: Metrics might look good while users are actually unhappy.Better approach: Combine automated metrics with real user feedback.
❌ Not Investigating Failures
The mistake: Seeing a failure, noting it, moving on.Why it’s wrong: You don’t learn what caused it or how to prevent it.Better approach: For each failure, understand why it failed and how to fix it.
Getting Started
Week 1: Create Your First Test Set
Gather 20-30 questions:
- 15 common questions from user logs
- 5 tricky or edge case questions
- 5 questions you know should work
- Write what the correct answer should be
- Note key points that must be included
- Mark optional details
Week 2: Run Your First Evaluation
Manual approach:
- Ask your AI each question
- Compare to expected answers
- Score each as pass/fail or 1-5 stars
- Calculate percentage
Week 3: Make Improvements
Based on failures:
- Which questions failed?
- Why did they fail?
- What can you improve? (Data? Prompts? Model?)
- Fix the biggest issues first
- One change at a time
- Re-evaluate after each change
