Evaluations - Arcbeam Documentation

Evaluations are how you test and measure if your AI system is working well. Just like a teacher grading student work, evaluations help you understand your AI’s strengths and weaknesses.

Why Evaluations Matter

AI is Different from Regular Software

Traditional Software Testing	AI System Testing
You write code that does exactly what you tell it	AI generates responses that vary
Same input always gives same output	Same question might get different (but valid) answers
Easy to test: either it works or it doesn’t	Quality is subjective - what’s “good” isn’t always clear

Think of it like: Testing a calculator (traditional) vs. grading an essay (AI). The calculator is either right or wrong. The essay requires judgment about quality, clarity, and helpfulness.

Types of Evaluations

Automated Testing

Like running tests on regular software, but for AI quality. How it works:

Create a set of test questions
Define what good answers look like
Have the AI answer all test questions
Automatically score the answers
Get a quality percentage

Example test set:Question: “What’s your refund policy?” Expected answer: “30-day money-back guarantee”Question: “How do I reset my password?” Expected answer: Should mention the “Forgot Password” linkQuestion: “What are your business hours?” Expected answer: Should list the correct hours

Advantages ✓	Limitations ⚠️
Very fast - can test thousands of questions in minutes	Only catches what you test for
Runs automatically whenever you make changes	May miss nuanced quality issues
Gives objective, consistent scores	Requires creating and maintaining test datasets
Can run continuously in production	Can’t evaluate subjective quality well

Human Review

Real people rate the quality of AI responses. How it works:

Select sample AI responses (random or specific)
Show them to human reviewers
Reviewers rate quality on different dimensions
Aggregate ratings to understand performance

What reviewers check:

Is this answer accurate? (Yes/No)
Is this answer helpful? (1-5 stars)
Is this answer on-topic? (Yes/No)
Would you be satisfied with this response? (Thumbs up/down)
Any problems? (Free text feedback)

Advantages ✓	Limitations ⚠️
Catches subtle quality issues	Expensive and time-consuming
Evaluates subjective aspects (tone, helpfulness)	Can’t review everything
Provides context and explanations	Reviewers may disagree with each other
Can identify unexpected problems	Hard to scale

User Feedback

Let your actual users tell you how you’re doing. Simple feedback methods:

Method	Characteristics	When to Use
Thumbs up/down	Easy for users to give, high response rate, limited detail	Quick validation, high-volume interactions
Star ratings (1-5)	More nuanced than thumbs, users understand the pattern, still very simple	When you need more granular feedback
Follow-up behavior	Did user immediately ask another question? (might indicate bad answer) Did user copy the response? (good sign) Did user stop using the product? (bad sign)	Implicit feedback without asking users

Advantages ✓	Limitations ⚠️
Real feedback from real users	Many users don’t leave feedback
Reflects actual satisfaction	May be biased (angry users more likely to respond)
No extra work for you	Doesn’t tell you HOW to improve
Continuous feedback stream	Can be vague or unhelpful

The Best Approach: Combine All Three

Recommended strategy:

Automated testing runs continuously (catches obvious problems fast)
Human review samples randomly (catches quality issues)
User feedback provides ongoing validation (real-world quality check)

What to Measure

Accuracy Metrics

Metric	What It Measures	Example	Good For	Not Good For
Exact match	Does the answer exactly match what you expected?	”What’s 2+2?” should be exactly “4”	Factual questions with one right answer	Questions with multiple valid answers
Meaning match	Does the answer mean the same thing, even if worded differently?	“30-day guarantee” vs. “You have 30 days to return” (both correct)	Most real-world questions	Simple pass/fail checking (requires sophisticated evaluation)
Fact-checking	Are the claims in the answer actually true? Compare against your knowledge base and detect when AI makes things up	Verifying “Our refund policy is 30 days” against actual policy documents	Domains where accuracy matters (medical, legal, financial)	Subjective or opinion-based responses

Quality Metrics

Metric	Key Questions	What to Look For
Relevance	Does the answer actually address the question asked? Is it on-topic?	Answer stays focused on the question without unnecessary tangents
Completeness	Does it answer the full question? Is anything missing?	All parts of the question are addressed with important details included
Clarity	Is it easy to understand? Is it well-structured?	Answer is well-organized and appropriate for the audience (not too technical or too simple)
Helpfulness	Does it solve the user’s problem? Does it provide actionable information?	Users can take action based on the response and would be satisfied with it

Performance Metrics

Metric	What to Track	Example Target
Response time	How long users wait (track averages and worst cases)	95% of queries under 2 seconds
Success rate	Percentage of requests that work without errors	Less than 1% error rate
Cost per query	How much each answer costs (track trends and identify expensive queries)	Average cost under $0.01 per query

Building Your Test Set

Start with Real Examples

Collect actual user questions:

Look at your logs or analytics
What do users actually ask?
What are the most common questions?

Example starting set:

20 most common questions
10 questions that previously failed
10 edge cases or tricky questions
10 questions for new features

Total: 50 question test set (good starting point)

Cover Different Scenarios

Common cases (60%)

Typical questions most users ask
Should have high success rate
Baseline quality check

Edge cases (20%)

Unusual or tricky questions
Boundary conditions
Things that might break

Known failures (20%)

Questions that broke before
Regression testing
Make sure old problems don’t come back

Include Hard Samples - Your Evals Should Never Pass 100%

Critical principle: If your evaluation is passing 100% of the time, it means your test set is too easy and won’t catch real problems.

A good evaluation should challenge your AI system and expose its limits. This helps you understand where improvement is needed and prevents false confidence. Why hard samples matter:

Find Real Weaknesses

Easy questions hide problems. Hard questions expose where your AI actually struggles, showing you what needs improvement.

Prevent Regression

When you make changes, hard samples catch subtle degradations that easy questions would miss.

Set Realistic Expectations

Understanding your AI’s limits helps you set appropriate expectations for users and stakeholders.

Drive Improvement

You can’t improve what you don’t measure. Hard samples show you the frontier of what’s possible.

Types of hard samples to include:

Category	Examples	Why Include It
Ambiguous questions	”What’s the status?” (of what?), “How much does it cost?” (which product?)	Tests if AI asks for clarification instead of guessing
Multi-step reasoning	”If I buy product A and return it within 15 days, can I use that refund toward product B?”	Tests complex logic and policy understanding
Conflicting information	Questions where documents contain outdated or contradictory data	Tests source prioritization and version awareness
Domain-specific jargon	Questions using industry terms or abbreviations your users actually use	Tests real-world vocabulary coverage
Out-of-scope questions	Questions your AI shouldn’t answer (legal advice, medical diagnosis)	Tests safety and boundary recognition
Trick questions	Questions with false premises like “Why is your refund policy 60 days?” when it’s actually 30 days	Tests fact-checking and correction ability

Target distribution: Aim for evaluation results between 70-90% pass rate. If you’re consistently above 90%, your tests are probably too easy. If you’re below 70%, focus on the fundamentals first.

Example hard sample scenarios:

Customer Support Chatbot

Easy sample: “What are your business hours?”

Clear question, simple answer from knowledge base

Hard sample: “I need to talk to someone NOW because my order was supposed to arrive for my daughter’s birthday YESTERDAY and it’s still not here and you’re ruining everything!”

Tests: Emotion handling, escalation, urgency recognition, empathy
Should: Acknowledge frustration, offer immediate help, escalate to human if needed

Hard sample: “I’m traveling in Japan and need support but your phone lines are closed, what do I do?”

Tests: Timezone awareness, alternative support channel knowledge
Should: Suggest email/chat, mention timezone differences, offer self-service options

Technical Documentation Assistant

Easy sample: “How do I install the package?”

Direct question from docs

Hard sample: “I’m getting error XYZ when I follow the installation guide, what’s wrong?”

Tests: Error diagnosis, documentation gap recognition, troubleshooting
Should: Reference relevant error documentation, suggest debugging steps

Hard sample: “Can I use version 2.0 features with version 1.5 installed?”

Tests: Version compatibility knowledge, cross-referencing docs
Should: Explain compatibility, suggest upgrade path if needed

E-commerce Product Assistant

Easy sample: “What colors does the Classic T-Shirt come in?”

Straightforward product spec question

Hard sample: “I’m 5’10” and 180 lbs, usually wear a medium in Nike, what size should I get in your Classic T-Shirt?”

Tests: Size comparison, brand knowledge, fit guidance
Should: Provide sizing advice or admit uncertainty and suggest size chart

Hard sample: “Which is better for hiking: Product A or Product B?”

Tests: Comparative analysis, use-case matching, feature prioritization
Should: Compare relevant features, ask clarifying questions about hiking needs

Best practice: Review your failed test cases quarterly. The questions your AI fails today should help you create better training examples, improve your data, or refine your prompts. Each failure is an opportunity to make your system better.

Define “Good” Answers

For each test question, document:

The expected answer

What should the AI say?
What are acceptable variations?
What would be wrong?

Why this is correct

Where does this information come from?
What makes this the right answer?
When was this verified?

Update regularly

Review monthly or quarterly
Update when products/policies change
Remove outdated questions

Analyzing Results

Look for Patterns

Example failure analysis:Failed questions breakdown:

Refund policy: 15 failures (37%)
Shipping times: 10 failures (25%)
Product specs: 8 failures (20%)
Other: 7 failures (18%)

Action: Focus on improving refund policy content first

What to look for:

Which categories fail most?
Are there common themes?
What types of questions struggle?
Where should you focus improvement efforts?

Track Progress Over Time

Month-over-month comparison:

January: 72% accuracy - Baseline
February: 76% accuracy - +4% (improved data)
March: 79% accuracy - +3% (better prompts)
April: 81% accuracy - +2% (model upgrade)

What this tells you:

You’re making steady progress
Improvements are slowing (diminishing returns)
May need new approaches for further gains

Segment by Question Type

Break down performance:

Factual questions: 88% accuracy
How-to questions: 75% accuracy
Opinion questions: 62% accuracy

Insights:

Factual answers are good
How-to needs improvement
Opinion questions need different approach

Best Practices

Start Simple, Add Complexity

Week 1

Create 20-30 test questions
Run manual evaluation
Calculate basic accuracy

Week 2

Expand to 50 questions
Add automated scoring
Track over time

Month 2

Add human review
Collect user feedback
Segment by category

Month 3

Implement continuous monitoring
Set quality thresholds
Auto-alert on regressions

Set Clear Standards

Must have ✓	Nice to have ⭐
Accuracy: At least 80%	Accuracy: Above 90%
Speed: 95% of queries under 2 seconds	Speed: Average under 1 second
Errors: Less than 1% failure rate	User rating: 4.5/5 stars

Use these to:

Decide if changes are good enough to deploy
Know when you’re ready to launch
Track progress toward goals

Test Regularly

Frequency	Test Type	Purpose
Daily	Quick automated check (smoke test)	Catch breaking changes immediately
Weekly	Full automated test suite	Comprehensive quality check across all scenarios
Monthly	Human quality review	Evaluate subjective quality and find edge cases
Quarterly	Comprehensive audit	Deep dive into performance, update test sets, strategic planning

Use Multiple Metrics

Don’t rely on just one number:

High accuracy but slow → Users unhappy
Fast but inaccurate → Users get wrong answers
Accurate and fast but expensive → Not sustainable

Balance multiple dimensions:

Quality (accuracy, helpfulness)
Performance (speed, reliability)
Cost (per query, total)

Update Your Tests

Your evaluation dataset is a living document:

Add new questions when features change
Remove outdated questions
Update expected answers when policies change
Review and refresh quarterly

Common Mistakes

❌ Only Testing Happy Paths

The mistake: Only testing questions you expect to work.Why it’s wrong: Edge cases and errors are where problems hide.Better approach: Deliberately test weird, unusual, and difficult questions.

❌ Letting Tests Get Stale

The mistake: Creating a test set once and never updating it.Why it’s wrong: Your product evolves, but your tests don’t, leading to false confidence.Better approach: Review and update tests monthly, especially after changes.

❌ Ignoring User Feedback

The mistake: Only looking at automated metrics, ignoring what users say.Why it’s wrong: Metrics might look good while users are actually unhappy.Better approach: Combine automated metrics with real user feedback.

❌ Not Investigating Failures

The mistake: Seeing a failure, noting it, moving on.Why it’s wrong: You don’t learn what caused it or how to prevent it.Better approach: For each failure, understand why it failed and how to fix it.

Getting Started

Week 1: Create Your First Test Set

Gather 20-30 questions:

15 common questions from user logs
5 tricky or edge case questions
5 questions you know should work

Define expected answers:

Write what the correct answer should be
Note key points that must be included
Mark optional details

Week 2: Run Your First Evaluation

Manual approach:

Ask your AI each question
Compare to expected answers
Score each as pass/fail or 1-5 stars
Calculate percentage

Simple metric:20 questions passed / 30 total = 67% accuracyThis is your baseline!

Week 3: Make Improvements

Based on failures:

Which questions failed?
Why did they fail?
What can you improve? (Data? Prompts? Model?)

Make targeted changes:

Fix the biggest issues first
One change at a time
Re-evaluate after each change

Week 4: Set Up Regular Testing

Automate if possible:

Run tests daily or weekly
Track results over time
Alert on regressions

Make it routine:

Test before every deployment
Review results weekly
Update test set monthly

Next Steps

Observability

Monitor your AI system in production

Data Sources

Improve accuracy with better data

Model Selection

Choose models with better performance

Cost Optimization

Balance quality with cost efficiency

​Why Evaluations Matter

​AI is Different from Regular Software

​Types of Evaluations

​Automated Testing

​Human Review

​User Feedback

​The Best Approach: Combine All Three

​What to Measure

​Accuracy Metrics

​Quality Metrics

​Performance Metrics

​Building Your Test Set

​Start with Real Examples

​Cover Different Scenarios

Common cases (60%)

Edge cases (20%)

Known failures (20%)

​Include Hard Samples - Your Evals Should Never Pass 100%

Find Real Weaknesses

Prevent Regression

Set Realistic Expectations

Drive Improvement

​Define “Good” Answers

The expected answer

Why this is correct

Update regularly

​Analyzing Results

​Look for Patterns

​Track Progress Over Time

​Segment by Question Type

​Best Practices

​Start Simple, Add Complexity

​Set Clear Standards

​Test Regularly

​Use Multiple Metrics

​Update Your Tests

​Common Mistakes

❌ Only Testing Happy Paths

❌ Letting Tests Get Stale

❌ Ignoring User Feedback

❌ Not Investigating Failures

​Getting Started

​Next Steps

Observability

Data Sources

Model Selection

Cost Optimization

Why Evaluations Matter

AI is Different from Regular Software

Types of Evaluations

Automated Testing

Human Review

User Feedback

The Best Approach: Combine All Three

What to Measure

Accuracy Metrics

Quality Metrics

Performance Metrics

Building Your Test Set

Start with Real Examples

Cover Different Scenarios

Include Hard Samples - Your Evals Should Never Pass 100%

Define “Good” Answers

Analyzing Results

Look for Patterns

Track Progress Over Time

Segment by Question Type

Best Practices

Start Simple, Add Complexity

Set Clear Standards

Test Regularly

Use Multiple Metrics

Update Your Tests

Common Mistakes

Getting Started

Next Steps