What to Compare
Use version comparison to evaluate:- Different models (GPT-4 vs GPT-4o-mini, Claude vs GPT)
- Prompt variations (different instructions or examples)
- Retrieval strategies (different k values, embedding models)
- System configurations (temperature, max tokens, etc.)
- Feature changes (new vs old implementation)
Setting Up Comparisons
Tag Traces with Versions
Use different projects or environment tags to organize different versions. For example, you can use the environment tag when initializing the connector:Run A/B Tests
Split traffic between versions by using different environment tags:Comparing Results
Create Collections
Create separate collections for each version:- Filter traces by
version: "a" - Create collection “Version A Results”
- Repeat for Version B
- Review side-by-side
Compare Metrics
Key metrics to compare:| Metric | Version A | Version B | Winner |
|---|---|---|---|
| Average Cost | $0.05 | $0.01 | B |
| Average Latency | 2.1s | 1.3s | B |
| Error Rate | 2% | 5% | A |
| Avg User Rating | 4.2/5 | 3.8/5 | A |
Statistical Significance
Ensure enough data before deciding:- Run at least 100 traces per version
- Check if differences are meaningful
- Consider variance and outliers
Common Comparisons
Model Comparison
Question: Should we use GPT-4 or GPT-4o-mini? Steps:- Run same queries through both models
- Tag with
model: "gpt-4"andmodel: "gpt-4o-mini" - Compare:
- Quality (user feedback, accuracy)
- Cost (GPT-4 is more expensive)
- Speed (GPT-4o-mini is faster)
- Decide based on requirements
- GPT-4o-mini: 90% of GPT-4 quality at 10% of the cost
- Decision: Use GPT-4o-mini for most queries, GPT-4 for complex ones
Prompt Comparison
Question: Which prompt structure works better? Prompt A (concise):- Deploy both prompts (A/B split)
- Tag with
prompt_version - Compare user satisfaction and accuracy
- Choose winner
Retrieval Strategy Comparison
Question: Should we retrieve 3 or 5 documents? Steps:- Version A:
k=3 - Version B:
k=5 - Compare:
- Answer quality
- Token usage (more docs = more tokens)
- Latency
- Find optimal balance
- k=5 improved quality by 5%
- But increased cost by 25%
- Decision: Use k=3 for simple queries, k=5 for complex ones
Embedding Model Comparison
Question: Which embedding model gives better retrieval? Steps:- Re-embed dataset with different models:
text-embedding-ada-002text-embedding-3-large
- Run same queries against both
- Compare relevance scores and answer quality
- Choose best performer
Analyzing Differences
Side-by-Side Comparison
View traces from different versions together:- Open trace from Version A
- Find corresponding trace from Version B (same input)
- Compare:
- Outputs (quality, length, tone)
- Retrieved documents
- Costs
- Timing
Aggregate Statistics
Look at overall patterns:Qualitative Review
Numbers don’t tell the whole story:- Read actual outputs from both versions
- Check for subtle quality differences
- Look for edge cases where one fails
- Get feedback from stakeholders
Making Decisions
Define Success Criteria
Before comparing, define what matters: For customer support bot:- Quality (most important)
- Cost (important)
- Speed (nice to have)
- Cost (most important)
- Quality (important)
- Speed (less important)
Calculate ROI
Consider trade-offs: Example:- GPT-4o-mini saves $10,000/month
- But quality drops slightly (4.2 vs 4.5 rating)
- Question: Is the quality drop worth $10k savings?
Gradual Rollout
Don’t switch 100% immediately:- Start with 10% traffic to new version
- Monitor for issues
- Gradually increase (25%, 50%, 75%)
- Full rollout only after validation
Best Practices
Use Consistent Test Queries
Compare apples to apples by running the same queries through different versions (using different environment tags or projects):Document Assumptions
Record what you’re testing:Avoid Contamination
Ensure fair comparison:- Use same time period (avoid external factors)
- Same data sources
- Same user base (random split)
- Control all variables except what you’re testing
Set Time Limits
Don’t run indefinitely:- Small changes: 1-3 days
- Major changes: 1-2 weeks
- Collect enough data, then decide
