Skip to main content
Compare different versions of your AI system to make data-driven decisions about what to deploy.

What to Compare

Use version comparison to evaluate:
  • Different models (GPT-4 vs GPT-4o-mini, Claude vs GPT)
  • Prompt variations (different instructions or examples)
  • Retrieval strategies (different k values, embedding models)
  • System configurations (temperature, max tokens, etc.)
  • Feature changes (new vs old implementation)

Setting Up Comparisons

Tag Traces with Versions

Use different projects or environment tags to organize different versions. For example, you can use the environment tag when initializing the connector:
from arcbeam_connector.langchain.connector import ArcbeamLangConnector

# Version A
connector = ArcbeamLangConnector(
    base_url="http://platform.arcbeam.ai",
    api_key="your-api-key-here",
    project_id="your-project-id-here",
    environment="v2.1.0-gpt4o-mini",
)
connector.init()
This allows filtering and comparing traces by environment later.

Run A/B Tests

Split traffic between versions by using different environment tags:
import random
from arcbeam_connector.langchain.connector import ArcbeamLangConnector
from langchain_openai import ChatOpenAI

if random.random() < 0.5:
    # Version A
    connector = ArcbeamLangConnector(
        base_url="http://platform.arcbeam.ai",
        api_key="your-api-key-here",
        project_id="your-project-id-here",
        environment="version-a-gpt4",
    )
    connector.init()
    model = ChatOpenAI(model="gpt-4")
else:
    # Version B
    connector = ArcbeamLangConnector(
        base_url="http://platform.arcbeam.ai",
        api_key="your-api-key-here",
        project_id="your-project-id-here",
        environment="version-b-gpt4o-mini",
    )
    connector.init()
    model = ChatOpenAI(model="gpt-4o-mini")
[INSERT SCREENSHOT]

Comparing Results

Create Collections

Create separate collections for each version:
  1. Filter traces by version: "a"
  2. Create collection “Version A Results”
  3. Repeat for Version B
  4. Review side-by-side
Learn more about collections →

Compare Metrics

Key metrics to compare:
MetricVersion AVersion BWinner
Average Cost$0.05$0.01B
Average Latency2.1s1.3sB
Error Rate2%5%A
Avg User Rating4.2/53.8/5A

Statistical Significance

Ensure enough data before deciding:
  • Run at least 100 traces per version
  • Check if differences are meaningful
  • Consider variance and outliers

Common Comparisons

Model Comparison

Question: Should we use GPT-4 or GPT-4o-mini? Steps:
  1. Run same queries through both models
  2. Tag with model: "gpt-4" and model: "gpt-4o-mini"
  3. Compare:
    • Quality (user feedback, accuracy)
    • Cost (GPT-4 is more expensive)
    • Speed (GPT-4o-mini is faster)
  4. Decide based on requirements
Example Result:
  • GPT-4o-mini: 90% of GPT-4 quality at 10% of the cost
  • Decision: Use GPT-4o-mini for most queries, GPT-4 for complex ones
[INSERT SCREENSHOT]

Prompt Comparison

Question: Which prompt structure works better? Prompt A (concise):
Answer the user's question based on the provided context.
Prompt B (detailed):
You are a helpful assistant. Use only the provided context to answer.
If the context doesn't contain the answer, say so clearly.
Be concise and accurate.
Steps:
  1. Deploy both prompts (A/B split)
  2. Tag with prompt_version
  3. Compare user satisfaction and accuracy
  4. Choose winner

Retrieval Strategy Comparison

Question: Should we retrieve 3 or 5 documents? Steps:
  1. Version A: k=3
  2. Version B: k=5
  3. Compare:
    • Answer quality
    • Token usage (more docs = more tokens)
    • Latency
  4. Find optimal balance
Example Result:
  • k=5 improved quality by 5%
  • But increased cost by 25%
  • Decision: Use k=3 for simple queries, k=5 for complex ones

Embedding Model Comparison

Question: Which embedding model gives better retrieval? Steps:
  1. Re-embed dataset with different models:
    • text-embedding-ada-002
    • text-embedding-3-large
  2. Run same queries against both
  3. Compare relevance scores and answer quality
  4. Choose best performer

Analyzing Differences

Side-by-Side Comparison

View traces from different versions together:
  1. Open trace from Version A
  2. Find corresponding trace from Version B (same input)
  3. Compare:
    • Outputs (quality, length, tone)
    • Retrieved documents
    • Costs
    • Timing
[INSERT SCREENSHOT]

Aggregate Statistics

Look at overall patterns:
Version A (GPT-4):
- Total traces: 500
- Success rate: 98%
- Avg cost: $0.045
- Avg latency: 2.3s
- Avg user rating: 4.5/5

Version B (GPT-4o-mini):
- Total traces: 500
- Success rate: 95%
- Avg cost: $0.008
- Avg latency: 1.1s
- Avg user rating: 4.2/5

Qualitative Review

Numbers don’t tell the whole story:
  1. Read actual outputs from both versions
  2. Check for subtle quality differences
  3. Look for edge cases where one fails
  4. Get feedback from stakeholders

Making Decisions

Define Success Criteria

Before comparing, define what matters: For customer support bot:
  1. Quality (most important)
  2. Cost (important)
  3. Speed (nice to have)
For batch processing:
  1. Cost (most important)
  2. Quality (important)
  3. Speed (less important)

Calculate ROI

Consider trade-offs: Example:
  • GPT-4o-mini saves $10,000/month
  • But quality drops slightly (4.2 vs 4.5 rating)
  • Question: Is the quality drop worth $10k savings?

Gradual Rollout

Don’t switch 100% immediately:
  1. Start with 10% traffic to new version
  2. Monitor for issues
  3. Gradually increase (25%, 50%, 75%)
  4. Full rollout only after validation

Best Practices

Use Consistent Test Queries

Compare apples to apples by running the same queries through different versions (using different environment tags or projects):
test_queries = [
    "What is the refund policy?",
    "How do I cancel my subscription?",
    "What payment methods do you accept?"
]

for query in test_queries:
    # Run through both versions and compare results in the Arcbeam dashboard
    # Filter by environment tag to separate the versions
    result = qa_chain.invoke(query)

Document Assumptions

Record what you’re testing:
Version A: GPT-4, temp=0, k=3, prompt v2
Version B: GPT-4o-mini, temp=0, k=5, prompt v3

Hypothesis: GPT-4o-mini with more context (k=5) will match
GPT-4 quality at lower cost.

Avoid Contamination

Ensure fair comparison:
  • Use same time period (avoid external factors)
  • Same data sources
  • Same user base (random split)
  • Control all variables except what you’re testing

Set Time Limits

Don’t run indefinitely:
  • Small changes: 1-3 days
  • Major changes: 1-2 weeks
  • Collect enough data, then decide

Next Steps