Observability - Arcbeam Documentation

Observability means being able to see and understand what’s happening inside your AI system. It’s like having a flight recorder for every interaction - you can go back and see exactly what happened, why, and how long it took.

The difference between “my car made a weird noise yesterday” (vague, hard to fix) and a mechanic’s diagnostic readout showing exactly which sensor failed at what time (specific, fixable).

Why Observability Matters

You Can’t Fix What You Can’t See

Without Observability	With Observability
User complains “the AI gave a wrong answer”	You see the exact question
You have no idea what question they asked	You see the exact response
You don’t know what the AI actually said	You see what documents it searched
You can’t see what information it used	You see which ones it used
You’re guessing at what went wrong	You know exactly what to fix

What Is a Trace?

A trace is a complete record of one interaction with your AI system.

What a Trace Captures

Every AI Call

Which model was used (GPT-4, Claude, etc.)
What you sent to it (the prompt)
What it responded with
How many words/tokens were processed
How much it cost
How long it took

Every Search/Retrieval

What question was searched for
Which documents were found
Which ones were actually used
Which database was searched
How long the search took

Every Tool Used

If agents used calculators, APIs, or other tools
What information was passed to each tool
What each tool returned
Whether tools succeeded or failed

Timing Information

When each step started and finished
How long each part took
Total time for the whole interaction

Context and Metadata

Which user made the request
What session or conversation it’s part of
Whether it’s production or testing
Any custom labels you’ve added

A Simple Example

User asks: “What’s your refund policy?” The trace shows:

Total time: 1.6 seconds
Total cost: $0.003

Step 1: Convert question to search format (0.1s, $0.0001)
Step 2: Search knowledge base (0.3s)
  → Found 3 relevant documents
  → Used documents: "Refund Policy 2024", "Returns Guide"
Step 3: Generate answer using GPT-4o mini (1.2s, $0.003)
  → Read 850 words of context
  → Generated 120-word response
Status: Success ✓

This trace tells you everything you need to know about what happened.

Understanding Spans

A span is one step within a trace. If a trace is a recipe, each span is one instruction.

Common Types of Spans

AI Model Calls

Generating a response
Answering a question
Summarizing text

Database Searches

Finding relevant documents
Looking up information
Retrieving data

Tool Executions

Calling calculators
Accessing external APIs
Running functions

Agent Workflows

Multi-step reasoning
Planning and execution
Decision-making processes

What Each Span Contains

Information	Details
What it did	”Generated AI response” or “Searched knowledge base”
When it happened	Start and end time
How long it took	Duration in seconds
What went in	The input data or query
What came out	The result or response
Cost	How much this step cost
Status	Did it succeed or fail?
Parent	What triggered this step?

Why This Information Is Valuable

Debugging Problems

Scenario: Users report AI is giving wrong answers about shipping.

Without Traces	With Traces
You guess which part is broken	Filter for “shipping” questions
You test different fixes randomly	See that it’s finding old documents from 2022
Takes days to find the problem	Problem identified in 5 minutes: outdated data
	Fix: Update knowledge base with current shipping info

Improving Performance

Scenario: AI feels slow to users. Trace analysis shows:

Average response time: 4.2 seconds

Breakdown:
- Database search: 0.3s (7%)
- AI generation: 1.2s (29%)
- Loading context: 2.7s (64%) ← Bottleneck!

Solution: Focus on speeding up context loading, the actual problem. Optimizing the AI model wouldn’t help much.

Controlling Costs

Scenario: AI costs are higher than expected. Trace analysis shows:

Top 10 most expensive queries:
"Write a 5-page report..." - $0.45
"Summarize these 20 documents..." - $0.38
"Compare all our products..." - $0.32

Insight: Very long responses are driving costs. You can:

Set response length limits
Warn users about expensive queries
Optimize prompts to be more concise

Understanding Usage Patterns

Trace analysis reveals:

By category:
- Product questions: 45% of traces
- Support questions: 30% of traces
- Billing questions: 15% of traces
- Other: 10% of traces

Success rates:
- Product questions: 92% success
- Support questions: 87% success
- Billing questions: 68% success ← Needs work

Action: Focus on improving billing question documentation.

What You Can Do with Traces

View Individual Traces

See exactly what happened in any interaction:

Click on any trace to open it
See the step-by-step breakdown of all operations
View inputs and outputs at each step
Check which documents were retrieved and used
Identify what went wrong (if anything)

Filter and Search

Find specific traces that need attention:

Show me all errors from last week
Find traces that cost more than $0.10
Show slow responses (over 5 seconds)
Find traces for a specific user
See traces using GPT-4 vs Claude

Compare Performance

Make before and after comparisons:

Did the new model improve quality?
Are responses faster after optimization?
Did costs go up or down?
Is the new version better?

Track Trends

Analyze patterns over time:

Are errors increasing?
Are we getting faster or slower?
Are costs rising?
Is quality improving?

Practical Examples

Finding Why Something Failed

Problem: User reports “AI said it doesn’t know, but the answer is definitely in our documentation.”

Search for the user's query in traces

Locate the specific interaction by searching for the user’s query or filtering by timestamp.

Find their trace from that time

Open the trace to see the complete interaction flow.

Look at the Retrieved Documents section

Check which documents were found and used during the search.

Identify the mismatch

See that search found different documents than expected - the right documentation wasn’t retrieved.

Diagnose the root cause

Realize the documentation wasn’t tagged correctly, preventing the search from finding it.

Fix: Update document tags so search finds the right content.

Optimizing for Cost

Observation: Monthly AI costs jumped 40% this month.

Filter traces by cost

Use the cost filter to focus on high-cost interactions.

Sort from most to least expensive

Identify which traces are consuming the most resources.

Notice the pattern

Very long output responses are driving the costs.

Analyze the findings

Key discoveries:

5% of queries generate 60% of costs
These are all “write a detailed report” type queries
They generate 1000+ word responses

Implement solutions

Actions taken:

Limit response length to 500 words
Ask users to be more specific
Use cheaper model for long outputs

Result: Costs drop by 35% without impacting quality

Improving Response Quality

Goal: Reduce errors in product recommendation questions.

Filter for product recommendation traces with errors

Focus on failed interactions in the product recommendation category.

Review what went wrong in each case

Examine the inputs, outputs, and retrieved information for each error.

Identify the pattern

AI often gets confused between similar products - a recurring issue emerges.

Diagnose the root cause

Retrieval is finding both products, but not explaining the differences clearly enough for the AI to distinguish them.

Implement the fix

Improve product descriptions to highlight key differences between similar items.Result: Error rate drops from 15% to 4%

Best Practices

Add Helpful Labels

Tag your traces with useful information:

User type (free vs. paid customer)
Feature name (chatbot, search, recommendations)
Environment (production, staging, testing)
Version number (v1.0, v2.0)

Why this helps:

Filter production issues from test issues
Compare performance across features
Identify problems affecting specific user types
Track improvements across versions

Protect Privacy

Don’t capture sensitive information:

Don’t log passwords
Be careful with personal information
Avoid storing customer secrets
Follow your privacy policies

What to capture:

User IDs (not names)
Session IDs
Transaction IDs
General query topics

Review Regularly

Make trace review a habit:

Daily: Quick check for errors
Weekly: Review slow or expensive traces
Monthly: Look for trends and patterns
After changes: Verify improvements worked

Set Up Alerts

Get notified about problems:

Error rate above 5%
Average response time over 3 seconds
Costs spike by 50%
Specific feature failing

Common Use Cases

Customer Support Escalation

Scenario: Angry customer says “your AI is broken.”

Look up their recent traces

Search for the customer’s interactions using their user ID or session information.

See exactly what happened

Review the complete trace to understand the full context of their interaction.

Identify the specific issue

Pinpoint what went wrong - which step failed, what information was used, what the AI said.

Respond with specifics

Provide a detailed response: “I see the problem - on Tuesday at 2:15pm, our AI incorrectly said [X]. This happened because [Y]. We’re fixing it now.”Result: Customer feels heard, problem gets fixed, not just vague apologies.

A/B Testing Different Approaches

Scenario: Testing two different prompts.

Set up the test with proper tagging

Version A: Tag traces with “prompt_v1”
Version B: Tag traces with “prompt_v2”

Let the test run for 1 week

Collect enough data from real usage to make an informed decision.

Analyze the results

Prompt V1:

Average quality score: 3.8/5
Average cost: $0.02
Average time: 1.5s

Prompt V2:

Average quality score: 4.2/5
Average cost: $0.03
Average time: 2.1s

Make a data-driven decision

V2 is better quality but 50% more expensive and slower. Use V2 for premium users, V1 for free users.

Training and Quality Review

Scenario: Training your team on what good AI interactions look like.

Filter for highly-rated traces

Find interactions with user satisfaction 5/5 - the success stories.

Review what made them successful

Identify common patterns: what information was used, how responses were structured, what made them effective.

Filter for poorly-rated traces

Find interactions with user satisfaction 1/5 - the failures.

Review what went wrong

Understand failure patterns: missing information, incorrect retrieval, poor response quality.

Share findings with team

Present both success and failure patterns to help the team understand what works and what doesn’t.Result: Team understands patterns of success and failure.

Monitoring Your AI System

Key Metrics to Track

Performance Metrics

Average response time
Percentage of slow responses (over X seconds)
Success rate vs. error rate

Cost Metrics

Average cost per query
Daily/weekly/monthly total costs
Cost per user or per feature

Quality Metrics

User satisfaction scores
Error rate by category
Successful completion rate

Usage Metrics

Number of queries per day
Most common query types
Peak usage times

What “Good” Looks Like

Healthy AI System	Warning Signs
95%+ success rate	Error rate increasing
Stable costs (not increasing unexpectedly)	Costs rising without explanation
Response times meeting targets	Response times getting slower
High user satisfaction	User complaints increasing
Few error spikes	Quality declining over time

Troubleshooting Common Issues

Traces Aren't Showing Up

Check:

Is your application properly instrumented?
Are traces being sent to the right place?
Any network issues preventing transmission?
Check your observability platform’s status

Too Many Traces to Review

Solution:

Use filters to narrow down
Focus on errors first
Sample randomly (review 1% of successful traces)
Set up automated quality checks

Can't Find Specific Traces

Tips:

Add better labels/tags when creating traces
Use custom attributes for important context
Search by user ID, session ID, or date
Keep retention periods long enough

Getting Started

Week 1: Set Up Tracing

Basic setup:

Instrument your AI application
Verify traces are being captured
Check that key information is included
Test with a few queries

Week 2: Add Context

Improve trace usefulness:

Add user IDs
Tag by feature or environment
Include version numbers
Add business context

Week 3: Review and Analyze

Start using your traces:

Look at recent errors
Review slow traces
Check expensive queries
Identify patterns

Week 4: Establish Routine

Make it habitual:

Daily error check
Weekly performance review
Monthly cost analysis
Set up alerts for issues

Next Steps

Find Problematic Traces

Learn how to filter and find traces that need attention

Trace Issues to Source Data

Connect bad outputs to the documents that caused them

Set Up Tracing

Instrument your application to send traces

Data Lineage

Track where information comes from

​Why Observability Matters

​You Can’t Fix What You Can’t See

​What Is a Trace?

​What a Trace Captures

Every AI Call

Every Search/Retrieval

Every Tool Used

Timing Information

Context and Metadata

​A Simple Example

​Understanding Spans

​Common Types of Spans

AI Model Calls

Database Searches

Tool Executions

Agent Workflows

​What Each Span Contains

​Why This Information Is Valuable

​Debugging Problems

​Improving Performance

​Controlling Costs

​Understanding Usage Patterns

​What You Can Do with Traces

​View Individual Traces

​Filter and Search

​Compare Performance

​Track Trends

​Practical Examples

​Finding Why Something Failed

​Optimizing for Cost

​Improving Response Quality

​Best Practices

Add Helpful Labels

Protect Privacy

Review Regularly

Set Up Alerts

​Common Use Cases

​Customer Support Escalation

​A/B Testing Different Approaches

​Training and Quality Review

​Monitoring Your AI System

​Key Metrics to Track

Performance Metrics

Cost Metrics

Quality Metrics

Usage Metrics

​What “Good” Looks Like

​Troubleshooting Common Issues

​Getting Started

​Next Steps

Find Problematic Traces

Trace Issues to Source Data

Set Up Tracing

Data Lineage

Why Observability Matters

You Can’t Fix What You Can’t See

What Is a Trace?

What a Trace Captures

A Simple Example

Understanding Spans

Common Types of Spans

What Each Span Contains

Why This Information Is Valuable

Debugging Problems

Improving Performance

Controlling Costs

Understanding Usage Patterns

What You Can Do with Traces

View Individual Traces

Filter and Search

Compare Performance

Track Trends

Practical Examples

Finding Why Something Failed

Optimizing for Cost

Improving Response Quality

Best Practices

Common Use Cases

Customer Support Escalation

A/B Testing Different Approaches

Training and Quality Review

Monitoring Your AI System

Key Metrics to Track

What “Good” Looks Like

Troubleshooting Common Issues

Getting Started

Next Steps