Skip to main content
Observability means being able to see and understand what’s happening inside your AI system. It’s like having a flight recorder for every interaction - you can go back and see exactly what happened, why, and how long it took.
The difference between “my car made a weird noise yesterday” (vague, hard to fix) and a mechanic’s diagnostic readout showing exactly which sensor failed at what time (specific, fixable).

Why Observability Matters

You Can’t Fix What You Can’t See

Without ObservabilityWith Observability
User complains “the AI gave a wrong answer”You see the exact question
You have no idea what question they askedYou see the exact response
You don’t know what the AI actually saidYou see what documents it searched
You can’t see what information it usedYou see which ones it used
You’re guessing at what went wrongYou know exactly what to fix

What Is a Trace?

A trace is a complete record of one interaction with your AI system.

What a Trace Captures

Every AI Call

  • Which model was used (GPT-4, Claude, etc.)
  • What you sent to it (the prompt)
  • What it responded with
  • How many words/tokens were processed
  • How much it cost
  • How long it took

Every Search/Retrieval

  • What question was searched for
  • Which documents were found
  • Which ones were actually used
  • Which database was searched
  • How long the search took

Every Tool Used

  • If agents used calculators, APIs, or other tools
  • What information was passed to each tool
  • What each tool returned
  • Whether tools succeeded or failed

Timing Information

  • When each step started and finished
  • How long each part took
  • Total time for the whole interaction

Context and Metadata

  • Which user made the request
  • What session or conversation it’s part of
  • Whether it’s production or testing
  • Any custom labels you’ve added

A Simple Example

User asks: “What’s your refund policy?” The trace shows:
Total time: 1.6 seconds
Total cost: $0.003

Step 1: Convert question to search format (0.1s, $0.0001)
Step 2: Search knowledge base (0.3s)
  → Found 3 relevant documents
  → Used documents: "Refund Policy 2024", "Returns Guide"
Step 3: Generate answer using GPT-4o mini (1.2s, $0.003)
  → Read 850 words of context
  → Generated 120-word response
Status: Success ✓
This trace tells you everything you need to know about what happened.

Understanding Spans

A span is one step within a trace. If a trace is a recipe, each span is one instruction.

Common Types of Spans

AI Model Calls

  • Generating a response
  • Answering a question
  • Summarizing text

Database Searches

  • Finding relevant documents
  • Looking up information
  • Retrieving data

Tool Executions

  • Calling calculators
  • Accessing external APIs
  • Running functions

Agent Workflows

  • Multi-step reasoning
  • Planning and execution
  • Decision-making processes

What Each Span Contains

InformationDetails
What it did”Generated AI response” or “Searched knowledge base”
When it happenedStart and end time
How long it tookDuration in seconds
What went inThe input data or query
What came outThe result or response
CostHow much this step cost
StatusDid it succeed or fail?
ParentWhat triggered this step?

Why This Information Is Valuable

Debugging Problems

Scenario: Users report AI is giving wrong answers about shipping.
Without TracesWith Traces
You guess which part is brokenFilter for “shipping” questions
You test different fixes randomlySee that it’s finding old documents from 2022
Takes days to find the problemProblem identified in 5 minutes: outdated data
Fix: Update knowledge base with current shipping info

Improving Performance

Scenario: AI feels slow to users. Trace analysis shows:
Average response time: 4.2 seconds

Breakdown:
- Database search: 0.3s (7%)
- AI generation: 1.2s (29%)
- Loading context: 2.7s (64%) ← Bottleneck!
Solution: Focus on speeding up context loading, the actual problem. Optimizing the AI model wouldn’t help much.

Controlling Costs

Scenario: AI costs are higher than expected. Trace analysis shows:
Top 10 most expensive queries:
1. "Write a 5-page report..." - $0.45
2. "Summarize these 20 documents..." - $0.38
3. "Compare all our products..." - $0.32
Insight: Very long responses are driving costs. You can:
  • Set response length limits
  • Warn users about expensive queries
  • Optimize prompts to be more concise

Understanding Usage Patterns

Trace analysis reveals:
By category:
- Product questions: 45% of traces
- Support questions: 30% of traces
- Billing questions: 15% of traces
- Other: 10% of traces

Success rates:
- Product questions: 92% success
- Support questions: 87% success
- Billing questions: 68% success ← Needs work
Action: Focus on improving billing question documentation.

What You Can Do with Traces

View Individual Traces

See exactly what happened in any interaction:
  • Click on any trace to open it
  • See the step-by-step breakdown of all operations
  • View inputs and outputs at each step
  • Check which documents were retrieved and used
  • Identify what went wrong (if anything)
Find specific traces that need attention:
  • Show me all errors from last week
  • Find traces that cost more than $0.10
  • Show slow responses (over 5 seconds)
  • Find traces for a specific user
  • See traces using GPT-4 vs Claude

Compare Performance

Make before and after comparisons:
  • Did the new model improve quality?
  • Are responses faster after optimization?
  • Did costs go up or down?
  • Is the new version better?
Analyze patterns over time:
  • Are errors increasing?
  • Are we getting faster or slower?
  • Are costs rising?
  • Is quality improving?

Practical Examples

Finding Why Something Failed

Problem: User reports “AI said it doesn’t know, but the answer is definitely in our documentation.”
1

Search for the user's query in traces

Locate the specific interaction by searching for the user’s query or filtering by timestamp.
2

Find their trace from that time

Open the trace to see the complete interaction flow.
3

Look at the Retrieved Documents section

Check which documents were found and used during the search.
4

Identify the mismatch

See that search found different documents than expected - the right documentation wasn’t retrieved.
5

Diagnose the root cause

Realize the documentation wasn’t tagged correctly, preventing the search from finding it.
Fix: Update document tags so search finds the right content.

Optimizing for Cost

Observation: Monthly AI costs jumped 40% this month.
1

Filter traces by cost

Use the cost filter to focus on high-cost interactions.
2

Sort from most to least expensive

Identify which traces are consuming the most resources.
3

Notice the pattern

Very long output responses are driving the costs.
4

Analyze the findings

Key discoveries:
  • 5% of queries generate 60% of costs
  • These are all “write a detailed report” type queries
  • They generate 1000+ word responses
5

Implement solutions

Actions taken:
  • Limit response length to 500 words
  • Ask users to be more specific
  • Use cheaper model for long outputs
Result: Costs drop by 35% without impacting quality

Improving Response Quality

Goal: Reduce errors in product recommendation questions.
1

Filter for product recommendation traces with errors

Focus on failed interactions in the product recommendation category.
2

Review what went wrong in each case

Examine the inputs, outputs, and retrieved information for each error.
3

Identify the pattern

AI often gets confused between similar products - a recurring issue emerges.
4

Diagnose the root cause

Retrieval is finding both products, but not explaining the differences clearly enough for the AI to distinguish them.
5

Implement the fix

Improve product descriptions to highlight key differences between similar items.Result: Error rate drops from 15% to 4%

Best Practices

Add Helpful Labels

Tag your traces with useful information:
  • User type (free vs. paid customer)
  • Feature name (chatbot, search, recommendations)
  • Environment (production, staging, testing)
  • Version number (v1.0, v2.0)
Why this helps:
  • Filter production issues from test issues
  • Compare performance across features
  • Identify problems affecting specific user types
  • Track improvements across versions

Protect Privacy

Don’t capture sensitive information:
  • Don’t log passwords
  • Be careful with personal information
  • Avoid storing customer secrets
  • Follow your privacy policies
What to capture:
  • User IDs (not names)
  • Session IDs
  • Transaction IDs
  • General query topics

Review Regularly

Make trace review a habit:
  • Daily: Quick check for errors
  • Weekly: Review slow or expensive traces
  • Monthly: Look for trends and patterns
  • After changes: Verify improvements worked

Set Up Alerts

Get notified about problems:
  • Error rate above 5%
  • Average response time over 3 seconds
  • Costs spike by 50%
  • Specific feature failing

Common Use Cases

Customer Support Escalation

Scenario: Angry customer says “your AI is broken.”
1

Look up their recent traces

Search for the customer’s interactions using their user ID or session information.
2

See exactly what happened

Review the complete trace to understand the full context of their interaction.
3

Identify the specific issue

Pinpoint what went wrong - which step failed, what information was used, what the AI said.
4

Respond with specifics

Provide a detailed response: “I see the problem - on Tuesday at 2:15pm, our AI incorrectly said [X]. This happened because [Y]. We’re fixing it now.”Result: Customer feels heard, problem gets fixed, not just vague apologies.

A/B Testing Different Approaches

Scenario: Testing two different prompts.
1

Set up the test with proper tagging

  • Version A: Tag traces with “prompt_v1”
  • Version B: Tag traces with “prompt_v2”
2

Let the test run for 1 week

Collect enough data from real usage to make an informed decision.
3

Analyze the results

Prompt V1:
  • Average quality score: 3.8/5
  • Average cost: $0.02
  • Average time: 1.5s
Prompt V2:
  • Average quality score: 4.2/5
  • Average cost: $0.03
  • Average time: 2.1s
4

Make a data-driven decision

V2 is better quality but 50% more expensive and slower. Use V2 for premium users, V1 for free users.

Training and Quality Review

Scenario: Training your team on what good AI interactions look like.
1

Filter for highly-rated traces

Find interactions with user satisfaction 5/5 - the success stories.
2

Review what made them successful

Identify common patterns: what information was used, how responses were structured, what made them effective.
3

Filter for poorly-rated traces

Find interactions with user satisfaction 1/5 - the failures.
4

Review what went wrong

Understand failure patterns: missing information, incorrect retrieval, poor response quality.
5

Share findings with team

Present both success and failure patterns to help the team understand what works and what doesn’t.Result: Team understands patterns of success and failure.

Monitoring Your AI System

Key Metrics to Track

Performance Metrics

  • Average response time
  • Percentage of slow responses (over X seconds)
  • Success rate vs. error rate

Cost Metrics

  • Average cost per query
  • Daily/weekly/monthly total costs
  • Cost per user or per feature

Quality Metrics

  • User satisfaction scores
  • Error rate by category
  • Successful completion rate

Usage Metrics

  • Number of queries per day
  • Most common query types
  • Peak usage times

What “Good” Looks Like

Healthy AI SystemWarning Signs
95%+ success rateError rate increasing
Stable costs (not increasing unexpectedly)Costs rising without explanation
Response times meeting targetsResponse times getting slower
High user satisfactionUser complaints increasing
Few error spikesQuality declining over time

Troubleshooting Common Issues

Check:
  • Is your application properly instrumented?
  • Are traces being sent to the right place?
  • Any network issues preventing transmission?
  • Check your observability platform’s status
Solution:
  • Use filters to narrow down
  • Focus on errors first
  • Sample randomly (review 1% of successful traces)
  • Set up automated quality checks
Tips:
  • Add better labels/tags when creating traces
  • Use custom attributes for important context
  • Search by user ID, session ID, or date
  • Keep retention periods long enough

Getting Started

1

Week 1: Set Up Tracing

Basic setup:
  • Instrument your AI application
  • Verify traces are being captured
  • Check that key information is included
  • Test with a few queries
2

Week 2: Add Context

Improve trace usefulness:
  • Add user IDs
  • Tag by feature or environment
  • Include version numbers
  • Add business context
3

Week 3: Review and Analyze

Start using your traces:
  • Look at recent errors
  • Review slow traces
  • Check expensive queries
  • Identify patterns
4

Week 4: Establish Routine

Make it habitual:
  • Daily error check
  • Weekly performance review
  • Monthly cost analysis
  • Set up alerts for issues

Next Steps