Why Observability Matters
You Can’t Fix What You Can’t See
| Without Observability | With Observability |
|---|---|
| User complains “the AI gave a wrong answer” | You see the exact question |
| You have no idea what question they asked | You see the exact response |
| You don’t know what the AI actually said | You see what documents it searched |
| You can’t see what information it used | You see which ones it used |
| You’re guessing at what went wrong | You know exactly what to fix |
What Is a Trace?
A trace is a complete record of one interaction with your AI system.What a Trace Captures
Every AI Call
- Which model was used (GPT-4, Claude, etc.)
- What you sent to it (the prompt)
- What it responded with
- How many words/tokens were processed
- How much it cost
- How long it took
Every Search/Retrieval
- What question was searched for
- Which documents were found
- Which ones were actually used
- Which database was searched
- How long the search took
Every Tool Used
- If agents used calculators, APIs, or other tools
- What information was passed to each tool
- What each tool returned
- Whether tools succeeded or failed
Timing Information
- When each step started and finished
- How long each part took
- Total time for the whole interaction
Context and Metadata
- Which user made the request
- What session or conversation it’s part of
- Whether it’s production or testing
- Any custom labels you’ve added
A Simple Example
User asks: “What’s your refund policy?” The trace shows:Understanding Spans
A span is one step within a trace. If a trace is a recipe, each span is one instruction.Common Types of Spans
AI Model Calls
- Generating a response
- Answering a question
- Summarizing text
Database Searches
- Finding relevant documents
- Looking up information
- Retrieving data
Tool Executions
- Calling calculators
- Accessing external APIs
- Running functions
Agent Workflows
- Multi-step reasoning
- Planning and execution
- Decision-making processes
What Each Span Contains
| Information | Details |
|---|---|
| What it did | ”Generated AI response” or “Searched knowledge base” |
| When it happened | Start and end time |
| How long it took | Duration in seconds |
| What went in | The input data or query |
| What came out | The result or response |
| Cost | How much this step cost |
| Status | Did it succeed or fail? |
| Parent | What triggered this step? |
Why This Information Is Valuable
Debugging Problems
Scenario: Users report AI is giving wrong answers about shipping.| Without Traces | With Traces |
|---|---|
| You guess which part is broken | Filter for “shipping” questions |
| You test different fixes randomly | See that it’s finding old documents from 2022 |
| Takes days to find the problem | Problem identified in 5 minutes: outdated data |
| Fix: Update knowledge base with current shipping info |
Improving Performance
Scenario: AI feels slow to users. Trace analysis shows:Solution: Focus on speeding up context loading, the actual problem. Optimizing the AI model wouldn’t help much.
Controlling Costs
Scenario: AI costs are higher than expected. Trace analysis shows:Insight: Very long responses are driving costs. You can:
- Set response length limits
- Warn users about expensive queries
- Optimize prompts to be more concise
Understanding Usage Patterns
Trace analysis reveals:What You Can Do with Traces
View Individual Traces
See exactly what happened in any interaction:- Click on any trace to open it
- See the step-by-step breakdown of all operations
- View inputs and outputs at each step
- Check which documents were retrieved and used
- Identify what went wrong (if anything)
Filter and Search
Find specific traces that need attention:- Show me all errors from last week
- Find traces that cost more than $0.10
- Show slow responses (over 5 seconds)
- Find traces for a specific user
- See traces using GPT-4 vs Claude
Compare Performance
Make before and after comparisons:- Did the new model improve quality?
- Are responses faster after optimization?
- Did costs go up or down?
- Is the new version better?
Track Trends
Analyze patterns over time:- Are errors increasing?
- Are we getting faster or slower?
- Are costs rising?
- Is quality improving?
Practical Examples
Finding Why Something Failed
Problem: User reports “AI said it doesn’t know, but the answer is definitely in our documentation.”Search for the user's query in traces
Locate the specific interaction by searching for the user’s query or filtering by timestamp.
Look at the Retrieved Documents section
Check which documents were found and used during the search.
Identify the mismatch
See that search found different documents than expected - the right documentation wasn’t retrieved.
Optimizing for Cost
Observation: Monthly AI costs jumped 40% this month.Analyze the findings
Key discoveries:
- 5% of queries generate 60% of costs
- These are all “write a detailed report” type queries
- They generate 1000+ word responses
Improving Response Quality
Goal: Reduce errors in product recommendation questions.Filter for product recommendation traces with errors
Focus on failed interactions in the product recommendation category.
Review what went wrong in each case
Examine the inputs, outputs, and retrieved information for each error.
Diagnose the root cause
Retrieval is finding both products, but not explaining the differences clearly enough for the AI to distinguish them.
Best Practices
Add Helpful Labels
Tag your traces with useful information:
- User type (free vs. paid customer)
- Feature name (chatbot, search, recommendations)
- Environment (production, staging, testing)
- Version number (v1.0, v2.0)
- Filter production issues from test issues
- Compare performance across features
- Identify problems affecting specific user types
- Track improvements across versions
Protect Privacy
Don’t capture sensitive information:
- Don’t log passwords
- Be careful with personal information
- Avoid storing customer secrets
- Follow your privacy policies
- User IDs (not names)
- Session IDs
- Transaction IDs
- General query topics
Review Regularly
Make trace review a habit:
- Daily: Quick check for errors
- Weekly: Review slow or expensive traces
- Monthly: Look for trends and patterns
- After changes: Verify improvements worked
Set Up Alerts
Get notified about problems:
- Error rate above 5%
- Average response time over 3 seconds
- Costs spike by 50%
- Specific feature failing
Common Use Cases
Customer Support Escalation
Scenario: Angry customer says “your AI is broken.”Look up their recent traces
Search for the customer’s interactions using their user ID or session information.
See exactly what happened
Review the complete trace to understand the full context of their interaction.
Identify the specific issue
Pinpoint what went wrong - which step failed, what information was used, what the AI said.
A/B Testing Different Approaches
Scenario: Testing two different prompts.Set up the test with proper tagging
- Version A: Tag traces with “prompt_v1”
- Version B: Tag traces with “prompt_v2”
Analyze the results
Prompt V1:
- Average quality score: 3.8/5
- Average cost: $0.02
- Average time: 1.5s
- Average quality score: 4.2/5
- Average cost: $0.03
- Average time: 2.1s
Training and Quality Review
Scenario: Training your team on what good AI interactions look like.Review what made them successful
Identify common patterns: what information was used, how responses were structured, what made them effective.
Review what went wrong
Understand failure patterns: missing information, incorrect retrieval, poor response quality.
Monitoring Your AI System
Key Metrics to Track
Performance Metrics
- Average response time
- Percentage of slow responses (over X seconds)
- Success rate vs. error rate
Cost Metrics
- Average cost per query
- Daily/weekly/monthly total costs
- Cost per user or per feature
Quality Metrics
- User satisfaction scores
- Error rate by category
- Successful completion rate
Usage Metrics
- Number of queries per day
- Most common query types
- Peak usage times
What “Good” Looks Like
| Healthy AI System | Warning Signs |
|---|---|
| 95%+ success rate | Error rate increasing |
| Stable costs (not increasing unexpectedly) | Costs rising without explanation |
| Response times meeting targets | Response times getting slower |
| High user satisfaction | User complaints increasing |
| Few error spikes | Quality declining over time |
Troubleshooting Common Issues
Traces Aren't Showing Up
Traces Aren't Showing Up
Check:
- Is your application properly instrumented?
- Are traces being sent to the right place?
- Any network issues preventing transmission?
- Check your observability platform’s status
Too Many Traces to Review
Too Many Traces to Review
Solution:
- Use filters to narrow down
- Focus on errors first
- Sample randomly (review 1% of successful traces)
- Set up automated quality checks
Can't Find Specific Traces
Can't Find Specific Traces
Tips:
- Add better labels/tags when creating traces
- Use custom attributes for important context
- Search by user ID, session ID, or date
- Keep retention periods long enough
Getting Started
Week 1: Set Up Tracing
Basic setup:
- Instrument your AI application
- Verify traces are being captured
- Check that key information is included
- Test with a few queries
Week 2: Add Context
Improve trace usefulness:
- Add user IDs
- Tag by feature or environment
- Include version numbers
- Add business context
Week 3: Review and Analyze
Start using your traces:
- Look at recent errors
- Review slow traces
- Check expensive queries
- Identify patterns
