Skip to main content
Learn how to evaluate and optimize the quality of document retrieval in your RAG applications.

What Is Retrieval Quality?

Retrieval quality measures how well your vector database returns relevant documents for user queries. High-quality retrieval means:
  • Documents are semantically relevant to the query
  • Top results contain the information needed
  • Relevance scores accurately reflect usefulness
  • Retrieved documents lead to good AI responses

Why Retrieval Quality Matters

Poor retrieval is often the root cause of bad AI outputs:
  • Wrong documents → AI generates incorrect answers
  • Missing documents → AI can’t answer or hallucinates
  • Low relevance → AI struggles to extract useful information
  • Too many documents → Context window wasted on noise

Measuring Retrieval Quality

Relevance Scores

Check the relevance scores for retrieved documents:
  1. View a trace with retrieved documents
  2. Check relevance scores (0.0 to 1.0)
  3. Evaluate score distribution:
    • High (>0.8): Strong semantic match
    • Medium (0.6-0.8): Moderate match
    • Low (<0.6): Weak match, likely not useful

User Feedback Correlation

Compare retrieval quality to user feedback:
  1. Filter traces by user feedback (thumbs up/down)
  2. Check average relevance scores for each group
  3. If positive feedback correlates with higher scores, retrieval is working well

Retrieved vs Used

Analyze how many retrieved documents are actually used in responses:
  • Are all retrieved documents relevant?
  • Or does the AI ignore some in the final response?
  • This indicates if you’re retrieving too many documents

Common Retrieval Issues

Issue: Low Relevance Scores Across the Board

Symptoms: All documents have scores <0.6 Possible causes:
  • Embedding model mismatch (query vs documents)
  • Poor document chunking strategy
  • Documents don’t cover user queries
Solutions:
  • Use same embedding model for queries and documents
  • Improve chunking (better size, overlap)
  • Add more relevant documents to knowledge base
Learn more →

Issue: Right Documents, Wrong Order

Symptoms: Relevant docs have low scores, irrelevant ones rank higher Possible causes:
  • Distance metric not optimal for your data
  • Embeddings not capturing semantic meaning well
Solutions:
  • Try different distance metrics (cosine vs euclidean vs dot product)
  • Experiment with different embedding models
  • Add metadata filters to narrow results
Learn more →

Issue: No Relevant Documents Found

Symptoms: Retrieved documents completely miss the topic Possible causes:
  • Content gap in knowledge base
  • Query phrasing doesn’t match document style
  • Chunk size too small or too large
Solutions:
  • Identify missing topics and add content
  • Implement query expansion or rewriting
  • Adjust chunk size and overlap
Learn more →

Improving Retrieval Quality

Optimize Embedding Models

Choose the right embedding model for your use case:
from langchain.embeddings import OpenAIEmbeddings

# Try different models
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")  # More dimensions
# vs
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")  # Faster, cheaper
Test retrieval quality with each model in Arcbeam.

Tune Search Parameters

Adjust retrieval parameters: Number of results (k):
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})  # Retrieve top 5
  • Too few → might miss relevant docs
  • Too many → adds noise to context
Relevance threshold:
retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"score_threshold": 0.7}  # Only return docs with score >0.7
)

Improve Document Chunking

Better chunks lead to better retrieval: Chunk size:
  • Too small (< 200 tokens): Lacks context
  • Too large (> 1000 tokens): Too generic
  • Optimal: 300-600 tokens
Overlap:
  • Add 10-20% overlap between chunks
  • Ensures important info isn’t split across boundaries
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    separators=["\n\n", "\n", ". ", " ", ""]
)

Add Metadata Filters

Narrow retrieval with metadata:
retriever = vectorstore.as_retriever(
    search_kwargs={
        "k": 5,
        "filter": {"source_type": "documentation"}  # Only docs, not FAQs
    }
)
Reduces noise by limiting search space. Combine vector search with keyword search:
from langchain.retrievers import EnsembleRetriever

# Vector retriever
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Keyword retriever (BM25)
keyword_retriever = BM25Retriever.from_documents(documents)
keyword_retriever.k = 5

# Combine both
ensemble_retriever = EnsembleRetriever(
    retrievers=[vector_retriever, keyword_retriever],
    weights=[0.7, 0.3]  # 70% vector, 30% keyword
)
Best for queries with specific terms or names.

Analyzing Retrieval Patterns

By Query Type

Group traces by query type to see patterns:
  1. Create collections for different query types (factual, procedural, troubleshooting)
  2. Compare average relevance scores across types
  3. Identify which types have poor retrieval
  4. Improve those specific areas

Over Time

Track retrieval quality trends:
  1. Filter traces by date range
  2. Plot average relevance scores over time
  3. Look for degradation (might indicate stale data)
  4. Correlate with data updates

By Dataset

If using multiple datasets:
  1. Compare retrieval quality across datasets
  2. Identify which datasets perform well
  3. Learn from high-performing datasets
  4. Improve or remove low-performing ones

Best Practices

Monitor Continuously

  • Check retrieval metrics weekly
  • Set up alerts for drops in average relevance
  • Review low-scoring traces regularly

Test Before Deploying

  • Create test collections with known queries
  • Measure retrieval quality on test set
  • Only deploy changes that improve metrics

Balance Precision and Recall

  • Precision: Are retrieved docs relevant?
  • Recall: Are all relevant docs retrieved?
  • Adjust k and threshold to optimize both

Document Your Findings

  • Note what works and what doesn’t
  • Track changes to embedding models, chunk size, etc.
  • Share insights with team

Next Steps