Skip to main content
While document-level analytics show individual pieces of content, dataset analytics reveal patterns across your entire knowledge base. This helps you assess overall quality and identify systemic issues.

What Is a Dataset?

A dataset is a collection of related documents, typically grouped by source. Datasets are created automatically when you sync a data source, grouped by the source field in your vector database.

Product Documentation

All docs from your product guide

Support Articles

Knowledge base articles

Engineering Wiki

Internal technical docs

API Reference

API documentation

Viewing Dataset Analytics

1

Navigate to Datasets

Go to Data → Datasets in the navigation menu
2

Browse Your Datasets

You’ll see a list of all datasets with their key metrics at a glance
3

View Detailed Analytics

Click on any dataset to see comprehensive analytics and insights
Insights into how this dataset is being used by your AI systems.

Dataset Overview Metrics

Total Documents

How many documents are in this dataset.

Total Retrievals

How many times any document in this dataset was retrieved.
Dataset SizeDocument CountMeaning
Large1000+Comprehensive coverage
Medium100-1000Moderate coverage
Small< 100Focused or incomplete coverage
Activity LevelMeaning
HighCore dataset, frequently referenced
MediumRegularly used
LowRarely needed

Usage Rate

Percentage of documents that have been retrieved at least once.
Usage RateMeaning
> 50%Excellent - Most documents are useful
30-50%Good - Majority of docs are relevant
10-30%Fair - Many unused documents
< 10%Poor - Dataset mostly unused
Low usage rate? Consider these actions:
  • Many documents are irrelevant → Prune the dataset
  • Embeddings are poor → Re-embed with better model
  • Documents aren’t discoverable → Improve metadata/chunking

Average Relevance Coming Soon

Mean relevance score across all retrievals in this dataset.
ScoreQuality
> 0.75Excellent - Strong semantic matches
0.60-0.75Good - Mostly relevant
0.45-0.60Fair - Some weak matches
< 0.45Poor - Retrieval quality issues
Low average relevance? Indicates:
  • Poor embeddings
  • Documents not well-written for retrieval
  • Retrieval parameters too loose

Document Distribution

See how retrievals are distributed across documents to understand which content is most valuable and which might need attention.

Top Documents

The most-retrieved documents in this dataset.
Information ShownDescription
Document nameTitle of the document
Retrieval countNumber of times retrieved
Percentage of total retrievalsShare of all retrievals in this dataset

Unused Documents

Documents with zero retrievals that may need attention.
Information ShownDescription
Document nameTitle of the document
Last updatedWhen it was last modified
ReasonWhy it might be unused (if determinable)
Click on any unused document to read its content, check if it should be deleted or improved, and verify embeddings are working.

Retrieval Distribution Graph Coming Soon

Histogram showing how many documents fall into each retrieval count bucket.
BucketRetrieval CountExample Document Count
Bucket 00 retrievals45 documents
Bucket 11-10 retrievals30 documents
Bucket 211-50 retrievals15 documents
Bucket 351-100 retrievals8 documents
Bucket 4100+ retrievals2 documents
Helps visualize how many documents are truly unused and if usage is concentrated or spread evenly.
Highly concentrated retrievals? If top 10 documents account for >80% of retrievals:
AspectInterpretation
GoodYou know which docs are critical
BadRest of dataset might be irrelevant
Graph showing retrieval activity for this dataset over time:
How often this dataset gets pulled into your AI system over time.
Trend TypeWhat It Means
Growth TrendsIncreasing usage as more queries come in
Declining TrendsDataset becoming less relevant over time
Usage SpikesSudden interest in this topic or content area
Seasonal PatternsCertain times of year show predictable patterns
Example: Tax documentation dataset spikes in March-April during tax season.

Quality Indicators Coming Soon

User Feedback Correlation

How users rate traces that used documents from this dataset:
MetricDescription
Thumbs up countPositive feedback on traces using this dataset
Thumbs down countNegative feedback
Satisfaction ratePercentage positive

Coverage Score

What percentage of queries in your traces find relevant documents (relevance > 0.7) from this dataset.
Coverage LevelPercentageMeaning
High>70%Dataset answers most questions in its domain
Medium40-70%Some gaps exist
Low<40%Significant gaps, many queries unanswered
Low satisfaction? This dataset may:
  • Contain outdated information
  • Have poor quality documents
  • Not cover topics users expect
Example: Support articles dataset has 45% coverage → 55% of support queries find no good documents.

Comparing Datasets

View multiple datasets side-by-side to make informed decisions about where to focus your efforts.
Comparison MetricWhat It Reveals
Usage RatesWhich datasets are most utilized by your AI system
Average RelevanceWhich have best quality and strongest semantic matches
User SatisfactionWhich lead to good responses and positive feedback
Growth TrendsWhich are growing or declining in importance
Use comparison view to:
  • Prioritize which datasets to improve
  • Allocate resources (focus on high-usage, low-quality datasets)
  • Identify which datasets can be archived or removed

Use Cases

Identify Low-Quality Datasets

Goal: Find datasets that need improvement.
1

Sort by Average Relevance

Sort datasets by Average Relevance (low to high) to surface the lowest quality datasets
2

Check Bottom Datasets

Review the bottom 3 datasets and examine sample documents from each
3

Determine Action

Decide what to do based on the root cause:
  • Documents are poorly written → Rewrite
  • Embeddings are bad → Re-embed
  • Dataset is irrelevant → Archive or remove
Result: Higher overall retrieval quality across your system.

Prioritize Dataset Updates

Goal: Focus updates on high-impact datasets.
1

Sort by Total Retrievals

Sort by Total Retrievals (high to low) to identify your most-used datasets
2

Check Last Updated Date

Review the “Last Updated” date for top datasets
3

Prioritize Old, High-Traffic Datasets

Focus your efforts on updating old, high-traffic datasets first. Deprioritize low-traffic datasets.
Result: Maximum impact from limited resources.

Find Coverage Gaps

Goal: Discover topics where you need more content.
1

Identify Low Coverage Datasets

Look at datasets with low coverage scores to find areas with content gaps
2

Analyze Failed Retrievals

Check traces that found no relevant documents and group by topic/query type
3

Fill the Gaps

Identify missing content areas and add new documents to fill those gaps
Result: Better coverage, fewer unanswered queries.

Measure Dataset Improvement

Goal: Track progress after improving a dataset.
1

Record Baseline Metrics

Document current metrics: usage rate, average relevance, and user satisfaction
2

Update and Re-sync

Update documents in the dataset and re-sync your data source
3

Wait for Data

Wait 2-4 weeks for enough usage data to accumulate
4

Compare Results

Check metrics again and compare before/after to measure improvement
Result: Data-driven proof of improvement.

Retire Unused Datasets

Goal: Clean up datasets no one uses.
1

Filter Low Usage

Filter to datasets with <5% usage rate over 90 days
2

Review Content

Review what’s in these datasets to understand why they’re unused
3

Determine Root Cause

Decide if truly irrelevant or just poorly embedded
4

Clean Up

Archive or delete unused datasets to keep your vector database lean
Result: Faster retrievals, reduced storage costs.

Dataset Health Score

Arcbeam calculates an overall health score for each dataset based on multiple factors:

Usage Rate

Higher is better - more documents being retrieved

Average Relevance

Higher is better - stronger semantic matches

User Satisfaction

Higher is better - positive user feedback

Coverage

Higher is better - fewer gaps in content

Recency of Updates

More recent is better - fresh content
ScoreHealth
80-100Excellent - Well-maintained, high-quality dataset
60-79Good - Solid dataset, minor improvements possible
40-59Fair - Needs attention, several issues
< 40Poor - Major issues, requires immediate work
Use health score to:
  • Quickly assess all datasets at a glance
  • Prioritize which datasets need work
  • Track improvements over time

Setting Goals

Set improvement targets for your datasets to drive measurable progress.

Example Goals

Product Documentation Dataset

Current State:
  • 42% usage rate
  • 0.68 avg relevance
Target:
  • 60% usage rate
  • 0.75 avg relevance
Action Plan:
  • Update top 20 docs
  • Remove 15 unused docs
  • Re-embed all documents

Support Articles Dataset

Current State:
  • 35% user satisfaction
Target:
  • 70% user satisfaction
Action Plan:
  • Rewrite top 10 most-used articles
  • Add 20 new articles for gaps

Best Practices

Review Dataset Health Monthly

Set a recurring task to monitor and improve your datasets.
TaskDescription
Check health scoresReview health scores for all datasets
Investigate dropsLook into any scores that decreased
Celebrate improvementsAcknowledge progress and wins

Focus on High-Usage Datasets First

Limited time? Prioritize based on this decision matrix:
Usage LevelQuality LevelAction Priority
HighLowFix these first - maximum impact
HighHighMaintain current quality
LowLowArchive or improve later
LowHighMonitor for future relevance

Track Metrics Over Time

Create a spreadsheet to monitor trends and patterns.
What to TrackWhy It Matters
Record key metrics monthlyEstablish baseline and track progress
Track trends (improving, stable, declining)Identify which datasets need attention
Identify seasonal patternsPlan for predictable usage spikes

Correlate with Business Goals

Align dataset priorities with business needs for maximum value.
Business SituationDataset Focus
Launching new productEnsure product docs dataset is excellent
Customer support issuesFocus on support articles dataset
Onboarding problemsImprove getting-started dataset
Feature adoption lowEnhance feature documentation dataset

Re-embed Periodically

Every 6-12 months, refresh your embeddings to maintain quality.
ActionBenefit
Re-embed datasets with latest embedding modelsNewer models often improve retrieval quality
Track if average relevance increasesMeasure ROI of re-embedding effort
Compare before/after metricsValidate improvement and inform future decisions

Next Steps