Data Sources - Arcbeam Documentation

Data sources are where your AI gets the information it needs to answer questions and help users. Think of them as the AI’s library of knowledge.

Just like a person needs to learn information before they can answer questions, your AI needs access to information. Data sources are the places where this information lives.

Understanding Data Sources

Company Documentation
Customer Information
External Information
File Storage

Product Manuals

Internal Wikis

FAQs

Policy Documents

Why Data Sources Matter

Your AI is Only as Good as its Data

The quality of your AI’s responses directly depends on the quality of the information you give it.

Accurate Information

Your AI gives correct answers that users can trust

Up-to-date Information

Your AI doesn’t provide outdated information

Relevant Information

Your AI stays on topic and focused

Complete Information

Your AI can answer more questions comprehensively

Example: If your AI is helping customers with product questions, but your product documentation is from 2 years ago, it will give outdated answers about features and prices.

Different Sources Enable Different Features

The type of data source determines what your AI can do:

Documentation

Answer questions about your products or services with comprehensive knowledge base access

Real-time APIs

Provide current information like order status, weather, or live inventory

Customer Data

Personalize responses based on user history, preferences, and past interactions

File Systems

Search through and summarize large documents stored across your organization

What is a Vector Database?

Vector databases are specialized systems designed for storing and searching semantic information. They’re the backbone of modern RAG (Retrieval-Augmented Generation) systems, enabling AI applications to find relevant information based on meaning rather than just keywords.

Why Traditional Databases Don’t Work for AI

Traditional databases excel at exact matches—finding “iPhone 15” when you search for “iPhone 15”. But they struggle with semantic searches like:

“smartphone with the best camera” (should find iPhone 15, Pixel 8, etc.)
“how do I reset my password?” (should find password reset documentation)
“troubleshooting connection issues” (should find network, wifi, and connectivity docs)

The Problem: AI needs to understand meaning and context, not just match words.

How Vector Databases Work

Vector databases solve this by converting text into numerical representations that capture semantic meaning:

Chunking Documents

Your documents are broken into smaller, meaningful pieces (typically 100-500 words). This could be paragraphs, sections, or logical units of information.Why chunking matters: Smaller chunks give more precise retrieval. If someone asks about “pricing”, you want the pricing section, not the entire 50-page manual.

Creating Embeddings

Each chunk is converted into a vector embedding—a list of numbers (typically 768-1536 dimensions) that represents its semantic meaning.The magic: Similar concepts have similar vectors. “refund policy” and “return process” will have vectors close together in mathematical space, even though they use different words.

Embeddings are created using machine learning models like OpenAI’s text-embedding-3-large or open-source models like all-MiniLM-L6-v2. These models have learned semantic relationships from billions of documents.

Indexing for Fast Retrieval

Vector databases use specialized algorithms (like HNSW, IVF, or FAISS) to organize vectors for lightning-fast similarity searches—even across millions of documents.Performance: Finding the top 10 relevant documents from 10 million can happen in milliseconds.

Semantic Search

When someone asks a question:

The question is converted to a vector using the same embedding model
The database finds vectors closest to the question vector (using cosine similarity or other distance metrics)
The corresponding text chunks are retrieved and ranked by relevance

Result: You get semantically relevant content, not just keyword matches.

Retrieval-Augmented Generation (RAG)

The retrieved chunks are provided as context to your AI model (like GPT-4 or Claude), which uses them to generate an informed, accurate response.Why this works: The AI can reference specific information from your knowledge base instead of relying solely on its training data.

What Gets Stored

For each document chunk in a vector database, you typically store:

Vector Embedding

The numerical representation (e.g., 1536 floating-point numbers) capturing the semantic meaning

Original Text

The actual content to be retrieved and shown to users or passed to the AI

Source Metadata

Which document, page, section, or URL it came from for attribution

Timestamps

When the content was added, last updated, or verified for freshness tracking

Custom Metadata

Categories, departments, access levels, product names, or any filterable attributes

Chunk Position

Information about where this chunk appears in the original document for context

Vector Database vs. Traditional Database

Feature	Traditional Database	Vector Database
Search Type	Exact keyword matching	Semantic similarity search
Query	”password reset” finds only exact phrase	”I forgot my login” finds password reset docs
Storage	Structured data in rows/columns	Vectors + unstructured text + metadata
Primary Use	Transactions, records, business data	AI/ML applications, semantic search, recommendations
Performance Metric	Query speed, transactions/sec	Similarity search speed, recall@k
Examples	PostgreSQL, MySQL, MongoDB	Pinecone, Weaviate, Qdrant, Chroma, pgvector

Hybrid Search: Best of Both Worlds

Many modern applications combine vector search with traditional filters for even better results: Example scenario: You want to find documents about “troubleshooting network issues” but only from:

The “Enterprise Router” product line (not all products)
The Support department (not Sales or Marketing docs)
Documents updated after January 1, 2024 (only recent information)

With hybrid search, you get:

Semantic search finding documents about network problems, connectivity issues, wifi troubleshooting, etc. (even if they don’t use the exact words “network issues”)
Traditional filters narrowing results to exactly the product, department, and date range you need

This gives you semantic understanding plus the precision of traditional database filters—the best of both approaches.

Connecting Your Data

Identify what information your AI needs

Ask yourself these questions:

What questions will users ask?
What knowledge does the AI need to answer those questions?
Where does that knowledge currently live?

Choose where to store it

For most AI applications, you’ll use a vector database

Popular options:

Pinecone

Weaviate

Qdrant

pgvector

Prepare your information

Gather your documents, FAQs, and other content
Make sure the information is accurate and current
Organize it in a logical way

Load it into the database

Your documents get processed and stored through an automated process. This can be done once or set up to update regularly.

Test and refine

Testing: Ask test questions to see if the AI finds the right information
Adjusting: Adjust what information is included based on results
Monitoring: Monitor which content gets used most

Keeping Your Data Fresh

Why Updates Matter

Imagine if your AI was trained on last year’s pricing but you’ve since updated your prices. Customers would get wrong information, leading to confusion and potentially lost sales.

How to Keep Data Current

Set update schedules:

Daily Updates

For information that changes frequently (like inventory)

Weekly Updates

For moderately changing content (like blog posts)

Monthly Updates

For stable content (like company policies)

Track versions:

Know which version of each document is currently being used
Keep a history of changes
Be able to roll back if needed

Monitor for staleness:

Get alerts when information hasn’t been updated in a while
Regularly review what’s in your database
Remove outdated content

Automate when possible:

Auto Imports

Set up automatic imports from your documentation system

Scheduled Jobs

Schedule regular refresh jobs

Auto Sync

Sync with source systems automatically

Data Quality Best Practices

Keep Your Sources Organized

Use clear naming:

Good Example

Product_Manual_2024.pdf

Bad Example

doc_final_v2_FINAL.pdf

Include dates in file names when relevant
Use consistent naming conventions

Add helpful tags and categories:

Tag by product, department, or topic
Include categories like “support”, “sales”, “technical”
Add keywords that users might search for

Track where information came from:

Note the original source file or system
Include the author or owner
Record when it was last updated

Maintain Quality Standards

Regular Audits

Review your content quarterly
Check for outdated information
Verify accuracy of key facts

User Feedback

Pay attention to when users say “that’s not right”
Track which answers get poor ratings
Use this to identify content that needs updating

Remove What Doesn't Work

Delete duplicate content
Remove irrelevant information
Archive old versions

Monitoring Your Data Sources

What to Track

How fresh is your data?

When was each piece of content last updated?
How long since your last sync or refresh?
Are there warnings about stale content?

How well is it working?

User Success

Are users finding what they need?

Document Usage

Which documents get used most?

Coverage Gaps

Where are the gaps in coverage?

What’s it costing?

Storage costs for your database
Costs for processing and updating content
API costs for external data sources

Signs of Problems

Your AI gives outdated answers

Problem: Your data isn’t being updated frequently enoughSolution: Increase update frequency and set up automated syncs

Your AI can't answer basic questions

Problem: You’re missing important contentSolution: Review your data sources and add missing documentation

Your AI often says 'I don't know'

Problem: You need more comprehensive coverageSolution: Expand your knowledge base with additional sources

Responses are slow

Problem: Your database might need optimizationSolution: Review database indexing and consider scaling options

Data Security and Privacy

Protecting Sensitive Information

Control who has access:

Permissions

Not everyone should see all data - set up proper permissions

Audit Logs

Log who accesses what for security tracking

Role-Based Access

Define different access levels for different roles

Handle personal information carefully:

Be aware of privacy regulations (GDPR, HIPAA, etc.)

Remove or anonymize personal details when possible
Have clear policies for data retention
Document your data handling procedures

Compliance requirements to consider:

GDPR for EU data
HIPAA for healthcare information
CCPA for California residents
Industry-specific regulations

Keep credentials secure:

No Hardcoded Secrets

Never put passwords or API keys in your documents

Secure Storage

Use secure methods to store access credentials

Regular Rotation

Rotate keys and passwords regularly

Next Steps

Data Processing

Learn how to prepare your data for AI

Data Lineage

Track where information comes from

Setup Guide

Step-by-step instructions to connect your data

Observability

Monitor how your data is being used

​Understanding Data Sources

Product Manuals

Internal Wikis

FAQs

Policy Documents

Support Tickets

CRM Data

Purchase History

Customer Preferences

Real-time APIs

Public Databases

Third-party Services

News & Updates

PDF Documents

Office Files

Cloud Storage

File Servers

​Why Data Sources Matter

​Your AI is Only as Good as its Data

Accurate Information

Up-to-date Information

Relevant Information

Complete Information

​Different Sources Enable Different Features

Documentation

Real-time APIs

Customer Data

File Systems

​What is a Vector Database?

​Why Traditional Databases Don’t Work for AI

​How Vector Databases Work

​What Gets Stored

Vector Embedding

Original Text

Source Metadata

Timestamps

Custom Metadata

Chunk Position

​Vector Database vs. Traditional Database

​Hybrid Search: Best of Both Worlds

​Connecting Your Data

Pinecone

Weaviate

Qdrant

pgvector

​Keeping Your Data Fresh

​Why Updates Matter

​How to Keep Data Current

Daily Updates

Weekly Updates

Monthly Updates

Auto Imports

Scheduled Jobs

Auto Sync

​Data Quality Best Practices

​Keep Your Sources Organized

Good Example

Bad Example

​Maintain Quality Standards

Regular Audits

User Feedback

Remove What Doesn't Work

​Monitoring Your Data Sources

​What to Track

User Success

Document Usage

Coverage Gaps

​Signs of Problems

​Data Security and Privacy

​Protecting Sensitive Information

Permissions

Audit Logs

Role-Based Access

No Hardcoded Secrets

Secure Storage

Regular Rotation

​Next Steps

Data Processing

Data Lineage

Setup Guide

Understanding Data Sources

Why Data Sources Matter

Your AI is Only as Good as its Data

Different Sources Enable Different Features

What is a Vector Database?

Why Traditional Databases Don’t Work for AI

How Vector Databases Work

What Gets Stored

Vector Database vs. Traditional Database

Hybrid Search: Best of Both Worlds

Connecting Your Data

Keeping Your Data Fresh

Why Updates Matter

How to Keep Data Current

Data Quality Best Practices

Keep Your Sources Organized

Maintain Quality Standards

Monitoring Your Data Sources

What to Track

Signs of Problems

Data Security and Privacy

Protecting Sensitive Information

Next Steps