Skip to main content
Data sources are where your AI gets the information it needs to answer questions and help users. Think of them as the AI’s library of knowledge.
Just like a person needs to learn information before they can answer questions, your AI needs access to information. Data sources are the places where this information lives.

Understanding Data Sources

Product Manuals

Internal Wikis

FAQs

Policy Documents

Why Data Sources Matter

Your AI is Only as Good as its Data

The quality of your AI’s responses directly depends on the quality of the information you give it.

Accurate Information

Your AI gives correct answers that users can trust

Up-to-date Information

Your AI doesn’t provide outdated information

Relevant Information

Your AI stays on topic and focused

Complete Information

Your AI can answer more questions comprehensively
Example: If your AI is helping customers with product questions, but your product documentation is from 2 years ago, it will give outdated answers about features and prices.

Different Sources Enable Different Features

The type of data source determines what your AI can do:

Documentation

Answer questions about your products or services with comprehensive knowledge base access

Real-time APIs

Provide current information like order status, weather, or live inventory

Customer Data

Personalize responses based on user history, preferences, and past interactions

File Systems

Search through and summarize large documents stored across your organization

What is a Vector Database?

Vector databases are specialized systems designed for storing and searching semantic information. They’re the backbone of modern RAG (Retrieval-Augmented Generation) systems, enabling AI applications to find relevant information based on meaning rather than just keywords.

Why Traditional Databases Don’t Work for AI

Traditional databases excel at exact matches—finding “iPhone 15” when you search for “iPhone 15”. But they struggle with semantic searches like:
  • “smartphone with the best camera” (should find iPhone 15, Pixel 8, etc.)
  • “how do I reset my password?” (should find password reset documentation)
  • “troubleshooting connection issues” (should find network, wifi, and connectivity docs)
The Problem: AI needs to understand meaning and context, not just match words.

How Vector Databases Work

Vector databases solve this by converting text into numerical representations that capture semantic meaning:
1

Chunking Documents

Your documents are broken into smaller, meaningful pieces (typically 100-500 words). This could be paragraphs, sections, or logical units of information.Why chunking matters: Smaller chunks give more precise retrieval. If someone asks about “pricing”, you want the pricing section, not the entire 50-page manual.
2

Creating Embeddings

Each chunk is converted into a vector embedding—a list of numbers (typically 768-1536 dimensions) that represents its semantic meaning.The magic: Similar concepts have similar vectors. “refund policy” and “return process” will have vectors close together in mathematical space, even though they use different words.
Embeddings are created using machine learning models like OpenAI’s text-embedding-3-large or open-source models like all-MiniLM-L6-v2. These models have learned semantic relationships from billions of documents.
3

Indexing for Fast Retrieval

Vector databases use specialized algorithms (like HNSW, IVF, or FAISS) to organize vectors for lightning-fast similarity searches—even across millions of documents.Performance: Finding the top 10 relevant documents from 10 million can happen in milliseconds.
4

Semantic Search

When someone asks a question:
  1. The question is converted to a vector using the same embedding model
  2. The database finds vectors closest to the question vector (using cosine similarity or other distance metrics)
  3. The corresponding text chunks are retrieved and ranked by relevance
Result: You get semantically relevant content, not just keyword matches.
5

Retrieval-Augmented Generation (RAG)

The retrieved chunks are provided as context to your AI model (like GPT-4 or Claude), which uses them to generate an informed, accurate response.Why this works: The AI can reference specific information from your knowledge base instead of relying solely on its training data.

What Gets Stored

For each document chunk in a vector database, you typically store:

Vector Embedding

The numerical representation (e.g., 1536 floating-point numbers) capturing the semantic meaning

Original Text

The actual content to be retrieved and shown to users or passed to the AI

Source Metadata

Which document, page, section, or URL it came from for attribution

Timestamps

When the content was added, last updated, or verified for freshness tracking

Custom Metadata

Categories, departments, access levels, product names, or any filterable attributes

Chunk Position

Information about where this chunk appears in the original document for context

Vector Database vs. Traditional Database

FeatureTraditional DatabaseVector Database
Search TypeExact keyword matchingSemantic similarity search
Query”password reset” finds only exact phrase”I forgot my login” finds password reset docs
StorageStructured data in rows/columnsVectors + unstructured text + metadata
Primary UseTransactions, records, business dataAI/ML applications, semantic search, recommendations
Performance MetricQuery speed, transactions/secSimilarity search speed, recall@k
ExamplesPostgreSQL, MySQL, MongoDBPinecone, Weaviate, Qdrant, Chroma, pgvector

Hybrid Search: Best of Both Worlds

Many modern applications combine vector search with traditional filters for even better results: Example scenario: You want to find documents about “troubleshooting network issues” but only from:
  • The “Enterprise Router” product line (not all products)
  • The Support department (not Sales or Marketing docs)
  • Documents updated after January 1, 2024 (only recent information)
With hybrid search, you get:
  • Semantic search finding documents about network problems, connectivity issues, wifi troubleshooting, etc. (even if they don’t use the exact words “network issues”)
  • Traditional filters narrowing results to exactly the product, department, and date range you need
This gives you semantic understanding plus the precision of traditional database filters—the best of both approaches.

Connecting Your Data

1

Identify what information your AI needs

Ask yourself these questions:
  • What questions will users ask?
  • What knowledge does the AI need to answer those questions?
  • Where does that knowledge currently live?
2

Choose where to store it

For most AI applications, you’ll use a vector database
Popular options:

Pinecone

Weaviate

Qdrant

pgvector

3

Prepare your information

  • Gather your documents, FAQs, and other content
  • Make sure the information is accurate and current
  • Organize it in a logical way
4

Load it into the database

Your documents get processed and stored through an automated process. This can be done once or set up to update regularly.
5

Test and refine

  • Testing: Ask test questions to see if the AI finds the right information
  • Adjusting: Adjust what information is included based on results
  • Monitoring: Monitor which content gets used most

Keeping Your Data Fresh

Why Updates Matter

Imagine if your AI was trained on last year’s pricing but you’ve since updated your prices. Customers would get wrong information, leading to confusion and potentially lost sales.

How to Keep Data Current

Set update schedules:

Daily Updates

For information that changes frequently (like inventory)

Weekly Updates

For moderately changing content (like blog posts)

Monthly Updates

For stable content (like company policies)
Track versions:
  • Know which version of each document is currently being used
  • Keep a history of changes
  • Be able to roll back if needed
Monitor for staleness:
  • Get alerts when information hasn’t been updated in a while
  • Regularly review what’s in your database
  • Remove outdated content
Automate when possible:

Auto Imports

Set up automatic imports from your documentation system

Scheduled Jobs

Schedule regular refresh jobs

Auto Sync

Sync with source systems automatically

Data Quality Best Practices

Keep Your Sources Organized

Use clear naming:

Good Example

Product_Manual_2024.pdf

Bad Example

doc_final_v2_FINAL.pdf
  • Include dates in file names when relevant
  • Use consistent naming conventions
Add helpful tags and categories:
  • Tag by product, department, or topic
  • Include categories like “support”, “sales”, “technical”
  • Add keywords that users might search for
Track where information came from:
  • Note the original source file or system
  • Include the author or owner
  • Record when it was last updated

Maintain Quality Standards

Regular Audits

  • Review your content quarterly
  • Check for outdated information
  • Verify accuracy of key facts

User Feedback

  • Pay attention to when users say “that’s not right”
  • Track which answers get poor ratings
  • Use this to identify content that needs updating

Remove What Doesn't Work

  • Delete duplicate content
  • Remove irrelevant information
  • Archive old versions

Monitoring Your Data Sources

What to Track

How fresh is your data?
  • When was each piece of content last updated?
  • How long since your last sync or refresh?
  • Are there warnings about stale content?
How well is it working?

User Success

Are users finding what they need?

Document Usage

Which documents get used most?

Coverage Gaps

Where are the gaps in coverage?
What’s it costing?
  • Storage costs for your database
  • Costs for processing and updating content
  • API costs for external data sources

Signs of Problems

Problem: Your data isn’t being updated frequently enoughSolution: Increase update frequency and set up automated syncs
Problem: You’re missing important contentSolution: Review your data sources and add missing documentation
Problem: You need more comprehensive coverageSolution: Expand your knowledge base with additional sources
Problem: Your database might need optimizationSolution: Review database indexing and consider scaling options

Data Security and Privacy

Protecting Sensitive Information

Control who has access:

Permissions

Not everyone should see all data - set up proper permissions

Audit Logs

Log who accesses what for security tracking

Role-Based Access

Define different access levels for different roles
Handle personal information carefully:
Be aware of privacy regulations (GDPR, HIPAA, etc.)
  • Remove or anonymize personal details when possible
  • Have clear policies for data retention
  • Document your data handling procedures
Compliance requirements to consider:
  • GDPR for EU data
  • HIPAA for healthcare information
  • CCPA for California residents
  • Industry-specific regulations
Keep credentials secure:

No Hardcoded Secrets

Never put passwords or API keys in your documents

Secure Storage

Use secure methods to store access credentials

Regular Rotation

Rotate keys and passwords regularly

Next Steps