Skip to main content
Data processing is the work of transforming your documents, files, and information into a format that AI can understand and use effectively. Think of it like organizing a library so people can find exactly what they need.

Why Data Processing Matters

Imagine you have a thousand-page employee handbook as one giant PDF. When someone asks “What’s the vacation policy?”, the AI needs to find just the relevant section about vacations, not the entire handbook.

Breaking Down Information

Large documents get split into smaller, manageable pieces that are easier to search and retrieve

Fast Searching

Organized content allows the AI to find exactly the right information quickly without searching through everything

Smart Organization

Content is structured logically so related information can be found together

Enhanced Searchability

Labels and categories help filter and locate the most relevant information for each question

The Data Processing Pipeline

1

Gathering Your Information

This is where you collect all the information your AI needs to work with.What you might collect:

Documents

PDF documents like manuals, guides, and reports

Office Files

Word documents, presentations, and spreadsheets

Web Content

Pages from your website, help center, or internal wiki
Things to consider:

Currency

Is this information current and accurate?

Permissions

Do you have permission to use this data?

Quality

Is the quality good enough (not blurry scans, readable text)?
2

Cleaning Your Information

Just like you’d remove coffee stains from a book before scanning it, you need to clean your data.What gets cleaned:

Page Elements

Page numbers, headers, and footers that aren’t useful for answering questions

Formatting Issues

Extra spaces, line breaks, and special characters that don’t add meaning

Duplicate Content

Repeated information that could confuse the AI or waste storage

Irrelevant Sections

Copyright notices, legal boilerplate, and other non-essential content
Cleaner data means better, more relevant answers from your AI
3

Breaking Into Smaller Pieces (Chunking)

This is one of the most important steps. Large documents get split into smaller, meaningful sections.Think of it like this:

Books to Chapters

Instead of having one book, you have individual chapters

Pages to Sections

Instead of a whole webpage, you have distinct paragraphs or sections

Standalone Pieces

Each piece is small enough to be useful on its own
How big should pieces be?
Chunk SizeGood ForBenefitDrawback
Smaller pieces (a few paragraphs)Specific facts, product specs, FAQ answersVery precise answersMight miss surrounding context
Larger pieces (several paragraphs)Explanations, how-to guides, policiesMore complete informationTakes longer to search through
Most teams start with sections that are about 2-3 paragraphs long, then adjust based on results
4

Converting to Searchable Format (Embedding)

This step converts your text into a special format that computers can search through very quickly.The simple explanation:

Text to Numbers

Each piece of text gets converted into a series of numbers (a mathematical representation)

Similarity Matching

Similar content gets similar numbers, making it easy to find related information

Question Conversion

When someone asks a question, the AI converts that question to numbers too

Smart Matching

The system finds the content with the most similar numbers to the question
It’s like creating a fingerprint for each piece of content. Questions about “vacation time” will match content about “paid time off” even though the words are different.
What you need to know:

Low Cost

This process has a small cost (usually pennies per thousand pieces)

Automated

It happens automatically without manual intervention

Quality Impact

The quality of this step affects how well your AI finds relevant information
5

Adding Labels and Organization (Metadata)

This is like adding sticky notes to your content with helpful information.What you might add:

Source Information

Source: “Product Manual v2.0” - Tracks which document the content came from

Date Tracking

Date: “Last updated: January 2024” - Helps identify freshness of information

Content Category

Category: “Installation Guide” - Groups similar types of content together

Product Association

Product: “Widget Pro” - Links content to specific products or services

Department Owner

Department: “Engineering” - Identifies which team owns this content
Why this helps:

Smart Filtering

You can filter results to only show recent content or specific categories

Better Organization

You can organize by product, topic, or department for easier navigation

Freshness Awareness

You know when content might be outdated and needs updating

Source Tracing

You can trace answers back to their original source for verification
6

Storing Everything (Indexing)

Finally, all your processed content gets stored in a vector database where it can be quickly searched.What happens:

Complete Storage

Each piece is stored with its converted format and all metadata labels

Optimized Organization

The database organizes everything for lightning-fast searching

Easy Updates

You can easily add new content or update existing content without disruption

Common Processing Decisions

How to Split Your Documents

FAQ Documents

Best Practice: Keep each question and answer together as one pieceDon’t split them apart - Example: “Q: How do I reset my password?” + “A: Click the forgot password link…” should stay together

Manuals and Guides

Best Practice: Split by sections or headingsKeep related paragraphs together and make sure each piece can stand alone with enough context

Policies and Legal Documents

Best Practice: Split by policy topic or clauseInclude the policy title with each piece and keep numbered sections together for coherence

Product Specifications

Best Practice: Keep all specs for one product togetherInclude product name in each piece and don’t split feature lists across multiple pieces

Handling Different File Types

PDFs

Text PDFs: Usually process easily with standard extractionScanned PDFs: May need extra processing (OCR) to extract text from imagesForms and tables: Might need special handling to preserve structure

Word Documents

Generally easy to process with good text extractionHeaders and styles help with automatically splitting into logical sectionsImages and charts may need separate handling or description

Web Pages

Remove navigation, ads, and boilerplate content that doesn’t add valueKeep the main content that answers questions or provides informationPreserve links that add context or point to related information

Spreadsheets

Convert to a readable format first (like CSV or structured text)Each row might become one piece of information to searchInclude column headers for context so the data makes sense

Best Practices

Start Simple

1

Pick Your Core Content

Select your most important 10-20 documents that will answer the majority of common questions
2

Use Default Settings

Start with default settings for splitting - your tool will have recommendations based on best practices
3

Process and Test

Process your documents and test with real questions people actually ask
4

Evaluate Results

See what works well and what doesn’t - gather feedback from actual usage
5

Adjust and Iterate

Make targeted adjustments based on what you learned, then repeat
Don’t try to be perfect: Get something working first, then improve it. Iteration beats perfection.

Test and Iterate

After processing your data:

Test with Known Answers

Ask questions you already know the answers to - this helps verify accuracy

Check Information Retrieval

See if the AI finds the right information from your content

Evaluate Context

Check if answers have enough surrounding context to be useful

Identify Errors

Look for cases where it finds the wrong or irrelevant information
Common fixes:

AI Can't Find Information

Problem: Information exists but isn’t being retrievedSolution: Add more content or check your metadata labels for accuracy

Answers Lack Context

Problem: Responses are too brief or missing important detailsSolution: Make your content pieces bigger to include more surrounding information

Answers Too General

Problem: Responses aren’t specific enough to be actionableSolution: Make your pieces smaller for more precise, focused answers

Wrong Information Retrieved

Problem: AI returns content from wrong section or documentSolution: Check your categories and labels - improve organization and tagging

Monitoring Processing Quality

Signs Your Processing is Working Well

Accurate Answers

Users consistently get accurate, relevant answers to their questions

Fast Retrieval

The AI finds information quickly without long search times

Right Detail Level

Answers include the right amount of detail - not too brief, not too verbose

High Satisfaction

Users rarely say “that’s wrong” or “I can’t find what I need”

Signs You Need to Adjust

Can't Find Information

AI often says “I don’t know” even though you have the information in your content

Missing Context

Answers are missing important context or background information to be useful

Wrong Results

AI returns information from the wrong section or document instead of relevant content

Performance Issues

Responses are slow or processing costs are higher than expected

Troubleshooting Common Issues

Likely cause: Your pieces are too small and don’t include enough contextFix: Make your pieces bigger or add overlap between pieces so related information appears together
Likely cause: Your categories and tags aren’t specific enoughFix: Add more specific tags and improve your organization with better metadata to help filter results
Likely cause: You’re processing too much at once or pieces are too numerousFix: Process in batches, or make pieces slightly larger to reduce total count and improve efficiency
Likely cause: You haven’t updated your processed data recentlyFix: Set up regular update schedule and version your content to ensure freshness

Getting Started

Your First Processing Project

1

Week 1: Prepare

  • Identify your 10 most important documents
  • Make sure they’re current and accurate
  • Gather them in one location
2

Week 2: Process

  • Choose a vector database
  • Run your first processing job
  • Review the results
3

Week 3: Test

  • Ask 20-30 test questions
  • Evaluate the answers
  • Note what works and what doesn’t
4

Week 4: Refine

  • Adjust your settings based on results
  • Re-process if needed
  • Add more documents

Questions to Ask Your Technical Team

If you’re working with engineers or technical staff:

Question 1: Chunk Size

“How big are our chunks? Should we try different sizes based on our content type?”

Question 2: Update Frequency

“How often will we refresh the data? What’s our update schedule?”

Question 3: Automation

“Can we set up automated updates to keep content fresh without manual work?”

Question 4: Metadata Strategy

“What categories or tags should we use to organize our content effectively?”

Question 5: Success Metrics

“How will we know if processing is working well? What should we measure?”

Next Steps