Data Processing - Arcbeam Documentation

Data processing is the work of transforming your documents, files, and information into a format that AI can understand and use effectively. Think of it like organizing a library so people can find exactly what they need.

Why Data Processing Matters

Imagine you have a thousand-page employee handbook as one giant PDF. When someone asks “What’s the vacation policy?”, the AI needs to find just the relevant section about vacations, not the entire handbook.

Breaking Down Information

Large documents get split into smaller, manageable pieces that are easier to search and retrieve

Fast Searching

Organized content allows the AI to find exactly the right information quickly without searching through everything

Smart Organization

Content is structured logically so related information can be found together

Enhanced Searchability

Labels and categories help filter and locate the most relevant information for each question

The Data Processing Pipeline

Gathering Your Information

This is where you collect all the information your AI needs to work with.What you might collect:

Documents

PDF documents like manuals, guides, and reports

Office Files

Word documents, presentations, and spreadsheets

Web Content

Pages from your website, help center, or internal wiki

Things to consider:

Currency

Is this information current and accurate?

Permissions

Do you have permission to use this data?

Quality

Is the quality good enough (not blurry scans, readable text)?

Cleaning Your Information

Just like you’d remove coffee stains from a book before scanning it, you need to clean your data.What gets cleaned:

Page Elements

Page numbers, headers, and footers that aren’t useful for answering questions

Formatting Issues

Extra spaces, line breaks, and special characters that don’t add meaning

Duplicate Content

Repeated information that could confuse the AI or waste storage

Irrelevant Sections

Cleaner data means better, more relevant answers from your AI

Breaking Into Smaller Pieces (Chunking)

This is one of the most important steps. Large documents get split into smaller, meaningful sections.Think of it like this:

Books to Chapters

Instead of having one book, you have individual chapters

Pages to Sections

Instead of a whole webpage, you have distinct paragraphs or sections

Standalone Pieces

Each piece is small enough to be useful on its own

How big should pieces be?

Chunk Size	Good For	Benefit	Drawback
Smaller pieces (a few paragraphs)	Specific facts, product specs, FAQ answers	Very precise answers	Might miss surrounding context
Larger pieces (several paragraphs)	Explanations, how-to guides, policies	More complete information	Takes longer to search through

Most teams start with sections that are about 2-3 paragraphs long, then adjust based on results

Converting to Searchable Format (Embedding)

This step converts your text into a special format that computers can search through very quickly.The simple explanation:

Text to Numbers

Each piece of text gets converted into a series of numbers (a mathematical representation)

Similarity Matching

Similar content gets similar numbers, making it easy to find related information

Question Conversion

When someone asks a question, the AI converts that question to numbers too

Smart Matching

The system finds the content with the most similar numbers to the question

It’s like creating a fingerprint for each piece of content. Questions about “vacation time” will match content about “paid time off” even though the words are different.

What you need to know:

Low Cost

This process has a small cost (usually pennies per thousand pieces)

Automated

It happens automatically without manual intervention

Quality Impact

The quality of this step affects how well your AI finds relevant information

Adding Labels and Organization (Metadata)

This is like adding sticky notes to your content with helpful information.What you might add:

Source Information

Source: “Product Manual v2.0” - Tracks which document the content came from

Date Tracking

Date: “Last updated: January 2024” - Helps identify freshness of information

Content Category

Category: “Installation Guide” - Groups similar types of content together

Product Association

Product: “Widget Pro” - Links content to specific products or services

Department Owner

Department: “Engineering” - Identifies which team owns this content

Why this helps:

Smart Filtering

You can filter results to only show recent content or specific categories

Better Organization

You can organize by product, topic, or department for easier navigation

Freshness Awareness

You know when content might be outdated and needs updating

Source Tracing

You can trace answers back to their original source for verification

Storing Everything (Indexing)

Finally, all your processed content gets stored in a vector database where it can be quickly searched.What happens:

Complete Storage

Each piece is stored with its converted format and all metadata labels

Optimized Organization

The database organizes everything for lightning-fast searching

Easy Updates

You can easily add new content or update existing content without disruption

Common Processing Decisions

How to Split Your Documents

FAQ Documents

Best Practice: Keep each question and answer together as one pieceDon’t split them apart - Example: “Q: How do I reset my password?” + “A: Click the forgot password link…” should stay together

Manuals and Guides

Best Practice: Split by sections or headingsKeep related paragraphs together and make sure each piece can stand alone with enough context

Policies and Legal Documents

Best Practice: Split by policy topic or clauseInclude the policy title with each piece and keep numbered sections together for coherence

Product Specifications

Best Practice: Keep all specs for one product togetherInclude product name in each piece and don’t split feature lists across multiple pieces

Handling Different File Types

PDFs

Text PDFs: Usually process easily with standard extractionScanned PDFs: May need extra processing (OCR) to extract text from imagesForms and tables: Might need special handling to preserve structure

Word Documents

Generally easy to process with good text extractionHeaders and styles help with automatically splitting into logical sectionsImages and charts may need separate handling or description

Web Pages

Remove navigation, ads, and boilerplate content that doesn’t add valueKeep the main content that answers questions or provides informationPreserve links that add context or point to related information

Spreadsheets

Convert to a readable format first (like CSV or structured text)Each row might become one piece of information to searchInclude column headers for context so the data makes sense

Best Practices

Start Simple

Pick Your Core Content

Select your most important 10-20 documents that will answer the majority of common questions

Use Default Settings

Start with default settings for splitting - your tool will have recommendations based on best practices

Process and Test

Process your documents and test with real questions people actually ask

Evaluate Results

See what works well and what doesn’t - gather feedback from actual usage

Adjust and Iterate

Make targeted adjustments based on what you learned, then repeat

Don’t try to be perfect: Get something working first, then improve it. Iteration beats perfection.

Test and Iterate

After processing your data:

Test with Known Answers

Ask questions you already know the answers to - this helps verify accuracy

Check Information Retrieval

See if the AI finds the right information from your content

Evaluate Context

Check if answers have enough surrounding context to be useful

Identify Errors

Look for cases where it finds the wrong or irrelevant information

Common fixes:

AI Can't Find Information

Problem: Information exists but isn’t being retrievedSolution: Add more content or check your metadata labels for accuracy

Answers Lack Context

Problem: Responses are too brief or missing important detailsSolution: Make your content pieces bigger to include more surrounding information

Answers Too General

Problem: Responses aren’t specific enough to be actionableSolution: Make your pieces smaller for more precise, focused answers

Wrong Information Retrieved

Problem: AI returns content from wrong section or documentSolution: Check your categories and labels - improve organization and tagging

Monitoring Processing Quality

Signs Your Processing is Working Well

Accurate Answers

Users consistently get accurate, relevant answers to their questions

Fast Retrieval

The AI finds information quickly without long search times

Right Detail Level

Answers include the right amount of detail - not too brief, not too verbose

High Satisfaction

Users rarely say “that’s wrong” or “I can’t find what I need”

Signs You Need to Adjust

Can't Find Information

AI often says “I don’t know” even though you have the information in your content

Missing Context

Answers are missing important context or background information to be useful

Wrong Results

AI returns information from the wrong section or document instead of relevant content

Performance Issues

Responses are slow or processing costs are higher than expected

Troubleshooting Common Issues

"The AI gives incomplete answers"

Likely cause: Your pieces are too small and don’t include enough contextFix: Make your pieces bigger or add overlap between pieces so related information appears together

"The AI finds irrelevant information"

Likely cause: Your categories and tags aren’t specific enoughFix: Add more specific tags and improve your organization with better metadata to help filter results

"Processing is taking too long"

Likely cause: You’re processing too much at once or pieces are too numerousFix: Process in batches, or make pieces slightly larger to reduce total count and improve efficiency

"Answers are outdated"

Likely cause: You haven’t updated your processed data recentlyFix: Set up regular update schedule and version your content to ensure freshness

Getting Started

Your First Processing Project

Week 1: Prepare

Identify your 10 most important documents
Make sure they’re current and accurate
Gather them in one location

Week 2: Process

Choose a vector database
Run your first processing job
Review the results

Week 3: Test

Ask 20-30 test questions
Evaluate the answers
Note what works and what doesn’t

Week 4: Refine

Adjust your settings based on results
Re-process if needed
Add more documents

Questions to Ask Your Technical Team

If you’re working with engineers or technical staff:

Question 1: Chunk Size

“How big are our chunks? Should we try different sizes based on our content type?”

Question 2: Update Frequency

“How often will we refresh the data? What’s our update schedule?”

Question 3: Automation

“Can we set up automated updates to keep content fresh without manual work?”

Question 4: Metadata Strategy

“What categories or tags should we use to organize our content effectively?”

Question 5: Success Metrics

“How will we know if processing is working well? What should we measure?”

Next Steps

Data Sources

Understand where your data comes from

Model Selection

Choose the right AI model for your needs

Evaluations

Test if your processing is working well

Data Lineage

Track where your information comes from

​Why Data Processing Matters

Breaking Down Information

Fast Searching

Smart Organization

Enhanced Searchability

​The Data Processing Pipeline

Documents

Office Files

Web Content

Currency

Permissions

Quality

Page Elements

Formatting Issues

Duplicate Content

Irrelevant Sections

Books to Chapters

Pages to Sections

Standalone Pieces

Text to Numbers

Similarity Matching

Question Conversion

Smart Matching

Low Cost

Automated

Quality Impact

Source Information

Date Tracking

Content Category

Product Association

Department Owner

Smart Filtering

Better Organization

Freshness Awareness

Source Tracing

Complete Storage

Optimized Organization

Easy Updates

​Common Processing Decisions

​How to Split Your Documents

FAQ Documents

Manuals and Guides

Policies and Legal Documents

Product Specifications

​Handling Different File Types

PDFs

Word Documents

Web Pages

Spreadsheets

​Best Practices

​Start Simple

​Test and Iterate

Test with Known Answers

Check Information Retrieval

Evaluate Context

Identify Errors

AI Can't Find Information

Answers Lack Context

Answers Too General

Wrong Information Retrieved

​Monitoring Processing Quality

​Signs Your Processing is Working Well

Accurate Answers

Fast Retrieval

Right Detail Level

High Satisfaction

​Signs You Need to Adjust

Can't Find Information

Missing Context

Wrong Results

Performance Issues

​Troubleshooting Common Issues

​Getting Started

​Your First Processing Project

​Questions to Ask Your Technical Team

Question 1: Chunk Size

Question 2: Update Frequency

Question 3: Automation

Question 4: Metadata Strategy

Question 5: Success Metrics

Why Data Processing Matters

The Data Processing Pipeline

Common Processing Decisions

How to Split Your Documents

Handling Different File Types

Best Practices

Start Simple

Test and Iterate

Monitoring Processing Quality

Signs Your Processing is Working Well

Signs You Need to Adjust

Troubleshooting Common Issues

Getting Started

Your First Processing Project

Questions to Ask Your Technical Team

Next Steps