PGvector Integration - Arcbeam Documentation

Connect your PostgreSQL database with to Arcbeam to enable full tracking. See which documents influence each AI response and track data usage over time.

What Is PGvector?

PGvector is a PostgreSQL extension for vector similarity search. It’s the most popular open-source vector database solution, especially for self-hosted applications. PGvector allows you to:

Store vector embeddings alongside your data
Perform similarity searches using various distance metrics
Scale to millions of vectors
Keep everything in PostgreSQL (no additional infrastructure)

Why Connect PGvector to Arcbeam?

Data Lineage

Link AI traces to the exact documents that influenced them. Trace wrong answers back to outdated documents.

Usage Analytics

Track which documents are retrieved most often, which are never used, and retrieval patterns over time.

Compliance

Demonstrate data provenance with full audit trail from output to source.

Prerequisites

Before connecting PGvector to Arcbeam, you need:

1. PostgreSQL with PGvector

Install the PGvector extension:

CREATE EXTENSION vector;

2. Table with Vector Data

A table containing your documents and embeddings:

CREATE TABLE documents (
  id UUID PRIMARY KEY,
  content TEXT,
  embedding VECTOR(1536),  -- Dimension depends on your model
  source TEXT,
  title TEXT,
  updated_at TIMESTAMP DEFAULT NOW()
);

3. Database Credentials

Read access to the database:

postgresql://username:password@host:port/database

4. Arcbeam Project

An existing project where you’ll connect the dataset.

Connecting PGvector

Via UI
Via API

Navigate to Datasets

Go to Datasets page in Arcbeam and click New Dataset

Select PGvector

Choose PGvector as the dataset type

Enter connection details

Connection String: postgresql://user:pass@host:port/db
Table Name: Name of your documents table (e.g., documents)

Map schema columns

ID Column: Unique identifier (e.g., id)
Content Column: Document text (e.g., content)
Embedding Column: Vector field (e.g., embedding)
Metadata Columns: Optional fields like source, title, updated_at

Test and Create

Click Test Connection to verify, then Create Dataset

Your PGvector database is now connected to Arcbeam!

curl -X POST https://api.arcbeam.ai/v1/vectordb-integrations \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Product Documentation",
    "description": "PGvector integration for product docs",
    "type": "pgvector",
    "hostUrl": "postgresql://user:pass@host:5432/db",
    "indexName": "documents",
    "schemaMapping": {
      "idField": "id",
      "documentField": "content",
      "sourceField": "source",
      "embeddingField": "embedding",
      "lastUpdatedField": "updated_at"
    },
    "isActive": true
  }'

Use the API method for programmatic setup or infrastructure-as-code deployments.

Syncing Data

Arcbeam periodically syncs metadata from your PGvector database.

Manual Sync

Trigger a sync anytime:

Go to Datasets page
Find your PGvector dataset
Click Sync Now
Wait for completion (usually < 1 minute for small datasets)

What Gets Synced

Data Type	Synced?	Description
Document IDs	✓ Yes	To match retrieved docs in traces
Content	✓ Yes	For display in trace details
Metadata columns	✓ Yes	Only fields you explicitly map
Embeddings	✗ Never	Not needed, saves storage and bandwidth
Other tables	✗ Never	Only the table you specify
Unmapped fields	✗ Never	Sensitive data unless explicitly mapped

Sync Performance

For large datasets:

< 10k docs: Sync in seconds
10k-100k docs: Sync in minutes
100k+ docs: Sync may take longer

Security & Privacy

Connection Security

Connections use SSL/TLS encryption
Credentials stored encrypted at rest
Access limited to read-only queries
No write access to your database

Data Access

Arcbeam queries:

Only the table you specify
Only columns you map
Read-only SELECT queries

Arcbeam does not:

Modify your data
Access other tables
Store full embeddings
Keep entire documents (only excerpts for display)

Credentials Management

Best practices:

Use read-only database user
Limit access to specific table
Rotate credentials periodically
Use connection pooling limits

Example read-only user:

CREATE USER arcbeam_readonly WITH PASSWORD 'secure_password';
GRANT CONNECT ON DATABASE mydb TO arcbeam_readonly;
GRANT USAGE ON SCHEMA public TO arcbeam_readonly;
GRANT SELECT ON documents TO arcbeam_readonly;

What You Can Do

Once your PGvector database is connected, you can:

Debug Wrong Answers

Trace incorrect AI responses back to the source documents and fix them

Optimize Your Knowledge Base

Find and remove unused documents to improve retrieval speed and quality

Track Content Updates

Identify high-traffic documents that need updating

Compliance Reporting

Generate audit trails showing complete data provenance

See detailed use cases and workflows →

Best Practices

Keep Source Attribution

Always include source field:

source TEXT  -- "refund_policy_v2024.pdf"

Track Update Timestamps

Maintain updated_at for tracking:

updated_at TIMESTAMP DEFAULT NOW()

Update on changes:

UPDATE documents
SET content = new_content, updated_at = NOW()
WHERE id = document_id;

Use Meaningful Source Names

Consistent, descriptive names:

✅ Good: "user_manual_2024.pdf", "api_docs_v3.html"
❌ Bad: "file1.pdf", "doc_final_FINAL_v2.pdf"

Index for Performance

Create indexes for faster searches:

-- Vector similarity index
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops);

-- Metadata indexes
CREATE INDEX ON documents(source);
CREATE INDEX ON documents(updated_at);

Troubleshooting

Connection Failed

Check connection string format:

postgresql://username:password@host:port/database

Verify network access:

Can Arcbeam reach your database?
Is your database behind a firewall?
Consider using a VPN or allowlist Arcbeam IPs

Test connection manually:

psql "postgresql://user:pass@host:port/db"

Documents Not Linking

Check ID column mapping:

Ensure ID column name is correct
Verify IDs in traces match IDs in database

Trigger manual sync:

Go to Datasets → Sync Now
Wait for completion
Check if documents now appear

Verify data exists:

SELECT COUNT(*) FROM documents;

Slow Syncs

Optimize table:

VACUUM ANALYZE documents;

Add indexes:

CREATE INDEX ON documents(id);

Limit metadata columns:

Only sync fields you need
Reduces sync time

Next Steps

Add Data Sources

Step-by-step guide to connecting PGvector

Data Lineage

Understand how lineage tracking works

See What Data Is Used

Explore document usage analytics

Trace Issues to Source Data

Debug problems using lineage

​What Is PGvector?

​Why Connect PGvector to Arcbeam?

Data Lineage

Usage Analytics

Compliance

​Prerequisites

​1. PostgreSQL with PGvector

​2. Table with Vector Data

​3. Database Credentials

​4. Arcbeam Project

​Connecting PGvector

​Syncing Data

​Manual Sync

​What Gets Synced

​Sync Performance

​Security & Privacy

​Connection Security

​Data Access

​Credentials Management

​What You Can Do

Debug Wrong Answers

Optimize Your Knowledge Base

Track Content Updates

Compliance Reporting

​Best Practices

​Keep Source Attribution

​Track Update Timestamps

​Use Meaningful Source Names

​Index for Performance

​Troubleshooting

​Connection Failed

​Documents Not Linking

​Slow Syncs

​Next Steps

Add Data Sources

Data Lineage

See What Data Is Used

Trace Issues to Source Data

What Is PGvector?

Why Connect PGvector to Arcbeam?

Prerequisites

1. PostgreSQL with PGvector

2. Table with Vector Data

3. Database Credentials

4. Arcbeam Project

Connecting PGvector

Syncing Data

Manual Sync

What Gets Synced

Sync Performance

Security & Privacy

Connection Security

Data Access

Credentials Management

What You Can Do

Best Practices

Keep Source Attribution

Track Update Timestamps

Use Meaningful Source Names

Index for Performance

Troubleshooting

Connection Failed

Documents Not Linking

Slow Syncs

Next Steps