Skip to main content
Connect your PostgreSQL database with to Arcbeam to enable full tracking. See which documents influence each AI response and track data usage over time.

What Is PGvector?

PGvector is a PostgreSQL extension for vector similarity search. It’s the most popular open-source vector database solution, especially for self-hosted applications. PGvector allows you to:
  • Store vector embeddings alongside your data
  • Perform similarity searches using various distance metrics
  • Scale to millions of vectors
  • Keep everything in PostgreSQL (no additional infrastructure)

Why Connect PGvector to Arcbeam?

Data Lineage

Link AI traces to the exact documents that influenced them. Trace wrong answers back to outdated documents.

Usage Analytics

Track which documents are retrieved most often, which are never used, and retrieval patterns over time.

Compliance

Demonstrate data provenance with full audit trail from output to source.

Prerequisites

Before connecting PGvector to Arcbeam, you need:

1. PostgreSQL with PGvector

Install the PGvector extension:
CREATE EXTENSION vector;

2. Table with Vector Data

A table containing your documents and embeddings:
CREATE TABLE documents (
  id UUID PRIMARY KEY,
  content TEXT,
  embedding VECTOR(1536),  -- Dimension depends on your model
  source TEXT,
  title TEXT,
  updated_at TIMESTAMP DEFAULT NOW()
);

3. Database Credentials

Read access to the database:
postgresql://username:password@host:port/database

4. Arcbeam Project

An existing project where you’ll connect the dataset.

Connecting PGvector

1

Navigate to Datasets

Go to Datasets page in Arcbeam and click New Dataset
2

Select PGvector

Choose PGvector as the dataset type
3

Enter connection details

  • Connection String: postgresql://user:pass@host:port/db
  • Table Name: Name of your documents table (e.g., documents)
4

Map schema columns

  • ID Column: Unique identifier (e.g., id)
  • Content Column: Document text (e.g., content)
  • Embedding Column: Vector field (e.g., embedding)
  • Metadata Columns: Optional fields like source, title, updated_at
5

Test and Create

Click Test Connection to verify, then Create Dataset
Your PGvector database is now connected to Arcbeam!

Syncing Data

Arcbeam periodically syncs metadata from your PGvector database.

Manual Sync

Trigger a sync anytime:
  1. Go to Datasets page
  2. Find your PGvector dataset
  3. Click Sync Now
  4. Wait for completion (usually < 1 minute for small datasets)

What Gets Synced

Data TypeSynced?Description
Document IDs✓ YesTo match retrieved docs in traces
Content✓ YesFor display in trace details
Metadata columns✓ YesOnly fields you explicitly map
EmbeddingsNeverNot needed, saves storage and bandwidth
Other tablesNeverOnly the table you specify
Unmapped fieldsNeverSensitive data unless explicitly mapped

Sync Performance

For large datasets:
  • < 10k docs: Sync in seconds
  • 10k-100k docs: Sync in minutes
  • 100k+ docs: Sync may take longer

Security & Privacy

Connection Security

  • Connections use SSL/TLS encryption
  • Credentials stored encrypted at rest
  • Access limited to read-only queries
  • No write access to your database

Data Access

Arcbeam queries:
  • Only the table you specify
  • Only columns you map
  • Read-only SELECT queries
Arcbeam does not:
  • Modify your data
  • Access other tables
  • Store full embeddings
  • Keep entire documents (only excerpts for display)

Credentials Management

Best practices:
  • Use read-only database user
  • Limit access to specific table
  • Rotate credentials periodically
  • Use connection pooling limits
Example read-only user:
CREATE USER arcbeam_readonly WITH PASSWORD 'secure_password';
GRANT CONNECT ON DATABASE mydb TO arcbeam_readonly;
GRANT USAGE ON SCHEMA public TO arcbeam_readonly;
GRANT SELECT ON documents TO arcbeam_readonly;

What You Can Do

Once your PGvector database is connected, you can:

Debug Wrong Answers

Trace incorrect AI responses back to the source documents and fix them

Optimize Your Knowledge Base

Find and remove unused documents to improve retrieval speed and quality

Track Content Updates

Identify high-traffic documents that need updating

Compliance Reporting

Generate audit trails showing complete data provenance
See detailed use cases and workflows →

Best Practices

Keep Source Attribution

Always include source field:
source TEXT  -- "refund_policy_v2024.pdf"

Track Update Timestamps

Maintain updated_at for tracking:
updated_at TIMESTAMP DEFAULT NOW()
Update on changes:
UPDATE documents
SET content = new_content, updated_at = NOW()
WHERE id = document_id;

Use Meaningful Source Names

Consistent, descriptive names:
✅ Good: "user_manual_2024.pdf", "api_docs_v3.html"
❌ Bad: "file1.pdf", "doc_final_FINAL_v2.pdf"

Index for Performance

Create indexes for faster searches:
-- Vector similarity index
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops);

-- Metadata indexes
CREATE INDEX ON documents(source);
CREATE INDEX ON documents(updated_at);

Troubleshooting

Connection Failed

Check connection string format:
postgresql://username:password@host:port/database
Verify network access:
  • Can Arcbeam reach your database?
  • Is your database behind a firewall?
  • Consider using a VPN or allowlist Arcbeam IPs
Test connection manually:
psql "postgresql://user:pass@host:port/db"

Documents Not Linking

Check ID column mapping:
  • Ensure ID column name is correct
  • Verify IDs in traces match IDs in database
Trigger manual sync:
  • Go to Datasets → Sync Now
  • Wait for completion
  • Check if documents now appear
Verify data exists:
SELECT COUNT(*) FROM documents;

Slow Syncs

Optimize table:
VACUUM ANALYZE documents;
Add indexes:
CREATE INDEX ON documents(id);
Limit metadata columns:
  • Only sync fields you need
  • Reduces sync time

Next Steps