What Is PGvector?
PGvector is a PostgreSQL extension for vector similarity search. It’s the most popular open-source vector database solution, especially for self-hosted applications. PGvector allows you to:- Store vector embeddings alongside your data
- Perform similarity searches using various distance metrics
- Scale to millions of vectors
- Keep everything in PostgreSQL (no additional infrastructure)
Why Connect PGvector to Arcbeam?
Data Lineage
Link AI traces to the exact documents that influenced them. Trace wrong answers back to outdated documents.
Usage Analytics
Track which documents are retrieved most often, which are never used, and retrieval patterns over time.
Compliance
Demonstrate data provenance with full audit trail from output to source.
Prerequisites
Before connecting PGvector to Arcbeam, you need:1. PostgreSQL with PGvector
Install the PGvector extension:2. Table with Vector Data
A table containing your documents and embeddings:3. Database Credentials
Read access to the database:4. Arcbeam Project
An existing project where you’ll connect the dataset.Connecting PGvector
- Via UI
- Via API
Enter connection details
- Connection String:
postgresql://user:pass@host:port/db - Table Name: Name of your documents table (e.g.,
documents)
Map schema columns
- ID Column: Unique identifier (e.g.,
id) - Content Column: Document text (e.g.,
content) - Embedding Column: Vector field (e.g.,
embedding) - Metadata Columns: Optional fields like
source,title,updated_at
Syncing Data
Arcbeam periodically syncs metadata from your PGvector database.Manual Sync
Trigger a sync anytime:- Go to Datasets page
- Find your PGvector dataset
- Click Sync Now
- Wait for completion (usually < 1 minute for small datasets)
What Gets Synced
| Data Type | Synced? | Description |
|---|---|---|
| Document IDs | ✓ Yes | To match retrieved docs in traces |
| Content | ✓ Yes | For display in trace details |
| Metadata columns | ✓ Yes | Only fields you explicitly map |
| Embeddings | ✗ Never | Not needed, saves storage and bandwidth |
| Other tables | ✗ Never | Only the table you specify |
| Unmapped fields | ✗ Never | Sensitive data unless explicitly mapped |
Sync Performance
For large datasets:- < 10k docs: Sync in seconds
- 10k-100k docs: Sync in minutes
- 100k+ docs: Sync may take longer
Security & Privacy
Connection Security
- Connections use SSL/TLS encryption
- Credentials stored encrypted at rest
- Access limited to read-only queries
- No write access to your database
Data Access
Arcbeam queries:- Only the table you specify
- Only columns you map
- Read-only SELECT queries
- Modify your data
- Access other tables
- Store full embeddings
- Keep entire documents (only excerpts for display)
Credentials Management
Best practices:- Use read-only database user
- Limit access to specific table
- Rotate credentials periodically
- Use connection pooling limits
What You Can Do
Once your PGvector database is connected, you can:Debug Wrong Answers
Trace incorrect AI responses back to the source documents and fix them
Optimize Your Knowledge Base
Find and remove unused documents to improve retrieval speed and quality
Track Content Updates
Identify high-traffic documents that need updating
Compliance Reporting
Generate audit trails showing complete data provenance
Best Practices
Keep Source Attribution
Always include source field:Track Update Timestamps
Maintainupdated_at for tracking:
Use Meaningful Source Names
Consistent, descriptive names:Index for Performance
Create indexes for faster searches:Troubleshooting
Connection Failed
Check connection string format:- Can Arcbeam reach your database?
- Is your database behind a firewall?
- Consider using a VPN or allowlist Arcbeam IPs
Documents Not Linking
Check ID column mapping:- Ensure ID column name is correct
- Verify IDs in traces match IDs in database
- Go to Datasets → Sync Now
- Wait for completion
- Check if documents now appear
Slow Syncs
Optimize table:- Only sync fields you need
- Reduces sync time
