Proposal: Multi-Engine Support for Guppy Preparation Repository
1. Executive Summary
As Guppy targets enterprise-level data ingestion (10TB+ uploads), the current reliance on SQLite for prep-state metadata introduces significant operational risks and performance bottlenecks. I propose adding experimental support for PostgreSQL to provide the concurrency, reliability, and backup capabilities required for large-scale production environments.
2. Problem Statement
The current sqlrepo implementation is optimized for the "Developer Experience" of a CLI tool—it is self-contained and requires zero configuration. However, for "VERY large uploads," two major root causes have been identified:
2.1 Concurrency Bottlenecks
SQLite is fundamentally a single-writer database. To avoid "Database is locked" errors during parallel preparation/sharding, the current implementation enforces:
This forces all metadata writes—tracking millions of IPLD nodes, CIDs, and shard offsets—through a single connection, creating a serial bottleneck that limits the throughput of multi-threaded data preparation.
2.2 Filesystem Lock Contention (NAS/NFS)
Enterprise datasets often reside on Network Attached Storage (NAS) or Distributed Filesystems (NFS/SMB).
- Root Cause: SQLite’s file-locking mechanism relies on consistent filesystem-level locking implementations. Many network filesystems have buggy or high-latency lock management.
- Result: This leads to "Disk I/O errors" or extremely degraded performance when the database file is stored on the same NAS as the data being scanned.
2.3 Operational Backup Risks
A 10TB upload may take several days to complete.
- Problem: You cannot safely backup a SQLite database while it is being actively written to without risk of corruption or needing to stop the process.
- Postgres Advantage: Postgres supports online hot-backups (WAL archiving), allowing operators to secure the progress of a massive ingestion task without interruption.
3. Proposed Solution: Dialect-Agnostic SQL Repository
I propose refactoring pkg/preparation/sqlrepo to support multiple database engines while keeping SQLite as the default for standard users.
3.1 Architectural Changes
- Dialect Abstraction: Implement a lightweight query transformer that handles the syntax difference between SQLite (
?) and Postgres ($1, $2) placeholders.
- Swappable Drivers: Integrate
pgx (PostgreSQL) alongside the existing CGO-free modernc.org/sqlite driver.
- Config-Driven Initialization: Add
database_type and database_url to the Guppy configuration.
3.2 Migration Strategy
- Goose Dialects: Leverage the existing
goose migration system to handle dialect-specific Up and Down migrations.
- Schema Standardization: Refactor the current
schema.sql to avoid SQLite-only keywords like STRICT and PRAGMA in the shared path, moving engine-specific tuning to the initialization phase.
4. Implementation Plan (Experimental)
- Phase 1: Configuration: Update
pkg/config to allow users to opt-in to Postgres via environment variables or a configuration file.
- Phase 2: Repo Refactor: Introduce the
dialect field to the Repo struct and wrap PrepareContext calls in a rewriter.
- Phase 3: Connection Management: Update
OpenRepo in pkg/preparation to branch logic based on the driver.
- Phase 4: Verification: Test against standard Postgres Docker containers to ensure CID and DID scanning remains identical to SQLite behavior.
5. Benefits for Enterprise
- High Concurrency: Enable
MaxOpenConns > 1 to scale preparation speed with CPU core count.
- Distributed Architecture: Decouple the "Brain" (metadata) from the "Worker" (CLI). The CLI can run on a storage node while the DB runs on a managed database service (e.g., RDS, Cloud SQL).
- Data Integrity: Offload locking and transaction management to a robust server-client database model, eliminating NAS-related locking issues.
Status: Experimental / Proposed
Target Package: pkg/preparation/sqlrepo
Proposal: Multi-Engine Support for Guppy Preparation Repository
1. Executive Summary
As Guppy targets enterprise-level data ingestion (10TB+ uploads), the current reliance on SQLite for prep-state metadata introduces significant operational risks and performance bottlenecks. I propose adding experimental support for PostgreSQL to provide the concurrency, reliability, and backup capabilities required for large-scale production environments.
2. Problem Statement
The current
sqlrepoimplementation is optimized for the "Developer Experience" of a CLI tool—it is self-contained and requires zero configuration. However, for "VERY large uploads," two major root causes have been identified:2.1 Concurrency Bottlenecks
SQLite is fundamentally a single-writer database. To avoid "Database is locked" errors during parallel preparation/sharding, the current implementation enforces:
This forces all metadata writes—tracking millions of IPLD nodes, CIDs, and shard offsets—through a single connection, creating a serial bottleneck that limits the throughput of multi-threaded data preparation.
2.2 Filesystem Lock Contention (NAS/NFS)
Enterprise datasets often reside on Network Attached Storage (NAS) or Distributed Filesystems (NFS/SMB).
2.3 Operational Backup Risks
A 10TB upload may take several days to complete.
3. Proposed Solution: Dialect-Agnostic SQL Repository
I propose refactoring
pkg/preparation/sqlrepoto support multiple database engines while keeping SQLite as the default for standard users.3.1 Architectural Changes
?) and Postgres ($1, $2) placeholders.pgx(PostgreSQL) alongside the existing CGO-freemodernc.org/sqlitedriver.database_typeanddatabase_urlto the Guppy configuration.3.2 Migration Strategy
goosemigration system to handle dialect-specificUpandDownmigrations.schema.sqlto avoid SQLite-only keywords likeSTRICTandPRAGMAin the shared path, moving engine-specific tuning to the initialization phase.4. Implementation Plan (Experimental)
pkg/configto allow users to opt-in to Postgres via environment variables or a configuration file.dialectfield to theRepostruct and wrapPrepareContextcalls in a rewriter.OpenRepoinpkg/preparationto branch logic based on the driver.5. Benefits for Enterprise
MaxOpenConns > 1to scale preparation speed with CPU core count.Status: Experimental / Proposed
Target Package:
pkg/preparation/sqlrepo