Skip to content

Proposal: Multi-Engine Support for Guppy Preparation Repository #320

@laciferin2024

Description

@laciferin2024

Proposal: Multi-Engine Support for Guppy Preparation Repository

1. Executive Summary

As Guppy targets enterprise-level data ingestion (10TB+ uploads), the current reliance on SQLite for prep-state metadata introduces significant operational risks and performance bottlenecks. I propose adding experimental support for PostgreSQL to provide the concurrency, reliability, and backup capabilities required for large-scale production environments.

2. Problem Statement

The current sqlrepo implementation is optimized for the "Developer Experience" of a CLI tool—it is self-contained and requires zero configuration. However, for "VERY large uploads," two major root causes have been identified:

2.1 Concurrency Bottlenecks

SQLite is fundamentally a single-writer database. To avoid "Database is locked" errors during parallel preparation/sharding, the current implementation enforces:

db.SetMaxOpenConns(1)

This forces all metadata writes—tracking millions of IPLD nodes, CIDs, and shard offsets—through a single connection, creating a serial bottleneck that limits the throughput of multi-threaded data preparation.

2.2 Filesystem Lock Contention (NAS/NFS)

Enterprise datasets often reside on Network Attached Storage (NAS) or Distributed Filesystems (NFS/SMB).

  • Root Cause: SQLite’s file-locking mechanism relies on consistent filesystem-level locking implementations. Many network filesystems have buggy or high-latency lock management.
  • Result: This leads to "Disk I/O errors" or extremely degraded performance when the database file is stored on the same NAS as the data being scanned.

2.3 Operational Backup Risks

A 10TB upload may take several days to complete.

  • Problem: You cannot safely backup a SQLite database while it is being actively written to without risk of corruption or needing to stop the process.
  • Postgres Advantage: Postgres supports online hot-backups (WAL archiving), allowing operators to secure the progress of a massive ingestion task without interruption.

3. Proposed Solution: Dialect-Agnostic SQL Repository

I propose refactoring pkg/preparation/sqlrepo to support multiple database engines while keeping SQLite as the default for standard users.

3.1 Architectural Changes

  • Dialect Abstraction: Implement a lightweight query transformer that handles the syntax difference between SQLite (?) and Postgres ($1, $2) placeholders.
  • Swappable Drivers: Integrate pgx (PostgreSQL) alongside the existing CGO-free modernc.org/sqlite driver.
  • Config-Driven Initialization: Add database_type and database_url to the Guppy configuration.

3.2 Migration Strategy

  • Goose Dialects: Leverage the existing goose migration system to handle dialect-specific Up and Down migrations.
  • Schema Standardization: Refactor the current schema.sql to avoid SQLite-only keywords like STRICT and PRAGMA in the shared path, moving engine-specific tuning to the initialization phase.

4. Implementation Plan (Experimental)

  1. Phase 1: Configuration: Update pkg/config to allow users to opt-in to Postgres via environment variables or a configuration file.
  2. Phase 2: Repo Refactor: Introduce the dialect field to the Repo struct and wrap PrepareContext calls in a rewriter.
  3. Phase 3: Connection Management: Update OpenRepo in pkg/preparation to branch logic based on the driver.
  4. Phase 4: Verification: Test against standard Postgres Docker containers to ensure CID and DID scanning remains identical to SQLite behavior.

5. Benefits for Enterprise

  • High Concurrency: Enable MaxOpenConns > 1 to scale preparation speed with CPU core count.
  • Distributed Architecture: Decouple the "Brain" (metadata) from the "Worker" (CLI). The CLI can run on a storage node while the DB runs on a managed database service (e.g., RDS, Cloud SQL).
  • Data Integrity: Offload locking and transaction management to a robust server-client database model, eliminating NAS-related locking issues.

Status: Experimental / Proposed
Target Package: pkg/preparation/sqlrepo

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions