Proposal: Multi-Engine Support for Guppy Preparation Repository

# Proposal: Multi-Engine Support for Guppy Preparation Repository

## 1. Executive Summary
As Guppy targets enterprise-level data ingestion (10TB+ uploads), the current reliance on SQLite for prep-state metadata introduces significant operational risks and performance bottlenecks. I propose adding experimental support for **PostgreSQL** to provide the concurrency, reliability, and backup capabilities required for large-scale production environments.

## 2. Problem Statement
The current `sqlrepo` implementation is optimized for the "Developer Experience" of a CLI tool—it is self-contained and requires zero configuration. However, for "VERY large uploads," two major root causes have been identified:

### 2.1 Concurrency Bottlenecks
SQLite is fundamentally a single-writer database. To avoid "Database is locked" errors during parallel preparation/sharding, the current implementation enforces:
```go
db.SetMaxOpenConns(1)
```
This forces all metadata writes—tracking millions of IPLD nodes, CIDs, and shard offsets—through a single connection, creating a serial bottleneck that limits the throughput of multi-threaded data preparation.

### 2.2 Filesystem Lock Contention (NAS/NFS)
Enterprise datasets often reside on Network Attached Storage (NAS) or Distributed Filesystems (NFS/SMB).
*   **Root Cause**: SQLite’s file-locking mechanism relies on consistent filesystem-level locking implementations. Many network filesystems have buggy or high-latency lock management.
*   **Result**: This leads to "Disk I/O errors" or extremely degraded performance when the database file is stored on the same NAS as the data being scanned.

### 2.3 Operational Backup Risks
A 10TB upload may take several days to complete. 
*   **Problem**: You cannot safely backup a SQLite database while it is being actively written to without risk of corruption or needing to stop the process.
*   **Postgres Advantage**: Postgres supports online hot-backups (WAL archiving), allowing operators to secure the progress of a massive ingestion task without interruption.

## 3. Proposed Solution: Dialect-Agnostic SQL Repository

I propose refactoring `pkg/preparation/sqlrepo` to support multiple database engines while keeping SQLite as the default for standard users.

### 3.1 Architectural Changes
*   **Dialect Abstraction**: Implement a lightweight query transformer that handles the syntax difference between SQLite (`?`) and Postgres (`$1, $2`) placeholders.
*   **Swappable Drivers**: Integrate `pgx` (PostgreSQL) alongside the existing CGO-free `modernc.org/sqlite` driver.
*   **Config-Driven Initialization**: Add `database_type` and `database_url` to the Guppy configuration.

### 3.2 Migration Strategy
*   **Goose Dialects**: Leverage the existing `goose` migration system to handle dialect-specific `Up` and `Down` migrations.
*   **Schema Standardization**: Refactor the current `schema.sql` to avoid SQLite-only keywords like `STRICT` and `PRAGMA` in the shared path, moving engine-specific tuning to the initialization phase.

## 4. Implementation Plan (Experimental)

1.  **Phase 1: Configuration**: Update `pkg/config` to allow users to opt-in to Postgres via environment variables or a configuration file.
2.  **Phase 2: Repo Refactor**: Introduce the `dialect` field to the `Repo` struct and wrap `PrepareContext` calls in a rewriter.
3.  **Phase 3: Connection Management**: Update `OpenRepo` in `pkg/preparation` to branch logic based on the driver.
4.  **Phase 4: Verification**: Test against standard Postgres Docker containers to ensure CID and DID scanning remains identical to SQLite behavior.

## 5. Benefits for Enterprise
*   **High Concurrency**: Enable `MaxOpenConns > 1` to scale preparation speed with CPU core count.
*   **Distributed Architecture**: Decouple the "Brain" (metadata) from the "Worker" (CLI). The CLI can run on a storage node while the DB runs on a managed database service (e.g., RDS, Cloud SQL).
*   **Data Integrity**: Offload locking and transaction management to a robust server-client database model, eliminating NAS-related locking issues.

---
**Status**: Experimental / Proposed
**Target Package**: `pkg/preparation/sqlrepo`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Multi-Engine Support for Guppy Preparation Repository #320

Proposal: Multi-Engine Support for Guppy Preparation Repository

1. Executive Summary

2. Problem Statement

2.1 Concurrency Bottlenecks

2.2 Filesystem Lock Contention (NAS/NFS)

2.3 Operational Backup Risks

3. Proposed Solution: Dialect-Agnostic SQL Repository

3.1 Architectural Changes

3.2 Migration Strategy

4. Implementation Plan (Experimental)

5. Benefits for Enterprise

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Multi-Engine Support for Guppy Preparation Repository #320

Description

Proposal: Multi-Engine Support for Guppy Preparation Repository

1. Executive Summary

2. Problem Statement

2.1 Concurrency Bottlenecks

2.2 Filesystem Lock Contention (NAS/NFS)

2.3 Operational Backup Risks

3. Proposed Solution: Dialect-Agnostic SQL Repository

3.1 Architectural Changes

3.2 Migration Strategy

4. Implementation Plan (Experimental)

5. Benefits for Enterprise

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions