Skip to content

burnside-project/pg-cdc

pg-cdc — A frustratingly simple way to create governed operational data from PostgreSQL for AI applications.

Build AI agents from PostgreSQL operational data — no production credentials.

AWS Marketplace
Coming Soon on AWS Marketplace

+-------------------------------+        +----------------------------------------------------+
|       PRODUCTION ZONE         |        |        GOVERNED DATA ZONE                          |
|                               |        |                                                    |
|  PostgreSQL                   |        |  S3  · immutable Iceberg/Parquet                   |
|  self-managed · RDS · Aurora  |        |  Catalog + Governance tags                         |
|         |                     |        |            audit trail                             |
|         |  WAL  (one-way)     |        |        |              |                            |
|         v                     |        |        v              v                            |
|      pg-cdc ------------------|------->|   MCP server     REST API / pg-warehouse(dev Tools)|
|   (CDC engine)                |        |   (AI agents)                                      |
+-------------------------------+        +----------------------------------------------------+

Build ai agent from Postgres Operational Data — no prod credentials.

CI Release License Go Stars

PostgreSQL change data capture → governed Apache Iceberg / Parquet on AWS S3 — built for AI agents.

📅 Book a Meeting

Core Features

  • pg-cdc is not just replication. pg-cdc streams Postgres Write Ahead Logs(WAL) out of production Postgres into typed, immutable, time-travelable Iceberg tables on S3
  • Registers each entities in the AWS Glue Catalog
  • Gates every read with AWS Lake Formation tags — so AI agents, analysts, and query engines consume governed data without ever touching the source database, and without database credentials.

No JVM. One binary.

  • No return path — the WAL is one-way; Parquet is immutable. Agents physically cannot write to production.
  • No database credentials — consumers authenticate via AWS IAM + Lake Formation, never a connection string.
  • Governed by default — untagged data is invisible; Lake Formation tags gate every read, down to the column.
  • Time travel built in — every flush is an Iceberg snapshot; CDC epochs + immutable raw@<ts> tags give historical queries with no database branching.

Security Architecture

See docs/10-security-architecture.md for the full five-property model and docs/11-ai-agent-consumption.md for the two consumption paths.


What makes it unique

A feature-by-feature catalog. The exhaustive, link-rich version lives in docs/00-features.md.

1. CDC engine — Postgres → typed columnar, no triggers

Capability What it does
Logical replication Reads a publication + replication slot, decodes pgoutput v2 (I/U/D). No triggers, no app changes.
Consistent initial snapshot init exports a snapshot at one LSN via CREATE_REPLICATION_SLOT; per-table COPY under SET TRANSACTION SNAPSHOT. Cross-table parallelism (init.parallel_workers).
Streaming WAL → delta Parquet start daemon buffers events and flushes typed Parquet per table on interval / row / byte budget. Per-table parallel flush.
Schema evolution Detects ADD/DROP COLUMN, propagates to manifest + Glue + Iceberg live — no restart. reconcile --table one-shot for hot tables.
Compaction compact merges delta epochs into a new base, applies I/U/D, soft-deletes on a tombstone TTL. CAS-safe alongside the live streamer.
TOAST partial-row preservation Honors TupleDataType 'u' — unchanged TOAST columns keep their value instead of silently becoming NULL.
Versioned manifest A single JSON source of truth (active tables, schema versions, epoch markers, source LSN), CAS-committed on S3.

2. Storage & formats — Iceberg-first on S3

Capability What it does
Apache Iceberg writer catalog.type: iceberg — every flush is an Iceberg snapshot; in-tree schema evolution + background snapshot expiration (default retain 100/table).
Hive/Parquet on Glue catalog.type: glue — register active tables so Athena / Spark / DuckDB read the Parquet directly.
S3 sink Multipart upload (aws-sdk-go-v2 transfer manager), SDK retry middleware, S3-compatible endpoints (MinIO, R2). Configurable bucket/prefix/region/endpoint.
Filesystem sink Zero-dependency local output for dev / on-prem / air-gapped. (GCS sink is a scaffolded stub — not implemented.)
Time-travel refs refs.json: staging advances every flush, main advances by promotion.mode (auto/manual), immutable raw@<ts> tags per flush. Git-shaped consumer workflow via promote.

3. Governance for AI — the part that isn't replication

The differentiator. pg-cdc doesn't just land data — it makes it default-deny governed so it's safe to point an autonomous agent at.

Capability What it does
AWS Glue catalog Tables registered at init and after every schema evolution; idempotent catalog register re-sync; drift metrics.
Lake Formation tag enforcement Tag-policy grants gate reads. Untagged = invisible. Column-level tags supported. The bucket policy + IAMAllowedPrincipals removal make it load-bearing, not advisory.
DynamoDB ACL intent registry Versioned classification per resource (db.schema.table[.column]) with tag inheritance (db→table→column), monotonic versions, and full audit provenance (who / when / why).
acl reconciliation acl register | get | set | list | diff | sync — diffs intent vs live LF and applies AddLFTagsToResource / RemoveLFTagsFromResource. CI-friendly exit codes (0 clean · 3 drift-healed · 2 AWS error).
Query-time enforcement The serve layer resolves each caller's LF grants (ListPermissions, 5-min cache) — columns a role can't see never leave storage. Live, not manifest-based.
Break-glass + audit Time-boxed emergency grants with TTL auto-revoke (GitHub Environment gate, OIDC role) and a 365-day CloudWatch audit log group.

4. AI & agent consumption — MCP, REST, time travel

Capability What it does
MCP server (serve --mcp) Model Context Protocol stdio server for Claude Desktop / Cursor / any MCP client. 8 tools: list_tables, describe_table, query, get_changes, get_manifest, get_freshness, get_snapshot, get_diff. Governed reads via DuckDB over the Parquet.
REST API (serve --http) Same query semantics + governance over HTTP. Data plane: /v1/tables*, /v1/query, /v1/manifest, /v1/diff. Console/ops: /v1/status, /v1/catalog, /v1/audit/summary, SSE /v1/stream/{status,events,audit}. Bearer auth, body caps, per-principal rate limiting.
Freshness & snapshot signals get_freshness (is_stale, lag, schema/policy versions) and get_snapshot (Iceberg snapshot_id / Parquet base path) so agents decide whether to re-pull and read object storage with their own identity — pg-cdc doesn't proxy bytes.
Contract diff get_diff / /v1/diff classifies ref-to-ref change as safe / schema_change / policy_change / breaking with a promote_safe gate.
Ad-hoc query CLI pg-cdc query — project, filter (col:op:value), limit, role-scope, --changes-since <epoch> for the CDC changelog.
Live policy refresh acl set bumps policy_version; the new access surface appears with no server restart.

5. Managed Postgres — RDS & Aurora, first-class

Capability What it does
Provider detection source.postgres.provider: auto|self-managed|rds|aurora, logged at startup; same pgoutput path for all three.
preflight Readiness check (wal_level, replication privilege, CREATE-on-db, slot headroom, replica identity) with provider-specific remediation.
IAM-token auth iam_auth: true mints + rotates short-lived RDS/Aurora tokens per connect — no static password — via a BeforeConnect re-resolution hook.
Slot-safety guardrail guardrail.max_retained_wal_bytes alarms before a lagging slot pins enough WAL to threaten the source.
Idle heartbeat replication.heartbeat_interval_sec keeps an idle slot advancing on Aurora Serverless v2.
Ephemeral test bed Zero-dormant-cost RDS/Aurora validation harness — standalone burnside-project-pg-cdc-testbed.

6. Reliability & durability — at-least-once, crash-safe

Capability What it does
At-least-once delivery The slot's confirmed_flush_lsn advances only after Parquet + Iceberg + manifest are durable. A crash replays from the last durable flush — no silent data loss. (docs/12-durability.md)
Manifest CAS + circuit breaker If-Match ETag CAS on S3; schema-aware merge on conflict; a breaker trips on suspected dual-writers (default 10 conflicts / 60 s).
Graceful SIGTERM drain Drains the in-flight buffer to the sink (fresh timeout) before exit — no partial Parquet, no orphan epochs.
Byte-budgeted batches flush.max_bytes (default 128 MiB) guards against OOM on wide-row / large-JSONB workloads.
Backpressure visibility Channel-full counters + block-seconds surface a slow sink before it OOMs the daemon.
Transient-error retry Bounded exponential backoff on PG reconnect (1 s→30 s, 10 attempts) and native AWS SDK retry on every client.

7. Observability — 29 metrics, streams, agent-readable artifacts

Capability What it does
Prometheus metrics 29 pgcdc_* metrics on :9090/metrics — replication lag, slot safety, throughput, sink latency/errors, compaction, catalog, Iceberg snapshots, CAS conflicts, backpressure, ACL drift, audit. (docs/metrics.md)
CloudWatch dashboard + alarms Terraform-managed 11-panel dashboard + 9 alarms, parameterized by deployment_name.
SSE event streams /v1/stream/{status,events,audit} for live consoles/dashboards without polling.
Agent-consumable run artifacts Per-run run.json · events.ndjson (stable schema) · state.json (minute-cadence LSN/lag/epoch) · summary.md — designed for an LLM to crawl.

8. Security & operability

Capability What it does
4 auth methods Password (${VAR} substitution, re-resolved on reconnect), mTLS, AWS RDS IAM, GCP Cloud SQL IAM. (docs/08-authentication.md)
Least-privilege posture Documented minimal grants; default-deny LF + S3 bucket policy; HIPAA overlay; on-file producer-side security audit.
systemd integration Hardened pg-cdc.service + compaction/soak timers; documented memory-cap drop-ins.
14 runbooks + SLOs Symptom→diagnosis→remediation runbooks, SLO definitions, on-call playbook. (docs/runbooks/)
Validation harness Dockerized integration suite, a 5-minute e2e-soak.sh, and failure-injection scripts (OOM, slot-full, S3 5xx, SIGTERM-loop).

OSS / commercial: the single Apache-2.0 binary built from this repo contains every feature above — there is no in-binary edition gate. The public OSS mirror at github.com/burnside-project/pg-cdc holds back the governance / ACL / audit / REST / refs plane at the repository level. Authoritative split: docs/internal/plans/oss-adoption.md.


Commands

# Core CDC  (every command accepts --config <path>, default pg-cdc.yml)
pg-cdc preflight                  # Verify the source is ready for logical replication (managed-aware)
pg-cdc init                       # Snapshot tables → base Iceberg/Parquet + manifest + catalog + ACL registration
pg-cdc start                      # Stream WAL → delta epochs (the long-running daemon)
pg-cdc compact                    # Merge deltas → new base (applies I/U/D, soft-deletes on tombstone TTL)
pg-cdc status                     # Health: slot, lag, LSN, epochs, tables, oldest source txn
pg-cdc discover [--acl] [--dry-run]      # List tables (+ role→table→column access map with --acl)
pg-cdc reconcile [--table <q>] --force   # One-shot schema reconcile after an ALTER on a hot table
pg-cdc teardown                   # Drop slot + publication + storage + state

# Servers  (full endpoint list: pg-cdc serve --help)
pg-cdc serve --mcp                # MCP stdio server — 8 tools (Claude Desktop & other MCP clients)
pg-cdc serve --http               # REST API on 127.0.0.1:8080 (bearer auth required for non-loopback)
#   data plane:  /v1/tables* · /v1/query · /v1/manifest · /v1/diff           (docs/spec/data-plane-api.md)
#   console:     /v1/status · /v1/catalog · /v1/audit/summary · /v1/stream/* (docs/spec/console-api.md)
#   always on:   /healthz · /metrics   ·   --pprof <loopback> for /debug/pprof

# Query  (flags: -t -c --role -f -n -w --no-deltas --changes-since)
pg-cdc query                                                              # list active tables
pg-cdc query -t public.orders -c id,name -w amount:gt:100 -n 10 -f json   # project + filter + limit
pg-cdc promote --from <ref> --to main [--force --dry-run --json --reason "..."]   # manual ref promotion

# Catalog (Glue / Iceberg)
pg-cdc catalog register           # Register manifest tables in the catalog without re-snapshotting

# ACL (DynamoDB-backed Layer-2 tag intent → Lake Formation)
pg-cdc acl register <resource>    # Idempotently register at v=0 with default tags
pg-cdc acl get <resource>         # Resolved + direct tags
pg-cdc acl set <resource> --tag k=v [--tag k2=v2 ...] --reason "..."      # Bump version, set intent (audited)
pg-cdc acl list [--unclassified]  # List classifications
pg-cdc acl diff                   # Drift between ACL intent and Lake Formation
pg-cdc acl sync                   # Apply intent → Lake Formation (exits 3 when drift healed)

pg-cdc version                    # Print version + build info

Full operations guide: docs/07-operations.md. Smoke-test runbook: docs/09-smoke-test.md.


Quickstart

# 1. Install (Linux amd64, from the private repo via gh)
gh release download --repo dataalgebra-engineering/burnside-project-pg-cdc \
  --pattern 'pg-cdc_*_linux_amd64.tar.gz' --dir /tmp
tar xzf /tmp/pg-cdc_*_linux_amd64.tar.gz -C /tmp
sudo install -m 0755 /tmp/pg-cdc-linux-amd64 /usr/local/bin/pg-cdc
# OSS mirror: curl -fsSL https://github.com/burnside-project/pg-cdc/releases/latest/download/pg-cdc_linux_amd64.tar.gz | tar xz

# 2. Verify the source, snapshot, stream
pg-cdc preflight
pg-cdc init
pg-cdc start

# 3. Serve to AI agents
pg-cdc serve --mcp

Configuration

Minimal (filesystem sink, no governance) — full reference in docs/02-configuration.md:

# pg-cdc.yml
source:
  postgres:
    url: "postgresql://cdc_user:${PGCDC_PASSWORD}@host:5432/db"
    schemas: ["public"]
    provider: auto                 # auto | self-managed | rds | aurora

storage:
  type: filesystem                 # filesystem | s3   (gcs is a stub)
  path: /var/lib/pg-cdc/output/

replication:
  publication: pg_cdc_pub
  slot: pg_cdc_slot

flush:
  interval_sec: 10
  max_rows: 1000

Governed S3 + Iceberg + Lake Formation (RDS/Aurora with IAM auth):

source:
  postgres:
    url: "postgresql://cdc_user@my-db.abc.us-west-2.rds.amazonaws.com:5432/db?sslmode=require"
    schemas: ["public"]
    provider: aurora
    iam_auth: true                 # short-lived RDS/Aurora IAM tokens, no static password

storage:
  type: s3
  bucket: my-governed-warehouse
  prefix: cdc/
  region: us-west-2

catalog:
  type: iceberg                    # glue (Hive/Parquet) | iceberg
  warehouse: s3://my-governed-warehouse/cdc/
  database: my_db
  region: us-west-2
  acl_table: my-acl-table          # DynamoDB ACL intent table

governance:
  catalog: glue
  mode: strict                     # refuse to publish untagged tables
  require_tags: ["sensitivity"]

replication:
  publication: pg_cdc_pub
  slot: pg_cdc_slot
  heartbeat_interval_sec: 30       # keep an idle slot advancing (Aurora Serverless v2)

guardrail:
  max_retained_wal_bytes: 10737418240   # alarm if the slot pins > 10 GiB of WAL

The Terraform for the Glue DB, DynamoDB ACL table, Lake Formation default-deny, audit pipeline, and per-role IAM scopes is in deploy/terraform/governance/.


Consumption paths

Consumer How Reads
AI agents pg-cdc serve --mcp (8 MCP tools) Governed Parquet via DuckDB; column-filtered by LF grants
HTTP clients / dashboards pg-cdc serve --http (REST + SSE) Same query/governance over HTTP
Developers / analytics pg-warehouse Reads object storage directly with its own AWS identity
Athena / Spark / Iceberg engines AWS Glue + Lake Formation The catalog tables, gated by LF tags

pg-cdc produces the governed zone; the contract with consumers is defined in burnside-go.


Tech stack

Layer Technology
Language Go 1.25 (pure Go, no CGO)
CLI Cobra
PostgreSQL pgx/v5 + pglogrepl (self-managed, AWS RDS, Aurora)
Columnar parquet-go · iceberg-go (Apache Iceberg)
Query DuckDB (read path)
State SQLite (modernc.org/sqlite)
Storage S3 · filesystem (GCS stub)
Catalog / governance AWS Glue · Lake Formation · DynamoDB
Platforms Linux amd64/arm64 · macOS amd64/arm64 · Windows amd64

Related repos

Repo Role
pg-cdc (this repo) CDC server — WAL streaming, Iceberg/Parquet, compaction, governance, MCP/REST
pg-warehouse Analytics client — refresh, model, version, export
burnside-go Shared types — manifest spec, storage interface
pg-cdc-testbed Ephemeral RDS/Aurora validation beds

Versioning

Release candidates auto-increment on push to main (v0.1.0-rc1, -rc2, …); stable releases are promoted from an RC via workflow dispatch. See CHANGELOG.md and docs/about/stability.md for what's GA vs beta.

Documentation

Start at docs/README.md — getting started, configuration, security architecture, governance, AI-agent consumption, durability, runbooks, and the full feature index.

License

Apache License 2.0

Community

License

Apache License 2.0 — Copyright 2025-2026 Burnside Project

About

PostgreSQL CDC server. Streams WAL changes into typed, compacted Parquet files in cloud storage. Follows the native Postgres replica pattern.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors