pg-cdc — A frustratingly simple way to create governed operational data from PostgreSQL for AI applications.
Build AI agents from PostgreSQL operational data — no production credentials.
Coming Soon on AWS Marketplace
+-------------------------------+ +----------------------------------------------------+
| PRODUCTION ZONE | | GOVERNED DATA ZONE |
| | | |
| PostgreSQL | | S3 · immutable Iceberg/Parquet |
| self-managed · RDS · Aurora | | Catalog + Governance tags |
| | | | audit trail |
| | WAL (one-way) | | | | |
| v | | v v |
| pg-cdc ------------------|------->| MCP server REST API / pg-warehouse(dev Tools)|
| (CDC engine) | | (AI agents) |
+-------------------------------+ +----------------------------------------------------+
Build ai agent from Postgres Operational Data — no prod credentials.
PostgreSQL change data capture → governed Apache Iceberg / Parquet on AWS S3 — built for AI agents.
- pg-cdc is not just replication. pg-cdc streams Postgres Write Ahead Logs(WAL) out of production Postgres into typed, immutable, time-travelable Iceberg tables on S3
- Registers each entities in the AWS Glue Catalog
- Gates every read with AWS Lake Formation tags — so AI agents, analysts, and query engines consume governed data without ever touching the source database, and without database credentials.
No JVM. One binary.
- No return path — the WAL is one-way; Parquet is immutable. Agents physically cannot write to production.
- No database credentials — consumers authenticate via AWS IAM + Lake Formation, never a connection string.
- Governed by default — untagged data is invisible; Lake Formation tags gate every read, down to the column.
- Time travel built in — every flush is an Iceberg snapshot; CDC epochs + immutable
raw@<ts>tags give historical queries with no database branching.
See docs/10-security-architecture.md for the full five-property model and docs/11-ai-agent-consumption.md for the two consumption paths.
A feature-by-feature catalog. The exhaustive, link-rich version lives in docs/00-features.md.
| Capability | What it does |
|---|---|
| Logical replication | Reads a publication + replication slot, decodes pgoutput v2 (I/U/D). No triggers, no app changes. |
| Consistent initial snapshot | init exports a snapshot at one LSN via CREATE_REPLICATION_SLOT; per-table COPY under SET TRANSACTION SNAPSHOT. Cross-table parallelism (init.parallel_workers). |
| Streaming WAL → delta Parquet | start daemon buffers events and flushes typed Parquet per table on interval / row / byte budget. Per-table parallel flush. |
| Schema evolution | Detects ADD/DROP COLUMN, propagates to manifest + Glue + Iceberg live — no restart. reconcile --table one-shot for hot tables. |
| Compaction | compact merges delta epochs into a new base, applies I/U/D, soft-deletes on a tombstone TTL. CAS-safe alongside the live streamer. |
| TOAST partial-row preservation | Honors TupleDataType 'u' — unchanged TOAST columns keep their value instead of silently becoming NULL. |
| Versioned manifest | A single JSON source of truth (active tables, schema versions, epoch markers, source LSN), CAS-committed on S3. |
| Capability | What it does |
|---|---|
| Apache Iceberg writer | catalog.type: iceberg — every flush is an Iceberg snapshot; in-tree schema evolution + background snapshot expiration (default retain 100/table). |
| Hive/Parquet on Glue | catalog.type: glue — register active tables so Athena / Spark / DuckDB read the Parquet directly. |
| S3 sink | Multipart upload (aws-sdk-go-v2 transfer manager), SDK retry middleware, S3-compatible endpoints (MinIO, R2). Configurable bucket/prefix/region/endpoint. |
| Filesystem sink | Zero-dependency local output for dev / on-prem / air-gapped. (GCS sink is a scaffolded stub — not implemented.) |
| Time-travel refs | refs.json: staging advances every flush, main advances by promotion.mode (auto/manual), immutable raw@<ts> tags per flush. Git-shaped consumer workflow via promote. |
The differentiator. pg-cdc doesn't just land data — it makes it default-deny governed so it's safe to point an autonomous agent at.
| Capability | What it does |
|---|---|
| AWS Glue catalog | Tables registered at init and after every schema evolution; idempotent catalog register re-sync; drift metrics. |
| Lake Formation tag enforcement | Tag-policy grants gate reads. Untagged = invisible. Column-level tags supported. The bucket policy + IAMAllowedPrincipals removal make it load-bearing, not advisory. |
| DynamoDB ACL intent registry | Versioned classification per resource (db.schema.table[.column]) with tag inheritance (db→table→column), monotonic versions, and full audit provenance (who / when / why). |
acl reconciliation |
acl register | get | set | list | diff | sync — diffs intent vs live LF and applies AddLFTagsToResource / RemoveLFTagsFromResource. CI-friendly exit codes (0 clean · 3 drift-healed · 2 AWS error). |
| Query-time enforcement | The serve layer resolves each caller's LF grants (ListPermissions, 5-min cache) — columns a role can't see never leave storage. Live, not manifest-based. |
| Break-glass + audit | Time-boxed emergency grants with TTL auto-revoke (GitHub Environment gate, OIDC role) and a 365-day CloudWatch audit log group. |
| Capability | What it does |
|---|---|
MCP server (serve --mcp) |
Model Context Protocol stdio server for Claude Desktop / Cursor / any MCP client. 8 tools: list_tables, describe_table, query, get_changes, get_manifest, get_freshness, get_snapshot, get_diff. Governed reads via DuckDB over the Parquet. |
REST API (serve --http) |
Same query semantics + governance over HTTP. Data plane: /v1/tables*, /v1/query, /v1/manifest, /v1/diff. Console/ops: /v1/status, /v1/catalog, /v1/audit/summary, SSE /v1/stream/{status,events,audit}. Bearer auth, body caps, per-principal rate limiting. |
| Freshness & snapshot signals | get_freshness (is_stale, lag, schema/policy versions) and get_snapshot (Iceberg snapshot_id / Parquet base path) so agents decide whether to re-pull and read object storage with their own identity — pg-cdc doesn't proxy bytes. |
| Contract diff | get_diff / /v1/diff classifies ref-to-ref change as safe / schema_change / policy_change / breaking with a promote_safe gate. |
| Ad-hoc query CLI | pg-cdc query — project, filter (col:op:value), limit, role-scope, --changes-since <epoch> for the CDC changelog. |
| Live policy refresh | acl set bumps policy_version; the new access surface appears with no server restart. |
| Capability | What it does |
|---|---|
| Provider detection | source.postgres.provider: auto|self-managed|rds|aurora, logged at startup; same pgoutput path for all three. |
preflight |
Readiness check (wal_level, replication privilege, CREATE-on-db, slot headroom, replica identity) with provider-specific remediation. |
| IAM-token auth | iam_auth: true mints + rotates short-lived RDS/Aurora tokens per connect — no static password — via a BeforeConnect re-resolution hook. |
| Slot-safety guardrail | guardrail.max_retained_wal_bytes alarms before a lagging slot pins enough WAL to threaten the source. |
| Idle heartbeat | replication.heartbeat_interval_sec keeps an idle slot advancing on Aurora Serverless v2. |
| Ephemeral test bed | Zero-dormant-cost RDS/Aurora validation harness — standalone burnside-project-pg-cdc-testbed. |
| Capability | What it does |
|---|---|
| At-least-once delivery | The slot's confirmed_flush_lsn advances only after Parquet + Iceberg + manifest are durable. A crash replays from the last durable flush — no silent data loss. (docs/12-durability.md) |
| Manifest CAS + circuit breaker | If-Match ETag CAS on S3; schema-aware merge on conflict; a breaker trips on suspected dual-writers (default 10 conflicts / 60 s). |
| Graceful SIGTERM drain | Drains the in-flight buffer to the sink (fresh timeout) before exit — no partial Parquet, no orphan epochs. |
| Byte-budgeted batches | flush.max_bytes (default 128 MiB) guards against OOM on wide-row / large-JSONB workloads. |
| Backpressure visibility | Channel-full counters + block-seconds surface a slow sink before it OOMs the daemon. |
| Transient-error retry | Bounded exponential backoff on PG reconnect (1 s→30 s, 10 attempts) and native AWS SDK retry on every client. |
| Capability | What it does |
|---|---|
| Prometheus metrics | 29 pgcdc_* metrics on :9090/metrics — replication lag, slot safety, throughput, sink latency/errors, compaction, catalog, Iceberg snapshots, CAS conflicts, backpressure, ACL drift, audit. (docs/metrics.md) |
| CloudWatch dashboard + alarms | Terraform-managed 11-panel dashboard + 9 alarms, parameterized by deployment_name. |
| SSE event streams | /v1/stream/{status,events,audit} for live consoles/dashboards without polling. |
| Agent-consumable run artifacts | Per-run run.json · events.ndjson (stable schema) · state.json (minute-cadence LSN/lag/epoch) · summary.md — designed for an LLM to crawl. |
| Capability | What it does |
|---|---|
| 4 auth methods | Password (${VAR} substitution, re-resolved on reconnect), mTLS, AWS RDS IAM, GCP Cloud SQL IAM. (docs/08-authentication.md) |
| Least-privilege posture | Documented minimal grants; default-deny LF + S3 bucket policy; HIPAA overlay; on-file producer-side security audit. |
| systemd integration | Hardened pg-cdc.service + compaction/soak timers; documented memory-cap drop-ins. |
| 14 runbooks + SLOs | Symptom→diagnosis→remediation runbooks, SLO definitions, on-call playbook. (docs/runbooks/) |
| Validation harness | Dockerized integration suite, a 5-minute e2e-soak.sh, and failure-injection scripts (OOM, slot-full, S3 5xx, SIGTERM-loop). |
OSS / commercial: the single Apache-2.0 binary built from this repo contains every feature above — there is no in-binary edition gate. The public OSS mirror at
github.com/burnside-project/pg-cdcholds back the governance / ACL / audit / REST / refs plane at the repository level. Authoritative split:docs/internal/plans/oss-adoption.md.
# Core CDC (every command accepts --config <path>, default pg-cdc.yml)
pg-cdc preflight # Verify the source is ready for logical replication (managed-aware)
pg-cdc init # Snapshot tables → base Iceberg/Parquet + manifest + catalog + ACL registration
pg-cdc start # Stream WAL → delta epochs (the long-running daemon)
pg-cdc compact # Merge deltas → new base (applies I/U/D, soft-deletes on tombstone TTL)
pg-cdc status # Health: slot, lag, LSN, epochs, tables, oldest source txn
pg-cdc discover [--acl] [--dry-run] # List tables (+ role→table→column access map with --acl)
pg-cdc reconcile [--table <q>] --force # One-shot schema reconcile after an ALTER on a hot table
pg-cdc teardown # Drop slot + publication + storage + state
# Servers (full endpoint list: pg-cdc serve --help)
pg-cdc serve --mcp # MCP stdio server — 8 tools (Claude Desktop & other MCP clients)
pg-cdc serve --http # REST API on 127.0.0.1:8080 (bearer auth required for non-loopback)
# data plane: /v1/tables* · /v1/query · /v1/manifest · /v1/diff (docs/spec/data-plane-api.md)
# console: /v1/status · /v1/catalog · /v1/audit/summary · /v1/stream/* (docs/spec/console-api.md)
# always on: /healthz · /metrics · --pprof <loopback> for /debug/pprof
# Query (flags: -t -c --role -f -n -w --no-deltas --changes-since)
pg-cdc query # list active tables
pg-cdc query -t public.orders -c id,name -w amount:gt:100 -n 10 -f json # project + filter + limit
pg-cdc promote --from <ref> --to main [--force --dry-run --json --reason "..."] # manual ref promotion
# Catalog (Glue / Iceberg)
pg-cdc catalog register # Register manifest tables in the catalog without re-snapshotting
# ACL (DynamoDB-backed Layer-2 tag intent → Lake Formation)
pg-cdc acl register <resource> # Idempotently register at v=0 with default tags
pg-cdc acl get <resource> # Resolved + direct tags
pg-cdc acl set <resource> --tag k=v [--tag k2=v2 ...] --reason "..." # Bump version, set intent (audited)
pg-cdc acl list [--unclassified] # List classifications
pg-cdc acl diff # Drift between ACL intent and Lake Formation
pg-cdc acl sync # Apply intent → Lake Formation (exits 3 when drift healed)
pg-cdc version # Print version + build infoFull operations guide: docs/07-operations.md. Smoke-test runbook: docs/09-smoke-test.md.
# 1. Install (Linux amd64, from the private repo via gh)
gh release download --repo dataalgebra-engineering/burnside-project-pg-cdc \
--pattern 'pg-cdc_*_linux_amd64.tar.gz' --dir /tmp
tar xzf /tmp/pg-cdc_*_linux_amd64.tar.gz -C /tmp
sudo install -m 0755 /tmp/pg-cdc-linux-amd64 /usr/local/bin/pg-cdc
# OSS mirror: curl -fsSL https://github.com/burnside-project/pg-cdc/releases/latest/download/pg-cdc_linux_amd64.tar.gz | tar xz
# 2. Verify the source, snapshot, stream
pg-cdc preflight
pg-cdc init
pg-cdc start
# 3. Serve to AI agents
pg-cdc serve --mcpMinimal (filesystem sink, no governance) — full reference in docs/02-configuration.md:
# pg-cdc.yml
source:
postgres:
url: "postgresql://cdc_user:${PGCDC_PASSWORD}@host:5432/db"
schemas: ["public"]
provider: auto # auto | self-managed | rds | aurora
storage:
type: filesystem # filesystem | s3 (gcs is a stub)
path: /var/lib/pg-cdc/output/
replication:
publication: pg_cdc_pub
slot: pg_cdc_slot
flush:
interval_sec: 10
max_rows: 1000Governed S3 + Iceberg + Lake Formation (RDS/Aurora with IAM auth):
source:
postgres:
url: "postgresql://cdc_user@my-db.abc.us-west-2.rds.amazonaws.com:5432/db?sslmode=require"
schemas: ["public"]
provider: aurora
iam_auth: true # short-lived RDS/Aurora IAM tokens, no static password
storage:
type: s3
bucket: my-governed-warehouse
prefix: cdc/
region: us-west-2
catalog:
type: iceberg # glue (Hive/Parquet) | iceberg
warehouse: s3://my-governed-warehouse/cdc/
database: my_db
region: us-west-2
acl_table: my-acl-table # DynamoDB ACL intent table
governance:
catalog: glue
mode: strict # refuse to publish untagged tables
require_tags: ["sensitivity"]
replication:
publication: pg_cdc_pub
slot: pg_cdc_slot
heartbeat_interval_sec: 30 # keep an idle slot advancing (Aurora Serverless v2)
guardrail:
max_retained_wal_bytes: 10737418240 # alarm if the slot pins > 10 GiB of WALThe Terraform for the Glue DB, DynamoDB ACL table, Lake Formation default-deny, audit pipeline, and per-role IAM scopes is in deploy/terraform/governance/.
| Consumer | How | Reads |
|---|---|---|
| AI agents | pg-cdc serve --mcp (8 MCP tools) |
Governed Parquet via DuckDB; column-filtered by LF grants |
| HTTP clients / dashboards | pg-cdc serve --http (REST + SSE) |
Same query/governance over HTTP |
| Developers / analytics | pg-warehouse | Reads object storage directly with its own AWS identity |
| Athena / Spark / Iceberg engines | AWS Glue + Lake Formation | The catalog tables, gated by LF tags |
pg-cdc produces the governed zone; the contract with consumers is defined in burnside-go.
| Layer | Technology |
|---|---|
| Language | Go 1.25 (pure Go, no CGO) |
| CLI | Cobra |
| PostgreSQL | pgx/v5 + pglogrepl (self-managed, AWS RDS, Aurora) |
| Columnar | parquet-go · iceberg-go (Apache Iceberg) |
| Query | DuckDB (read path) |
| State | SQLite (modernc.org/sqlite) |
| Storage | S3 · filesystem (GCS stub) |
| Catalog / governance | AWS Glue · Lake Formation · DynamoDB |
| Platforms | Linux amd64/arm64 · macOS amd64/arm64 · Windows amd64 |
| Repo | Role |
|---|---|
| pg-cdc (this repo) | CDC server — WAL streaming, Iceberg/Parquet, compaction, governance, MCP/REST |
| pg-warehouse | Analytics client — refresh, model, version, export |
| burnside-go | Shared types — manifest spec, storage interface |
| pg-cdc-testbed | Ephemeral RDS/Aurora validation beds |
Release candidates auto-increment on push to main (v0.1.0-rc1, -rc2, …); stable releases are promoted from an RC via workflow dispatch. See CHANGELOG.md and docs/about/stability.md for what's GA vs beta.
Start at docs/README.md — getting started, configuration, security architecture, governance, AI-agent consumption, durability, runbooks, and the full feature index.
- GitHub Issues — bugs and feature requests
- GitHub Discussions — questions and ideas
- Contributing — development setup and guidelines
- Code of Conduct
- Security Policy
Apache License 2.0 — Copyright 2025-2026 Burnside Project