Skip to content

feat(routerreplay): default store_backend to postgres for durable replay#1683

Merged
rootfs merged 1 commit intovllm-project:mainfrom
yehuditkerido:durable_router_replay
Mar 31, 2026
Merged

feat(routerreplay): default store_backend to postgres for durable replay#1683
rootfs merged 1 commit intovllm-project:mainfrom
yehuditkerido:durable_router_replay

Conversation

@yehuditkerido
Copy link
Copy Markdown
Collaborator

@yehuditkerido yehuditkerido commented Mar 29, 2026

Summary

Router Replay records were lost on every restart because store_backend defaulted to memory. Change the default to postgres for SQL-queryable audit storage, warn on memory, and add an E2E restart-recovery test.

Changes

  • Default RouterReplayConfig.StoreBackend from "memory" to "postgres" in canonical_defaults.go
  • Emit logging.Warnf when operator selects memory backend
  • Add Go doc to RouterReplayConfig and unit test for new default
  • Update website docs with Postgres config example and backend comparison table
  • Update state-taxonomy-and-inventory.md to reflect new default
  • New E2E profile router-replay-postgres with Postgres 16 deployment, restart-recovery test, and CI wiring
  • Add DoGETRequest fixture helper and register profile in imports.go

CLI auto-provisioning of storage backends

  • storage_backends.py: detect_required_backends reads global.services.<key>.store_backend from the loaded config and returns which backends (redis, postgres) need provisioning
  • docker_services.py: docker_start_redis and docker_start_postgres now call _reuse_running_storage_container before _replace_existing_container — if the storage container is already running it is kept as-is, preserving data across router restarts
  • config/vllm-sr-config-cli.yaml: migrated to v0.3 canonical format; sets response_api → redis, router_replay → postgres, adds router_replay plugin to default_route
  • docs/durable-router-replay-guide-he.html: end-to-end manual test guide covering auto-provisioning, replay verification in Postgres, and restart durability

Testing

  • Unit: default backend assertion passes (postgres, TTL 2592000)
  • E2E: router-replay-restart-recovery passes - record survives pod restart
  • go vet and gofmt clean
  • Manual (CLI): vllm-sr serve with canonical v0.3 config auto-provisions Redis + Postgres, replay record persists after docker stop/rm of router containers and re-running vllm-sr serve

Note: router-replay-postgres is temporarily added to the CI baseline profiles
so the new E2E test runs on this PR. Happy to remove it in a follow-up commit once
reviewers see it pass — just let me know.

Related Issues

Resolves Router Replay portion of #1608
Follows #1661 (Response API → Redis)

@netlify
Copy link
Copy Markdown

netlify bot commented Mar 29, 2026

Deploy Preview for vllm-semantic-router ready!

Name Link
🔨 Latest commit 652382a
🔍 Latest deploy log https://app.netlify.com/projects/vllm-semantic-router/deploys/69cba8b16706b50008528409
😎 Deploy Preview https://deploy-preview-1683--vllm-semantic-router.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 29, 2026

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 Root Directory

Owners: @rootfs, @Xunzhuo
Files changed:

  • .github/workflows/ci-changes.yml
  • .github/workflows/integration-test-k8s.yml
  • docs/agent/state-taxonomy-and-inventory.md

📁 deploy

Owners: @rootfs, @Xunzhuo
Files changed:

  • deploy/kubernetes/router-replay/postgres.yaml

📁 e2e

Owners: @Xunzhuo
Files changed:

  • e2e/README.md
  • e2e/pkg/fixtures/http.go
  • e2e/profiles/all/imports.go
  • e2e/profiles/router-replay-postgres/profile.go
  • e2e/profiles/router-replay-postgres/values.yaml
  • e2e/testcases/router_replay_restart_recovery.go

📁 src

Owners: @rootfs, @Xunzhuo, @wangchen615
Files changed:

  • src/semantic-router/pkg/config/canonical_defaults.go
  • src/semantic-router/pkg/config/canonical_loader_test.go
  • src/semantic-router/pkg/config/runtime_config.go
  • src/semantic-router/pkg/extproc/router_replay_setup.go
  • src/vllm-sr/cli/core.py
  • src/vllm-sr/cli/docker_cli.py
  • src/vllm-sr/cli/docker_services.py
  • src/vllm-sr/cli/runtime_lifecycle.py
  • src/vllm-sr/cli/runtime_stack.py
  • src/vllm-sr/cli/storage_backends.py

📁 tools

Owners: @yuluo-yx, @rootfs, @Xunzhuo
Files changed:

  • tools/agent/e2e-profile-map.yaml

📁 website

Owners: @Xunzhuo, @rootfs, @yuluo-yx
Files changed:

  • website/docs/tutorials/global/api-and-observability.md

vLLM

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 29, 2026

✅ Supply Chain Security Report — All Clear

Scanner Status Findings
AST Codebase Scan (Py, Go, JS/TS, Rust) 29 finding(s) — MEDIUM: 22 · LOW: 7
AST PR Diff Scan No issues detected
Regex Fallback Scan No issues detected

Scanned at 2026-03-31T10:59:47.532Z · View full workflow logs

@yehuditkerido yehuditkerido force-pushed the durable_router_replay branch from 0f836f1 to ce3429b Compare March 29, 2026 12:22
Copy link
Copy Markdown
Member

@Xunzhuo Xunzhuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple places we need a database when we run vllm-sr serve, can we unify this? And make sure when we start vllm-sr the default environment contains relevant resources

@yehuditkerido yehuditkerido marked this pull request as draft March 29, 2026 12:49
@yehuditkerido yehuditkerido force-pushed the durable_router_replay branch from ce3429b to e5d8af3 Compare March 30, 2026 07:08
@yehuditkerido
Copy link
Copy Markdown
Collaborator Author

There are multiple places we need a database when we run vllm-sr serve, can we unify this? And make sure when we start vllm-sr the default environment contains relevant resources

Hi @Xunzhuo just want to make sure I understand correctly — you'd like the vllm-sr serve CLI to detect which storage backends the config requires (Redis, Postgres, etc.) and automatically start them as part of the environment, right?

If so, should I address that in this PR or open a follow-up issue for it?

@yehuditkerido yehuditkerido force-pushed the durable_router_replay branch from e5d8af3 to c4b01e5 Compare March 30, 2026 08:12
@yehuditkerido yehuditkerido marked this pull request as ready for review March 30, 2026 08:28
@Xunzhuo
Copy link
Copy Markdown
Member

Xunzhuo commented Mar 30, 2026

@yehuditkerido yes, this PR now will break the vllm-sr serve installation process, since you changed the defaults for replay storage but not adding storage setup process in vllm-sr

@yehuditkerido yehuditkerido marked this pull request as draft March 30, 2026 13:51
@yehuditkerido yehuditkerido force-pushed the durable_router_replay branch from c4b01e5 to 8bcf9e1 Compare March 31, 2026 10:18
@yehuditkerido yehuditkerido marked this pull request as ready for review March 31, 2026 10:20
@yehuditkerido yehuditkerido force-pushed the durable_router_replay branch from 8bcf9e1 to ea0c268 Compare March 31, 2026 10:29
… replay

Router Replay records (routing decisions, model selections, guardrail
results) were lost on every restart because the default was memory.
Change the default to postgres — the right tool for structured audit
data that needs SQL queryability and long-term retention. Warn
operators who explicitly choose memory. Add a dedicated E2E profile
with Postgres to validate restart-recovery.

Signed-off-by: Yehudit Kerido <ykerido@ykerido-thinkpadp1gen7.raanaii.csb>
@yehuditkerido yehuditkerido force-pushed the durable_router_replay branch from ea0c268 to 652382a Compare March 31, 2026 10:57
@rootfs rootfs merged commit 6e91324 into vllm-project:main Mar 31, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants