Skip to content

feat(networking): native mTLS with subject-name authorization for fabric inter-node communication#4681

Open
rushabhvaria wants to merge 13 commits intorestatedev:mainfrom
rushabhvaria:main
Open

feat(networking): native mTLS with subject-name authorization for fabric inter-node communication#4681
rushabhvaria wants to merge 13 commits intorestatedev:mainfrom
rushabhvaria:main

Conversation

@rushabhvaria
Copy link
Copy Markdown

@rushabhvaria rushabhvaria commented Apr 30, 2026

Closes #3306
Related: #3583

Summary

  • Add optional TLS/mTLS configuration for the fabric port (5122), securing inter-node communication at the application layer without requiring Kubernetes NetworkPolicy or external service meshes
  • Support strict mode (TLS only) and optional mode (accepts both plaintext and TLS) for zero-downtime rolling upgrades
  • Periodic certificate hot-reload from disk (configurable interval, default 1h)
  • Subject-name authorization (allowed-subject-names): after mTLS authentication, verify the peer's Subject CN and SANs match allowed patterns — prevents unauthorized services from connecting when using a shared CA
  • Fail-safe config validation: allowed-subject-names is required when require-client-auth is true — prevents accidental fail-open. Use ["*"] to explicitly opt into CA-only trust

Motivation

Restate's security docs state: "You are expected to secure access to [the fabric port] using the network and proxy layers available in your deployment environment." The recommended approach is Kubernetes NetworkPolicy — but many production environments don't support it (shared clusters, certain CNI plugins, platform constraints). Most distributed systems (etcd, CockroachDB, Consul) offer built-in inter-node TLS — this brings Restate to parity, especially for enterprise environments.

The authorization layer addresses feedback that mTLS alone is insufficient when using a shared CA (e.g., SPIFFE). Without identity checking, any service holding a cert from the same CA could connect to the fabric port.

Configuration

[networking.tls]
mode = "strict"                          # or "optional" for rolling upgrades
cert-file = "/certs/node.crt"
key-file = "/certs/node.key"
ca-files = ["/certs/ca.crt"]
require-client-auth = true               # default: mTLS enabled
refresh-interval = "1h"                  # hot-reload certs from disk

# Authorization: required when require-client-auth is true
# Use ["*"] for CA-only trust, or specify identity patterns
allowed-subject-names = [
    "spiffe://svc.example.com/restate/*",
]

# Optional: separate client certs for outbound (inherits from above if omitted)
[networking.tls.client]
cert-file = "/certs/client.crt"
key-file = "/certs/client.key"
root-ca-files = ["/certs/client-ca.crt"]

Without [networking.tls], behavior is identical to today (plaintext).

Authorization behavior

require-client-auth allowed-subject-names Result
true omitted/empty Startup error — must specify patterns
true ["*"] CA-only trust (explicit opt-in)
true ["spiffe://domain/*"] Identity-based authorization
false omitted/empty OK — no client auth, no subject check

Design

Encryption and Authentication (mTLS):

  • TLS termination at the tonic/hyper layer using rustls (already a workspace dep)
  • Inbound strict: tokio-rustls::TlsAcceptor wraps TcpStream before hyper
  • Inbound optional: peek first byte — 0x16 (TLS ClientHello) routes to TLS, else plaintext
  • Outbound: custom tower::service_fn connector using tokio_rustls::TlsConnector, reads latest certs from ArcSwap per-connection
  • Cert rotation: background tokio task reloads PEM files on interval, swaps via ArcSwap (lock-free)
  • Scheme signaling: TLS-enabled nodes advertise https:// — peers use the scheme to decide connection type

Authorization (subject-name verification):

  • SubjectNameVerifier wraps WebPkiClientVerifier — delegates chain validation, then checks identity
  • Checks Subject CN first, then SANs (DNS names and URIs) against glob patterns
  • Uses x509-parser for DER certificate parsing
  • ["*"] explicitly skips identity checking (CA-only trust, no SubjectNameVerifier overhead)
  • Config validation at startup prevents empty allowed-subject-names when client auth is enabled

Rolling upgrade path:

  1. Deploy all nodes with mode = "optional" and TLS certs — nodes advertise https://, accept both
  2. Verify all nodes communicate via TLS
  3. Switch to mode = "strict" — plaintext rejected

Note on restatectl compatibility (related to #3583):
Port 5122 currently serves both internal (CoreNodeSvc) and external (ClusterCtrlSvc, NodeCtlSvc) gRPC services. In optional mode, restatectl connects via plaintext while inter-node traffic uses TLS. Once #3583 splits these into separate ports, strict mode can be applied to the internal port without affecting restatectl.

Changes

File Change
crates/types/src/config/networking.rs FabricTlsOptions, TlsMode, allowed-subject-names, validate() + 11 unit tests
crates/types/src/net/address.rs PeerNetAddress::is_tls(), derive_from_bind_address_with_tls() + 2 tests
crates/core/src/network/tls.rs NEWTlsCertResolver, SubjectNameVerifier, cert loading, hot-reload, glob_match + 19 unit tests
crates/core/src/network/net_util.rs TLS accept (strict) + protocol sniff (optional)
crates/core/src/network/grpc/connector.rs Custom TLS connector for outbound https:// peers
crates/core/src/network/server_builder.rs Thread TlsCertResolver to listener
crates/core/src/network/networking.rs Accept TLS resolver in constructor
crates/node/src/lib.rs TLS init + config validation at startup, spawn reloader, wire to server + connector
crates/admin/src/service.rs Pass None for admin port (no TLS on admin)
server/tests/fabric_tls.rs NEW — 2 integration tests (strict cluster, optional mode)

Verification

  • cargo check — all modified crates compile
  • cargo clippy -D warnings — zero warnings
  • cargo fmt --check — clean
  • 32 unit tests pass (11 config + 2 address + 19 TLS)
  • 2 integration tests compile (strict cluster, optional mode)
  • No regressions in existing tests

Test plan

Config and validation (11 tests):

  • TOML parsing: defaults, modes, client inheritance, allowed-subject-names
  • Validation: empty + client auth = error, ["*"] = OK, no client auth = skip, specific patterns = OK

TLS core (19 tests):

  • PEM cert/key loading: valid, missing, empty, invalid
  • is_tls() detection: https, http, bare host, UDS
  • derive_from_bind_address_with_tls(): http:// vs https:// scheme
  • Glob matching: exact, trailing *, middle *, prefix, multiple wildcards
  • Subject-name verifier with real X.509 certs (via rcgen):
    • Accept matching SAN URI / SAN DNS / CN
    • Reject non-matching, reject no match anywhere
    • Multi-pattern authorization, CN fallback without SANs

Integration (2 tests):

  • fabric_tls_strict_cluster: 3-node cluster with strict mTLS
  • fabric_tls_optional_mode: 3-node cluster with optional TLS

…nication

Add optional TLS/mTLS configuration for Restate's fabric port (5122).
This enables securing inter-node communication at the application layer
without relying on Kubernetes NetworkPolicy or external service meshes.

Configuration lives under [networking.tls] with support for:
- Strict mode (TLS only) and optional mode (accepts both plaintext and TLS)
- Mutual TLS with configurable client certificate requirements
- Periodic certificate hot-reload from disk (default: 1h)
- Client config inheritance from server config when not specified separately
- Scheme-based signaling (https:// in advertised-address)

Key changes:
- Add FabricTlsOptions, FabricTlsClientOptions, TlsMode config structs
- Add TlsCertResolver with ArcSwap-based lock-free cert rotation
- Modify run_hyper_server to support TLS accept and protocol sniffing
- Modify GrpcConnector to use ClientTlsConfig for https:// peers
- Extend PeerNetAddress with is_tls() and derive_from_bind_address_with_tls()
- Add tokio-rustls, rustls-pemfile workspace dependencies

Without [networking.tls] configuration, behavior is identical to today.
- Config parsing tests: TOML deserialization, defaults, mode parsing,
  client inheritance fallback, client override
- TLS resolver tests: cert loading from PEM, missing file errors,
  empty cert file errors, invalid key handling, mismatched cert/key rejection
- Address tests: is_tls() for https/http/UDS, derive_from_bind_address_with_tls()

Also restores inline comments in derive_from_bind_address_with_tls that
were inadvertently dropped during refactoring.
feat(networking): native mTLS for fabric inter-node communication
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 30, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@rushabhvaria
Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

Add cluster-level integration tests that verify multi-node Restate
clusters form correctly with TLS-secured fabric communication.

Tests:
- fabric_tls_strict_cluster: 3-node cluster with strict mTLS, verifies
  all nodes connect and cluster becomes healthy
- fabric_tls_optional_mode: 3-node cluster with optional TLS mode,
  verifies nodes form cluster accepting both TLS and plaintext

Uses rcgen to generate test CA + per-node certificates at runtime.
Nodes use random TCP ports (not UDS) since TLS applies to TCP only.
@rushabhvaria rushabhvaria marked this pull request as ready for review April 30, 2026 21:48
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

test(networking): add integration tests for fabric mTLS
mTLS authenticates the peer but doesn't authorize them. In environments
where a shared CA issues certs to many services (e.g., SPIFFE), any
service could connect to the fabric port. This adds an optional
`allowed-sans` config that checks the peer certificate's Subject
Alternative Names (DNS names and URIs) against glob patterns after
the TLS handshake succeeds.

Config example:
  [networking.tls]
  allowed-sans = ["spiffe://svc.pin220.com/restate-agents/*"]

Implementation:
- SanCheckingVerifier wraps WebPkiClientVerifier, adding SAN check
  after chain validation passes
- Uses x509-parser to extract SANs from DER certificates
- Supports * glob wildcards for flexible pattern matching
- When allowed-sans is empty (default), behavior is unchanged

Tests:
- glob_match: exact, trailing wildcard, middle wildcard, prefix, multi
- Config parsing with allowed-sans field
…d add CN matching

Rename `allowed-sans` to `allowed-subject-names` to better reflect that
both the Subject Common Name (CN) and Subject Alternative Names (DNS/URI)
are checked against the allowed patterns.

The verifier now checks CN first, then SANs. This handles certs that use
CN alone (without SANs) and provides a more complete authorization model.

Tests added:
- test_subject_verifier_accepts_matching_cn: CN-only cert accepted
- test_subject_verifier_cn_fallback_when_no_san: CN match when no SANs present
- test_subject_verifier_rejects_no_match_anywhere: neither CN nor SANs match
feat(networking): add SAN-based authorization for fabric mTLS
@rushabhvaria rushabhvaria changed the title feat(networking): native mTLS for fabric inter-node communication feat(networking): native mTLS with subject-name authorization for fabric inter-node communication May 1, 2026
… is enabled

Prevent accidental fail-open: when require-client-auth is true,
allowed-subject-names must be explicitly set. Operators who want
CA-only trust (no identity checking) set allowed-subject-names = ["*"]
to make the choice explicit. An empty list with client auth enabled
is now a configuration error that prevents node startup.

This addresses feedback that the previous default (empty = allow all)
could lead to unintended access when using a shared CA.

Changes:
- Add FabricTlsOptions::validate() with startup-time check
- Call validate() during node initialization before TLS setup
- Treat ["*"] as explicit CA-only trust (skip SubjectNameVerifier)
- Update integration tests to use allowed-subject-names = ["*"]
- 4 new validation unit tests

Config that now fails:
  [networking.tls]
  require-client-auth = true
  # missing allowed-subject-names → startup error

Config that works:
  [networking.tls]
  require-client-auth = true
  allowed-subject-names = ["*"]               # explicit CA-only trust
  # OR
  allowed-subject-names = ["spiffe://dom/*"]   # identity-based authz
…subject-names

feat(networking): require allowed-subject-names when mTLS client auth…
Comment thread crates/core/src/network/net_util.rs Outdated
Copy link
Copy Markdown

@nickpan47 nickpan47 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm. Minor comment on duplicated code section.

@tillrohrmann
Copy link
Copy Markdown
Contributor

Thanks a lot for adding mTLS support to Restate @rushabhvaria. It looks like a great contribution.

Right now the team is a little bit busy with finalizing the 1.7 release and that's why we probably need a bit of time to give your PR the deserved attention. So please bear with us.

@tillrohrmann tillrohrmann self-requested a review May 4, 2026 07:37
rushabhvaria added a commit to rushabhvaria/restate that referenced this pull request May 4, 2026
Extract serve_connection() helper to eliminate repeated connection
error-handling blocks across TLS, plaintext, and UDS code paths.
Also simplify the TLS/plaintext branching by resolving the TLS
acceptor first, then handling the connection in two clean branches
instead of five duplicated blocks.

Addresses review feedback from nickpan47 on PR restatedev#4681.
Extract serve_connection() helper to eliminate repeated connection
error-handling blocks across TLS, plaintext, and UDS code paths.
Also simplify the TLS/plaintext branching by resolving the TLS
acceptor first, then handling the connection in two clean branches
instead of five duplicated blocks.

Addresses review feedback from nickpan47 on PR restatedev#4681.
@AhmedSoliman AhmedSoliman self-requested a review May 5, 2026 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for encrypting cross Restate node traffic

3 participants