feat(networking): native mTLS with subject-name authorization for fabric inter-node communication#4681
Open
rushabhvaria wants to merge 13 commits intorestatedev:mainfrom
Open
feat(networking): native mTLS with subject-name authorization for fabric inter-node communication#4681rushabhvaria wants to merge 13 commits intorestatedev:mainfrom
rushabhvaria wants to merge 13 commits intorestatedev:mainfrom
Conversation
…nication Add optional TLS/mTLS configuration for Restate's fabric port (5122). This enables securing inter-node communication at the application layer without relying on Kubernetes NetworkPolicy or external service meshes. Configuration lives under [networking.tls] with support for: - Strict mode (TLS only) and optional mode (accepts both plaintext and TLS) - Mutual TLS with configurable client certificate requirements - Periodic certificate hot-reload from disk (default: 1h) - Client config inheritance from server config when not specified separately - Scheme-based signaling (https:// in advertised-address) Key changes: - Add FabricTlsOptions, FabricTlsClientOptions, TlsMode config structs - Add TlsCertResolver with ArcSwap-based lock-free cert rotation - Modify run_hyper_server to support TLS accept and protocol sniffing - Modify GrpcConnector to use ClientTlsConfig for https:// peers - Extend PeerNetAddress with is_tls() and derive_from_bind_address_with_tls() - Add tokio-rustls, rustls-pemfile workspace dependencies Without [networking.tls] configuration, behavior is identical to today.
- Config parsing tests: TOML deserialization, defaults, mode parsing, client inheritance fallback, client override - TLS resolver tests: cert loading from PEM, missing file errors, empty cert file errors, invalid key handling, mismatched cert/key rejection - Address tests: is_tls() for https/http/UDS, derive_from_bind_address_with_tls() Also restores inline comments in derive_from_bind_address_with_tls that were inadvertently dropped during refactoring.
feat(networking): native mTLS for fabric inter-node communication
|
All contributors have signed the CLA ✍️ ✅ |
Author
|
I have read the CLA Document and I hereby sign the CLA |
Add cluster-level integration tests that verify multi-node Restate clusters form correctly with TLS-secured fabric communication. Tests: - fabric_tls_strict_cluster: 3-node cluster with strict mTLS, verifies all nodes connect and cluster becomes healthy - fabric_tls_optional_mode: 3-node cluster with optional TLS mode, verifies nodes form cluster accepting both TLS and plaintext Uses rcgen to generate test CA + per-node certificates at runtime. Nodes use random TCP ports (not UDS) since TLS applies to TCP only.
test(networking): add integration tests for fabric mTLS
mTLS authenticates the peer but doesn't authorize them. In environments where a shared CA issues certs to many services (e.g., SPIFFE), any service could connect to the fabric port. This adds an optional `allowed-sans` config that checks the peer certificate's Subject Alternative Names (DNS names and URIs) against glob patterns after the TLS handshake succeeds. Config example: [networking.tls] allowed-sans = ["spiffe://svc.pin220.com/restate-agents/*"] Implementation: - SanCheckingVerifier wraps WebPkiClientVerifier, adding SAN check after chain validation passes - Uses x509-parser to extract SANs from DER certificates - Supports * glob wildcards for flexible pattern matching - When allowed-sans is empty (default), behavior is unchanged Tests: - glob_match: exact, trailing wildcard, middle wildcard, prefix, multi - Config parsing with allowed-sans field
…d add CN matching Rename `allowed-sans` to `allowed-subject-names` to better reflect that both the Subject Common Name (CN) and Subject Alternative Names (DNS/URI) are checked against the allowed patterns. The verifier now checks CN first, then SANs. This handles certs that use CN alone (without SANs) and provides a more complete authorization model. Tests added: - test_subject_verifier_accepts_matching_cn: CN-only cert accepted - test_subject_verifier_cn_fallback_when_no_san: CN match when no SANs present - test_subject_verifier_rejects_no_match_anywhere: neither CN nor SANs match
feat(networking): add SAN-based authorization for fabric mTLS
… is enabled Prevent accidental fail-open: when require-client-auth is true, allowed-subject-names must be explicitly set. Operators who want CA-only trust (no identity checking) set allowed-subject-names = ["*"] to make the choice explicit. An empty list with client auth enabled is now a configuration error that prevents node startup. This addresses feedback that the previous default (empty = allow all) could lead to unintended access when using a shared CA. Changes: - Add FabricTlsOptions::validate() with startup-time check - Call validate() during node initialization before TLS setup - Treat ["*"] as explicit CA-only trust (skip SubjectNameVerifier) - Update integration tests to use allowed-subject-names = ["*"] - 4 new validation unit tests Config that now fails: [networking.tls] require-client-auth = true # missing allowed-subject-names → startup error Config that works: [networking.tls] require-client-auth = true allowed-subject-names = ["*"] # explicit CA-only trust # OR allowed-subject-names = ["spiffe://dom/*"] # identity-based authz
…subject-names feat(networking): require allowed-subject-names when mTLS client auth…
nickpan47
reviewed
May 1, 2026
nickpan47
approved these changes
May 1, 2026
nickpan47
left a comment
There was a problem hiding this comment.
Overall lgtm. Minor comment on duplicated code section.
Contributor
|
Thanks a lot for adding mTLS support to Restate @rushabhvaria. It looks like a great contribution. Right now the team is a little bit busy with finalizing the 1.7 release and that's why we probably need a bit of time to give your PR the deserved attention. So please bear with us. |
rushabhvaria
added a commit
to rushabhvaria/restate
that referenced
this pull request
May 4, 2026
Extract serve_connection() helper to eliminate repeated connection error-handling blocks across TLS, plaintext, and UDS code paths. Also simplify the TLS/plaintext branching by resolving the TLS acceptor first, then handling the connection in two clean branches instead of five duplicated blocks. Addresses review feedback from nickpan47 on PR restatedev#4681.
Extract serve_connection() helper to eliminate repeated connection error-handling blocks across TLS, plaintext, and UDS code paths. Also simplify the TLS/plaintext branching by resolving the TLS acceptor first, then handling the connection in two clean branches instead of five duplicated blocks. Addresses review feedback from nickpan47 on PR restatedev#4681.
…dler Fix/mtls dedup connection handler
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #3306
Related: #3583
Summary
allowed-subject-names): after mTLS authentication, verify the peer's Subject CN and SANs match allowed patterns — prevents unauthorized services from connecting when using a shared CAallowed-subject-namesis required whenrequire-client-authis true — prevents accidental fail-open. Use["*"]to explicitly opt into CA-only trustMotivation
Restate's security docs state: "You are expected to secure access to [the fabric port] using the network and proxy layers available in your deployment environment." The recommended approach is Kubernetes NetworkPolicy — but many production environments don't support it (shared clusters, certain CNI plugins, platform constraints). Most distributed systems (etcd, CockroachDB, Consul) offer built-in inter-node TLS — this brings Restate to parity, especially for enterprise environments.
The authorization layer addresses feedback that mTLS alone is insufficient when using a shared CA (e.g., SPIFFE). Without identity checking, any service holding a cert from the same CA could connect to the fabric port.
Configuration
Without
[networking.tls], behavior is identical to today (plaintext).Authorization behavior
require-client-authallowed-subject-namestruetrue["*"]true["spiffe://domain/*"]falseDesign
Encryption and Authentication (mTLS):
tokio-rustls::TlsAcceptorwrapsTcpStreambefore hyper0x16(TLS ClientHello) routes to TLS, else plaintexttower::service_fnconnector usingtokio_rustls::TlsConnector, reads latest certs fromArcSwapper-connectionArcSwap(lock-free)https://— peers use the scheme to decide connection typeAuthorization (subject-name verification):
SubjectNameVerifierwrapsWebPkiClientVerifier— delegates chain validation, then checks identityx509-parserfor DER certificate parsing["*"]explicitly skips identity checking (CA-only trust, noSubjectNameVerifieroverhead)allowed-subject-nameswhen client auth is enabledRolling upgrade path:
mode = "optional"and TLS certs — nodes advertisehttps://, accept bothmode = "strict"— plaintext rejectedNote on restatectl compatibility (related to #3583):
Port 5122 currently serves both internal (
CoreNodeSvc) and external (ClusterCtrlSvc,NodeCtlSvc) gRPC services. Inoptionalmode,restatectlconnects via plaintext while inter-node traffic uses TLS. Once #3583 splits these into separate ports,strictmode can be applied to the internal port without affectingrestatectl.Changes
crates/types/src/config/networking.rsFabricTlsOptions,TlsMode,allowed-subject-names,validate()+ 11 unit testscrates/types/src/net/address.rsPeerNetAddress::is_tls(),derive_from_bind_address_with_tls()+ 2 testscrates/core/src/network/tls.rsTlsCertResolver,SubjectNameVerifier, cert loading, hot-reload,glob_match+ 19 unit testscrates/core/src/network/net_util.rscrates/core/src/network/grpc/connector.rshttps://peerscrates/core/src/network/server_builder.rsTlsCertResolverto listenercrates/core/src/network/networking.rscrates/node/src/lib.rscrates/admin/src/service.rsNonefor admin port (no TLS on admin)server/tests/fabric_tls.rsVerification
cargo check— all modified crates compilecargo clippy -D warnings— zero warningscargo fmt --check— cleanTest plan
Config and validation (11 tests):
allowed-subject-names["*"]= OK, no client auth = skip, specific patterns = OKTLS core (19 tests):
is_tls()detection: https, http, bare host, UDSderive_from_bind_address_with_tls(): http:// vs https:// schemercgen):Integration (2 tests):
fabric_tls_strict_cluster: 3-node cluster with strict mTLSfabric_tls_optional_mode: 3-node cluster with optional TLS