Skip to content

Enabling zero-trust GPU inference without host RAM exposure #135

@hixichen

Description

@hixichen

Title: [Feature Request] Enclave key generation + OIDC discovery for NRAS — enabling zero-trust GPU inference without host RAM exposure

Body:

Summary

I’m fairly new to the GPU world and have been spending time researching confidential inference — specifically protecting proprietary model weights on multi-tenant H100/B200 infrastructure. I’ve been reading through the NVIDIA CC docs, nvtrust repo, and this forum, and I’m really impressed with what the CC stack provides (VRAM encryption + NRAS attestation).

That said, as I’ve been trying to piece together a practical deployment, I’ve run into a few areas where I’m either missing something or there might be genuine gaps. I’d really appreciate any guidance from folks who’ve been working with this longer than I have.

Context

The current CC implementation provides VRAM encryption and NRAS attestation — both work well for proving GPU identity and protecting weights at rest in GPU memory. The gap is what happens between the host CPU and the GPU: during envelope decryption, the plaintext DEK and decrypted model weights must briefly exist in host RAM before DMA transfer to VRAM. On infrastructure without a CPU TEE (SEV-SNP/TDX), this is observable by a privileged host operator.

Feature Requests

1. In-GPU Ephemeral Key Generation and Unwrap

Problem: The plaintext DEK and decrypted weights briefly exist in host RAM during VRAM loading. Hardened native extensions (mlock, MADV_DONTDUMP, PR_SET_DUMPABLE(0), explicit_bzero) raise the attack bar but cannot provide cryptographic guarantees without a CPU TEE.

Proposed APIs:

GenerateEphemeralKeyPair()
  → Creates asymmetric keypair inside GPU secure enclave
  → Embeds public key in attestation report (bound to hardware state)
  → Private key never leaves the GPU

UnwrapKey(wrapped_dek)
  → Unwraps a DEK inside the enclave using the ephemeral private key
  → Plaintext DEK available only within VRAM

Alternatively, expose AES-GCM unwrap as a secure enclave primitive invocable from CUDA kernels.

Why this matters: This would allow the external KMS/broker to wrap the DEK to the GPU's ephemeral public key. The GPU unwraps and decrypts entirely in VRAM — host CPU never sees plaintext. This eliminates the host RAM exposure window without requiring the operator to provision CPU TEE infrastructure (which breaks standard observability/orchestration tooling).

Analogy: Intel SGX sealing/unsealing APIs; AWS Nitro Enclaves exposing kms:Decrypt inside the enclave boundary.

2. OIDC Discovery Document for NRAS

Problem: NRAS publishes a JWKS endpoint but lacks /.well-known/openid-configuration. This blocks native AWS AssumeRoleWithWebIdentity federation — the cleanest credential-free architecture where the NRAS JWT itself becomes the cloud credential.

Proposed fix: Publish one static JSON file at https://nras.attestation.nvidia.com/.well-known/openid-configuration containing the standard OIDC discovery fields (issuer, jwks_uri, id_token_signing_alg_values_supported, etc.).

Why this matters: Without OIDC discovery, every organization deploying confidential inference must either (a) run a custom attestation broker (Lambda/Cloud Run) to bridge NRAS JWTs to KMS, or (b) configure HashiCorp Vault JWT Auth with a direct JWKS URI. Adding one file would unlock direct OIDC federation for the entire GPU CC ecosystem — eliminating the broker entirely for AWS/GCP/Azure users.

3. Offline / Local NRAS Verification Support

Problem: Every pod startup currently requires a synchronous NRAS API call. At scale, this creates latency bottlenecks and a hard runtime dependency on NRAS availability.

Current state: The local GPU verifier in nvtrust exists but requires manually managing CRLs, RIMs, and certificate chains. There's no documented caching strategy or recommended refresh interval.

Ask: Document a supported offline verification workflow: recommended CRL/RIM cache refresh intervals, cache invalidation strategy, and trust model guidance (i.e., who should control the cache to prevent stale-CRL attacks). Even a reference implementation of a caching proxy would help.

4. Cloud KMS Native Attestation Conditions

Problem: Cloud KMS providers (AWS, GCP, Azure) don't natively validate NRAS JWT claims as key policy conditions. AWS KMS already supports Nitro Enclave attestation conditions (kms:RecipientAttestation:PCR0) — a similar integration for GPU attestation would eliminate all intermediary code.

This likely requires NVIDIA + cloud provider partnership, but flagging here as the long-term endgame that would make GPU confidential computing as simple as Nitro Enclaves.

Environment

  • GPUs: H100 (testing), B200 (target)
  • CC Mode: GPU-only CC (no CPU TEE on current infrastructure)
  • Attestation: NRAS with JWT verification
  • KMS: AWS KMS with envelope encryption

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions