Skip to content

nydus-snapshotter: Helm deployment fails with FailedPostStartHook - containerd restart kills pod #466

@eveningcafe

Description

@eveningcafe

Issue: Helm chart fails to deploy - postStart hook restarts containerd causing CrashLoopBackOff

Environment

  • Kubernetes: v1.30+
  • Containerd: v1.7.25
  • nydus-snapshotter Helm chart: latest from dragonfly repo
  • Chart version used: nydus-snapshotter (dragonfly/nydus-snapshotter)

Problem Description

The nydus-snapshotter DaemonSet fails to start with FailedPostStartHook errors, followed by CrashLoopBackOff. The root cause is the postStart lifecycle hook that restarts containerd, which creates a circular failure.

Root Cause

The postStart hook executes:

nsenter -t 1 -m systemctl -- restart containerd.service

When this command runs:

  1. Container starts
  2. postStart hook executes and restarts containerd
  3. Containerd restart kills all running containers, including the pod that triggered the restart
  4. Pod never completes startup successfully
  5. Kubernetes marks it as failed and restarts the pod
  6. Loop continues indefinitely

Steps to Reproduce

  1. Deploy nydus-snapshotter using Helm:
helm repo add dragonfly https://dragonflyoss.github.io/helm-charts/
helm repo update
helm install nydus-snapshotter dragonfly/nydus-snapshotter \
  --namespace nydus-snapshotter \
  --create-namespace \
  --wait
  1. Observe pod status:
kubectl get pods -n nydus-snapshotter

Expected Behavior

Pods should start successfully and run the nydus-snapshotter service.

Actual Behavior

NAME                      READY   STATUS               RESTARTS      AGE
nydus-snapshotter-xxxxx   0/1     CrashLoopBackOff     2 (17s ago)   87s

Events show:

Warning  FailedPostStartHook  PostStartHook failed
Normal   Killing              FailedPostStartHook

Analysis

The postStart hook is attempting to reload containerd configuration after the init container modifies /etc/containerd/config.toml. However, restarting containerd terminates the very pod that's performing the restart, preventing successful startup.

Containerd does not support systemctl reload (CanReload=no), so using reload instead of restart is not viable.

Proposed Solutions

  1. Remove postStart hook entirely - Document that users must manually restart containerd after deployment
  2. Use a separate Job - Create a pre-install Job that configures containerd and restarts it before the DaemonSet starts
  3. Background restart with delay - Fork the restart into background with delay (hacky but works):
    (sleep 3 && nsenter -t 1 -m -- systemctl restart containerd.service) &
  4. Add configuration option - Allow users to disable the hook via values.yaml:
    containerRuntime:
      containerd:
        enable: true
        autoRestart: false  # new option

Related Issues

Workaround

Temporarily disable containerd configuration injection and configure manually:

containerRuntime:
  containerd:
    enable: false

Then manually add to /etc/containerd/config.toml on each node:

[proxy_plugins]
  [proxy_plugins.nydus]
    type = "snapshot"
    address = "/run/containerd-nydus/containerd-nydus-grpc.sock"

And restart containerd once manually.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions