Runtime required is a special mode that packages can be run in. This mode is for when a set of Packages must complete before any other workloads are allowed to run on the node.
- A node MUST join the cluster with a pre define taint
- That same taint must be set as the chart value
controllerManager.manager.env.runtimeRequiredTaint- The default value for this taint is
skyhook.nvidia.com=runtime-required:NoSchedule
- The default value for this taint is
Once the pre-requisites are satisfied any Skyhook Custom Resource (SCR) may be marked with runtimeRequired: true. This flag indicates that all packages within this SCR must complete
before the nodes that it targets are considered available for general use.
Recommended: The preferred approach is to taint nodes as they are added to the cluster (e.g., via your infrastructure provisioner or node bootstrap configuration). Pre-tainting eliminates any race condition between a node becoming schedulable and the operator applying the taint.
If you cannot control node tainting at provisioning time, autoTaintNewNodes provides a fallback. Note that there is a small window between when a node joins the cluster and when the operator's reconcile loop applies the taint, during which workloads could theoretically be scheduled on the node.
To enable, set autoTaintNewNodes: true alongside runtimeRequired: true on the Skyhook CR:
spec:
runtimeRequired: true
autoTaintNewNodes: trueWhen enabled, the operator automatically applies the runtime-required taint to nodes that:
- Match the Skyhook's node selector
- Do not already have the runtime-required taint
- Have no
skyhook.nvidia.com/*annotations (i.e., have never been touched by the Skyhook operator)
A node is considered "new" if it has no Skyhook annotations. This works for both initial cluster setup (day 0) and nodes joining an existing cluster (day 2+). Nodes that have already been processed by Skyhook (and had their taint removed after completion) will not be re-tainted because they retain their Skyhook annotations.
- Without
autoTaintNewNodes: true, it will NOT add the taint to any nodes targeted by a SCR withruntimeRequired: true
The taint is removed from a node when all SCRs with runtimeRequired: true that target that node are complete on that specific node.
Important: Taint removal is per-node, not per-skyhook. This means:
- Node A's taint is removed when all runtime-required skyhooks complete on Node A
- Node A does NOT wait for Node B to complete those same skyhooks
- If Node B is stuck or failing, Node A can still have its taint removed and become available
This per-node behavior prevents deadlocks where a few bad nodes would block all other healthy nodes from becoming available.
- The node becomes available for general workload scheduling (pods without the runtime-required toleration can now be scheduled on it).
This is useful when you want to gate other work behind the successful completion of some set of Skyhook Packages. This can be for security reasons or for scheduling.
NOTE: No additional toleration is required, skyhook auto tolerates this (env:runtimeRequiredTaint) taint.