-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Description
When autoUpgrade is set to true in the GPU Operator's ClusterPolicy, the isAutoDrainEnabled() and isGPUPodEvictionEnabled() functions incorrectly return false, which disables auto-drain and GPU pod eviction. This is the opposite of the expected behavior.
Expected Behavior
When autoUpgrade: true is configured, auto-drain and GPU pod eviction should be enabled to facilitate seamless driver upgrades.
Actual Behavior
When autoUpgrade: true is configured, auto-drain and GPU pod eviction are disabled, preventing the driver manager from properly managing node drain and pod eviction during upgrades.
Root Cause
In cmd/driver-manager/main.go, the logic is inverted:
// Lines 828-834
func (dm *DriverManager) isAutoDrainEnabled() bool {
if dm.isDriverAutoUpgradePolicyEnabled() {
dm.log.Info("Auto drain of the node is disabled by the upgrade policy")
return false // BUG: Should return dm.config.enableAutoDrain when autoUpgrade is true
}
return dm.config.enableAutoDrain
}
// Lines 836-842
func (dm *DriverManager) isGPUPodEvictionEnabled() bool {
if dm.isDriverAutoUpgradePolicyEnabled() {
dm.log.Infof("Auto eviction of GPU pods on node %s is disabled by the upgrade policy", dm.config.nodeName)
return false // BUG: Should return dm.config.enableGPUPodEviction when autoUpgrade is true
}
return dm.config.enableGPUPodEviction
}When isDriverAutoUpgradePolicyEnabled() returns true (meaning autoUpgrade is enabled), these functions return false, disabling the drain/eviction features.
Workaround
Set autoUpgrade: false in the GPU Operator values to enable auto-drain and GPU pod eviction:
gpu-operator:
driver:
upgradePolicy:
autoUpgrade: false # Workaround for inverted logic bug
drain:
enable: true
force: true
Environment
k8s-driver-manager version: v0.9.0
GPU Operator version: v25.10.0
Kubernetes version: 1.32
Suggested Fix
The condition should likely be inverted, or the functions should return the config values when autoUpgrade is enabled:
func (dm *DriverManager) isAutoDrainEnabled() bool {
if !dm.isDriverAutoUpgradePolicyEnabled() {
dm.log.Info("Auto drain of the node is disabled by the upgrade policy")
return false
}
return dm.config.enableAutoDrain
}