Skip to content

isAutoDrainEnabled() and isGPUPodEvictionEnabled() return false when autoUpgrade is true (inverted logic) #138

@tejag5

Description

@tejag5

Description

When autoUpgrade is set to true in the GPU Operator's ClusterPolicy, the isAutoDrainEnabled() and isGPUPodEvictionEnabled() functions incorrectly return false, which disables auto-drain and GPU pod eviction. This is the opposite of the expected behavior.

Expected Behavior

When autoUpgrade: true is configured, auto-drain and GPU pod eviction should be enabled to facilitate seamless driver upgrades.

Actual Behavior

When autoUpgrade: true is configured, auto-drain and GPU pod eviction are disabled, preventing the driver manager from properly managing node drain and pod eviction during upgrades.

Root Cause

In cmd/driver-manager/main.go, the logic is inverted:

// Lines 828-834
func (dm *DriverManager) isAutoDrainEnabled() bool {
    if dm.isDriverAutoUpgradePolicyEnabled() {
        dm.log.Info("Auto drain of the node is disabled by the upgrade policy")
        return false  // BUG: Should return dm.config.enableAutoDrain when autoUpgrade is true
    }
    return dm.config.enableAutoDrain
}

// Lines 836-842
func (dm *DriverManager) isGPUPodEvictionEnabled() bool {
    if dm.isDriverAutoUpgradePolicyEnabled() {
        dm.log.Infof("Auto eviction of GPU pods on node %s is disabled by the upgrade policy", dm.config.nodeName)
        return false  // BUG: Should return dm.config.enableGPUPodEviction when autoUpgrade is true
    }
    return dm.config.enableGPUPodEviction
}

When isDriverAutoUpgradePolicyEnabled() returns true (meaning autoUpgrade is enabled), these functions return false, disabling the drain/eviction features.

Workaround

Set autoUpgrade: false in the GPU Operator values to enable auto-drain and GPU pod eviction:

gpu-operator:
  driver:
    upgradePolicy:
      autoUpgrade: false  # Workaround for inverted logic bug
      drain:
        enable: true
        force: true

Environment

k8s-driver-manager version: v0.9.0
GPU Operator version: v25.10.0
Kubernetes version: 1.32

Suggested Fix

The condition should likely be inverted, or the functions should return the config values when autoUpgrade is enabled:

func (dm *DriverManager) isAutoDrainEnabled() bool {
    if !dm.isDriverAutoUpgradePolicyEnabled() {
        dm.log.Info("Auto drain of the node is disabled by the upgrade policy")
        return false
    }
    return dm.config.enableAutoDrain
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugIssue/PR to expose/discuss/fix a bugneeds-triageissue or PR has not been assigned a priority-px label

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions