🐛 Bypass PDBs during drain when node is unreachable#13509
🐛 Bypass PDBs during drain when node is unreachable#13509rafael-azevedo wants to merge 1 commit intokubernetes-sigs:mainfrom
Conversation
|
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @rafael-azevedo! |
|
Hi @rafael-azevedo. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Fixes #13508
What this PR does / why we need it:
When a node becomes unreachable (e.g. the underlying instance is stopped), CAPI's machine drain gets stuck indefinitely. The drain uses the Kubernetes Eviction API which respects PDBs. When pods on the unreachable node have PDBs with minAvailable: 1 and currentHealthy: 0, the Eviction API returns 429 TooManyRequests and drain retries forever.
The existing code correctly detects unreachable nodes but relies on the taint manager to bypass PDBs. When the instance is stopped/terminated, kubelet is not running and cannot execute the taint manager's deletions and pods linger indefinitely.
This adds a DisableEviction option to the drain helper that uses direct pod deletion instead of the Eviction API. This is set when the node is unreachable, since pods are not actually running and PDB protection is not meaningful.
/area machine