Fix pod network rollback issue 911#1032
Fix pod network rollback issue 911#1032Lokesh2Arvind wants to merge 2 commits intokrkn-chaos:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses pod network rollback issue #911 by implementing a stateless rollback mechanism for pod network chaos scenarios. The changes enable automatic cleanup of network configurations when chaos experiments fail or complete.
Key Changes
- Adds
rollback_pod_network_outagefunction to clean up OpenFlow rules, IFB interfaces, and modtools pods - Integrates rollback registration into three network chaos scenarios: pod_network_outage, pod_egress_shaping, and pod_ingress_shaping
- Adds comprehensive unit tests for the rollback functionality
- Fixes package directory configuration from "kraken" to "krkn" in setup.cfg
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 17 comments.
| File | Description |
|---|---|
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py |
Implements stateless rollback function and integrates rollback handlers into three network chaos scenarios; adds helper functions for UUID resolution and handler initialization |
tests/rollback_scenario_plugins/test_pod_network_outage_rollback.py |
New test file validating rollback cleanup operations including pod creation, command execution, and modtools pod deletion |
setup.cfg |
Corrects package directory from "kraken" to "krkn" |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
Outdated
Show resolved
Hide resolved
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
Outdated
Show resolved
Hide resolved
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
Outdated
Show resolved
Hide resolved
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
Outdated
Show resolved
Hide resolved
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
Outdated
Show resolved
Hide resolved
tests/rollback_scenario_plugins/test_pod_network_outage_rollback.py
Outdated
Show resolved
Hide resolved
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
Outdated
Show resolved
Hide resolved
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
Outdated
Show resolved
Hide resolved
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
Outdated
Show resolved
Hide resolved
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
Outdated
Show resolved
Hide resolved
|
@Lokesh2Arvind Thank you very much for your contributions, could you please solve the issues identified by copilot? Thanks a lot! |
|
Sure @tsebastiani , I will try to fix the pending errors as soon as possible. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
Outdated
Show resolved
Hide resolved
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
Outdated
Show resolved
Hide resolved
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
Outdated
Show resolved
Hide resolved
4a59a9d to
b2a81f1
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 14 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Register rollback before applying chaos | ||
| run_uuid = _resolve_run_uuid(params.kraken_config) | ||
| rollback_handler = _init_rollback_handler(run_uuid, SCENARIO_TYPE) | ||
| rollback_handler.set_rollback_callable( | ||
| rollback_pod_network_outage, | ||
| RollbackContent( | ||
| namespace=test_namespace, | ||
| resource_identifier=ROLLBACK_RESOURCE_IDENTIFIER, | ||
| ), | ||
| ) |
There was a problem hiding this comment.
The test creates 3 rollback handlers (one per scenario function: pod_outage, pod_egress_shaping, pod_ingress_shaping) with the same run_uuid and scenario_type. This could lead to duplicate rollback registrations if multiple scenario types are executed in the same run. Each invocation creates a new handler and registers the same rollback function, potentially causing the rollback to be executed multiple times. Consider registering the rollback once per run, or ensuring the handler doesn't register duplicate callbacks.
| # Register rollback before applying chaos | ||
| run_uuid = _resolve_run_uuid(params.kraken_config) | ||
| rollback_handler = _init_rollback_handler(run_uuid, SCENARIO_TYPE) | ||
| rollback_handler.set_rollback_callable( | ||
| rollback_pod_network_outage, | ||
| RollbackContent( | ||
| namespace=test_namespace, | ||
| resource_identifier=ROLLBACK_RESOURCE_IDENTIFIER, | ||
| ), | ||
| ) |
There was a problem hiding this comment.
The test creates 3 rollback handlers (one per scenario function: pod_outage, pod_egress_shaping, pod_ingress_shaping) with the same run_uuid and scenario_type. This could lead to duplicate rollback registrations if multiple scenario types are executed in the same run. Each invocation creates a new handler and registers the same rollback function, potentially causing the rollback to be executed multiple times. Consider registering the rollback once per run, or ensuring the handler doesn't register duplicate callbacks.
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
Show resolved
Hide resolved
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
Show resolved
Hide resolved
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
Show resolved
Hide resolved
| except Exception: | ||
| logging.exception(f"[Rollback] Failed during cleanup operations on node '{node}'") | ||
|
|
||
| finally: | ||
| # Delete cleanup pod only if it was created | ||
| if pod_created: | ||
| try: | ||
| logging.info(f"[Rollback] Deleting cleanup pod: {cleanup_pod}") | ||
| kubecli.delete_pod(cleanup_pod, namespace="default") | ||
| except Exception: | ||
| logging.exception(f"[Rollback] Failed to delete cleanup pod '{cleanup_pod}'") |
There was a problem hiding this comment.
The rollback function swallows exceptions with broad 'except Exception' clauses and only logs them. While this prevents the rollback from stopping on individual node failures, it may hide important errors. Consider collecting and returning a summary of failures so callers can determine if the rollback was partially successful and take appropriate action.
431cdf7 to
1f7d74d
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 23 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
Show resolved
Hide resolved
1f7d74d to
fd554e9
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 4 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
Outdated
Show resolved
Hide resolved
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
Outdated
Show resolved
Hide resolved
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
Outdated
Show resolved
Hide resolved
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
Show resolved
Hide resolved
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
Show resolved
Hide resolved
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
Show resolved
Hide resolved
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py
Outdated
Show resolved
Hide resolved
fd554e9 to
fa3ebf4
Compare
…h testing - Add rollback handler caching to prevent duplicate registrations across pod_outage, pod_egress_shaping, and pod_ingress_shaping - Implement stateless rollback_pod_network_outage function with pod readiness checks (60s timeout) - Validate node names (DNS-1123) before creating privileged cleanup pods - Clean OpenFlow rules (priority=65535) on br-int and br0 bridges - Unload IFB kernel module to remove ingress shaping virtual interfaces - Delete leftover modtools pods created during chaos execution - Add UUID-based fallback for run_uuid generation - Add missing image argument to apply_net_policy function - Improve error handling with exception logging and tracebacks - Add comprehensive unit tests for rollback logic with mock pod readiness - Update .gitignore to ignore generated CI/config YAML files (keep only common_test_config.yaml) Fixes krkn-chaos#911 Signed-off-by: Lokesh2Arvind <lokiarvstud.gn@gmail.com>
fa3ebf4 to
649bbff
Compare
Type of change
Description
This PR adds a stateless rollback mechanism for the
pod_network_outagescenario.Previously, this scenario had no rollback support, which could leave behind:
ovs-ofctl) rulesmodtools-*podsSince the engine does not track detailed flow metadata, the rollback does not attempt to reverse the exact rules created.
Instead, it performs a safe best-effort cleanup that restores cluster networking to a clean baseline.
✔ Key Improvements
rollback_pod_network_outagefunction65535modprobe -r ifb)modtools-*podsThis significantly improves the reliability of the chaos scenario and prevents networking side effects after failures.
Related Tickets & Documents
Documentation
If checked, a documentation PR must be created and merged in the website repository.
Related Documentation PR (if applicable)
Checklist before requesting a review