Skip to content

Fix pod network rollback issue 911#1032

Open
Lokesh2Arvind wants to merge 2 commits intokrkn-chaos:mainfrom
Lokesh2Arvind:fix-pod-network-rollback-issue-911
Open

Fix pod network rollback issue 911#1032
Lokesh2Arvind wants to merge 2 commits intokrkn-chaos:mainfrom
Lokesh2Arvind:fix-pod-network-rollback-issue-911

Conversation

@Lokesh2Arvind
Copy link

@Lokesh2Arvind Lokesh2Arvind commented Dec 27, 2025

Type of change

  • Refactor
  • New feature
  • Bug fix
  • Optimization

Description

This PR adds a stateless rollback mechanism for the pod_network_outage scenario.
Previously, this scenario had no rollback support, which could leave behind:

  • OpenFlow (ovs-ofctl) rules
  • IFB interfaces
  • Temporary modtools-* pods

Since the engine does not track detailed flow metadata, the rollback does not attempt to reverse the exact rules created.
Instead, it performs a safe best-effort cleanup that restores cluster networking to a clean baseline.

✔ Key Improvements

  • Added rollback_pod_network_outage function
  • Removes all OpenFlow rules with priority 65535
  • Removes IFB interfaces (modprobe -r ifb)
  • Cleans up leftover modtools-* pods
  • Uses temporary privileged cleanup pods to operate on nodes
  • Added unit test for rollback behavior
  • Addressed Copilot/IDE suggestions (unused imports, misplaced rollback calls, duplicate registration, formatting fixes)

This significantly improves the reliability of the chaos scenario and prevents networking side effects after failures.

Related Tickets & Documents

Documentation

  • Is documentation needed for this update?

If checked, a documentation PR must be created and merged in the website repository.

Related Documentation PR (if applicable)

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses pod network rollback issue #911 by implementing a stateless rollback mechanism for pod network chaos scenarios. The changes enable automatic cleanup of network configurations when chaos experiments fail or complete.

Key Changes

  • Adds rollback_pod_network_outage function to clean up OpenFlow rules, IFB interfaces, and modtools pods
  • Integrates rollback registration into three network chaos scenarios: pod_network_outage, pod_egress_shaping, and pod_ingress_shaping
  • Adds comprehensive unit tests for the rollback functionality
  • Fixes package directory configuration from "kraken" to "krkn" in setup.cfg

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 17 comments.

File Description
krkn/scenario_plugins/native/pod_network_outage/pod_network_outage_plugin.py Implements stateless rollback function and integrates rollback handlers into three network chaos scenarios; adds helper functions for UUID resolution and handler initialization
tests/rollback_scenario_plugins/test_pod_network_outage_rollback.py New test file validating rollback cleanup operations including pod creation, command execution, and modtools pod deletion
setup.cfg Corrects package directory from "kraken" to "krkn"

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tsebastiani
Copy link
Collaborator

@Lokesh2Arvind Thank you very much for your contributions, could you please solve the issues identified by copilot? Thanks a lot!

@Lokesh2Arvind
Copy link
Author

Sure @tsebastiani , I will try to fix the pending errors as soon as possible.
Thank you

Copilot AI review requested due to automatic review settings January 3, 2026 06:48
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 14 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1422 to +1431
# Register rollback before applying chaos
run_uuid = _resolve_run_uuid(params.kraken_config)
rollback_handler = _init_rollback_handler(run_uuid, SCENARIO_TYPE)
rollback_handler.set_rollback_callable(
rollback_pod_network_outage,
RollbackContent(
namespace=test_namespace,
resource_identifier=ROLLBACK_RESOURCE_IDENTIFIER,
),
)
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test creates 3 rollback handlers (one per scenario function: pod_outage, pod_egress_shaping, pod_ingress_shaping) with the same run_uuid and scenario_type. This could lead to duplicate rollback registrations if multiple scenario types are executed in the same run. Each invocation creates a new handler and registers the same rollback function, potentially causing the rollback to be executed multiple times. Consider registering the rollback once per run, or ensuring the handler doesn't register duplicate callbacks.

Copilot uses AI. Check for mistakes.
Comment on lines +1719 to +1728
# Register rollback before applying chaos
run_uuid = _resolve_run_uuid(params.kraken_config)
rollback_handler = _init_rollback_handler(run_uuid, SCENARIO_TYPE)
rollback_handler.set_rollback_callable(
rollback_pod_network_outage,
RollbackContent(
namespace=test_namespace,
resource_identifier=ROLLBACK_RESOURCE_IDENTIFIER,
),
)
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test creates 3 rollback handlers (one per scenario function: pod_outage, pod_egress_shaping, pod_ingress_shaping) with the same run_uuid and scenario_type. This could lead to duplicate rollback registrations if multiple scenario types are executed in the same run. Each invocation creates a new handler and registers the same rollback function, potentially causing the rollback to be executed multiple times. Consider registering the rollback once per run, or ensuring the handler doesn't register duplicate callbacks.

Copilot uses AI. Check for mistakes.
Comment on lines +1917 to +1927
except Exception:
logging.exception(f"[Rollback] Failed during cleanup operations on node '{node}'")

finally:
# Delete cleanup pod only if it was created
if pod_created:
try:
logging.info(f"[Rollback] Deleting cleanup pod: {cleanup_pod}")
kubecli.delete_pod(cleanup_pod, namespace="default")
except Exception:
logging.exception(f"[Rollback] Failed to delete cleanup pod '{cleanup_pod}'")
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rollback function swallows exceptions with broad 'except Exception' clauses and only logs them. While this prevents the rollback from stopping on individual node failures, it may hide important errors. Consider collecting and returning a summary of failures so callers can determine if the rollback was partially successful and take appropriate action.

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 23 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 4 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Lokesh2Arvind Lokesh2Arvind force-pushed the fix-pod-network-rollback-issue-911 branch from fd554e9 to fa3ebf4 Compare January 7, 2026 20:27
…h testing

- Add rollback handler caching to prevent duplicate registrations across pod_outage, pod_egress_shaping, and pod_ingress_shaping
- Implement stateless rollback_pod_network_outage function with pod readiness checks (60s timeout)
- Validate node names (DNS-1123) before creating privileged cleanup pods
- Clean OpenFlow rules (priority=65535) on br-int and br0 bridges
- Unload IFB kernel module to remove ingress shaping virtual interfaces
- Delete leftover modtools pods created during chaos execution
- Add UUID-based fallback for run_uuid generation
- Add missing image argument to apply_net_policy function
- Improve error handling with exception logging and tracebacks
- Add comprehensive unit tests for rollback logic with mock pod readiness
- Update .gitignore to ignore generated CI/config YAML files (keep only common_test_config.yaml)

Fixes krkn-chaos#911

Signed-off-by: Lokesh2Arvind <lokiarvstud.gn@gmail.com>
@Lokesh2Arvind Lokesh2Arvind force-pushed the fix-pod-network-rollback-issue-911 branch from fa3ebf4 to 649bbff Compare January 25, 2026 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pod Network Scenario Rollback

4 participants