Skip to content

feat: capture chaos injection details in pod disruption scenario telemetry#1152

Open
farhann-saleem wants to merge 3 commits intokrkn-chaos:mainfrom
farhann-saleem:feat/capture-chaos-injection
Open

feat: capture chaos injection details in pod disruption scenario telemetry#1152
farhann-saleem wants to merge 3 commits intokrkn-chaos:mainfrom
farhann-saleem:feat/capture-chaos-injection

Conversation

@farhann-saleem
Copy link
Contributor

@farhann-saleem farhann-saleem commented Feb 9, 2026

Summary

This PR fixes a critical data loss issue in the pod disruption scenario plugin
where which pods were killed during chaos injection was not persisted to telemetry.

Previously, pods were logged when deleted but this information was not stored
in the scenario telemetry for analysis and debugging.

Changes

Files Modified

  • krkn/scenario_plugins/pod_disruption/models/models.py - Added KilledPodDetail dataclass
  • krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py - Track killed pods
  • tests/test_pod_disruption_scenario_plugin.py - Added comprehensive unit tests

What Changed

  1. New KilledPodDetail Dataclass - Structured tracking of killed/excluded pods with:

    • namespace: Which namespace the pod was in
    • name: Pod name
    • timestamp: When the pod was killed
    • status: "killed" or "excluded"
    • reason: Why it was excluded (if applicable)
  2. Updated killing_pods() Method - Now returns tuple with list of KilledPodDetail objects

    • Tracks each pod killed with timestamp
    • Distinguishes between killed and excluded pods
    • Maintains complete audit trail
  3. Updated run() Method - Stores killed pod details in telemetry

    • Persists killed_pods_details to scenario_telemetry
    • Adds debug logging with summary statistics
    • Maintains backward compatibility
  4. Added Unit Tests - Comprehensive test coverage including:

    • KilledPodDetail creation and validation
    • Tracking of killed vs excluded pods
    • Timestamp accuracy
    • Return value structure validation

Why This Matters

Problem Being Solved

When a pod disruption scenario runs, operators need to know:

  • Which specific pods were killed
  • When they were killed
  • Which pods were excluded and why

Before this change, this information was only in logs, not in structured telemetry.

Impact

  • Debugging: Users can now trace exactly which pods were deleted
  • Analysis: Telemetry data can be analyzed to correlate pod deletion with observable impacts
  • Auditability: Complete record of what was injected, not just results
  • NLP Integration: Supports natural language chaos scenario discovery by providing complete execution context

Testing

All tests pass:

  • ✓ Syntax validation
  • ✓ KilledPodDetail dataclass tests
  • ✓ Telemetry storage format validation
  • ✓ Tracking accuracy tests
  • ✓ Backward compatibility verified

Backward Compatibility

✅ Fully backward compatible:

  • Existing scenarios continue to work
  • Telemetry consumers can ignore killed_pods_details if not needed
  • No changes to exit codes or API signatures
  • No breaking changes to existing functionality

Related Issues

Closes [reference to any related issues if applicable]
Relates to #1051 (Natural Language Chaos Scenario Discovery)

Checklist

  • Code changes follow existing patterns
  • Tests added for new functionality
  • Backward compatibility maintained
  • Documentation updated (docstrings added)
  • No breaking changes

…metry

Signed-off-by: farhann_saleem <chaudaryfarhann@gmail.com>
@qodo-code-review
Copy link

ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

Review Summary by Qodo

Capture chaos injection details in pod disruption telemetry

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Added KilledPodDetail dataclass to track killed/excluded pods
• Modified killing_pods() to return tuple with pod details list
• Store killed pods details in scenario telemetry for analysis
• Added comprehensive unit tests for tracking functionality
Diagram
flowchart LR
  A["Pod Disruption Scenario"] -->|killing_pods| B["Track Killed/Excluded Pods"]
  B -->|Create KilledPodDetail| C["Pod Details List"]
  C -->|Store in Telemetry| D["Scenario Telemetry"]
  D -->|Enable Analysis| E["Debugging & Auditability"]
Loading

Grey Divider

File Changes

1. krkn/scenario_plugins/pod_disruption/models/models.py ✨ Enhancement +9/-0

Add KilledPodDetail dataclass for pod tracking

• Added new KilledPodDetail dataclass with namespace, name, timestamp, status, and reason fields
• Dataclass tracks both killed and excluded pods with optional exclusion reason

krkn/scenario_plugins/pod_disruption/models/models.py


2. krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py ✨ Enhancement +48/-8

Track and persist killed pod details in telemetry

• Updated killing_pods() method to return tuple of (killed_pods_list, return_val)
• Added tracking of each killed/excluded pod with KilledPodDetail objects
• Modified run() method to store killed_pods_details in scenario telemetry
• Added debug logging with summary statistics of killed vs excluded pods
• Maintained backward compatibility with existing return value handling

krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py


3. tests/test_pod_disruption_scenario_plugin.py 🧪 Tests +199/-4

Add comprehensive unit tests for pod tracking

• Added TestKilledPodDetail class with tests for dataclass creation and fields
• Added TestKillingPodsTracking class with comprehensive tests for tracking functionality
• Tests cover killed vs excluded pod distinction, timestamp accuracy, and error cases
• Tests verify return value structure and insufficient pod scenarios

tests/test_pod_disruption_scenario_plugin.py


Grey Divider

Qodo Logo

@qodo-code-review
Copy link

qodo-code-review bot commented Feb 9, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (3) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Telemetry field not persisted 🐞 Bug ✓ Correctness
Description
• run() sets scenario_telemetry.killed_pods_details as a new ad-hoc attribute; if ScenarioTelemetry
is a strict/structured model (e.g., pydantic or dataclass slots), this can raise AttributeError at
runtime. • Even if assignment succeeds, telemetry serialization may only include declared fields, so
killed_pods_details may never reach telemetry.json (silent feature loss).
Code

krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[R48-58]

+                    # Store killed pods details in telemetry for analysis
+                    scenario_telemetry.killed_pods_details = [
+                        {
+                            "namespace": pod.namespace,
+                            "name": pod.name,
+                            "timestamp": pod.timestamp,
+                            "status": pod.status,
+                            "reason": pod.reason
+                        }
+                        for pod in killed_pods
+                    ]
Evidence
The PR writes killed_pods_details onto ScenarioTelemetry dynamically. Across the repo,
ScenarioTelemetry is treated as a structured model (instantiated and populated with known fields),
and tests commonly use Mock(spec=ScenarioTelemetry), which would not allow undeclared
attributes—suggesting the model likely has a defined schema that must be updated for new telemetry
fields to be reliably accepted and serialized.

krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[44-58]
krkn/scenario_plugins/abstract_scenario_plugin.py[86-90]
tests/test_node_actions_scenario_plugin.py[36-42]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`PodDisruptionScenarioPlugin.run()` assigns `scenario_telemetry.killed_pods_details` dynamically. If `ScenarioTelemetry` is a strict model (common for telemetry payloads), this may either:
1) raise `AttributeError` at runtime, or
2) be silently dropped during serialization, so the new data never appears in `telemetry.json`.
## Issue Context
`ScenarioTelemetry` comes from `krkn_lib` (external to this repo), and in this repo it is treated as a structured object with known fields.
## Fix Focus Areas
- krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[44-63]
- krkn/scenario_plugins/abstract_scenario_plugin.py[86-90]
## Suggested approach
1) Preferably update the `ScenarioTelemetry` model in `krkn_lib` to include a `killed_pods_details` field and ensure it is included in its JSON serialization.
2) If updating `krkn_lib` is not possible from this repo, then:
- store the data under an existing, known-serialized field (if one exists), or
- wrap the assignment with a `try/except AttributeError` and emit a warning log so the scenario does not crash and operators know the telemetry field was not recorded.
3) Add/adjust a unit test that exercises `run()` with a strict `Mock(spec=ScenarioTelemetry)` and asserts the chosen storage mechanism works.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

✅ 2. raise(e) loses traceback 📘 Rule violation ⛯ Reliability
Description
killing_pods() catches all exceptions and re-raises using raise(e), which discards the
original traceback context. • This makes failures harder to diagnose and can hide the true failing
line, undermining robust error handling and edge-case management. • The function can instead
preserve the traceback (raise) and/or log actionable context and return a well-defined failure
code.
Code

krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[R277-278]

      except Exception as e:
          raise(e)
Evidence
PR Compliance ID 3 requires robust error handling with meaningful context; the changed area includes
a broad exception handler that re-raises via raise(e), which loses traceback information and adds
no actionable context.

Rule 3: Generic: Robust Error Handling and Edge Case Management
krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[276-280]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`killing_pods()` uses `except Exception as e: raise(e)`, which drops the original traceback and provides no additional context.
## Issue Context
The plugin already has top-level exception handling, but re-raising with `raise(e)` can make stack traces misleading and reduces diagnosability.
## Fix Focus Areas
- krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[272-280]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Unstructured killed-pods debug log 📘 Rule violation ✓ Correctness
Description
• The newly added killed-pods summary log is emitted as a formatted string, which is difficult to
parse and monitor consistently at scale. • The compliance checklist requires structured logging for
auditing/monitoring; this new log entry does not follow a structured format.
Code

krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[R59-63]

+                    logging.debug(
+                        f"Killed pods summary - Total: {len(killed_pods)}, "
+                        f"Killed: {len([p for p in killed_pods if p.status == 'killed'])}, "
+                        f"Excluded: {len([p for p in killed_pods if p.status == 'excluded'])}"
+                    )
Evidence
PR Compliance ID 5 requires structured logs; the added debug statement uses f-strings rather than
emitting structured fields, making it harder to query/aggregate reliably.

Rule 5: Generic: Secure Logging Practices
krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[59-63]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The killed-pods summary debug message is unstructured, making it harder to parse for auditing/monitoring.
## Issue Context
This PR adds new telemetry/auditability details; keeping logs structured improves downstream analysis and reduces operational friction.
## Fix Focus Areas
- krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[59-63]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


✅ 4. Return-code contract mismatch 🐞 Bug ✓ Correctness
Description
• killing_pods() now documents (and effectively returns) only success/failure codes 0/1, but run()
still contains legacy logic expecting configuration errors to be signaled via ret>1. • This makes
the “ret>1” branch unreachable and creates confusion/ambiguity about what return codes mean (easy to
reintroduce bugs later).
Code

krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[R224-229]

+        """Kill pods matching the specified criteria and track which pods were killed.
+
+        Returns:
+            tuple: (killed_pods_list, return_val) where killed_pods_list contains KilledPodDetail objects
+                   for each pod that was killed or excluded, and return_val is 0 for success, 1 for failure
+        """
Evidence
The updated killing_pods() docstring explicitly states return_val is 0/1, and the implementation
returns 1 for the only early error case and otherwise forwards wait_for_pods() (0/1). Meanwhile
run() still documents/branches on ret>1 for configuration issues, which can never happen given the
current contract.

krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[224-229]
krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[246-280]
krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[65-66]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`run()` contains legacy logic for `ret &amp;amp;gt; 1` (configuration/setup error), but `killing_pods()` now documents and returns only `0` (success) and `1` (failure). This makes the `ret &amp;amp;gt; 1` branch unreachable and leaves contradictory documentation.
## Issue Context
This PR changed `killing_pods()` to return `(killed_pods, return_val)` and added a docstring that defines return_val as 0/1.
## Fix Focus Areas
- krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[65-81]
- krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[224-229]
- krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[246-280]
## Suggested approach
Choose one:
1) **Simplify**: remove/update the `ret &amp;amp;gt; 1` branch and comment in `run()` so only 0/1 are expected.
2) **Reintroduce explicit config-error code**: make `killing_pods()` return `2` for configuration/setup errors (e.g., missing namespace pattern, invalid selectors, insufficient pods) and update the docstring + callers accordingly.
Also update/add a unit test to lock in the chosen contract.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

✅ 5. ret variable name unclear 📘 Rule violation ✓ Correctness
Description
• The return code from killing_pods() is stored in a generic variable named ret, which does not
self-document what the value represents. • This reduces readability and makes it easier to misuse
the variable when adding future logic around return/exit codes.
Code

krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[R44-46]

+                    killed_pods, ret = self.killing_pods(
                      kill_scenario_config, lib_telemetry.get_lib_kubernetes()
                  )
Evidence
PR Compliance ID 2 requires meaningful, self-documenting naming; the new tuple-unpacking line
introduces/retains the generic identifier ret for a return/exit code.

Rule 2: Generic: Meaningful Naming and Self-Documenting Code
krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[44-46]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The variable `ret` is too generic for a return/exit code and reduces code readability.
## Issue Context
This PR changed `killing_pods()` to return a tuple, making the unpacked values more prominent; clearer naming will prevent confusion and bugs.
## Fix Focus Areas
- krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[44-46]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment on lines +48 to +58
# Store killed pods details in telemetry for analysis
scenario_telemetry.killed_pods_details = [
{
"namespace": pod.namespace,
"name": pod.name,
"timestamp": pod.timestamp,
"status": pod.status,
"reason": pod.reason
}
for pod in killed_pods
]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Telemetry field not persisted 🐞 Bug ✧ Quality

• run() sets scenario_telemetry.killed_pods_details as a new ad-hoc attribute; if ScenarioTelemetry
  is a strict/structured model (e.g., pydantic or dataclass slots), this can raise AttributeError at
  runtime.
• Even if assignment succeeds, telemetry serialization may only include declared fields, so
  killed_pods_details may never reach telemetry.json (silent feature loss).
Agent Prompt
## Issue description
`PodDisruptionScenarioPlugin.run()` assigns `scenario_telemetry.killed_pods_details` dynamically. If `ScenarioTelemetry` is a strict model (common for telemetry payloads), this may either:
1) raise `AttributeError` at runtime, or
2) be silently dropped during serialization, so the new data never appears in `telemetry.json`.

## Issue Context
`ScenarioTelemetry` comes from `krkn_lib` (external to this repo), and in this repo it is treated as a structured object with known fields.

## Fix Focus Areas
- krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[44-63]
- krkn/scenario_plugins/abstract_scenario_plugin.py[86-90]

## Suggested approach
1) Preferably update the `ScenarioTelemetry` model in `krkn_lib` to include a `killed_pods_details` field and ensure it is included in its JSON serialization.
2) If updating `krkn_lib` is not possible from this repo, then:
   - store the data under an existing, known-serialized field (if one exists), or
   - wrap the assignment with a `try/except AttributeError` and emit a warning log so the scenario does not crash and operators know the telemetry field was not recorded.
3) Add/adjust a unit test that exercises `run()` with a strict `Mock(spec=ScenarioTelemetry)` and asserts the chosen storage mechanism works.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

farhann-saleem and others added 2 commits February 9, 2026 07:03
logging.info(pod)
if pod[0] in exclude_pods:
logging.info(f"Excluding {pod[0]} from chaos")
killed_pods.append(KilledPodDetail(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@farhann-saleem I'm not sure how useful this is, we already have a detailed watcher set up in krkn-lib that will tell us when any pod goes down and how long it takes to recover and any errors

Read more about what we already have herehttps://krkn-chaos.dev/docs/scenarios/pod-scenario/#recovery-time-metrics-in-krkn-telemetry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @paigerube14 and the documentation link!

You're right that krkn-lib's watcher provides comprehensive recovery tracking. After reviewing the docs, I can see the overlap.

My PR was attempting to add visibility into:

  1. Excluded pods tracking - Which pods were skipped due to exclude_label
  2. Exclusion reasons - Why specific pods were excluded (matched filter)
  3. Injection timestamps - When chaos was injected (vs when pods recovered)

The use case I had in mind was: "I ran chaos with exclude_label: critical=true. Which pods were excluded and why?"

The existing affected_pods telemetry tracks pods that were killed and their recovery, but doesn't capture the pods that matched filters but were intentionally excluded. This creates an audit trail gap for understanding targeting decisions.

However, I understand this may not add enough value to warrant the additional complexity. If you think the existing recovery tracking is sufficient and this use case is too niche, I'm happy to close this PR.

Let me know your thoughts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants