feat: capture chaos injection details in pod disruption scenario telemetry by farhann-saleem · Pull Request #1152 · krkn-chaos/krkn

farhann-saleem · 2026-02-09T01:42:06Z

Summary

This PR fixes a critical data loss issue in the pod disruption scenario plugin
where which pods were killed during chaos injection was not persisted to telemetry.

Previously, pods were logged when deleted but this information was not stored
in the scenario telemetry for analysis and debugging.

Changes

Files Modified

krkn/scenario_plugins/pod_disruption/models/models.py - Added KilledPodDetail dataclass
krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py - Track killed pods
tests/test_pod_disruption_scenario_plugin.py - Added comprehensive unit tests

What Changed

New KilledPodDetail Dataclass - Structured tracking of killed/excluded pods with:
- namespace: Which namespace the pod was in
- name: Pod name
- timestamp: When the pod was killed
- status: "killed" or "excluded"
- reason: Why it was excluded (if applicable)
Updated killing_pods() Method - Now returns tuple with list of KilledPodDetail objects
- Tracks each pod killed with timestamp
- Distinguishes between killed and excluded pods
- Maintains complete audit trail
Updated run() Method - Stores killed pod details in telemetry
- Persists killed_pods_details to scenario_telemetry
- Adds debug logging with summary statistics
- Maintains backward compatibility
Added Unit Tests - Comprehensive test coverage including:
- KilledPodDetail creation and validation
- Tracking of killed vs excluded pods
- Timestamp accuracy
- Return value structure validation

Why This Matters

Problem Being Solved

When a pod disruption scenario runs, operators need to know:

Which specific pods were killed
When they were killed
Which pods were excluded and why

Before this change, this information was only in logs, not in structured telemetry.

Impact

Debugging: Users can now trace exactly which pods were deleted
Analysis: Telemetry data can be analyzed to correlate pod deletion with observable impacts
Auditability: Complete record of what was injected, not just results
NLP Integration: Supports natural language chaos scenario discovery by providing complete execution context

Testing

All tests pass:

✓ Syntax validation
✓ KilledPodDetail dataclass tests
✓ Telemetry storage format validation
✓ Tracking accuracy tests
✓ Backward compatibility verified

Backward Compatibility

✅ Fully backward compatible:

Existing scenarios continue to work
Telemetry consumers can ignore killed_pods_details if not needed
No changes to exit codes or API signatures
No breaking changes to existing functionality

Related Issues

Closes [reference to any related issues if applicable]
Relates to #1051 (Natural Language Chaos Scenario Discovery)

Checklist

Code changes follow existing patterns
Tests added for new functionality
Backward compatibility maintained
Documentation updated (docstrings added)
No breaking changes

…metry Signed-off-by: farhann_saleem <chaudaryfarhann@gmail.com>

qodo-code-review · 2026-02-09T01:42:20Z

ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

Review Summary by Qodo

Capture chaos injection details in pod disruption telemetry

✨ Enhancement

Walkthroughs

Description

• Added KilledPodDetail dataclass to track killed/excluded pods
• Modified killing_pods() to return tuple with pod details list
• Store killed pods details in scenario telemetry for analysis
• Added comprehensive unit tests for tracking functionality

Diagram

flowchart LR
  A["Pod Disruption Scenario"] -->|killing_pods| B["Track Killed/Excluded Pods"]
  B -->|Create KilledPodDetail| C["Pod Details List"]
  C -->|Store in Telemetry| D["Scenario Telemetry"]
  D -->|Enable Analysis| E["Debugging & Auditability"]

File Changes

1. krkn/scenario_plugins/pod_disruption/models/models.py ✨ Enhancement +9/-0

Add KilledPodDetail dataclass for pod tracking
• Added new KilledPodDetail dataclass with namespace, name, timestamp, status, and reason fields
• Dataclass tracks both killed and excluded pods with optional exclusion reason
krkn/scenario_plugins/pod_disruption/models/models.py

2. krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py ✨ Enhancement +48/-8

Track and persist killed pod details in telemetry

• Updated killing_pods() method to return tuple of (killed_pods_list, return_val)
• Added tracking of each killed/excluded pod with KilledPodDetail objects
• Modified run() method to store killed_pods_details in scenario telemetry
• Added debug logging with summary statistics of killed vs excluded pods
• Maintained backward compatibility with existing return value handling

krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py

3. tests/test_pod_disruption_scenario_plugin.py 🧪 Tests +199/-4

Add comprehensive unit tests for pod tracking

• Added TestKilledPodDetail class with tests for dataclass creation and fields
• Added TestKillingPodsTracking class with comprehensive tests for tracking functionality
• Tests cover killed vs excluded pod distinction, timestamp accuracy, and error cases
• Tests verify return value structure and insufficient pod scenarios

tests/test_pod_disruption_scenario_plugin.py

qodo-code-review · 2026-02-09T01:50:15Z

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (3) 📎 Requirement gaps (0)

1. Telemetry field not persisted 🐞 Bug ✓ Correctness

Description

• run() sets scenario_telemetry.killed_pods_details as a new ad-hoc attribute; if ScenarioTelemetry
is a strict/structured model (e.g., pydantic or dataclass slots), this can raise AttributeError at
runtime. • Even if assignment succeeds, telemetry serialization may only include declared fields, so
killed_pods_details may never reach telemetry.json (silent feature loss).

Code

krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[R48-58]

+                    # Store killed pods details in telemetry for analysis
+                    scenario_telemetry.killed_pods_details = [
+                        {
+                            "namespace": pod.namespace,
+                            "name": pod.name,
+                            "timestamp": pod.timestamp,
+                            "status": pod.status,
+                            "reason": pod.reason
+                        }
+                        for pod in killed_pods
+                    ]

Evidence
The PR writes killed_pods_details onto ScenarioTelemetry dynamically. Across the repo,
ScenarioTelemetry is treated as a structured model (instantiated and populated with known fields),
and tests commonly use Mock(spec=ScenarioTelemetry), which would not allow undeclared
attributes—suggesting the model likely has a defined schema that must be updated for new telemetry
fields to be reliably accepted and serialized.
krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[44-58]
krkn/scenario_plugins/abstract_scenario_plugin.py[86-90]
tests/test_node_actions_scenario_plugin.py[36-42]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`PodDisruptionScenarioPlugin.run()` assigns `scenario_telemetry.killed_pods_details` dynamically. If `ScenarioTelemetry` is a strict model (common for telemetry payloads), this may either:
1) raise `AttributeError` at runtime, or
2) be silently dropped during serialization, so the new data never appears in `telemetry.json`.
## Issue Context
`ScenarioTelemetry` comes from `krkn_lib` (external to this repo), and in this repo it is treated as a structured object with known fields.
## Fix Focus Areas
- krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[44-63]
- krkn/scenario_plugins/abstract_scenario_plugin.py[86-90]
## Suggested approach
1) Preferably update the `ScenarioTelemetry` model in `krkn_lib` to include a `killed_pods_details` field and ensure it is included in its JSON serialization.
2) If updating `krkn_lib` is not possible from this repo, then:
- store the data under an existing, known-serialized field (if one exists), or
- wrap the assignment with a `try/except AttributeError` and emit a warning log so the scenario does not crash and operators know the telemetry field was not recorded.
3) Add/adjust a unit test that exercises `run()` with a strict `Mock(spec=ScenarioTelemetry)` and asserts the chosen storage mechanism works.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

✅ 2. raise(e) loses traceback 📘 Rule violation ⛯ Reliability

Description

• killing_pods() catches all exceptions and re-raises using raise(e), which discards the
original traceback context. • This makes failures harder to diagnose and can hide the true failing
line, undermining robust error handling and edge-case management. • The function can instead
preserve the traceback (raise) and/or log actionable context and return a well-defined failure
code.

Code

krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[R277-278]
      except Exception as e:
          raise(e)

Evidence
PR Compliance ID 3 requires robust error handling with meaningful context; the changed area includes
a broad exception handler that re-raises via raise(e), which loses traceback information and adds
no actionable context.
Rule 3: Generic: Robust Error Handling and Edge Case Management
krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[276-280]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`killing_pods()` uses `except Exception as e: raise(e)`, which drops the original traceback and provides no additional context.
## Issue Context
The plugin already has top-level exception handling, but re-raising with `raise(e)` can make stack traces misleading and reduces diagnosability.
## Fix Focus Areas
- krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[272-280]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. Unstructured killed-pods debug log 📘 Rule violation ✓ Correctness

Description

• The newly added killed-pods summary log is emitted as a formatted string, which is difficult to
parse and monitor consistently at scale. • The compliance checklist requires structured logging for
auditing/monitoring; this new log entry does not follow a structured format.

Code

krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[R59-63]

+                    logging.debug(
+                        f"Killed pods summary - Total: {len(killed_pods)}, "
+                        f"Killed: {len([p for p in killed_pods if p.status == 'killed'])}, "
+                        f"Excluded: {len([p for p in killed_pods if p.status == 'excluded'])}"
+                    )

Evidence
PR Compliance ID 5 requires structured logs; the added debug statement uses f-strings rather than
emitting structured fields, making it harder to query/aggregate reliably.
Rule 5: Generic: Secure Logging Practices
krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[59-63]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The killed-pods summary debug message is unstructured, making it harder to parse for auditing/monitoring.
## Issue Context
This PR adds new telemetry/auditability details; keeping logs structured improves downstream analysis and reduces operational friction.
## Fix Focus Areas
- krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[59-63]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

✅ 4. ~~Return-code contract mismatch~~ 🐞 Bug ✓ Correctness

Description

• killing_pods() now documents (and effectively returns) only success/failure codes 0/1, but run()
still contains legacy logic expecting configuration errors to be signaled via ret>1. • This makes
the “ret>1” branch unreachable and creates confusion/ambiguity about what return codes mean (easy to
reintroduce bugs later).

Code

krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[R224-229]

+        """Kill pods matching the specified criteria and track which pods were killed.
+
+        Returns:
+            tuple: (killed_pods_list, return_val) where killed_pods_list contains KilledPodDetail objects
+                   for each pod that was killed or excluded, and return_val is 0 for success, 1 for failure
+        """

Evidence
The updated killing_pods() docstring explicitly states return_val is 0/1, and the implementation
returns 1 for the only early error case and otherwise forwards wait_for_pods() (0/1). Meanwhile
run() still documents/branches on ret>1 for configuration issues, which can never happen given the
current contract.
krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[224-229]
krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[246-280]
krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[65-66]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`run()` contains legacy logic for `ret &amp;amp;gt; 1` (configuration/setup error), but `killing_pods()` now documents and returns only `0` (success) and `1` (failure). This makes the `ret &amp;amp;gt; 1` branch unreachable and leaves contradictory documentation.
## Issue Context
This PR changed `killing_pods()` to return `(killed_pods, return_val)` and added a docstring that defines return_val as 0/1.
## Fix Focus Areas
- krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[65-81]
- krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[224-229]
- krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[246-280]
## Suggested approach
Choose one:
1) **Simplify**: remove/update the `ret &amp;amp;gt; 1` branch and comment in `run()` so only 0/1 are expected.
2) **Reintroduce explicit config-error code**: make `killing_pods()` return `2` for configuration/setup errors (e.g., missing namespace pattern, invalid selectors, insufficient pods) and update the docstring + callers accordingly.
Also update/add a unit test to lock in the chosen contract.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

✅ 5. ret variable name unclear 📘 Rule violation ✓ Correctness

Description

• The return code from killing_pods() is stored in a generic variable named ret, which does not
self-document what the value represents. • This reduces readability and makes it easier to misuse
the variable when adding future logic around return/exit codes.

Code

krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[R44-46]

+                    killed_pods, ret = self.killing_pods(
                      kill_scenario_config, lib_telemetry.get_lib_kubernetes()
                  )

Evidence
PR Compliance ID 2 requires meaningful, self-documenting naming; the new tuple-unpacking line
introduces/retains the generic identifier ret for a return/exit code.
Rule 2: Generic: Meaningful Naming and Self-Documenting Code
krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[44-46]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The variable `ret` is too generic for a return/exit code and reduces code readability.
## Issue Context
This PR changed `killing_pods()` to return a tuple, making the unpacked values more prominent; clearer naming will prevent confusion and bugs.
## Fix Focus Areas
- krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[44-46]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

qodo-code-review · 2026-02-09T01:50:16Z

krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py

+                    # Store killed pods details in telemetry for analysis
+                    scenario_telemetry.killed_pods_details = [
+                        {
+                            "namespace": pod.namespace,
+                            "name": pod.name,
+                            "timestamp": pod.timestamp,
+                            "status": pod.status,
+                            "reason": pod.reason
+                        }
+                        for pod in killed_pods
+                    ]


1. Telemetry field not persisted 🐞 Bug ✧ Quality

• run() sets scenario_telemetry.killed_pods_details as a new ad-hoc attribute; if ScenarioTelemetry is a strict/structured model (e.g., pydantic or dataclass slots), this can raise AttributeError at runtime. • Even if assignment succeeds, telemetry serialization may only include declared fields, so killed_pods_details may never reach telemetry.json (silent feature loss).

Agent Prompt

## Issue description `PodDisruptionScenarioPlugin.run()` assigns `scenario_telemetry.killed_pods_details` dynamically. If `ScenarioTelemetry` is a strict model (common for telemetry payloads), this may either: 1) raise `AttributeError` at runtime, or 2) be silently dropped during serialization, so the new data never appears in `telemetry.json`. ## Issue Context `ScenarioTelemetry` comes from `krkn_lib` (external to this repo), and in this repo it is treated as a structured object with known fields. ## Fix Focus Areas - krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[44-63] - krkn/scenario_plugins/abstract_scenario_plugin.py[86-90] ## Suggested approach 1) Preferably update the `ScenarioTelemetry` model in `krkn_lib` to include a `killed_pods_details` field and ensure it is included in its JSON serialization. 2) If updating `krkn_lib` is not possible from this repo, then: - store the data under an existing, known-serialized field (if one exists), or - wrap the assignment with a `try/except AttributeError` and emit a warning log so the scenario does not crash and operators know the telemetry field was not recorded. 3) Add/adjust a unit test that exercises `run()` with a strict `Mock(spec=ScenarioTelemetry)` and asserts the chosen storage mechanism works.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

… . unstructed kill-pods Signed-off-by: farhann_saleem <chaudaryfarhann@gmail.com>

paigerube14 · 2026-02-17T19:49:55Z

krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py

                logging.info(pod)
                if pod[0] in exclude_pods:
                    logging.info(f"Excluding {pod[0]} from chaos")
+                    killed_pods.append(KilledPodDetail(


@farhann-saleem I'm not sure how useful this is, we already have a detailed watcher set up in krkn-lib that will tell us when any pod goes down and how long it takes to recover and any errors

Read more about what we already have herehttps://krkn-chaos.dev/docs/scenarios/pod-scenario/#recovery-time-metrics-in-krkn-telemetry

Thanks for the feedback @paigerube14 and the documentation link!

You're right that krkn-lib's watcher provides comprehensive recovery tracking. After reviewing the docs, I can see the overlap.

My PR was attempting to add visibility into:

Excluded pods tracking - Which pods were skipped due to exclude_label

Exclusion reasons - Why specific pods were excluded (matched filter)

Injection timestamps - When chaos was injected (vs when pods recovered)

The use case I had in mind was: "I ran chaos with exclude_label: critical=true. Which pods were excluded and why?"

The existing affected_pods telemetry tracks pods that were killed and their recovery, but doesn't capture the pods that matched filters but were intentionally excluded. This creates an audit trail gap for understanding targeting decisions.

However, I understand this may not add enough value to warrant the additional complexity. If you think the existing recovery tracking is sufficient and this use case is too niche, I'm happy to close this PR.

Let me know your thoughts!

feat: capture chaos injection details in pod disruption scenario tele…

03efc3d

…metry Signed-off-by: farhann_saleem <chaudaryfarhann@gmail.com>

farhann-saleem requested review from chaitanyaenr, paigerube14 and tsebastiani as code owners February 9, 2026 01:42

qodo-code-review bot reviewed Feb 9, 2026

View reviewed changes

farhann-saleem and others added 2 commits February 9, 2026 07:03

Fixed CI issues-telemetry field-return code contract . lose traceback…

2b71a10

… . unstructed kill-pods Signed-off-by: farhann_saleem <chaudaryfarhann@gmail.com>

Merge branch 'main' into feat/capture-chaos-injection

2b666a8

paigerube14 reviewed Feb 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: capture chaos injection details in pod disruption scenario telemetry#1152

feat: capture chaos injection details in pod disruption scenario telemetry#1152
farhann-saleem wants to merge 3 commits intokrkn-chaos:mainfrom
farhann-saleem:feat/capture-chaos-injection

farhann-saleem commented Feb 9, 2026 •

edited

Loading

Uh oh!

qodo-code-review bot commented Feb 9, 2026

Uh oh!

qodo-code-review bot commented Feb 9, 2026 •

edited

Loading

Uh oh!

qodo-code-review bot Feb 9, 2026

Uh oh!

paigerube14 Feb 17, 2026

Uh oh!

farhann-saleem Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

farhann-saleem commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Files Modified

What Changed

Why This Matters

Problem Being Solved

Impact

Testing

Backward Compatibility

Related Issues

Checklist

Uh oh!

qodo-code-review bot commented Feb 9, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

qodo-code-review bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

paigerube14 Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

farhann-saleem Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

farhann-saleem commented Feb 9, 2026 •

edited

Loading

qodo-code-review bot commented Feb 9, 2026 •

edited

Loading