feat: capture chaos injection details in pod disruption scenario telemetry#1152
feat: capture chaos injection details in pod disruption scenario telemetry#1152farhann-saleem wants to merge 3 commits intokrkn-chaos:mainfrom
Conversation
…metry Signed-off-by: farhann_saleem <chaudaryfarhann@gmail.com>
ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan Review Summary by QodoCapture chaos injection details in pod disruption telemetry
WalkthroughsDescription• Added KilledPodDetail dataclass to track killed/excluded pods • Modified killing_pods() to return tuple with pod details list • Store killed pods details in scenario telemetry for analysis • Added comprehensive unit tests for tracking functionality Diagramflowchart LR
A["Pod Disruption Scenario"] -->|killing_pods| B["Track Killed/Excluded Pods"]
B -->|Create KilledPodDetail| C["Pod Details List"]
C -->|Store in Telemetry| D["Scenario Telemetry"]
D -->|Enable Analysis| E["Debugging & Auditability"]
File Changes1. krkn/scenario_plugins/pod_disruption/models/models.py
|
Code Review by Qodo
1. Telemetry field not persisted
|
| # Store killed pods details in telemetry for analysis | ||
| scenario_telemetry.killed_pods_details = [ | ||
| { | ||
| "namespace": pod.namespace, | ||
| "name": pod.name, | ||
| "timestamp": pod.timestamp, | ||
| "status": pod.status, | ||
| "reason": pod.reason | ||
| } | ||
| for pod in killed_pods | ||
| ] |
There was a problem hiding this comment.
1. Telemetry field not persisted 🐞 Bug ✧ Quality
• run() sets scenario_telemetry.killed_pods_details as a new ad-hoc attribute; if ScenarioTelemetry is a strict/structured model (e.g., pydantic or dataclass slots), this can raise AttributeError at runtime. • Even if assignment succeeds, telemetry serialization may only include declared fields, so killed_pods_details may never reach telemetry.json (silent feature loss).
Agent Prompt
## Issue description
`PodDisruptionScenarioPlugin.run()` assigns `scenario_telemetry.killed_pods_details` dynamically. If `ScenarioTelemetry` is a strict model (common for telemetry payloads), this may either:
1) raise `AttributeError` at runtime, or
2) be silently dropped during serialization, so the new data never appears in `telemetry.json`.
## Issue Context
`ScenarioTelemetry` comes from `krkn_lib` (external to this repo), and in this repo it is treated as a structured object with known fields.
## Fix Focus Areas
- krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py[44-63]
- krkn/scenario_plugins/abstract_scenario_plugin.py[86-90]
## Suggested approach
1) Preferably update the `ScenarioTelemetry` model in `krkn_lib` to include a `killed_pods_details` field and ensure it is included in its JSON serialization.
2) If updating `krkn_lib` is not possible from this repo, then:
- store the data under an existing, known-serialized field (if one exists), or
- wrap the assignment with a `try/except AttributeError` and emit a warning log so the scenario does not crash and operators know the telemetry field was not recorded.
3) Add/adjust a unit test that exercises `run()` with a strict `Mock(spec=ScenarioTelemetry)` and asserts the chosen storage mechanism works.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
… . unstructed kill-pods Signed-off-by: farhann_saleem <chaudaryfarhann@gmail.com>
| logging.info(pod) | ||
| if pod[0] in exclude_pods: | ||
| logging.info(f"Excluding {pod[0]} from chaos") | ||
| killed_pods.append(KilledPodDetail( |
There was a problem hiding this comment.
@farhann-saleem I'm not sure how useful this is, we already have a detailed watcher set up in krkn-lib that will tell us when any pod goes down and how long it takes to recover and any errors
Read more about what we already have herehttps://krkn-chaos.dev/docs/scenarios/pod-scenario/#recovery-time-metrics-in-krkn-telemetry
There was a problem hiding this comment.
Thanks for the feedback @paigerube14 and the documentation link!
You're right that krkn-lib's watcher provides comprehensive recovery tracking. After reviewing the docs, I can see the overlap.
My PR was attempting to add visibility into:
- Excluded pods tracking - Which pods were skipped due to
exclude_label - Exclusion reasons - Why specific pods were excluded (matched filter)
- Injection timestamps - When chaos was injected (vs when pods recovered)
The use case I had in mind was: "I ran chaos with exclude_label: critical=true. Which pods were excluded and why?"
The existing affected_pods telemetry tracks pods that were killed and their recovery, but doesn't capture the pods that matched filters but were intentionally excluded. This creates an audit trail gap for understanding targeting decisions.
However, I understand this may not add enough value to warrant the additional complexity. If you think the existing recovery tracking is sufficient and this use case is too niche, I'm happy to close this PR.
Let me know your thoughts!
Summary
This PR fixes a critical data loss issue in the pod disruption scenario plugin
where which pods were killed during chaos injection was not persisted to telemetry.
Previously, pods were logged when deleted but this information was not stored
in the scenario telemetry for analysis and debugging.
Changes
Files Modified
krkn/scenario_plugins/pod_disruption/models/models.py- Added KilledPodDetail dataclasskrkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py- Track killed podstests/test_pod_disruption_scenario_plugin.py- Added comprehensive unit testsWhat Changed
New KilledPodDetail Dataclass - Structured tracking of killed/excluded pods with:
Updated killing_pods() Method - Now returns tuple with list of KilledPodDetail objects
Updated run() Method - Stores killed pod details in telemetry
Added Unit Tests - Comprehensive test coverage including:
Why This Matters
Problem Being Solved
When a pod disruption scenario runs, operators need to know:
Before this change, this information was only in logs, not in structured telemetry.
Impact
Testing
All tests pass:
Backward Compatibility
✅ Fully backward compatible:
Related Issues
Closes [reference to any related issues if applicable]
Relates to #1051 (Natural Language Chaos Scenario Discovery)
Checklist