-
Notifications
You must be signed in to change notification settings - Fork 228
Energy calculation uses stale resource utilization due to refresh ordering #2446
Description
Kepler Version
0.10.0 or later (Current/Supported)
Bug Description
Hi, while working on energy measurement and validating results, I noticed an inconsistency in how resource utilization is accounted for during power calculation.
From my measurements and testing, the current implementation computes energy based on stale resource utilization data.
In firstReading, resources are refreshed after the initial node read:
func (pm *PowerMonitor) firstReading(newSnapshot *Snapshot) error {
// First read for node
if err := pm.firstNodeRead(newSnapshot.Node); err != nil {
return fmt.Errorf(nodePowerError, err)
}
if err := pm.resources.Refresh(); err != nil {
pm.logger.Error("snapshot rebuild failed to refresh resources", "error", err)
return err
}
}
However, in calculatePower, node power is computed before refreshing resources:
func (pm *PowerMonitor) calculatePower(prev, newSnapshot *Snapshot) error {
// Calculate node power
if err := pm.calculateNodePower(prev.Node, newSnapshot.Node); err != nil {
return fmt.Errorf(nodePowerError, err)
}
if err := pm.resources.Refresh(); err != nil {
pm.logger.Error("snapshot rebuild failed to refresh resources", "error", err)
return err
}
Problem
Because pm.resources.Refresh() happens after calculateNodePower, the energy calculation is effectively based on the previous snapshot’s resource utilization, not the current one.
In practice, this leads to a measurable lag/inaccuracy in energy accounting, especially when resource usage changes between snapshots.
Steps to Reproduce
-
Start the system with power monitoring enabled
-
Establish a baseline
- Let the system run under low or idle resource utilisation and record a few consecutive energy/power measurements.
-
Introduce a sudden change in resource utilisation
- For example Start a CPU-intensive workload (e.g., stress test)
-
Observe power/energy measurements across snapshots
- Compare the timestamp when resource utilisation increases with the timestamp when the energy measurement reflects the increase
-
Identify the lag
-
You will notice that:
- Resource utilization increases at snapshot N
- Energy measurement reflects this increase only at snapshot N+1
-
Energy measurements lag behind actual resource utilization by one snapshot cycle.
Expected Behavior
Energy measurements should reflect resource utilization changes within the same snapshot cycle.
Environment
- OS: Ubuntu 22.04.5 LTS (Jammy Jellyfish)
- Kubernetes Version: v1.33.5+k3s1 (k3s single-node setup)
- Container Runtime: containerd 2.1.4 (k3s)
- Hardware: Intel Xeon Silver 4309Y CPU (RAPL supported), 32 CPUs
- Deployment Method: Kubernetes DaemonSet (Kepler, 1 node)
Logs and Error Messages
From the collected data, node-active-energy appears to be computed using the previous interval's node-CPUUsageRatio rather than the current one.
For example:
-
At time 2026-03-23T18:33:30.960Z: node-CPUUsageRatio ≈ 0.007805
-
At time 2026-03-23T18:33:35.978Z: node-CPUUsageRatio ≈ 0.020210 && node-rapl-delta-energy ≈ 607.13
If energy were computed using the current ratio: Expected ≈ 607.13 × 0.020210 ≈ 12.27
However, the observed node-active-energy is ≈ 4.74, which matches: 607.13 × 0.007805 ≈ 4.74
This indicates that: node-active-energy(2026-03-23T18:33:35.978Z) is derived using node-CPUUsageRatio(2026-03-23T18:33:30.960Z) i.e., from the previous interval.
Additionally, a request was issued at:
2026-03-23 18:33:32.408387 UTC
which falls between these two samples. This causes the CPU usage increase to appear in the later snapshot, while the energy attribution is still based on the previous (lower) CPU usage, leading to stale accounting.