feat: implement rotation handling#23
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis change adds a new director-agent TLS certificate rotation correctness case type, validates its ChangesDirector-agent TLS rotation correctness
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Deploying pipebench with
|
| Latest commit: |
b199cf4
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://4feb0f50.pipebench.pages.dev |
| Branch Preview URL: | https://dt-805.pipebench.pages.dev |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/config/case.go`:
- Around line 464-475: The SettleSecondsOrDefault and StallSecondsOrDefault
methods silently treat negative values as unset by using defaults, but the
validateRotation function does not reject negative values. Add explicit
validation in the validateRotation function to check that SettleSeconds and
StallSeconds fields are either zero (unset) or positive integers, and return an
error if negative values are provided. This prevents configuration typos like
settle_seconds: -5 from being silently ignored and using defaults instead.
In `@internal/runner/runner.go`:
- Around line 2247-2268: The current verdict calculation for the test result
only checks the count-based conditions (gotEnough and resumed) but ignores the
receiver's content verdict stored in recvMetrics.Passed and recvMetrics.Errors.
To fix this, add an additional check to validate that recvMetrics.Passed is true
(or equivalently, that recvMetrics.Errors is empty), and include this check in
the calculation of the passed variable alongside the existing gotEnough and
resumed conditions. If the receiver's content check fails, append an appropriate
error message to the errs slice to capture why the test failed.
- Around line 2119-2121: The issue is that countAtRotation is being set when
LinesReceived first reaches minRecv during the established phase, but it should
instead capture a fresh baseline immediately before the actual
rotation/disruption occurs. In the current code near line 2119-2121 where
established is set to true, and similarly in the other locations noted at
2163-2184 and 2231-2257, you need to capture the current LinesReceived value at
the moment just before the rotation event happens rather than at initial
establishment. This ensures the resume verdict comparison against
countAtRotation reflects only post-disruption activity, not pre-rotation
arrivals.
- Line 2007: In the os.Chmod call on tmpDir, change the permission mode from
0o777 to 01777. The sticky bit (01xxx) prevents other users from removing or
renaming files they don't own while maintaining write access needed for
container operations. This addresses the security issue where another user on a
multi-user host could otherwise delete or replace critical files like generated
certificates or compose configurations.
- Around line 1987-1990: The runDirectorAgentCertRotation function bypasses the
generic capability guard that checks the requires constraint, allowing rotation
test cases with requires declarations to run against unsupported subjects. Add a
requirements check at the beginning of runDirectorAgentCertRotation, after
applying subject overrides, to verify that the subject satisfies any requires
constraints specified in the test case (tc.Requires). If the subject does not
meet the requirements, return an appropriate skip or error result before
proceeding with the rotation workflow.
- Around line 1922-1923: The Docker exec probe in the agent package check is
using exec.Command which can hang indefinitely if Docker stalls. Replace the
exec.Command call with exec.CommandContext and provide it a context with a short
timeout to prevent indefinite blocking. Ensure that when the context times out
or an error occurs during the Docker exec probe (the "test -d
/opt/vmetric/package/agent" check), the timeout/error results are kept
inconclusive rather than failing the workflow.
- Around line 2109-2223: All rotation wait periods (settle, stall, and
drainTimeout variables) need to be clamped to respect the configured
Options.Timeout to prevent runs from continuing to sleep and poll beyond the
timeout. Replace the direct time.Sleep calls for settle, stall, and drainTimeout
with sleepWithinDeadline calls that enforce the overall timeout deadline,
ensuring that the positive-mode settle sleeps and the stall/drain periods during
rotation phases cannot extend the total runtime beyond r.opts.Timeout.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 578cc74f-817d-40b6-92d3-bf25549aaf38
📒 Files selected for processing (3)
internal/config/case.gointernal/orchestrator/tls.gointernal/runner/runner.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/runner/runner.go`:
- Line 2173: The resumeBaseline variable is assigned a value at its declaration
point, but this initial assignment is immediately overwritten in all non-default
rotation branches before the variable is ever read, making the initial
assignment ineffectual. Locate the resumeBaseline variable declaration in the
runner.go file and change it from an initialized assignment (resumeBaseline :=
countAtRotation) to a bare variable declaration without an initial value (var
resumeBaseline <type>), allowing the appropriate assignment in each code path to
be the first and only assignment before use.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 25b60c32-fabd-45a7-ac08-b7b91b4c60c1
📒 Files selected for processing (2)
internal/config/case.gointernal/runner/runner.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/config/case.go
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/config/case.go`:
- Around line 732-738: The TestCase.IsDirectorAgentRotationType method is
duplicated, causing a redeclaration build/vet failure. Remove the extra copy of
IsDirectorAgentRotationType in TestCase and keep only the original definition
that already exists earlier in case.go; make sure the remaining method still
returns the director_agent_tls_cert_rotation_correctness type check.
- Around line 517-575: The build is broken by a duplicate RotationConfig
definition in this file. Remove one of the two copies of the RotationConfig
type, the RotationSameCA/RotationNewCARecover/RotationNewCAReject constants, and
the SettleSecondsOrDefault/StallSecondsOrDefault methods, keeping only the
canonical block with the intended doc comments. Use the existing RotationConfig,
RotationSameCA, and SettleSecondsOrDefault symbols to locate both declarations
and delete the redundant one so internal/config compiles cleanly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 85d6544e-420b-4f6c-8b19-78ff2abc0a2d
📒 Files selected for processing (1)
internal/config/case.go
| // Rotation parameterizes the director_agent_tls_cert_rotation_correctness | ||
| // driver (see TestCase.Rotation, runDirectorAgentCertRotation). The director | ||
| // deploys an agent that streams back over the proxy_tls listener; mid-run the | ||
| // director's serving cert/CA is rotated on disk and the director is bounced so | ||
| // the enrolled agent must re-handshake. Mode selects which rotation and the | ||
| // expected agent response. | ||
| type RotationConfig struct { | ||
| // Mode selects the mid-run rotation: | ||
| // "same_ca" — re-sign the director leaf under the SAME CA | ||
| // (RotateServerCert). The agent reconnects transparently | ||
| // because the chain still validates. Verdict: delivery | ||
| // resumes (count grows after the bounce). | ||
| // "new_ca_recover" — rotate to a BRAND-NEW CA written to ca.crt and served | ||
| // at /dl/cert.pem (RotateServerCertNewCA). A bootstrap | ||
| // agent (no operator-pinned CA) must re-fetch the new CA | ||
| // and reconnect. Verdict: delivery resumes. | ||
| // "new_ca_reject" — TWO PHASE. Phase 1 re-signs the leaf under an UNTRUSTED | ||
| // CA the director never serves (RotateServerCertWrongCA): | ||
| // the agent MUST fail validation, so delivery STALLS | ||
| // (a missing stall is a SECURITY failure — validation is | ||
| // disabled). Phase 2 restores a trusted leaf | ||
| // (RotateServerCert) and delivery must resume. | ||
| Mode string `yaml:"mode"` | ||
|
|
||
| // SettleSeconds is the pause after a rotation+bounce before the driver samples | ||
| // the receiver, giving the director time to rebind its listener and the agent | ||
| // time to detect the dropped session and reconnect (default 25s). | ||
| SettleSeconds int `yaml:"settle_seconds"` | ||
|
|
||
| // StallSeconds (new_ca_reject only) is how long the receiver count must hold | ||
| // flat after the untrusted rotation for the bad cert to count as rejected | ||
| // (default 20s). The case's endpoint seed loop MUST still be appending fresh | ||
| // records during this window, else the stall is vacuous — see the case NOTES. | ||
| StallSeconds int `yaml:"stall_seconds"` | ||
| } | ||
|
|
||
| // Rotation mode values for RotationConfig.Mode. | ||
| const ( | ||
| RotationSameCA = "same_ca" | ||
| RotationNewCARecover = "new_ca_recover" | ||
| RotationNewCAReject = "new_ca_reject" | ||
| ) | ||
|
|
||
| // SettleSecondsOrDefault / StallSecondsOrDefault centralize the rotation timing | ||
| // defaults so the driver and any caller agree. | ||
| func (rc *RotationConfig) SettleSecondsOrDefault() int { | ||
| if rc != nil && rc.SettleSeconds > 0 { | ||
| return rc.SettleSeconds | ||
| } | ||
| return 25 | ||
| } | ||
|
|
||
| func (rc *RotationConfig) StallSecondsOrDefault() int { | ||
| if rc != nil && rc.StallSeconds > 0 { | ||
| return rc.StallSeconds | ||
| } | ||
| return 20 | ||
| } | ||
|
|
There was a problem hiding this comment.
🩺 Stability & Availability | 🔴 Critical | ⚡ Quick win
Duplicate RotationConfig declaration breaks the build.
This entire block — the RotationConfig type (Line 523), the RotationSameCA/RotationNewCARecover/RotationNewCAReject constants (Lines 555-557), and the SettleSecondsOrDefault/StallSecondsOrDefault methods (Lines 562, 569) — is a verbatim re-declaration of the definitions already present earlier in this file (441-493). Go does not permit redeclaration in the same package block, so internal/config no longer compiles.
Remove this duplicated block (or the earlier copy), keeping only one definition.
🐛 Proposed fix — drop the duplicate block
-// Rotation parameterizes the director_agent_tls_cert_rotation_correctness
-// driver (see TestCase.Rotation, runDirectorAgentCertRotation). The director
-// deploys an agent that streams back over the proxy_tls listener; mid-run the
-// director's serving cert/CA is rotated on disk and the director is bounced so
-// the enrolled agent must re-handshake. Mode selects which rotation and the
-// expected agent response.
-type RotationConfig struct {
- // Mode selects the mid-run rotation:
- // "same_ca" — re-sign the director leaf under the SAME CA
- // (RotateServerCert). The agent reconnects transparently
- // because the chain still validates. Verdict: delivery
- // resumes (count grows after the bounce).
- // "new_ca_recover" — rotate to a BRAND-NEW CA written to ca.crt and served
- // at /dl/cert.pem (RotateServerCertNewCA). A bootstrap
- // agent (no operator-pinned CA) must re-fetch the new CA
- // and reconnect. Verdict: delivery resumes.
- // "new_ca_reject" — TWO PHASE. Phase 1 re-signs the leaf under an UNTRUSTED
- // CA the director never serves (RotateServerCertWrongCA):
- // the agent MUST fail validation, so delivery STALLS
- // (a missing stall is a SECURITY failure — validation is
- // disabled). Phase 2 restores a trusted leaf
- // (RotateServerCert) and delivery must resume.
- Mode string `yaml:"mode"`
-
- // SettleSeconds is the pause after a rotation+bounce before the driver samples
- // the receiver, giving the director time to rebind its listener and the agent
- // time to detect the dropped session and reconnect (default 25s).
- SettleSeconds int `yaml:"settle_seconds"`
-
- // StallSeconds (new_ca_reject only) is how long the receiver count must hold
- // flat after the untrusted rotation for the bad cert to count as rejected
- // (default 20s). The case's endpoint seed loop MUST still be appending fresh
- // records during this window, else the stall is vacuous — see the case NOTES.
- StallSeconds int `yaml:"stall_seconds"`
-}
-
-// Rotation mode values for RotationConfig.Mode.
-const (
- RotationSameCA = "same_ca"
- RotationNewCARecover = "new_ca_recover"
- RotationNewCAReject = "new_ca_reject"
-)
-
-// SettleSecondsOrDefault / StallSecondsOrDefault centralize the rotation timing
-// defaults so the driver and any caller agree.
-func (rc *RotationConfig) SettleSecondsOrDefault() int {
- if rc != nil && rc.SettleSeconds > 0 {
- return rc.SettleSeconds
- }
- return 25
-}
-
-func (rc *RotationConfig) StallSecondsOrDefault() int {
- if rc != nil && rc.StallSeconds > 0 {
- return rc.StallSeconds
- }
- return 20
-}
-Confirm which copy carries the canonical doc comments before deleting; keep that one.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Rotation parameterizes the director_agent_tls_cert_rotation_correctness | |
| // driver (see TestCase.Rotation, runDirectorAgentCertRotation). The director | |
| // deploys an agent that streams back over the proxy_tls listener; mid-run the | |
| // director's serving cert/CA is rotated on disk and the director is bounced so | |
| // the enrolled agent must re-handshake. Mode selects which rotation and the | |
| // expected agent response. | |
| type RotationConfig struct { | |
| // Mode selects the mid-run rotation: | |
| // "same_ca" — re-sign the director leaf under the SAME CA | |
| // (RotateServerCert). The agent reconnects transparently | |
| // because the chain still validates. Verdict: delivery | |
| // resumes (count grows after the bounce). | |
| // "new_ca_recover" — rotate to a BRAND-NEW CA written to ca.crt and served | |
| // at /dl/cert.pem (RotateServerCertNewCA). A bootstrap | |
| // agent (no operator-pinned CA) must re-fetch the new CA | |
| // and reconnect. Verdict: delivery resumes. | |
| // "new_ca_reject" — TWO PHASE. Phase 1 re-signs the leaf under an UNTRUSTED | |
| // CA the director never serves (RotateServerCertWrongCA): | |
| // the agent MUST fail validation, so delivery STALLS | |
| // (a missing stall is a SECURITY failure — validation is | |
| // disabled). Phase 2 restores a trusted leaf | |
| // (RotateServerCert) and delivery must resume. | |
| Mode string `yaml:"mode"` | |
| // SettleSeconds is the pause after a rotation+bounce before the driver samples | |
| // the receiver, giving the director time to rebind its listener and the agent | |
| // time to detect the dropped session and reconnect (default 25s). | |
| SettleSeconds int `yaml:"settle_seconds"` | |
| // StallSeconds (new_ca_reject only) is how long the receiver count must hold | |
| // flat after the untrusted rotation for the bad cert to count as rejected | |
| // (default 20s). The case's endpoint seed loop MUST still be appending fresh | |
| // records during this window, else the stall is vacuous — see the case NOTES. | |
| StallSeconds int `yaml:"stall_seconds"` | |
| } | |
| // Rotation mode values for RotationConfig.Mode. | |
| const ( | |
| RotationSameCA = "same_ca" | |
| RotationNewCARecover = "new_ca_recover" | |
| RotationNewCAReject = "new_ca_reject" | |
| ) | |
| // SettleSecondsOrDefault / StallSecondsOrDefault centralize the rotation timing | |
| // defaults so the driver and any caller agree. | |
| func (rc *RotationConfig) SettleSecondsOrDefault() int { | |
| if rc != nil && rc.SettleSeconds > 0 { | |
| return rc.SettleSeconds | |
| } | |
| return 25 | |
| } | |
| func (rc *RotationConfig) StallSecondsOrDefault() int { | |
| if rc != nil && rc.StallSeconds > 0 { | |
| return rc.StallSeconds | |
| } | |
| return 20 | |
| } |
🧰 Tools
🪛 GitHub Actions: CI / 2_Build & Vet.txt
[error] 523-523: Go build failed: RotationConfig redeclared in this block.
🪛 GitHub Actions: CI / Build & Vet
[error] 523-523: Go build failed: RotationConfig redeclared in this block
🪛 GitHub Check: Build & Vet
[failure] 569-569:
method RotationConfig.StallSecondsOrDefault already declared at internal/config/case.go:487:27
[failure] 562-562:
method RotationConfig.SettleSecondsOrDefault already declared at internal/config/case.go:480:27
[failure] 557-557:
RotationNewCAReject redeclared in this block
[failure] 556-556:
RotationNewCARecover redeclared in this block
[failure] 555-555:
RotationSameCA redeclared in this block
[failure] 523-523:
RotationConfig redeclared in this block
🪛 golangci-lint (2.12.2)
[error] 523-523: : # github.com/VirtualMetric/PipeBench/internal/config [github.com/VirtualMetric/PipeBench/internal/config.test]
internal/config/case.go:523:6: RotationConfig redeclared in this block
internal/config/case.go:441:6: other declaration of RotationConfig
internal/config/case.go:555:2: RotationSameCA redeclared in this block
internal/config/case.go:473:2: other declaration of RotationSameCA
internal/config/case.go:556:2: RotationNewCARecover redeclared in this block
internal/config/case.go:474:2: other declaration of RotationNewCARecover
internal/config/case.go:557:2: RotationNewCAReject redeclared in this block
internal/config/case.go:475:2: other declaration of RotationNewCAReject
internal/config/case.go:562:27: method RotationConfig.SettleSecondsOrDefault already declared at internal/config/case.go:480:27
internal/config/case.go:569:27: method RotationConfig.StallSecondsOrDefault already declared at internal/config/case.go:487:27
internal/config/case.go:735:21: method TestCase.IsDirectorAgentRotationType already declared at internal/config/case.go:724:21
(typecheck)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/config/case.go` around lines 517 - 575, The build is broken by a
duplicate RotationConfig definition in this file. Remove one of the two copies
of the RotationConfig type, the
RotationSameCA/RotationNewCARecover/RotationNewCAReject constants, and the
SettleSecondsOrDefault/StallSecondsOrDefault methods, keeping only the canonical
block with the intended doc comments. Use the existing RotationConfig,
RotationSameCA, and SettleSecondsOrDefault symbols to locate both declarations
and delete the redundant one so internal/config compiles cleanly.
Sources: Linters/SAST tools, Pipeline failures
| // IsDirectorAgentRotationType reports whether the case is the director↔agent | ||
| // TLS cert-rotation correctness flow, which has its own subject-driven (no | ||
| // generator) driver — see runDirectorAgentCertRotation. | ||
| func (tc *TestCase) IsDirectorAgentRotationType() bool { | ||
| return tc.Type == "director_agent_tls_cert_rotation_correctness" | ||
| } | ||
|
|
There was a problem hiding this comment.
🩺 Stability & Availability | 🔴 Critical | ⚡ Quick win
Duplicate IsDirectorAgentRotationType method.
This method is declared a second time here; the original is already defined earlier (around Line 724). Like the duplicated RotationConfig block above, this redeclaration fails go vet/build (method TestCase.IsDirectorAgentRotationType already declared). Remove one of the two copies.
🐛 Proposed fix — remove the duplicate method
-// IsDirectorAgentRotationType reports whether the case is the director↔agent
-// TLS cert-rotation correctness flow, which has its own subject-driven (no
-// generator) driver — see runDirectorAgentCertRotation.
-func (tc *TestCase) IsDirectorAgentRotationType() bool {
- return tc.Type == "director_agent_tls_cert_rotation_correctness"
-}
-📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // IsDirectorAgentRotationType reports whether the case is the director↔agent | |
| // TLS cert-rotation correctness flow, which has its own subject-driven (no | |
| // generator) driver — see runDirectorAgentCertRotation. | |
| func (tc *TestCase) IsDirectorAgentRotationType() bool { | |
| return tc.Type == "director_agent_tls_cert_rotation_correctness" | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/config/case.go` around lines 732 - 738, The
TestCase.IsDirectorAgentRotationType method is duplicated, causing a
redeclaration build/vet failure. Remove the extra copy of
IsDirectorAgentRotationType in TestCase and keep only the original definition
that already exists earlier in case.go; make sure the remaining method still
returns the director_agent_tls_cert_rotation_correctness type check.
Sources: Linters/SAST tools, Pipeline failures
Summary by CodeRabbit
New Features
rotationsupport to director↔agent TLS correctness test cases with three modes: same CA, new CA recovery, and new CA rejection.Improvements