-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Overview
Code review of branch_controller.rs revealed several reliability issues that could lead to resource inconsistency, race conditions, and poor error recovery. This issue tracks the systematic improvements needed.
Critical Issues Identified
1. Race Conditions in timeline_id Generation
Problem: Multiple reconcile loops can generate different timeline IDs, causing last-writer-wins corruption.
Impact: Branches could end up with incorrect timeline references.
2. Infinite Pageserver API Calls
Problem: Timeline creation HTTP call happens every reconcile regardless of success.
Impact: API spam, unnecessary load, and poor handling of 409 conflicts.
3. Resource Drift Issues
Problem: Deployments and ConfigMaps are created once but never updated when specs change.
Impact: Running resources become out of sync with desired state.
4. Poor Error Recovery
Problem: Failed phase never clears when conditions improve; missing project causes permanent failure.
Impact: Branches stuck in failed state even after conditions resolve.
Implementation Plan
Phase 1: Status Condition Framework
- Add explicit status conditions for each reconciliation step:
TimelineCreated- Timeline exists on pageserverProjectFound- Referenced NeonProject existsDeploymentReady- Compute deployment is readyDefaultUserCreated- Default user provisionedDefaultDatabaseCreated- Default database created
- Implement deterministic phase derivation from conditions
- Update
BranchStatusManagerto handle new conditions
Phase 2: Idempotent Operations
- Timeline ID Generation: Use server-side apply with conflict detection
// Use PatchParams::apply with force_conflicts(false) // Handle 409 conflicts gracefully
- Pageserver Integration: Only call when
TimelineCreatedcondition is false- Add request timeouts (5s)
- Treat 409 as success
- Implement exponential backoff
- Resource Management: Switch from create-only to apply-based reconciliation
- Use server-side apply for Deployments
- Use server-side apply for ConfigMaps
- Include ownerReferences for garbage collection
Phase 3: Error Handling Improvements
- Replace panic-prone
unwrap()calls with proper error handling:namespace().unwrap()→ return meaningful errortimeline_id.unwrap()→ check and handle gracefullytenant_id.unwrap()→ validate and report missing field
- Implement progressive backoff in error policy
- Add proper cleanup in finalizer (delete timeline, deployment, configmap)
Phase 4: Robustness Enhancements
- Deployment Readiness: Check
observedGenerationandAvailablecondition - Project Namespace: Make namespace configurable instead of hardcoded "neon"
- Concurrent Processing: Consider single-threaded per-object processing
- Resource Validation: Add spec validation with clear error messages
Phase 5: Observability & Testing
- Add structured logging with span fields (branch, project, timeline, tenant_id)
- Add Prometheus metrics for each reconciliation step
- Include HTTP status/response body in error logs
- Unit tests for status transitions and patch operations
- Integration tests for common failure scenarios
Technical Implementation Notes
Server-Side Apply Example
// Timeline ID generation with conflict detection
let pp = PatchParams::apply(FIELD_MANAGER)
.force_conflicts(false);
let patch = Patch::Apply(json!({
"apiVersion": NeonBranch::api_version(&()),
"kind": NeonBranch::kind(&()),
"metadata": { "name": name, "namespace": namespace },
"spec": { "timeline_id": timeline_id },
}));Idempotent Pageserver Calls
// Only call pageserver when needed
if !is_status_condition_true(&conditions, "TimelineCreated") {
let resp = http_client
.post(&pageserver_url)
.timeout(Duration::from_secs(5))
.json(&payload)
.send().await?;
match resp.status() {
StatusCode::OK | StatusCode::CREATED | StatusCode::CONFLICT => {
status_manager.update_condition("TimelineCreated", true).await?;
}
_ => return Ok(Action::requeue(Duration::from_secs(30)))
}
}Success Criteria
- No race conditions in timeline ID assignment
- Pageserver API called exactly once per timeline
- Deployments stay in sync with branch specs
- Branches recover from transient failures
- All
unwrap()calls removed from hot paths - Comprehensive test coverage for failure scenarios
Priority
High - These issues affect production reliability and could cause data consistency problems.
Estimated Effort
- Phase 1-2: 2-3 days
- Phase 3-4: 2-3 days
- Phase 5: 1-2 days
Dependencies
- No external dependencies
- May need coordination with storage-controller API team for timeline creation semantics