Skip to content

Improve Branch Controller Reliability - Address Race Conditions and Resource Drift #27

@bittermandel

Description

@bittermandel

Overview

Code review of branch_controller.rs revealed several reliability issues that could lead to resource inconsistency, race conditions, and poor error recovery. This issue tracks the systematic improvements needed.

Critical Issues Identified

1. Race Conditions in timeline_id Generation

Problem: Multiple reconcile loops can generate different timeline IDs, causing last-writer-wins corruption.
Impact: Branches could end up with incorrect timeline references.

2. Infinite Pageserver API Calls

Problem: Timeline creation HTTP call happens every reconcile regardless of success.
Impact: API spam, unnecessary load, and poor handling of 409 conflicts.

3. Resource Drift Issues

Problem: Deployments and ConfigMaps are created once but never updated when specs change.
Impact: Running resources become out of sync with desired state.

4. Poor Error Recovery

Problem: Failed phase never clears when conditions improve; missing project causes permanent failure.
Impact: Branches stuck in failed state even after conditions resolve.

Implementation Plan

Phase 1: Status Condition Framework

  • Add explicit status conditions for each reconciliation step:
    • TimelineCreated - Timeline exists on pageserver
    • ProjectFound - Referenced NeonProject exists
    • DeploymentReady - Compute deployment is ready
    • DefaultUserCreated - Default user provisioned
    • DefaultDatabaseCreated - Default database created
  • Implement deterministic phase derivation from conditions
  • Update BranchStatusManager to handle new conditions

Phase 2: Idempotent Operations

  • Timeline ID Generation: Use server-side apply with conflict detection
    // Use PatchParams::apply with force_conflicts(false)
    // Handle 409 conflicts gracefully
  • Pageserver Integration: Only call when TimelineCreated condition is false
    • Add request timeouts (5s)
    • Treat 409 as success
    • Implement exponential backoff
  • Resource Management: Switch from create-only to apply-based reconciliation
    • Use server-side apply for Deployments
    • Use server-side apply for ConfigMaps
    • Include ownerReferences for garbage collection

Phase 3: Error Handling Improvements

  • Replace panic-prone unwrap() calls with proper error handling:
    • namespace().unwrap() → return meaningful error
    • timeline_id.unwrap() → check and handle gracefully
    • tenant_id.unwrap() → validate and report missing field
  • Implement progressive backoff in error policy
  • Add proper cleanup in finalizer (delete timeline, deployment, configmap)

Phase 4: Robustness Enhancements

  • Deployment Readiness: Check observedGeneration and Available condition
  • Project Namespace: Make namespace configurable instead of hardcoded "neon"
  • Concurrent Processing: Consider single-threaded per-object processing
  • Resource Validation: Add spec validation with clear error messages

Phase 5: Observability & Testing

  • Add structured logging with span fields (branch, project, timeline, tenant_id)
  • Add Prometheus metrics for each reconciliation step
  • Include HTTP status/response body in error logs
  • Unit tests for status transitions and patch operations
  • Integration tests for common failure scenarios

Technical Implementation Notes

Server-Side Apply Example

// Timeline ID generation with conflict detection
let pp = PatchParams::apply(FIELD_MANAGER)
    .force_conflicts(false);
let patch = Patch::Apply(json!({
    "apiVersion": NeonBranch::api_version(&()),
    "kind": NeonBranch::kind(&()),
    "metadata": { "name": name, "namespace": namespace },
    "spec": { "timeline_id": timeline_id },
}));

Idempotent Pageserver Calls

// Only call pageserver when needed
if !is_status_condition_true(&conditions, "TimelineCreated") {
    let resp = http_client
        .post(&pageserver_url)
        .timeout(Duration::from_secs(5))
        .json(&payload)
        .send().await?;
    
    match resp.status() {
        StatusCode::OK | StatusCode::CREATED | StatusCode::CONFLICT => {
            status_manager.update_condition("TimelineCreated", true).await?;
        }
        _ => return Ok(Action::requeue(Duration::from_secs(30)))
    }
}

Success Criteria

  • No race conditions in timeline ID assignment
  • Pageserver API called exactly once per timeline
  • Deployments stay in sync with branch specs
  • Branches recover from transient failures
  • All unwrap() calls removed from hot paths
  • Comprehensive test coverage for failure scenarios

Priority

High - These issues affect production reliability and could cause data consistency problems.

Estimated Effort

  • Phase 1-2: 2-3 days
  • Phase 3-4: 2-3 days
  • Phase 5: 1-2 days

Dependencies

  • No external dependencies
  • May need coordination with storage-controller API team for timeline creation semantics

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions