Skip to content

Prefix-Cache Coordinator Follow-ups: Protocol Hardening, Metrics Integration, TTL Policy and Churn Validation #4176

@DhineshPonnarasan

Description

@DhineshPonnarasan

Supersedes overlapping scope from #4145; this issue now tracks only follow-up enhancements.

Update

This issue overlaps with #4145, which now covers the core prefix-cache routing improvements and has PR #4148.

The following items have been addressed in PR #4148:

  • Active-rank-aware routing (filtering stale/disconnected identities)
  • Metadata cleanup on engine removal
  • Bounded metadata growth via cap-based eviction
  • Basic observability counters for routing behavior
  • Unit tests for routing correctness and cleanup

Remaining Scope (Follow-up Work)

The following improvements are not covered in PR #4148 and can be addressed incrementally:

1. Protocol Robustness

  • Strict validation of coordinator message payloads
  • Graceful handling of malformed inputs and unknown headers
  • Clear distinction between client errors and internal failures

2. Observability Enhancements

  • Integration with standardized metrics/monitoring pipeline
  • More detailed routing-quality and latency metrics

3. Metadata Policy Extensions

  • Optional TTL-based expiration for prefix metadata (in addition to existing cap-based eviction)

4. Integration and Performance Validation

  • Engine churn and rolling restart scenarios
  • Long-running, high-cardinality prompt workloads
  • Tail latency (p95/p99) validation under fault conditions

Follow-up Implementation Details

1. Protocol Robustness

Goal:
Ensure coordinator communication is resilient to malformed or unexpected inputs without affecting stability.

Scope:

  • Introduce validation layer for inbound messages (schema, required fields, types)
  • Handle malformed payloads, unknown headers, and invalid fields gracefully
  • Distinguish client errors from internal failures in logs and metrics
  • Avoid hard failures (e.g., assertions) in runtime message handling

Done when:

  • Invalid messages do not crash the coordinator
  • Errors are logged with clear reasons and context
  • Unit tests cover malformed and edge-case inputs

2. Observability Enhancements

Goal:
Expose routing and reliability metrics for production monitoring and debugging.

Scope:

  • Integrate coordinator metrics with the standard monitoring pipeline
  • Track routing quality (cache hits, stale filtering, fallback rates)
  • Track reliability signals (invalid messages, unreachable engines)
  • Add latency-related metrics for coordinator overhead

Done when:

  • Metrics are visible in monitoring dashboards
  • Error and routing counters are clearly categorized
  • Latency distributions are measurable

3. Metadata Policy Extensions (TTL)

Goal:
Add time-based expiration to complement cap-based eviction for prefix metadata.

Scope:

  • Support optional TTL for prefix entries
  • Combine TTL with existing cap-based eviction
  • Use low-overhead cleanup (lazy or periodic)
  • Add configuration for TTL duration and behavior

Done when:

  • TTL can be enabled/disabled via config
  • Metadata remains bounded in long-running workloads
  • No regression in steady-state routing behavior

4. Integration and Performance Validation

Goal:
Validate correctness and performance under realistic workload and failure scenarios.

Scope:

  • Test engine churn (disconnect/restart scenarios)
  • Simulate long-running, high-cardinality workloads
  • Benchmark latency (p50/p95/p99) before vs after
  • Measure fallback rate and memory behavior

Done when:

  • No regression in baseline performance
  • Improved stability under churn scenarios
  • Reproducible test and benchmark setup

Note

To keep tracking clean, #4145 is the primary issue for the implemented routing improvements.
This issue focuses only on follow-up enhancements and can be split further if needed.

If preferred, I can split the remaining scope into separate issues (e.g., protocol validation, metrics integration, TTL policies, and validation/testing).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions