Supersedes overlapping scope from #4145; this issue now tracks only follow-up enhancements.
Update
This issue overlaps with #4145, which now covers the core prefix-cache routing improvements and has PR #4148.
The following items have been addressed in PR #4148:
- Active-rank-aware routing (filtering stale/disconnected identities)
- Metadata cleanup on engine removal
- Bounded metadata growth via cap-based eviction
- Basic observability counters for routing behavior
- Unit tests for routing correctness and cleanup
Remaining Scope (Follow-up Work)
The following improvements are not covered in PR #4148 and can be addressed incrementally:
1. Protocol Robustness
- Strict validation of coordinator message payloads
- Graceful handling of malformed inputs and unknown headers
- Clear distinction between client errors and internal failures
2. Observability Enhancements
- Integration with standardized metrics/monitoring pipeline
- More detailed routing-quality and latency metrics
3. Metadata Policy Extensions
- Optional TTL-based expiration for prefix metadata (in addition to existing cap-based eviction)
4. Integration and Performance Validation
- Engine churn and rolling restart scenarios
- Long-running, high-cardinality prompt workloads
- Tail latency (p95/p99) validation under fault conditions
Follow-up Implementation Details
1. Protocol Robustness
Goal:
Ensure coordinator communication is resilient to malformed or unexpected inputs without affecting stability.
Scope:
- Introduce validation layer for inbound messages (schema, required fields, types)
- Handle malformed payloads, unknown headers, and invalid fields gracefully
- Distinguish client errors from internal failures in logs and metrics
- Avoid hard failures (e.g., assertions) in runtime message handling
Done when:
- Invalid messages do not crash the coordinator
- Errors are logged with clear reasons and context
- Unit tests cover malformed and edge-case inputs
2. Observability Enhancements
Goal:
Expose routing and reliability metrics for production monitoring and debugging.
Scope:
- Integrate coordinator metrics with the standard monitoring pipeline
- Track routing quality (cache hits, stale filtering, fallback rates)
- Track reliability signals (invalid messages, unreachable engines)
- Add latency-related metrics for coordinator overhead
Done when:
- Metrics are visible in monitoring dashboards
- Error and routing counters are clearly categorized
- Latency distributions are measurable
3. Metadata Policy Extensions (TTL)
Goal:
Add time-based expiration to complement cap-based eviction for prefix metadata.
Scope:
- Support optional TTL for prefix entries
- Combine TTL with existing cap-based eviction
- Use low-overhead cleanup (lazy or periodic)
- Add configuration for TTL duration and behavior
Done when:
- TTL can be enabled/disabled via config
- Metadata remains bounded in long-running workloads
- No regression in steady-state routing behavior
4. Integration and Performance Validation
Goal:
Validate correctness and performance under realistic workload and failure scenarios.
Scope:
- Test engine churn (disconnect/restart scenarios)
- Simulate long-running, high-cardinality workloads
- Benchmark latency (p50/p95/p99) before vs after
- Measure fallback rate and memory behavior
Done when:
- No regression in baseline performance
- Improved stability under churn scenarios
- Reproducible test and benchmark setup
Note
To keep tracking clean, #4145 is the primary issue for the implemented routing improvements.
This issue focuses only on follow-up enhancements and can be split further if needed.
If preferred, I can split the remaining scope into separate issues (e.g., protocol validation, metrics integration, TTL policies, and validation/testing).
Supersedes overlapping scope from #4145; this issue now tracks only follow-up enhancements.
Update
This issue overlaps with #4145, which now covers the core prefix-cache routing improvements and has PR #4148.
The following items have been addressed in PR #4148:
Remaining Scope (Follow-up Work)
The following improvements are not covered in PR #4148 and can be addressed incrementally:
1. Protocol Robustness
2. Observability Enhancements
3. Metadata Policy Extensions
4. Integration and Performance Validation
Follow-up Implementation Details
1. Protocol Robustness
Goal:
Ensure coordinator communication is resilient to malformed or unexpected inputs without affecting stability.
Scope:
Done when:
2. Observability Enhancements
Goal:
Expose routing and reliability metrics for production monitoring and debugging.
Scope:
Done when:
3. Metadata Policy Extensions (TTL)
Goal:
Add time-based expiration to complement cap-based eviction for prefix metadata.
Scope:
Done when:
4. Integration and Performance Validation
Goal:
Validate correctness and performance under realistic workload and failure scenarios.
Scope:
Done when:
Note
To keep tracking clean, #4145 is the primary issue for the implemented routing improvements.
This issue focuses only on follow-up enhancements and can be split further if needed.
If preferred, I can split the remaining scope into separate issues (e.g., protocol validation, metrics integration, TTL policies, and validation/testing).