Prefix-Cache Coordinator Follow-ups: Protocol Hardening, Metrics Integration, TTL Policy and Churn Validation

Supersedes overlapping scope from #4145; this issue now tracks only follow-up enhancements.

## Update

This issue overlaps with #4145, which now covers the core prefix-cache routing improvements and has PR #4148.

The following items have been addressed in PR #4148:
- Active-rank-aware routing (filtering stale/disconnected identities)
- Metadata cleanup on engine removal
- Bounded metadata growth via cap-based eviction
- Basic observability counters for routing behavior
- Unit tests for routing correctness and cleanup

---

## Remaining Scope (Follow-up Work)

The following improvements are not covered in PR #4148 and can be addressed incrementally:

### 1. Protocol Robustness
- Strict validation of coordinator message payloads
- Graceful handling of malformed inputs and unknown headers
- Clear distinction between client errors and internal failures

### 2. Observability Enhancements
- Integration with standardized metrics/monitoring pipeline
- More detailed routing-quality and latency metrics

### 3. Metadata Policy Extensions
- Optional TTL-based expiration for prefix metadata (in addition to existing cap-based eviction)

### 4. Integration and Performance Validation
- Engine churn and rolling restart scenarios
- Long-running, high-cardinality prompt workloads
- Tail latency (p95/p99) validation under fault conditions

---

## Follow-up Implementation Details

### 1. Protocol Robustness

**Goal:**  
Ensure coordinator communication is resilient to malformed or unexpected inputs without affecting stability.

**Scope:**
- Introduce validation layer for inbound messages (schema, required fields, types)
- Handle malformed payloads, unknown headers, and invalid fields gracefully
- Distinguish client errors from internal failures in logs and metrics
- Avoid hard failures (e.g., assertions) in runtime message handling

**Done when:**
- Invalid messages do not crash the coordinator
- Errors are logged with clear reasons and context
- Unit tests cover malformed and edge-case inputs

---

### 2. Observability Enhancements

**Goal:**  
Expose routing and reliability metrics for production monitoring and debugging.

**Scope:**
- Integrate coordinator metrics with the standard monitoring pipeline
- Track routing quality (cache hits, stale filtering, fallback rates)
- Track reliability signals (invalid messages, unreachable engines)
- Add latency-related metrics for coordinator overhead

**Done when:**
- Metrics are visible in monitoring dashboards
- Error and routing counters are clearly categorized
- Latency distributions are measurable

---

### 3. Metadata Policy Extensions (TTL)

**Goal:**  
Add time-based expiration to complement cap-based eviction for prefix metadata.

**Scope:**
- Support optional TTL for prefix entries
- Combine TTL with existing cap-based eviction
- Use low-overhead cleanup (lazy or periodic)
- Add configuration for TTL duration and behavior

**Done when:**
- TTL can be enabled/disabled via config
- Metadata remains bounded in long-running workloads
- No regression in steady-state routing behavior

---

### 4. Integration and Performance Validation

**Goal:**  
Validate correctness and performance under realistic workload and failure scenarios.

**Scope:**
- Test engine churn (disconnect/restart scenarios)
- Simulate long-running, high-cardinality workloads
- Benchmark latency (p50/p95/p99) before vs after
- Measure fallback rate and memory behavior

**Done when:**
- No regression in baseline performance
- Improved stability under churn scenarios
- Reproducible test and benchmark setup

---

## Note

To keep tracking clean, #4145 is the primary issue for the implemented routing improvements.  
This issue focuses only on follow-up enhancements and can be split further if needed.

If preferred, I can split the remaining scope into separate issues (e.g., protocol validation, metrics integration, TTL policies, and validation/testing).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefix-Cache Coordinator Follow-ups: Protocol Hardening, Metrics Integration, TTL Policy and Churn Validation #4176

Update

Remaining Scope (Follow-up Work)

1. Protocol Robustness

2. Observability Enhancements

3. Metadata Policy Extensions

4. Integration and Performance Validation

Follow-up Implementation Details

1. Protocol Robustness

2. Observability Enhancements

3. Metadata Policy Extensions (TTL)

4. Integration and Performance Validation

Note

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Prefix-Cache Coordinator Follow-ups: Protocol Hardening, Metrics Integration, TTL Policy and Churn Validation #4176

Description

Update

Remaining Scope (Follow-up Work)

1. Protocol Robustness

2. Observability Enhancements

3. Metadata Policy Extensions

4. Integration and Performance Validation

Follow-up Implementation Details

1. Protocol Robustness

2. Observability Enhancements

3. Metadata Policy Extensions (TTL)

4. Integration and Performance Validation

Note

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions