-
Notifications
You must be signed in to change notification settings - Fork 61
Description
What would you like to be added:
Add an optional integration of BLIS (Blackbox Inference Simulator; https://github.com/inference-sim/inference-sim) as a high-fidelity vLLM endpoint performance model inside llm-d-inference-sim.
Specifically:
-
Use BLIS as a drop-in replacement or augmentation for the current vLLM endpoint simulation logic.
-
The inference scheduler and llm-d-inference-sim continues to run unmodified and for real; BLIS is only responsible for modeling:
- request-level latency (TTFT, TPOT, end-to-end)
- throughput under batching and contention
- saturation throughput
- KV cache allocation, reuse, and eviction dynamics.
-
BLIS operates as a discrete-event simulator that predicts vLLM behavior given:
- request arrival streams
- model type (gpt-oss-120b, llama-70b, etc.)
- hardware configuration (GPU type, memory)
- vLLM version
This enables llm-d-inference-sim to evaluate real scheduler logic against a much more accurate model of a vLLM instance, without running vLLM itself.
Why is this needed:
llm-d-inference-sim today provides a functional abstraction of a vLLM endpoint, but many of the most critical behaviors that affect scheduler correctness and stability are second-order effects:
- non-linear batching dynamics,
- KV cache pressure and eviction,
- prefill/decode phase coupling,
- tail-latency amplification under bursty load.
BLIS is designed to model these dynamics explicitly and can be calibrated offline for each:
- vLLM version,
- LLM model,
- GPU type / memory configuration.
By integrating BLIS:
- llm-d-inference-sim gains predictive accuracy for latency and throughput while preserving fast simulation.
- Active-active and HA scheduler designs can be evaluated under realistic endpoint behavior, rather than idealized service times.
- Scheduler regressions and instability modes can be detected before production rollout, especially under bursty, skewed, or failure-injection workloads.
This is particularly important for HA active-active inference configurations, where small modeling errors in endpoint behavior can lead to large emergent effects at the cluster level.