Integrate BLIS as a High-Fidelity vLLM Performance Model in llm-d-inference-sim

**What would you like to be added**:

Add an optional integration of BLIS (Blackbox Inference Simulator; https://github.com/inference-sim/inference-sim) as a high-fidelity vLLM endpoint performance model inside llm-d-inference-sim.

Specifically:

* Use BLIS as a drop-in replacement or augmentation for the current vLLM endpoint simulation logic.
* The inference scheduler and llm-d-inference-sim continues to run unmodified and for real; BLIS is only responsible for modeling:

  - request-level latency (TTFT, TPOT, end-to-end)
  - throughput under batching and contention
  - saturation throughput
  - KV cache allocation, reuse, and eviction dynamics.

* BLIS operates as a discrete-event simulator that predicts vLLM behavior given:

  - request arrival streams
  - model type (gpt-oss-120b, llama-70b, etc.)
  - hardware configuration (GPU type, memory)
  - vLLM version

This enables llm-d-inference-sim to evaluate real scheduler logic against a much more accurate model of a vLLM instance, without running vLLM itself.

**Why is this needed**:

llm-d-inference-sim today provides a functional abstraction of a vLLM endpoint, but many of the most critical behaviors that affect scheduler correctness and stability are second-order effects:

* non-linear batching dynamics,
* KV cache pressure and eviction,
* prefill/decode phase coupling,
* tail-latency amplification under bursty load.

BLIS is designed to model these dynamics explicitly and can be calibrated offline for each:

* vLLM version,
* LLM model,
* GPU type / memory configuration.

By integrating BLIS:

* llm-d-inference-sim gains predictive accuracy for latency and throughput while preserving fast simulation.
* Active-active and HA scheduler designs can be evaluated under realistic endpoint behavior, rather than idealized service times.
* Scheduler regressions and instability modes can be detected before production rollout, especially under bursty, skewed, or failure-injection workloads.

This is particularly important for HA active-active inference configurations, where small modeling errors in endpoint behavior can lead to large emergent effects at the cluster level.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate BLIS as a High-Fidelity vLLM Performance Model in llm-d-inference-sim #285

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Integrate BLIS as a High-Fidelity vLLM Performance Model in llm-d-inference-sim #285

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions