Skip to content

Integrate BLIS as a High-Fidelity vLLM Performance Model in llm-d-inference-sim #285

@sriumcp

Description

@sriumcp

What would you like to be added:

Add an optional integration of BLIS (Blackbox Inference Simulator; https://github.com/inference-sim/inference-sim) as a high-fidelity vLLM endpoint performance model inside llm-d-inference-sim.

Specifically:

  • Use BLIS as a drop-in replacement or augmentation for the current vLLM endpoint simulation logic.

  • The inference scheduler and llm-d-inference-sim continues to run unmodified and for real; BLIS is only responsible for modeling:

    • request-level latency (TTFT, TPOT, end-to-end)
    • throughput under batching and contention
    • saturation throughput
    • KV cache allocation, reuse, and eviction dynamics.
  • BLIS operates as a discrete-event simulator that predicts vLLM behavior given:

    • request arrival streams
    • model type (gpt-oss-120b, llama-70b, etc.)
    • hardware configuration (GPU type, memory)
    • vLLM version

This enables llm-d-inference-sim to evaluate real scheduler logic against a much more accurate model of a vLLM instance, without running vLLM itself.

Why is this needed:

llm-d-inference-sim today provides a functional abstraction of a vLLM endpoint, but many of the most critical behaviors that affect scheduler correctness and stability are second-order effects:

  • non-linear batching dynamics,
  • KV cache pressure and eviction,
  • prefill/decode phase coupling,
  • tail-latency amplification under bursty load.

BLIS is designed to model these dynamics explicitly and can be calibrated offline for each:

  • vLLM version,
  • LLM model,
  • GPU type / memory configuration.

By integrating BLIS:

  • llm-d-inference-sim gains predictive accuracy for latency and throughput while preserving fast simulation.
  • Active-active and HA scheduler designs can be evaluated under realistic endpoint behavior, rather than idealized service times.
  • Scheduler regressions and instability modes can be detected before production rollout, especially under bursty, skewed, or failure-injection workloads.

This is particularly important for HA active-active inference configurations, where small modeling errors in endpoint behavior can lead to large emergent effects at the cluster level.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions