-
Notifications
You must be signed in to change notification settings - Fork 258
Description
Problem
We currently have two repositories each producing an EPP image, the upstream inference gateway and the llm-d-scheduler. The llm-d inference scheduler extends the IGW with a set of plugins (basically using it as a library).
Advanced plugins (like P/D and precise kv-cache aware scheduling) as well as recent scheduling framework enhancements have mostly been driven by the llm-d project, with most new plugins being hosted in the llm-d repo. This created a suboptimal situation, fragmented code and duplicate maintenance and documentation.
This not only created confusion among developers on what to put where, but also confusion among end users as well.
Proposal
Move the Endpoint Picker code from its current repo under kubernetes org to the llm-d scheduler repo. The InferencePool API and the EPP protocol definition along its conformance tests will stay in k8s, and ideally move eventually to the Gateway api repository.
The two APIs, InferenceObjective and InferenceModelRewrite APIs would also move with EPP since they don’t spec any behavior related to the Gateway itself, rather to the EPP only.
Rationale
The Inference Gateway (IGW) project is split into three distinct functional parts:
- An API (InferencePool) and protocol (endpoint picker protocol) that extends the Kubernetes Gateway API. These spec how a Gateway should configure an external callout service—the endpoint picker—for inference-optimized request scheduling. Crucially, the API and protocols remain agnostic of the selection logic, leaving the specific endpoint selection algorithm to the picker implementation.
- A conformant Gateway ecosystem that supports the InferencePool API and Endpoint Picker protocol. As of this proposal, this includes GKE Gateway, Istio, kGateway, and NGINX Gateway Fabric.
- An implementation of the endpoint picker service, known as epp.
Beyond standard protocol implementation (primarily ExtProc gRPC handling), the core value of the epp lies in its inference-specific optimizations. These include routing requests based on LoRA adapter affinity, lowest KV-cache utilization, and predicted latency to pick the most efficient endpoint for efficient serving.
Advancing these optimizations requires deep AI/ML domain expertise and tight integration with model server implementations. Consequently, the ideal community to drive the endpoint picker forward is one that actively involves model server developers (such as the vLLM and sglang communities) and model developers, ensuring the scheduler evolves alongside the underlying serving technology.
As such, llm-d (under CNCF sandbox project application) offers this community and the ideal home for epp to continue to evolve and adapt to new innovations in GenAI serving.