Skip to content

Proposal to Move EPP to the llm-d Repo #2430

@ahg-g

Description

@ahg-g

Problem

We currently have two repositories each producing an EPP image, the upstream inference gateway and the llm-d-scheduler. The llm-d inference scheduler extends the IGW with a set of plugins (basically using it as a library).

Advanced plugins (like P/D and precise kv-cache aware scheduling) as well as recent scheduling framework enhancements have mostly been driven by the llm-d project, with most new plugins being hosted in the llm-d repo. This created a suboptimal situation, fragmented code and duplicate maintenance and documentation.

This not only created confusion among developers on what to put where, but also confusion among end users as well.

Proposal

Move the Endpoint Picker code from its current repo under kubernetes org to the llm-d scheduler repo. The InferencePool API and the EPP protocol definition along its conformance tests will stay in k8s, and ideally move eventually to the Gateway api repository.

The two APIs, InferenceObjective and InferenceModelRewrite APIs would also move with EPP since they don’t spec any behavior related to the Gateway itself, rather to the EPP only.

Rationale

The Inference Gateway (IGW) project is split into three distinct functional parts:

  • An API (InferencePool) and protocol (endpoint picker protocol) that extends the Kubernetes Gateway API. These spec how a Gateway should configure an external callout service—the endpoint picker—for inference-optimized request scheduling. Crucially, the API and protocols remain agnostic of the selection logic, leaving the specific endpoint selection algorithm to the picker implementation.
  • A conformant Gateway ecosystem that supports the InferencePool API and Endpoint Picker protocol. As of this proposal, this includes GKE Gateway, Istio, kGateway, and NGINX Gateway Fabric.
  • An implementation of the endpoint picker service, known as epp.

Beyond standard protocol implementation (primarily ExtProc gRPC handling), the core value of the epp lies in its inference-specific optimizations. These include routing requests based on LoRA adapter affinity, lowest KV-cache utilization, and predicted latency to pick the most efficient endpoint for efficient serving.

Advancing these optimizations requires deep AI/ML domain expertise and tight integration with model server implementations. Consequently, the ideal community to drive the endpoint picker forward is one that actively involves model server developers (such as the vLLM and sglang communities) and model developers, ensuring the scheduler evolves alongside the underlying serving technology.

As such, llm-d (under CNCF sandbox project application) offers this community and the ideal home for epp to continue to evolve and adapt to new innovations in GenAI serving.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions