update repo docs and scope by vMaroon · Pull Request #312 · llm-d/llm-d-kv-cache

vMaroon · 2026-02-13T16:09:59Z

Summary

The current repository README and documentation are still focused on KV-cache indexing, which has become only one component of the libraries provided here since a while. This PR updates documentation across the repository to better reflect the current scope and state

Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>

Copilot

Pull request overview

This PR updates the repository documentation to reflect that the project has evolved beyond just KV-cache indexing to encompass a broader suite of libraries, services, and connectors. The documentation is reorganized to better present the current scope: KV-Cache indexing, tokenization, KV-event processing, and KV-offloading.

Changes:

Restructured main README.md to present the repository as a complete KV-Cache subsystem with multiple components (indexer, events, tokenization, offloading)
Created two new detailed documentation files: docs/indexer.md and docs/tokenization.md to provide component-specific deep dives
Refactored docs/architecture.md to be a high-level overview with links to detailed component docs
Updated docs/configuration.md to clarify scope and reference component-specific configuration
Updated docs/README.md to serve as a documentation index with component-specific links

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
README.md	Transformed from KV-indexer-focused to comprehensive subsystem overview with components table, architecture diagram, and quick start guide
docs/README.md	Converted to documentation index with links to component docs and examples
docs/architecture.md	Simplified to high-level overview, moved detailed content to component-specific docs
docs/configuration.md	Updated introduction to clarify scope and add cross-references to other component configs
docs/indexer.md	New file providing detailed documentation for KV-Cache Indexer, block hashing, backends, and event processing
docs/tokenization.md	New file documenting tokenization pool, backends, UDS service, and chat template preprocessing
examples/valkey_example/README.md	Updated link text from "KV-Cache Architecture" to "Architecture" for consistency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

docs/tokenization.md

README.md

docs/README.md

docs/tokenization.md

sagearc

Great work @vMaroon ! :)

sagearc · 2026-02-16T09:39:58Z

docs/architecture.md


-The **KV-Cache Indexer** is a high-performance library that keeps a global, near-real-time view of KV-Cache block locality across a fleet of vLLM pods. 
-Its purpose is the enablement of smart routing and scheduling by exposing a fast, intelligent scoring mechanism for vLLM pods based on their cached KV-blocks.
+This document gives a high-level overview of the Go library system: how the KV-Cache Indexer, KV-Event processing, tokenization, and block index fit together.


Since the specific library components haven't been introduced yet, could we start with a more general overview of the doc's purpose? Something like: "This document provides a high-level overview of the KV-cache library, its data path, and core components, including..."

sagearc · 2026-02-16T09:44:57Z

docs/architecture.md

 The Write Path keeps the index up-to-date by processing a constant stream of events from the vLLM fleet.

+Each `KVEvent` (e.g., `BlockStored`) carries the engine's own block hashes, the tokens stored in the block, and metadata (device tier, LoRA ID, etc.).
+The library does **not** use the engine hashes as index keys directly. Instead, it recomputes its own **request keys** from the tokens using the same deterministic hashing scheme used in the read path (FNV-64a over CBOR-encoded tuples).


Can you shortly describe what engine hash is, and where/how it is computed?

sagearc · 2026-02-16T09:48:06Z

docs/architecture.md


 The system has two primary data flows: the **Read Path** for scoring pods and the **Write Path** for ingesting cache events.

 ### Read Path: Scoring a Prompt


I know it is unrelated to this PR, but can you highlight the read path is related to incoming inference request?

sagearc · 2026-02-16T09:51:10Z

docs/architecture.md

-Efficiently handling tokenization is critical for performance. The system is designed to tokenize prompts quickly using a worker pool that supports both asynchronous and synchronous operations.
+1. **Event Publication**: A vLLM pod emits an event when its cache changes, published to a ZMQ topic. `BlockStored` events include the engine block hashes, parent hash, token IDs, and metadata.
+2. **Message Reception**: The `zmqSubscriber` receives the message and parses the topic to get `podIdentifier` and `modelName`.
+3. **Sharded Queuing**: The `kvevents.Pool` hashes the pod identifier (FNV-1a) to select a worker queue, guaranteeing in-order processing per pod.


hashes the pod identifier (FNV-1a) to select a worker queue

How it happens is implementation detail, might be changed later. I think we can omit this

sagearc · 2026-02-16T09:54:49Z

docs/configuration.md

 ## Main Configuration

-This package consists of two components:
+The Go library has three top-level configuration areas:


nit: why "The Go library" over "This package"?

sagearc · 2026-02-16T09:56:59Z

docs/indexer.md

+
+## Block Hashing & Key Generation
+
+To guarantee compatibility, the block key generation logic exactly matches vLLM's content-addressing scheme.


Not anymore "exactly matches" :)

sagearc · 2026-02-16T09:58:03Z

docs/indexer.md

+* **Hash Algorithm**: A chained hash is computed. Each block's key is an **FNV-64a hash** generated from the CBOR-encoded `[parentHash, tokenChunk, extra]` tuple.
+* **Initialization**: The hash chain starts with a configurable `HashSeed`.
+* **Extra Parameter**: The third component of the hash tuple enables cache differentiation:
+  - **nil** (default): Standard prompts without LoRA or multi-modal content


I think we always include the model name (even for base model requests)

sagearc · 2026-02-16T10:17:39Z

docs/indexer.md

+  - **nil** (default): Standard prompts without LoRA or multi-modal content
+  - **int**: LoRA adapter ID (e.g., 42)
+  - **string**: Adapter name or content-affecting identifier (e.g., "lora-v2")
+  - **map**: Structured metadata (e.g., `{"lora_id": 42, "medium": "gpu"}`)


Do you describe TokensToKVBlockKeys? Don't we always hash the base model name/lora adapter name strings? I don't think we accept structured metadata at this point (chunkedTokenDatabase.hash is private)

sagearc · 2026-02-16T10:25:00Z

README.md

-**Write Path:**
- A: **Event Ingestion**: As vLLM pods create or evict KV-blocks, they emit `KVEvents` containing metadata about these changes
- B: **Index Update**: The **Event Subscriber** consumes these events and updates the **KV-Block Index** in near-real-time
+**KV-Event ingestion** - vLLM pods emit `KVEvents` as blocks are stored or evicted. The KV Index consumes these events to maintain a near-real-time global view of block locality.


It is reasonable to assume that most people don’t have a deep understanding of what KV events are. We should probably start with why they are needed and how they are generated.

sagearc · 2026-02-16T10:27:49Z

README.md

+
+### KV-Offloading
+
+A vLLM offloading connector that enables KV-Cache block transfers between GPU and shared file-system storage.


It's also worth noting that, by default, KV cache blocks are stored in the available GPU memory.

liu-cong · 2026-02-20T12:54:56Z

README.md

-**Write Path:**
- A: **Event Ingestion**: As vLLM pods create or evict KV-blocks, they emit `KVEvents` containing metadata about these changes
- B: **Index Update**: The **Event Subscriber** consumes these events and updates the **KV-Block Index** in near-real-time
+**KV-Event ingestion** - vLLM pods emit `KVEvents` as blocks are stored or evicted. The KV Index consumes these events to maintain a near-real-time global view of block locality.


Suggested change

**KV-Event ingestion** - vLLM pods emit `KVEvents` as blocks are stored or evicted. The KV Index consumes these events to maintain a near-real-time global view of block locality.

**KV-Event ingestion** - model server pods emit `KVEvents` as blocks are stored or evicted. The KV Index consumes these events to maintain a near-real-time global view of block locality.

liu-cong · 2026-02-20T12:59:20Z

docs/architecture.md

-#### Tokenization Subsystem
-
-Efficiently handling tokenization is critical for performance. The system is designed to tokenize prompts quickly using a worker pool that supports both asynchronous and synchronous operations.
+1. **Event Publication**: A vLLM pod emits an event when its cache changes, published to a ZMQ topic. `BlockStored` events include the engine block hashes, parent hash, token IDs, and metadata.


Let's use general terms like "model server" instead of vLLM. And in the beginning of the doc, explain that any model server that implements the kv event protocol is supported.

liu-cong · 2026-02-20T13:03:30Z

docs/README.md

This readme feels redundant, the top level read me already provides overview of each component and links to the docs.

update repo docs and scoping

11acdea

Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>

vMaroon requested a review from dannyharnik as a code owner February 13, 2026 16:10

Copilot AI review requested due to automatic review settings February 13, 2026 16:10

vMaroon requested a review from kfirtoledo as a code owner February 13, 2026 16:10

vMaroon requested review from hyeongyun0916, liu-cong, sagearc and yankay February 13, 2026 16:10

Copilot started reviewing on behalf of vMaroon February 13, 2026 16:10 View session

Copilot AI reviewed Feb 13, 2026

View reviewed changes

docs/tokenization.md Show resolved Hide resolved

README.md Show resolved Hide resolved

docs/README.md Show resolved Hide resolved

docs/tokenization.md Show resolved Hide resolved

docs/tokenization.md Show resolved Hide resolved

docs/tokenization.md Show resolved Hide resolved

sagearc reviewed Feb 16, 2026

View reviewed changes

sagearc suggested changes Feb 16, 2026

View reviewed changes

liu-cong reviewed Feb 20, 2026

View reviewed changes


		The system has two primary data flows: the Read Path for scoring pods and the Write Path for ingesting cache events.

		### Read Path: Scoring a Prompt


		## Block Hashing & Key Generation

		To guarantee compatibility, the block key generation logic exactly matches vLLM's content-addressing scheme.


		### KV-Offloading

		A vLLM offloading connector that enables KV-Cache block transfers between GPU and shared file-system storage.

	KV-Event ingestion - vLLM pods emit `KVEvents` as blocks are stored or evicted. The KV Index consumes these events to maintain a near-real-time global view of block locality.
	KV-Event ingestion - model server pods emit `KVEvents` as blocks are stored or evicted. The KV Index consumes these events to maintain a near-real-time global view of block locality.

Conversation

vMaroon commented Feb 13, 2026

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sagearc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants