Conversation
Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the repository documentation to reflect that the project has evolved beyond just KV-cache indexing to encompass a broader suite of libraries, services, and connectors. The documentation is reorganized to better present the current scope: KV-Cache indexing, tokenization, KV-event processing, and KV-offloading.
Changes:
- Restructured main README.md to present the repository as a complete KV-Cache subsystem with multiple components (indexer, events, tokenization, offloading)
- Created two new detailed documentation files:
docs/indexer.mdanddocs/tokenization.mdto provide component-specific deep dives - Refactored
docs/architecture.mdto be a high-level overview with links to detailed component docs - Updated
docs/configuration.mdto clarify scope and reference component-specific configuration - Updated
docs/README.mdto serve as a documentation index with component-specific links
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| README.md | Transformed from KV-indexer-focused to comprehensive subsystem overview with components table, architecture diagram, and quick start guide |
| docs/README.md | Converted to documentation index with links to component docs and examples |
| docs/architecture.md | Simplified to high-level overview, moved detailed content to component-specific docs |
| docs/configuration.md | Updated introduction to clarify scope and add cross-references to other component configs |
| docs/indexer.md | New file providing detailed documentation for KV-Cache Indexer, block hashing, backends, and event processing |
| docs/tokenization.md | New file documenting tokenization pool, backends, UDS service, and chat template preprocessing |
| examples/valkey_example/README.md | Updated link text from "KV-Cache Architecture" to "Architecture" for consistency |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| The **KV-Cache Indexer** is a high-performance library that keeps a global, near-real-time view of KV-Cache block locality across a fleet of vLLM pods. | ||
| Its purpose is the enablement of smart routing and scheduling by exposing a fast, intelligent scoring mechanism for vLLM pods based on their cached KV-blocks. | ||
| This document gives a high-level overview of the Go library system: how the KV-Cache Indexer, KV-Event processing, tokenization, and block index fit together. |
There was a problem hiding this comment.
Since the specific library components haven't been introduced yet, could we start with a more general overview of the doc's purpose? Something like: "This document provides a high-level overview of the KV-cache library, its data path, and core components, including..."
| The Write Path keeps the index up-to-date by processing a constant stream of events from the vLLM fleet. | ||
|
|
||
| Each `KVEvent` (e.g., `BlockStored`) carries the engine's own block hashes, the tokens stored in the block, and metadata (device tier, LoRA ID, etc.). | ||
| The library does **not** use the engine hashes as index keys directly. Instead, it recomputes its own **request keys** from the tokens using the same deterministic hashing scheme used in the read path (FNV-64a over CBOR-encoded tuples). |
There was a problem hiding this comment.
Can you shortly describe what engine hash is, and where/how it is computed?
|
|
||
| The system has two primary data flows: the **Read Path** for scoring pods and the **Write Path** for ingesting cache events. | ||
|
|
||
| ### Read Path: Scoring a Prompt |
There was a problem hiding this comment.
I know it is unrelated to this PR, but can you highlight the read path is related to incoming inference request?
| Efficiently handling tokenization is critical for performance. The system is designed to tokenize prompts quickly using a worker pool that supports both asynchronous and synchronous operations. | ||
| 1. **Event Publication**: A vLLM pod emits an event when its cache changes, published to a ZMQ topic. `BlockStored` events include the engine block hashes, parent hash, token IDs, and metadata. | ||
| 2. **Message Reception**: The `zmqSubscriber` receives the message and parses the topic to get `podIdentifier` and `modelName`. | ||
| 3. **Sharded Queuing**: The `kvevents.Pool` hashes the pod identifier (FNV-1a) to select a worker queue, guaranteeing in-order processing per pod. |
There was a problem hiding this comment.
hashes the pod identifier (FNV-1a) to select a worker queue
How it happens is implementation detail, might be changed later. I think we can omit this
| ## Main Configuration | ||
|
|
||
| This package consists of two components: | ||
| The Go library has three top-level configuration areas: |
There was a problem hiding this comment.
nit: why "The Go library" over "This package"?
|
|
||
| ## Block Hashing & Key Generation | ||
|
|
||
| To guarantee compatibility, the block key generation logic exactly matches vLLM's content-addressing scheme. |
There was a problem hiding this comment.
Not anymore "exactly matches" :)
| * **Hash Algorithm**: A chained hash is computed. Each block's key is an **FNV-64a hash** generated from the CBOR-encoded `[parentHash, tokenChunk, extra]` tuple. | ||
| * **Initialization**: The hash chain starts with a configurable `HashSeed`. | ||
| * **Extra Parameter**: The third component of the hash tuple enables cache differentiation: | ||
| - **nil** (default): Standard prompts without LoRA or multi-modal content |
There was a problem hiding this comment.
I think we always include the model name (even for base model requests)
| - **nil** (default): Standard prompts without LoRA or multi-modal content | ||
| - **int**: LoRA adapter ID (e.g., 42) | ||
| - **string**: Adapter name or content-affecting identifier (e.g., "lora-v2") | ||
| - **map**: Structured metadata (e.g., `{"lora_id": 42, "medium": "gpu"}`) |
There was a problem hiding this comment.
Do you describe TokensToKVBlockKeys? Don't we always hash the base model name/lora adapter name strings? I don't think we accept structured metadata at this point (chunkedTokenDatabase.hash is private)
| **Write Path:** | ||
| - A: **Event Ingestion**: As vLLM pods create or evict KV-blocks, they emit `KVEvents` containing metadata about these changes | ||
| - B: **Index Update**: The **Event Subscriber** consumes these events and updates the **KV-Block Index** in near-real-time | ||
| **KV-Event ingestion** - vLLM pods emit `KVEvents` as blocks are stored or evicted. The KV Index consumes these events to maintain a near-real-time global view of block locality. |
There was a problem hiding this comment.
It is reasonable to assume that most people don’t have a deep understanding of what KV events are. We should probably start with why they are needed and how they are generated.
|
|
||
| ### KV-Offloading | ||
|
|
||
| A vLLM offloading connector that enables KV-Cache block transfers between GPU and shared file-system storage. |
There was a problem hiding this comment.
It's also worth noting that, by default, KV cache blocks are stored in the available GPU memory.
| **Write Path:** | ||
| - A: **Event Ingestion**: As vLLM pods create or evict KV-blocks, they emit `KVEvents` containing metadata about these changes | ||
| - B: **Index Update**: The **Event Subscriber** consumes these events and updates the **KV-Block Index** in near-real-time | ||
| **KV-Event ingestion** - vLLM pods emit `KVEvents` as blocks are stored or evicted. The KV Index consumes these events to maintain a near-real-time global view of block locality. |
There was a problem hiding this comment.
| **KV-Event ingestion** - vLLM pods emit `KVEvents` as blocks are stored or evicted. The KV Index consumes these events to maintain a near-real-time global view of block locality. | |
| **KV-Event ingestion** - model server pods emit `KVEvents` as blocks are stored or evicted. The KV Index consumes these events to maintain a near-real-time global view of block locality. |
| #### Tokenization Subsystem | ||
|
|
||
| Efficiently handling tokenization is critical for performance. The system is designed to tokenize prompts quickly using a worker pool that supports both asynchronous and synchronous operations. | ||
| 1. **Event Publication**: A vLLM pod emits an event when its cache changes, published to a ZMQ topic. `BlockStored` events include the engine block hashes, parent hash, token IDs, and metadata. |
There was a problem hiding this comment.
Let's use general terms like "model server" instead of vLLM. And in the beginning of the doc, explain that any model server that implements the kv event protocol is supported.
There was a problem hiding this comment.
This readme feels redundant, the top level read me already provides overview of each component and links to the docs.
Summary
The current repository README and documentation are still focused on KV-cache indexing, which has become only one component of the libraries provided here since a while. This PR updates documentation across the repository to better reflect the current scope and state