Skip to content

update repo docs and scope#312

Open
vMaroon wants to merge 1 commit intomainfrom
scope
Open

update repo docs and scope#312
vMaroon wants to merge 1 commit intomainfrom
scope

Conversation

@vMaroon
Copy link
Member

@vMaroon vMaroon commented Feb 13, 2026

Summary

The current repository README and documentation are still focused on KV-cache indexing, which has become only one component of the libraries provided here since a while. This PR updates documentation across the repository to better reflect the current scope and state

Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com>
@vMaroon vMaroon requested a review from dannyharnik as a code owner February 13, 2026 16:10
Copilot AI review requested due to automatic review settings February 13, 2026 16:10
@vMaroon vMaroon requested a review from kfirtoledo as a code owner February 13, 2026 16:10
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the repository documentation to reflect that the project has evolved beyond just KV-cache indexing to encompass a broader suite of libraries, services, and connectors. The documentation is reorganized to better present the current scope: KV-Cache indexing, tokenization, KV-event processing, and KV-offloading.

Changes:

  • Restructured main README.md to present the repository as a complete KV-Cache subsystem with multiple components (indexer, events, tokenization, offloading)
  • Created two new detailed documentation files: docs/indexer.md and docs/tokenization.md to provide component-specific deep dives
  • Refactored docs/architecture.md to be a high-level overview with links to detailed component docs
  • Updated docs/configuration.md to clarify scope and reference component-specific configuration
  • Updated docs/README.md to serve as a documentation index with component-specific links

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
README.md Transformed from KV-indexer-focused to comprehensive subsystem overview with components table, architecture diagram, and quick start guide
docs/README.md Converted to documentation index with links to component docs and examples
docs/architecture.md Simplified to high-level overview, moved detailed content to component-specific docs
docs/configuration.md Updated introduction to clarify scope and add cross-references to other component configs
docs/indexer.md New file providing detailed documentation for KV-Cache Indexer, block hashing, backends, and event processing
docs/tokenization.md New file documenting tokenization pool, backends, UDS service, and chat template preprocessing
examples/valkey_example/README.md Updated link text from "KV-Cache Architecture" to "Architecture" for consistency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator

@sagearc sagearc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @vMaroon ! :)


The **KV-Cache Indexer** is a high-performance library that keeps a global, near-real-time view of KV-Cache block locality across a fleet of vLLM pods.
Its purpose is the enablement of smart routing and scheduling by exposing a fast, intelligent scoring mechanism for vLLM pods based on their cached KV-blocks.
This document gives a high-level overview of the Go library system: how the KV-Cache Indexer, KV-Event processing, tokenization, and block index fit together.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the specific library components haven't been introduced yet, could we start with a more general overview of the doc's purpose? Something like: "This document provides a high-level overview of the KV-cache library, its data path, and core components, including..."

The Write Path keeps the index up-to-date by processing a constant stream of events from the vLLM fleet.

Each `KVEvent` (e.g., `BlockStored`) carries the engine's own block hashes, the tokens stored in the block, and metadata (device tier, LoRA ID, etc.).
The library does **not** use the engine hashes as index keys directly. Instead, it recomputes its own **request keys** from the tokens using the same deterministic hashing scheme used in the read path (FNV-64a over CBOR-encoded tuples).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you shortly describe what engine hash is, and where/how it is computed?


The system has two primary data flows: the **Read Path** for scoring pods and the **Write Path** for ingesting cache events.

### Read Path: Scoring a Prompt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it is unrelated to this PR, but can you highlight the read path is related to incoming inference request?

Efficiently handling tokenization is critical for performance. The system is designed to tokenize prompts quickly using a worker pool that supports both asynchronous and synchronous operations.
1. **Event Publication**: A vLLM pod emits an event when its cache changes, published to a ZMQ topic. `BlockStored` events include the engine block hashes, parent hash, token IDs, and metadata.
2. **Message Reception**: The `zmqSubscriber` receives the message and parses the topic to get `podIdentifier` and `modelName`.
3. **Sharded Queuing**: The `kvevents.Pool` hashes the pod identifier (FNV-1a) to select a worker queue, guaranteeing in-order processing per pod.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hashes the pod identifier (FNV-1a) to select a worker queue

How it happens is implementation detail, might be changed later. I think we can omit this

## Main Configuration

This package consists of two components:
The Go library has three top-level configuration areas:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why "The Go library" over "This package"?


## Block Hashing & Key Generation

To guarantee compatibility, the block key generation logic exactly matches vLLM's content-addressing scheme.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not anymore "exactly matches" :)

* **Hash Algorithm**: A chained hash is computed. Each block's key is an **FNV-64a hash** generated from the CBOR-encoded `[parentHash, tokenChunk, extra]` tuple.
* **Initialization**: The hash chain starts with a configurable `HashSeed`.
* **Extra Parameter**: The third component of the hash tuple enables cache differentiation:
- **nil** (default): Standard prompts without LoRA or multi-modal content
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we always include the model name (even for base model requests)

- **nil** (default): Standard prompts without LoRA or multi-modal content
- **int**: LoRA adapter ID (e.g., 42)
- **string**: Adapter name or content-affecting identifier (e.g., "lora-v2")
- **map**: Structured metadata (e.g., `{"lora_id": 42, "medium": "gpu"}`)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you describe TokensToKVBlockKeys? Don't we always hash the base model name/lora adapter name strings? I don't think we accept structured metadata at this point (chunkedTokenDatabase.hash is private)

**Write Path:**
- A: **Event Ingestion**: As vLLM pods create or evict KV-blocks, they emit `KVEvents` containing metadata about these changes
- B: **Index Update**: The **Event Subscriber** consumes these events and updates the **KV-Block Index** in near-real-time
**KV-Event ingestion** - vLLM pods emit `KVEvents` as blocks are stored or evicted. The KV Index consumes these events to maintain a near-real-time global view of block locality.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is reasonable to assume that most people don’t have a deep understanding of what KV events are. We should probably start with why they are needed and how they are generated.


### KV-Offloading

A vLLM offloading connector that enables KV-Cache block transfers between GPU and shared file-system storage.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also worth noting that, by default, KV cache blocks are stored in the available GPU memory.

**Write Path:**
- A: **Event Ingestion**: As vLLM pods create or evict KV-blocks, they emit `KVEvents` containing metadata about these changes
- B: **Index Update**: The **Event Subscriber** consumes these events and updates the **KV-Block Index** in near-real-time
**KV-Event ingestion** - vLLM pods emit `KVEvents` as blocks are stored or evicted. The KV Index consumes these events to maintain a near-real-time global view of block locality.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**KV-Event ingestion** - vLLM pods emit `KVEvents` as blocks are stored or evicted. The KV Index consumes these events to maintain a near-real-time global view of block locality.
**KV-Event ingestion** - model server pods emit `KVEvents` as blocks are stored or evicted. The KV Index consumes these events to maintain a near-real-time global view of block locality.

#### Tokenization Subsystem

Efficiently handling tokenization is critical for performance. The system is designed to tokenize prompts quickly using a worker pool that supports both asynchronous and synchronous operations.
1. **Event Publication**: A vLLM pod emits an event when its cache changes, published to a ZMQ topic. `BlockStored` events include the engine block hashes, parent hash, token IDs, and metadata.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use general terms like "model server" instead of vLLM. And in the beginning of the doc, explain that any model server that implements the kv event protocol is supported.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This readme feels redundant, the top level read me already provides overview of each component and links to the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants