GitHub - xxrjun/gb200-kvcache-offload-study: An empirical study of benchmarking LLM inference with KV cache offloading using vLLM and LMCache on NVIDIA GB200 with high-bandwidth NVLink-C2C .

| Paper | Poster |

This is an empirical study of benchmarking LLM inference with KV cache offloading using vLLM and LMCache on NVIDIA GB200 with high-bandwidth NVLink-C2C .

Keywords: GB200, HBM3e, NVLink5, NVLink-C2C, Large Model Inference, Offloading, Parallelism, MoE

Environment Setup

It is recommend using uv, an extremely fast Python package and project manager, written in Rust.

# Method 1. Install uv via pip
pip install uv

# Method 2. Install uv via shell script
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

After installing uv, create and activate a virtual environment, then install vLLM with PyTorch backend:

bash requrirements/install_vllm_cu128.sh

Tested versions:

Python: 3.13.7
vLLM: 0.11.2
PyTorch: 2.9.0+cu130
LMCache: 0.3.9.post2

Usage

Run sweep serve experiments for different models with/without KV cache offloading. For openai/gpt-oss-120b, use long sequence length settings; for meta-llama/Llama-3.3-70B-Instruct, use short sequence length settings.

# Example: run sweep serve for a model
# bash scripts/bench_serve/bench_sweep_serve.sh <MODEL_NAME> <MODE>

# Example 1: openai/gpt-oss-120b
bash scripts/bench_serve/bench_sweep_serve.sh openai/gpt-oss-120b long

# Example 2: meta-llama/Llama-3.3-70B-Instruct
bash scripts/bench_serve/bench_sweep_serve.sh meta-llama/Llama-3.3-70B-Instruct short

Tip

For generating summary tables and plots, refer to the scripts and README.md in scripts/bench_serve/.

Acknowledgement

We gratefully acknowledge the support of Taiwanese enterprises and academic institutions for providing access to the H200 and GB200 computing resources used in this work. We also thank the open-source community for developing and maintaining vLLM and LMCache.

Limitation

The script currently only supports single-node multi-GPU setups. Multi-node support is experimental and may require additional configuration.

Troubleshooting

openai_harmony.HarmonyError: error downloading or loading vocab file: failed to download or load vocab file

Reference: vllm-project/vllm#22525 (comment)

mkdir -p ~/.cache/harmony/encodings
curl -o ~/.cache/harmony/encodings/o200k_base.tiktoken https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
curl -o ~/.cache/harmony/encodings/cl100k_base.tiktoken https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken

Then export TIKTOKEN_ENCODINGS_BASE=~/.cache/harmony/encodings before running your script.

Citation

@misc{tsou2025gb200kvcacheoffloading,
  title={High-Bandwidth KV Cache Offloading for MoE Inference on NVIDIA GB200},
  author={Hsiang-Yu Tsou and Zih-Heng Ma and Yung-Hsiang Yang},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
data		data
requirements		requirements
scripts/bench_serve		scripts/bench_serve
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Environment Setup

Usage

Acknowledgement

Limitation

Troubleshooting

openai_harmony.HarmonyError: error downloading or loading vocab file: failed to download or load vocab file

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Environment Setup

Usage

Acknowledgement

Limitation

Troubleshooting

openai_harmony.HarmonyError: error downloading or loading vocab file: failed to download or load vocab file

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages