This is an empirical study of benchmarking LLM inference with KV cache offloading using vLLM and LMCache on NVIDIA GB200 with high-bandwidth NVLink-C2C .
Keywords: GB200, HBM3e, NVLink5, NVLink-C2C, Large Model Inference, Offloading, Parallelism, MoE
It is recommend using uv, an extremely fast Python package and project manager, written in Rust.
# Method 1. Install uv via pip
pip install uv
# Method 2. Install uv via shell script
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/envAfter installing uv, create and activate a virtual environment, then install vLLM with PyTorch backend:
bash requrirements/install_vllm_cu128.shTested versions:
- Python: 3.13.7
- vLLM: 0.11.2
- PyTorch: 2.9.0+cu130
- LMCache: 0.3.9.post2
Run sweep serve experiments for different models with/without KV cache offloading. For openai/gpt-oss-120b, use long sequence length settings; for meta-llama/Llama-3.3-70B-Instruct, use short sequence length settings.
# Example: run sweep serve for a model
# bash scripts/bench_serve/bench_sweep_serve.sh <MODEL_NAME> <MODE>
# Example 1: openai/gpt-oss-120b
bash scripts/bench_serve/bench_sweep_serve.sh openai/gpt-oss-120b long
# Example 2: meta-llama/Llama-3.3-70B-Instruct
bash scripts/bench_serve/bench_sweep_serve.sh meta-llama/Llama-3.3-70B-Instruct shortTip
For generating summary tables and plots, refer to the scripts and README.md in scripts/bench_serve/.
We gratefully acknowledge the support of Taiwanese enterprises and academic institutions for providing access to the H200 and GB200 computing resources used in this work. We also thank the open-source community for developing and maintaining vLLM and LMCache.
The script currently only supports single-node multi-GPU setups. Multi-node support is experimental and may require additional configuration.
openai_harmony.HarmonyError: error downloading or loading vocab file: failed to download or load vocab file
Reference: vllm-project/vllm#22525 (comment)
mkdir -p ~/.cache/harmony/encodings
curl -o ~/.cache/harmony/encodings/o200k_base.tiktoken https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
curl -o ~/.cache/harmony/encodings/cl100k_base.tiktoken https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktokenThen export TIKTOKEN_ENCODINGS_BASE=~/.cache/harmony/encodings before running your script.
@misc{tsou2025gb200kvcacheoffloading,
title={High-Bandwidth KV Cache Offloading for MoE Inference on NVIDIA GB200},
author={Hsiang-Yu Tsou and Zih-Heng Ma and Yung-Hsiang Yang},
year={2025}
}