Glm5 nvfp4 serving by pst2154 · Pull Request #11522 · NVIDIA/TensorRT-LLM

pst2154 · 2026-02-14T03:20:28Z

[None][doc] Add GLM-5 NVFP4 serving guide using DeepSeek V3.2 model mapping

Description

GLM-5 (MLA architecture, v_head_dim=256, qk_nope_head_dim=192) can be served
in TRT-LLM by remapping to deepseek_v32 (DeepseekV32ForCausalLM).

Key insight: DeepseekV32Attention has a built-in DSA indexer that routes context
attention through absorption mode (576/512 FMHA kernels). This avoids the 256/256
separate QKV context path in DeepseekV3Attention, which falls back to unfused MHA
that does not support MLA (produces garbage output).

No source build or C++ changes required — works on the stock 1.3.0rc3 container
with only:

config.json: remap model_type → deepseek_v32, architectures → DeepseekV32ForCausalLM, add rope_theta and rope_scaling
tokenizer_config.json: fix tokenizer class and special tokens key name (needs reconfirmation)
Serve YAML: sparse_attention_config with DSA algorithm and GLM-5 indexer params

Files added:

examples/GLM5_NVFP4_SERVING.md — step-by-step README
examples/glm5_nvfp4_serving.sh — copy-paste script version

Verified working on 8×B200 (SM100), TRT-LLM 1.3.0rc3, producing coherent output.

Test Coverage

Manual end-to-end test: served GLM-5 NVFP4 on 8×B200 with TP=8, sent completion
requests via OpenAI-compatible API, verified coherent English output.
No automated tests added (documentation-only PR).

PR Checklist

Please check this after reviewing the above items as appropriate for this PR.

Summary by CodeRabbit

New Features
- Added sparse attention optimization configuration enabling more efficient GPU memory utilization and enhanced serving throughput for large language models.
Documentation
- Introduced comprehensive end-to-end setup guide and automated deployment script for GLM-5 NVFP4 serving via TensorRT-LLM, including model and tokenizer configuration updates, Docker environment preparation, and complete testing workflow.

- Fix headSizeV in attentionOp.cpp: use v_head_dim instead of qk_nope_head_dim for non-sparse MLA context path. GLM5 has v_head_dim=256 vs qk_nope_head_dim=192; the old assumption (v_head_dim == qk_nope_head_dim) only held for DeepSeek V3. - Relax FMHA assertion for context MLA to a warning instead of a fatal error, allowing graceful fallback when kernels are unavailable for a given head size combination. - Add examples/glm5_nvfp4_serving.sh: step-by-step guide to serve GLM5 NVFP4 on 8xB200 using DeepseekV32ForCausalLM (model_type=deepseek_v32) with DSA indexer, FP8 KV cache, and PerkzZheng's FMHA kernel branch. Verified working on 8xB200 (SM100), TRT-LLM 1.3.0rc3/rc4. Co-authored-by: Cursor <cursoragent@cursor.com>

Markdown guide covering all prerequisites, config changes, C++ fixes, build, flashinfer patch, and serve commands. Co-authored-by: Cursor <cursoragent@cursor.com>

deepseek_v32 model type routes context through absorption mode using existing 576/512 FMHA kernels. No C++ fixes, no PerkzZheng kernel branch, no flashinfer patch required. Only needs: config.json + tokenizer_config.json edits, serve YAML. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

deepseek_v32 routes context through absorption mode (mqa with mIsGenerationMLA=true), using 576/512 kernels. The non-sparse context path that uses headSizeV = qk_nope_head_dim is never hit. Co-authored-by: Cursor <cursoragent@cursor.com>

coderabbitai · 2026-02-14T03:25:39Z

📝 Walkthrough

Walkthrough

Two new example files are introduced documenting a complete workflow for serving GLM-5 NVFP4 models via TensorRT-LLM with DeepSeek V3.2 compatibility, including model configuration, tokenizer setup, Docker deployment, sparse attention serving configuration, and testing instructions.

Changes

Cohort / File(s)	Summary
GLM5 NVFP4 Serving Documentation `examples/GLM5_NVFP4_SERVING.md`	Comprehensive guide documenting prerequisite setup, model config (architecture: DeepseekV32ForCausalLM, rope_theta: 1000000, sparse attention via DSA indexer), tokenizer config (PreTrainedTokenizerFast), serve configuration with sparse_attention_config and kv_cache_config (fp8, 0.90 memory), and end-to-end workflow with expected output.
GLM5 NVFP4 Serving Script `examples/glm5_nvfp4_serving.sh`	Automation script providing step-by-step environment setup, Docker container execution with GPU access, in-container YAML configuration for sparse attention and kv_cache, trtllm-serve invocation with tensor parallelism and token limits, and curl test validation.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Merge Conflict Detection	⚠️ Warning	❌ Merge conflicts detected (34 files): ⚔️ `README.md` (content) ⚔️ `cpp/tensorrt_llm/executor/serialization.cpp` (content) ⚔️ `cpp/tensorrt_llm/kernels/helixKernels.cu` (content) ⚔️ `cpp/tensorrt_llm/kernels/helixKernels.h` (content) ⚔️ `cpp/tensorrt_llm/thop/helixPostProcessOp.cpp` (content) ⚔️ `requirements.txt` (content) ⚔️ `tensorrt_llm/_torch/auto_deploy/export/export.py` (content) ⚔️ `tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py` (content) ⚔️ `tensorrt_llm/_torch/distributed/__init__.py` (content) ⚔️ `tensorrt_llm/_torch/distributed/ops.py` (content) ⚔️ `tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py` (content) ⚔️ `tensorrt_llm/_torch/modules/attention.py` (content) ⚔️ `tensorrt_llm/_torch/modules/fused_moe/quantization.py` (content) ⚔️ `tensorrt_llm/_torch/modules/linear.py` (content) ⚔️ `tensorrt_llm/commands/serve.py` (content) ⚔️ `tensorrt_llm/executor/ipc.py` (content) ⚔️ `tensorrt_llm/inputs/data.py` (content) ⚔️ `tensorrt_llm/inputs/utils.py` (content) ⚔️ `tensorrt_llm/llmapi/__init__.py` (content) ⚔️ `tensorrt_llm/llmapi/disagg_utils.py` (content) ⚔️ `tensorrt_llm/llmapi/utils.py` (content) ⚔️ `tensorrt_llm/ray_stub.py` (content) ⚔️ `tensorrt_llm/serve/chat_utils.py` (content) ⚔️ `tensorrt_llm/serve/openai_protocol.py` (content) ⚔️ `tensorrt_llm/serve/openai_server.py` (content) ⚔️ `tests/integration/defs/accuracy/test_disaggregated_serving.py` (content) ⚔️ `tests/integration/test_lists/qa/llm_function_core.txt` (content) ⚔️ `tests/integration/test_lists/test-db/l0_b200.yml` (content) ⚔️ `tests/integration/test_lists/test-db/l0_dgx_b200.yml` (content) ⚔️ `tests/integration/test_lists/waives.txt` (content) ⚔️ `tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/test_export.py` (content) ⚔️ `tests/unittest/_torch/modules/test_mla_helix.py` (content) ⚔️ `tests/unittest/_torch/thop/parallel/test_helix_postprocess.py` (content) ⚔️ `tests/unittest/llmapi/apps/test_chat_utils.py` (content) These conflicts must be resolved before merging into `main`.	Resolve conflicts locally and push changes to this branch.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Glm5 nvfp4 serving' is directly related to the main change—adding documentation and scripts for serving GLM-5 NVFP4 models.
Description check	✅ Passed	The description follows the template structure, includes the required ticket format [None][doc], clearly explains the GLM-5 mapping approach, lists added files, and documents manual testing performed.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

⚔️ Resolve merge conflicts (beta)

Auto-commit resolved conflicts to branch glm5-nvfp4-serving
Post resolved changes as copyable diffs in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🤖 Fix all issues with AI agents

In `@examples/GLM5_NVFP4_SERVING.md`:
- Around line 57-71: The doc contains an unresolved note in Step 2 saying the
tokenizer_config changes "may not be required"; resolve this before merging by
verifying the tokenizer works with and without the changes and then updating the
doc: if the changes are required, remove the uncertainty note and keep the edits
to tokenizer_config.json (replace "tokenizer_class": "TokenizersBackend" with
"PreTrainedTokenizerFast" and rename "extra_special_tokens" to
"additional_special_tokens"); if they are not required, remove the suggested
edits and the note. Ensure the final text clearly states the required action for
the tokenizer_config.json keys ("tokenizer_class" and
"additional_special_tokens") with no hedging.
- Around line 32-55: The config uses rope_scaling: {"type": "none", "factor":
1.0} which is not a valid PositionEmbeddingType and will raise a KeyError;
update the model config (the JSON object containing "architectures",
"model_type", "rope_theta" and "rope_scaling") by either removing the
rope_scaling field (or setting it to null) or replacing it with a valid type
such as {"type": "linear", "factor": 1.0} (which is equivalent to no scaling);
keep "rope_theta": 1000000 and ensure any code that maps the rope_scaling.type
string to PositionEmbeddingType uses a valid enum member name.

In `@examples/glm5_nvfp4_serving.sh`:
- Around line 43-44: Replace the hardcoded SLURM partition string used in the
salloc command ("salloc -N 1 --gres=gpu:8 --time=04:00:00
--partition=b200@500/None@cr+mp/8gpu-224cpu-2048gb") and the specific host in
the ssh command ("ssh umb-b200-236") with clear placeholders (e.g.,
--partition=<YOUR_PARTITION> and ssh <HOSTNAME>) and add a brief comment above
them explaining to fill in cluster-specific values or refer to the README;
update the example lines so they mirror the README placeholders and avoid
leaking internal cluster identifiers.
- Around line 47-49: The docker run line in examples/glm5_nvfp4_serving.sh
currently mounts a hardcoded personal path (/home/scratch.asteiner); change that
mount to a generic placeholder (e.g., /path/to/scratch or an environment
variable like ${SCRATCH_DIR}) so the script doesn't leak internal directories.
Update the volume flag in the docker run command (the "-v
/home/scratch.asteiner:/workspace/scratch" segment) to use the placeholder and,
if desired, add a short comment or README pointer that users should set
SCRATCH_DIR before running.
- Around line 78-80: The curl example currently hardcodes the Docker bridge IP
"172.17.0.2" which is fragile; update the curl invocation in
examples/glm5_nvfp4_serving.sh to use a variable or command substitution instead
of a fixed IP (e.g., replace the literal IP in the curl command with a
placeholder like SERVER_IP or use the docker inspect command to programmatically
obtain the container IP), and include a short inline comment showing the docker
inspect command to discover the container IP so users can set SERVER_IP before
running the curl line.
- Line 9: The script references the container tag string
"nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc3" which is not documented; update
every occurrence of that exact string (found in the top comment and later docker
image pulls/variables) to "nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc2" so all
container tag references match the official NVIDIA NGC documentation.

coderabbitai · 2026-02-14T03:25:42Z

examples/GLM5_NVFP4_SERVING.md

+## Step 1 — Prepare the model config
+
+Edit `GLM-5-nvfp4-v1/config.json` (back up original first):
+
+```bash
+cd /path/to/GLM-5-nvfp4-v1
+cp config.json config.json.orig
+```
+
+Make these changes:
+
+```jsonc
+{
+  // Change these two:
+  "architectures": ["DeepseekV32ForCausalLM"],   // was GlmMoeDsaForCausalLM
+  "model_type": "deepseek_v32",                   // was glm_moe_dsa
+
+  // Add these (GLM5 uses rope_theta=1M, no yarn scaling):
+  "rope_theta": 1000000,
+  "rope_scaling": {"type": "none", "factor": 1.0},
+
+  // Everything else stays the same
+}
+```


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's understand the repo structure find . -type f -name "*.md" | grep -i "glm\|deepseek" | head -10

Repository: NVIDIA/TensorRT-LLM

Length of output: 787

🏁 Script executed:

# Search for rope_scaling handling in the codebase rg -n "rope_scaling" --type py -C 2 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 5196

🏁 Script executed:

# Search for deepseek_v32 model references rg -n "deepseek_v32\|DeepseekV32" --type py -C 2 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Search for deepseek model files find . -type f $ -name "*.py" -o -name "*.json" $ | xargs grep -l "deepseek_v32\|DeepseekV32" | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 828

🏁 Script executed:

# Search for how rope_scaling is validated/handled in model configs rg -n "rope_scaling.*type" --type py -B 2 -A 2 | head -60

Repository: NVIDIA/TensorRT-LLM

Length of output: 5462

🏁 Script executed:

# Look for GLM5 specific model handling rg -n "glm.*rope\|rope.*glm" --type py -i -C 2 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Search for rope_scaling type validation or mapping rg -n "RotaryScalingType\|rope.*type.*none" --type py -C 3 | head -60

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Check the deepseek model config cat tensorrt_llm/models/deepseek_v2/config.py | head -120

Repository: NVIDIA/TensorRT-LLM

Length of output: 4977

🏁 Script executed:

# Look for how rope_scaling with "none" type would be handled rg -n "rope_scaling\[.*type.*\]" --type py -C 5 | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 8034

🏁 Script executed:

# Search for RotaryScalingType enum definition rg -n "class RotaryScalingType\|RotaryScalingType.*=" --type py -C 5 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Check if there's a way to disable rope_scaling (set to None) rg -n "rope_scaling.*=.*None\|if.*rope_scaling.*is.*None" --type py -C 2 | head -60

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Find RotaryScalingType enum definition rg -n "class RotaryScalingType" --type py -A 10

Repository: NVIDIA/TensorRT-LLM

Length of output: 583

🏁 Script executed:

# Search for where "none" rope type is defined rg -n '"none".*rope\|rope.*"none"' --type py -C 2 | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Check the GLM5_NVFP4_SERVING.md file to see the full context cat examples/GLM5_NVFP4_SERVING.md | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 2471

🏁 Script executed:

# Look for deepseek_v32 model implementation find . -path ./tests -prune -o -type f -name "*.py" -print | xargs grep -l "class.*DeepSeekV32\|deepseek_v32" | grep -v test | head -10

Repository: NVIDIA/TensorRT-LLM

Length of output: 492

🏁 Script executed:

# Check the full RotaryScalingType enum and its from_string method rg -n "class RotaryScalingType" --type py -A 20

Repository: NVIDIA/TensorRT-LLM

Length of output: 1146

🏁 Script executed:

# Look for how rope_scaling is validated when it's passed through rg -n "rope_scaling\[.*type" --type py -B 3 -A 3 | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 9530

🏁 Script executed:

# Check if deepseek_v32 model has its own config handling find . -name "*deepseek*v32*" -o -name "*deepseekv32*" | grep -v __pycache__

Repository: NVIDIA/TensorRT-LLM

Length of output: 2366

🏁 Script executed:

# Look for how rope_scaling "none" is actually used in model conversion or loading rg -n "rope_scaling.*none\|type.*none" --type py -i -B 2 -A 2 | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Look for deepseek_v32 model files in the models directory find ./tensorrt_llm/models -name "*deepseek*" -type d

Repository: NVIDIA/TensorRT-LLM

Length of output: 130

🏁 Script executed:

# Check if there's a deepseek_v32 config or model file find ./tensorrt_llm/models -name "*deepseek*" -type f | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Search for how rope_scaling is handled when type is "none" rg -n "PositionEmbeddingType.from_string" --type py -B 2 -A 2 | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 3774

🏁 Script executed:

# Check how deepseek models handle rope_scaling configuration rg -n "rope_scaling\|rotary_scaling" tensorrt_llm/models/deepseek* --type py -C 2 | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Find PositionEmbeddingType enum and its from_string method rg -n "class PositionEmbeddingType" --type py -A 30

Repository: NVIDIA/TensorRT-LLM

Length of output: 1869

🏁 Script executed:

# Check if there are any config files for deepseek_v32 that show rope_scaling examples find . -name "config.json" -o -name "*deepseek*config*" | xargs grep -l "rope_scaling" 2>/dev/null | head -10

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Look for actual deepseek_v32 model implementation find ./tensorrt_llm -name "*deepseek_v32*" -o -name "*deepseekv32*" | grep -v __pycache__

Repository: NVIDIA/TensorRT-LLM

Length of output: 155

🏁 Script executed:

# Check if there's a deepseek_v32 model in the models directory or if it uses deepseek_v2/v1 ls -la ./tensorrt_llm/models/ | grep deepseek

Repository: NVIDIA/TensorRT-LLM

Length of output: 188

🏁 Script executed:

# Search for deepseek_v32 model registration rg -n "deepseek_v32\|DeepseekV32" --type py -C 3 | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Look for how rope_scaling type is converted for models rg -n "from_string.*rope_scaling\|rope_scaling.*from_string" --type py -B 3 -A 3 | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Check how model types are mapped rg -n "model_type.*deepseek" --type py -C 2 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 3908

🏁 Script executed:

# Look for deepseek model in the __init__ or registry rg -n "deepseek_v32" --type py | head -30

Repository: NVIDIA/TensorRT-LLM

Length of output: 2561

🏁 Script executed:

# Look at the modeling_deepseekv3.py file to see how it handles rope_scaling sed -n '1,150p' tensorrt_llm/_torch/models/modeling_deepseekv3.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 6252

🏁 Script executed:

# Search for rope_scaling handling in deepseekv3 rg -n "rope_scaling\|rotary" tensorrt_llm/_torch/models/modeling_deepseekv3.py -C 2 | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Check how the config is loaded from HuggingFace for deepseek_v32 rg -n "from_hugging_face\|from_pretrained" tensorrt_llm/_torch/pyexecutor/config_utils.py -B 2 -A 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Search more specifically for rope_scaling and rope in deepseekv3 rg -n "rope_scaling" tensorrt_llm/_torch/models/modeling_deepseekv3.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Check how PositionEmbeddingType.from_string handles different types rg -n "PositionEmbeddingType.from_string" --type py -B 5 -A 5

Repository: NVIDIA/TensorRT-LLM

Length of output: 7511

🏁 Script executed:

# Look for the actual implementation of from_string method sed -n '689,750p' tensorrt_llm/functional.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1553

🏁 Script executed:

# Search for how rope_scaling type "none" is actually used rg -n "rope_scaling\[.*type.*\]\|\.from_string\(config.rope_scaling" --type py -C 5 | head -150

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Check the pyexecutor config for deepseek_v32 cat tensorrt_llm/_torch/pyexecutor/config_utils.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 2298

🏁 Script executed:

# Look for how rope_scaling is loaded/processed from config rg -n "rope_scaling\|rotary_scaling" tensorrt_llm/_torch/pyexecutor/config_utils.py -C 3

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Check if deepseek_v32 uses a different config class rg -n "DeepseekV3Config" tensorrt_llm/_torch/models/checkpoints/ -C 2 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Search for actual HF config loading for deepseek models find tensorrt_llm -name "*config*.py" -path "*/deepseek*" -o -name "*deepseek*config*.py" | head -10

Repository: NVIDIA/TensorRT-LLM

Length of output: 146

🏁 Script executed:

# Find where DeepseekV3Config is defined find . -name "*.py" | xargs grep -l "class DeepseekV3Config" | head -5

Repository: NVIDIA/TensorRT-LLM

Length of output: 107

🏁 Script executed:

# Look for the config in tensorrt_llm._torch.configs find tensorrt_llm/_torch/configs -name "*.py" | head -10

Repository: NVIDIA/TensorRT-LLM

Length of output: 145

🏁 Script executed:

# Check if there's a DeepseekV3Config in tensorrt_llm._torch.configs rg -n "class DeepseekV3Config" --type py -A 30 | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 2342

🏁 Script executed:

# Get the full DeepseekV3Config class cat tensorrt_llm/_torch/configs/deepseek_v3.py | head -200

Repository: NVIDIA/TensorRT-LLM

Length of output: 3663

🏁 Script executed:

# Check if there's rope_scaling handling in the config init or elsewhere rg -n "rope_scaling\|rope_theta" tensorrt_llm/_torch/configs/deepseek_v3.py -C 3

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

Remove rope_scaling: {"type": "none", "factor": 1.0} — it will cause a KeyError.

The rope_scaling type field must map to a PositionEmbeddingType enum member. Valid types are: rope_gptj, rope_gpt_neox, long_rope, yarn, mrope, or others. The string "none" is not a recognized type and will fail at runtime when the model attempts to look it up.

For GLM-5 with no scaling, either:

Omit rope_scaling entirely (set to null), or

Use "rope_scaling": {"type": "linear", "factor": 1.0} (linear with factor 1.0 is equivalent to no scaling)

🤖 Prompt for AI Agents

In `@examples/GLM5_NVFP4_SERVING.md` around lines 32 - 55, The config uses rope_scaling: {"type": "none", "factor": 1.0} which is not a valid PositionEmbeddingType and will raise a KeyError; update the model config (the JSON object containing "architectures", "model_type", "rope_theta" and "rope_scaling") by either removing the rope_scaling field (or setting it to null) or replacing it with a valid type such as {"type": "linear", "factor": 1.0} (which is equivalent to no scaling); keep "rope_theta": 1000000 and ensure any code that maps the rope_scaling.type string to PositionEmbeddingType uses a valid enum member name.

coderabbitai · 2026-02-14T03:25:42Z

examples/GLM5_NVFP4_SERVING.md

+## Step 2 — Fix the tokenizer config
+
+> **Note:** These changes may not be required — needs confirmation. Try without them first.
+
+Edit `GLM-5-nvfp4-v1/tokenizer_config.json`:
+
+```jsonc
+{
+  // Change:
+  "tokenizer_class": "PreTrainedTokenizerFast",   // was TokenizersBackend
+
+  // Rename key:
+  "additional_special_tokens": [...]               // was extra_special_tokens
+}
+```


⚠️ Potential issue | 🟡 Minor

Uncertainty in Step 2 should be resolved before merging.

Line 59 states "These changes may not be required — needs confirmation." Shipping documentation with unconfirmed steps may confuse users. Consider either:

Testing without these changes and updating accordingly, or

Removing the note if they are indeed required.

🤖 Prompt for AI Agents

In `@examples/GLM5_NVFP4_SERVING.md` around lines 57 - 71, The doc contains an unresolved note in Step 2 saying the tokenizer_config changes "may not be required"; resolve this before merging by verifying the tokenizer works with and without the changes and then updating the doc: if the changes are required, remove the uncertainty note and keep the edits to tokenizer_config.json (replace "tokenizer_class": "TokenizersBackend" with "PreTrainedTokenizerFast" and rename "extra_special_tokens" to "additional_special_tokens"); if they are not required, remove the suggested edits and the note. Ensure the final text clearly states the required action for the tokenizer_config.json keys ("tokenizer_class" and "additional_special_tokens") with no hedging.

coderabbitai · 2026-02-14T03:25:42Z

examples/glm5_nvfp4_serving.sh

+# VERIFIED WORKING: 2026-02-14 on 8xB200 (SM100), TRT-LLM 1.3.0rc3/rc4
+#
+# DIRECTIONS: To serve GLM5 NVFP4 in TRT-LLM, you need the following
+# changes on top of the base container (nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc3):


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

nvcr.io nvidia tensorrt-llm release 1.3.0rc3 NGC container

💡 Result:

NVIDIA’s official TensorRT‑LLM Quick Start currently shows the NGC “release” container tag as:

docker run --rm -it --ipc host --gpus all \ --ulimit memlock=-1 --ulimit stack=67108864 -p 8000:8000 \ nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc2

[1]

I could not find any NVIDIA-published documentation or release page that references an NGC container tag 1.3.0rc3 for nvcr.io/nvidia/tensorrt-llm/release (only 1.3.0rc2 is explicitly documented) [1].

To verify whether 1.3.0rc3 exists on NGC anyway, try:

docker login nvcr.io docker pull nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc3

If you want, tell me whether docker pull succeeds or fails (and the exact error), and I’ll help you pick the correct available tag.

Sources:
[1] TensorRT‑LLM Quick Start Guide (container launch command shows :1.3.0rc2) (nvidia.github.io)

Citations:

1: https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html?utm_source=openai

🏁 Script executed:

# First, let's find and check the file mentioned in the review find . -name "glm5_nvfp4_serving.sh" -type f

Repository: NVIDIA/TensorRT-LLM

Length of output: 95

🏁 Script executed:

# Search for all references to 1.3.0rc3 in the repository rg "1\.3\.0rc3" --type sh -A 2 -B 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 1063

Update references from 1.3.0rc3 to 1.3.0rc2 to align with official NVIDIA NGC documentation.

Official NVIDIA TensorRT-LLM documentation currently documents only the 1.3.0rc2 container tag. The script references 1.3.0rc3, which is not reflected in published NGC resources. To ensure users pull the correct image, update all container tag references to the officially documented 1.3.0rc2.

Applies to lines 9, 27, 49, 86.

🤖 Prompt for AI Agents

In `@examples/glm5_nvfp4_serving.sh` at line 9, The script references the container tag string "nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc3" which is not documented; update every occurrence of that exact string (found in the top comment and later docker image pulls/variables) to "nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc2" so all container tag references match the official NVIDIA NGC documentation.

coderabbitai · 2026-02-14T03:25:42Z

examples/glm5_nvfp4_serving.sh

+salloc -N 1 --gres=gpu:8 --time=04:00:00 --partition=b200@500/None@cr+mp/8gpu-224cpu-2048gb
+ssh umb-b200-236


⚠️ Potential issue | 🟡 Minor

Hardcoded site-specific SLURM partition and hostname.

b200@500/None@cr+mp/8gpu-224cpu-2048gb and umb-b200-236 are internal to your cluster. The README version (lines 77–80) correctly uses placeholders. Consider replacing these with placeholders and a comment, e.g.:

Suggested diff

-salloc -N 1 --gres=gpu:8 --time=04:00:00 --partition=b200@500/None@cr+mp/8gpu-224cpu-2048gb -ssh umb-b200-236 +salloc -N 1 --gres=gpu:8 --time=04:00:00 --partition=<your-partition> +ssh <node-name>

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

salloc -N 1 --gres=gpu:8 --time=04:00:00 --partition=b200@500/None@cr+mp/8gpu-224cpu-2048gb

ssh umb-b200-236

salloc -N 1 --gres=gpu:8 --time=04:00:00 --partition=<your-partition>

ssh <node-name>

🤖 Prompt for AI Agents

In `@examples/glm5_nvfp4_serving.sh` around lines 43 - 44, Replace the hardcoded SLURM partition string used in the salloc command ("salloc -N 1 --gres=gpu:8 --time=04:00:00 --partition=b200@500/None@cr+mp/8gpu-224cpu-2048gb") and the specific host in the ssh command ("ssh umb-b200-236") with clear placeholders (e.g., --partition=<YOUR_PARTITION> and ssh <HOSTNAME>) and add a brief comment above them explaining to fill in cluster-specific values or refer to the README; update the example lines so they mirror the README placeholders and avoid leaking internal cluster identifiers.

coderabbitai · 2026-02-14T03:25:42Z

examples/glm5_nvfp4_serving.sh

+docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
+    -v /home/scratch.asteiner:/workspace/scratch \
+    -it nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc3 bash


⚠️ Potential issue | 🟡 Minor

Hardcoded personal mount path leaks internal directory structure.

/home/scratch.asteiner is a personal/internal path. Replace with a generic placeholder consistent with the README.

Suggested diff

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \ - -v /home/scratch.asteiner:/workspace/scratch \ + -v /path/to/your/models:/workspace/scratch \ -it nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc3 bash

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \

-v /home/scratch.asteiner:/workspace/scratch \

-it nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc3 bash

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \

-v /path/to/your/models:/workspace/scratch \

-it nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc3 bash

🤖 Prompt for AI Agents

In `@examples/glm5_nvfp4_serving.sh` around lines 47 - 49, The docker run line in examples/glm5_nvfp4_serving.sh currently mounts a hardcoded personal path (/home/scratch.asteiner); change that mount to a generic placeholder (e.g., /path/to/scratch or an environment variable like ${SCRATCH_DIR}) so the script doesn't leak internal directories. Update the volume flag in the docker run command (the "-v /home/scratch.asteiner:/workspace/scratch" segment) to use the placeholder and, if desired, add a short comment or README pointer that users should set SCRATCH_DIR before running.

coderabbitai · 2026-02-14T03:25:42Z

examples/glm5_nvfp4_serving.sh

+curl http://172.17.0.2:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"GLM-5-nvfp4-v1","prompt":"The capital of France is","max_tokens":50}'


⚠️ Potential issue | 🟡 Minor

Hardcoded Docker bridge IP 172.17.0.2 is fragile.

This IP is not guaranteed. The README's approach of using docker inspect to discover the IP is better. Replace with a placeholder or show the inspect command here too.

Suggested diff

# --- Test (from host node, not inside container) --- -curl http://172.17.0.2:8000/v1/completions \ +# Find container IP: docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' <container-id> +curl http://<container-ip>:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{"model":"GLM-5-nvfp4-v1","prompt":"The capital of France is","max_tokens":50}'

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

curl http://172.17.0.2:8000/v1/completions \

-H "Content-Type: application/json" \

-d '{"model":"GLM-5-nvfp4-v1","prompt":"The capital of France is","max_tokens":50}'

# Find container IP: docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' <container-id>

curl http://<container-ip>:8000/v1/completions \

-H "Content-Type: application/json" \

-d '{"model":"GLM-5-nvfp4-v1","prompt":"The capital of France is","max_tokens":50}'

🤖 Prompt for AI Agents

In `@examples/glm5_nvfp4_serving.sh` around lines 78 - 80, The curl example currently hardcodes the Docker bridge IP "172.17.0.2" which is fragile; update the curl invocation in examples/glm5_nvfp4_serving.sh to use a variable or command substitution instead of a fixed IP (e.g., replace the literal IP in the curl command with a placeholder like SERVER_IP or use the docker inspect command to programmatically obtain the container IP), and include a short inline comment showing the docker inspect command to discover the container IP so users can set SERVER_IP before running the curl line.

Alex Steiner and others added 6 commits February 13, 2026 17:28

Add step-by-step README for GLM5 NVFP4 serving

3e1e8fb

Markdown guide covering all prerequisites, config changes, C++ fixes, build, flashinfer patch, and serve commands. Co-authored-by: Cursor <cursoragent@cursor.com>

Add note: tokenizer fixes need reconfirmation

8b4afe2

Co-authored-by: Cursor <cursoragent@cursor.com>

docs: link GLM-5 NVFP4 checkpoint, add disclaimer and Model Opt credit

6e97d7d

Co-authored-by: Cursor <cursoragent@cursor.com>

pst2154 requested a review from a team as a code owner February 14, 2026 03:20

pst2154 requested review from chang-l and kaiyux February 14, 2026 03:20

coderabbitai bot reviewed Feb 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Glm5 nvfp4 serving#11522

Glm5 nvfp4 serving#11522
pst2154 wants to merge 6 commits intoNVIDIA:mainfrom
pst2154:glm5-nvfp4-serving

pst2154 commented Feb 14, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 14, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 14, 2026

Uh oh!

coderabbitai bot Feb 14, 2026

Uh oh!

coderabbitai bot Feb 14, 2026

Uh oh!

coderabbitai bot Feb 14, 2026

Uh oh!

coderabbitai bot Feb 14, 2026

Uh oh!

coderabbitai bot Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		salloc -N 1 --gres=gpu:8 --time=04:00:00 --partition=b200@500/None@cr+mp/8gpu-224cpu-2048gb
		ssh umb-b200-236

Conversation

pst2154 commented Feb 14, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 14, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pst2154 commented Feb 14, 2026 •

edited by coderabbitai bot

Loading