Skip to content

Glm5 nvfp4 serving#11522

Open
pst2154 wants to merge 6 commits intoNVIDIA:mainfrom
pst2154:glm5-nvfp4-serving
Open

Glm5 nvfp4 serving#11522
pst2154 wants to merge 6 commits intoNVIDIA:mainfrom
pst2154:glm5-nvfp4-serving

Conversation

@pst2154
Copy link

@pst2154 pst2154 commented Feb 14, 2026

[None][doc] Add GLM-5 NVFP4 serving guide using DeepSeek V3.2 model mapping

Description

GLM-5 (MLA architecture, v_head_dim=256, qk_nope_head_dim=192) can be served
in TRT-LLM by remapping to deepseek_v32 (DeepseekV32ForCausalLM).

Key insight: DeepseekV32Attention has a built-in DSA indexer that routes context
attention through absorption mode (576/512 FMHA kernels). This avoids the 256/256
separate QKV context path in DeepseekV3Attention, which falls back to unfused MHA
that does not support MLA (produces garbage output).

No source build or C++ changes required — works on the stock 1.3.0rc3 container
with only:

  • config.json: remap model_typedeepseek_v32, architecturesDeepseekV32ForCausalLM, add rope_theta and rope_scaling
  • tokenizer_config.json: fix tokenizer class and special tokens key name (needs reconfirmation)
  • Serve YAML: sparse_attention_config with DSA algorithm and GLM-5 indexer params

Files added:

  • examples/GLM5_NVFP4_SERVING.md — step-by-step README
  • examples/glm5_nvfp4_serving.sh — copy-paste script version

Verified working on 8×B200 (SM100), TRT-LLM 1.3.0rc3, producing coherent output.

Test Coverage

  • Manual end-to-end test: served GLM-5 NVFP4 on 8×B200 with TP=8, sent completion
    requests via OpenAI-compatible API, verified coherent English output.
  • No automated tests added (documentation-only PR).

PR Checklist

  • Please check this after reviewing the above items as appropriate for this PR.

Summary by CodeRabbit

  • New Features

    • Added sparse attention optimization configuration enabling more efficient GPU memory utilization and enhanced serving throughput for large language models.
  • Documentation

    • Introduced comprehensive end-to-end setup guide and automated deployment script for GLM-5 NVFP4 serving via TensorRT-LLM, including model and tokenizer configuration updates, Docker environment preparation, and complete testing workflow.

Alex Steiner and others added 6 commits February 13, 2026 17:28
- Fix headSizeV in attentionOp.cpp: use v_head_dim instead of
  qk_nope_head_dim for non-sparse MLA context path. GLM5 has
  v_head_dim=256 vs qk_nope_head_dim=192; the old assumption
  (v_head_dim == qk_nope_head_dim) only held for DeepSeek V3.

- Relax FMHA assertion for context MLA to a warning instead of
  a fatal error, allowing graceful fallback when kernels are
  unavailable for a given head size combination.

- Add examples/glm5_nvfp4_serving.sh: step-by-step guide to
  serve GLM5 NVFP4 on 8xB200 using DeepseekV32ForCausalLM
  (model_type=deepseek_v32) with DSA indexer, FP8 KV cache,
  and PerkzZheng's FMHA kernel branch.

Verified working on 8xB200 (SM100), TRT-LLM 1.3.0rc3/rc4.

Co-authored-by: Cursor <cursoragent@cursor.com>
Markdown guide covering all prerequisites, config changes,
C++ fixes, build, flashinfer patch, and serve commands.

Co-authored-by: Cursor <cursoragent@cursor.com>
deepseek_v32 model type routes context through absorption mode
using existing 576/512 FMHA kernels. No C++ fixes, no PerkzZheng
kernel branch, no flashinfer patch required.

Only needs: config.json + tokenizer_config.json edits, serve YAML.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
deepseek_v32 routes context through absorption mode (mqa with
mIsGenerationMLA=true), using 576/512 kernels. The non-sparse
context path that uses headSizeV = qk_nope_head_dim is never hit.

Co-authored-by: Cursor <cursoragent@cursor.com>
@pst2154 pst2154 requested a review from a team as a code owner February 14, 2026 03:20
@pst2154 pst2154 requested review from chang-l and kaiyux February 14, 2026 03:20
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 14, 2026

📝 Walkthrough

Walkthrough

Two new example files are introduced documenting a complete workflow for serving GLM-5 NVFP4 models via TensorRT-LLM with DeepSeek V3.2 compatibility, including model configuration, tokenizer setup, Docker deployment, sparse attention serving configuration, and testing instructions.

Changes

Cohort / File(s) Summary
GLM5 NVFP4 Serving Documentation
examples/GLM5_NVFP4_SERVING.md
Comprehensive guide documenting prerequisite setup, model config (architecture: DeepseekV32ForCausalLM, rope_theta: 1000000, sparse attention via DSA indexer), tokenizer config (PreTrainedTokenizerFast), serve configuration with sparse_attention_config and kv_cache_config (fp8, 0.90 memory), and end-to-end workflow with expected output.
GLM5 NVFP4 Serving Script
examples/glm5_nvfp4_serving.sh
Automation script providing step-by-step environment setup, Docker container execution with GPU access, in-container YAML configuration for sparse attention and kv_cache, trtllm-serve invocation with tensor parallelism and token limits, and curl test validation.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (34 files):

⚔️ README.md (content)
⚔️ cpp/tensorrt_llm/executor/serialization.cpp (content)
⚔️ cpp/tensorrt_llm/kernels/helixKernels.cu (content)
⚔️ cpp/tensorrt_llm/kernels/helixKernels.h (content)
⚔️ cpp/tensorrt_llm/thop/helixPostProcessOp.cpp (content)
⚔️ requirements.txt (content)
⚔️ tensorrt_llm/_torch/auto_deploy/export/export.py (content)
⚔️ tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (content)
⚔️ tensorrt_llm/_torch/distributed/__init__.py (content)
⚔️ tensorrt_llm/_torch/distributed/ops.py (content)
⚔️ tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py (content)
⚔️ tensorrt_llm/_torch/modules/attention.py (content)
⚔️ tensorrt_llm/_torch/modules/fused_moe/quantization.py (content)
⚔️ tensorrt_llm/_torch/modules/linear.py (content)
⚔️ tensorrt_llm/commands/serve.py (content)
⚔️ tensorrt_llm/executor/ipc.py (content)
⚔️ tensorrt_llm/inputs/data.py (content)
⚔️ tensorrt_llm/inputs/utils.py (content)
⚔️ tensorrt_llm/llmapi/__init__.py (content)
⚔️ tensorrt_llm/llmapi/disagg_utils.py (content)
⚔️ tensorrt_llm/llmapi/utils.py (content)
⚔️ tensorrt_llm/ray_stub.py (content)
⚔️ tensorrt_llm/serve/chat_utils.py (content)
⚔️ tensorrt_llm/serve/openai_protocol.py (content)
⚔️ tensorrt_llm/serve/openai_server.py (content)
⚔️ tests/integration/defs/accuracy/test_disaggregated_serving.py (content)
⚔️ tests/integration/test_lists/qa/llm_function_core.txt (content)
⚔️ tests/integration/test_lists/test-db/l0_b200.yml (content)
⚔️ tests/integration/test_lists/test-db/l0_dgx_b200.yml (content)
⚔️ tests/integration/test_lists/waives.txt (content)
⚔️ tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/test_export.py (content)
⚔️ tests/unittest/_torch/modules/test_mla_helix.py (content)
⚔️ tests/unittest/_torch/thop/parallel/test_helix_postprocess.py (content)
⚔️ tests/unittest/llmapi/apps/test_chat_utils.py (content)

These conflicts must be resolved before merging into main.
Resolve conflicts locally and push changes to this branch.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Glm5 nvfp4 serving' is directly related to the main change—adding documentation and scripts for serving GLM-5 NVFP4 models.
Description check ✅ Passed The description follows the template structure, includes the required ticket format [None][doc], clearly explains the GLM-5 mapping approach, lists added files, and documents manual testing performed.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
⚔️ Resolve merge conflicts (beta)
  • Auto-commit resolved conflicts to branch glm5-nvfp4-serving
  • Post resolved changes as copyable diffs in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In `@examples/GLM5_NVFP4_SERVING.md`:
- Around line 57-71: The doc contains an unresolved note in Step 2 saying the
tokenizer_config changes "may not be required"; resolve this before merging by
verifying the tokenizer works with and without the changes and then updating the
doc: if the changes are required, remove the uncertainty note and keep the edits
to tokenizer_config.json (replace "tokenizer_class": "TokenizersBackend" with
"PreTrainedTokenizerFast" and rename "extra_special_tokens" to
"additional_special_tokens"); if they are not required, remove the suggested
edits and the note. Ensure the final text clearly states the required action for
the tokenizer_config.json keys ("tokenizer_class" and
"additional_special_tokens") with no hedging.
- Around line 32-55: The config uses rope_scaling: {"type": "none", "factor":
1.0} which is not a valid PositionEmbeddingType and will raise a KeyError;
update the model config (the JSON object containing "architectures",
"model_type", "rope_theta" and "rope_scaling") by either removing the
rope_scaling field (or setting it to null) or replacing it with a valid type
such as {"type": "linear", "factor": 1.0} (which is equivalent to no scaling);
keep "rope_theta": 1000000 and ensure any code that maps the rope_scaling.type
string to PositionEmbeddingType uses a valid enum member name.

In `@examples/glm5_nvfp4_serving.sh`:
- Around line 43-44: Replace the hardcoded SLURM partition string used in the
salloc command ("salloc -N 1 --gres=gpu:8 --time=04:00:00
--partition=b200@500/None@cr+mp/8gpu-224cpu-2048gb") and the specific host in
the ssh command ("ssh umb-b200-236") with clear placeholders (e.g.,
--partition=<YOUR_PARTITION> and ssh <HOSTNAME>) and add a brief comment above
them explaining to fill in cluster-specific values or refer to the README;
update the example lines so they mirror the README placeholders and avoid
leaking internal cluster identifiers.
- Around line 47-49: The docker run line in examples/glm5_nvfp4_serving.sh
currently mounts a hardcoded personal path (/home/scratch.asteiner); change that
mount to a generic placeholder (e.g., /path/to/scratch or an environment
variable like ${SCRATCH_DIR}) so the script doesn't leak internal directories.
Update the volume flag in the docker run command (the "-v
/home/scratch.asteiner:/workspace/scratch" segment) to use the placeholder and,
if desired, add a short comment or README pointer that users should set
SCRATCH_DIR before running.
- Around line 78-80: The curl example currently hardcodes the Docker bridge IP
"172.17.0.2" which is fragile; update the curl invocation in
examples/glm5_nvfp4_serving.sh to use a variable or command substitution instead
of a fixed IP (e.g., replace the literal IP in the curl command with a
placeholder like SERVER_IP or use the docker inspect command to programmatically
obtain the container IP), and include a short inline comment showing the docker
inspect command to discover the container IP so users can set SERVER_IP before
running the curl line.
- Line 9: The script references the container tag string
"nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc3" which is not documented; update
every occurrence of that exact string (found in the top comment and later docker
image pulls/variables) to "nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc2" so all
container tag references match the official NVIDIA NGC documentation.

Comment on lines +32 to +55
## Step 1 — Prepare the model config

Edit `GLM-5-nvfp4-v1/config.json` (back up original first):

```bash
cd /path/to/GLM-5-nvfp4-v1
cp config.json config.json.orig
```

Make these changes:

```jsonc
{
// Change these two:
"architectures": ["DeepseekV32ForCausalLM"], // was GlmMoeDsaForCausalLM
"model_type": "deepseek_v32", // was glm_moe_dsa

// Add these (GLM5 uses rope_theta=1M, no yarn scaling):
"rope_theta": 1000000,
"rope_scaling": {"type": "none", "factor": 1.0},

// Everything else stays the same
}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's understand the repo structure
find . -type f -name "*.md" | grep -i "glm\|deepseek" | head -10

Repository: NVIDIA/TensorRT-LLM

Length of output: 787


🏁 Script executed:

# Search for rope_scaling handling in the codebase
rg -n "rope_scaling" --type py -C 2 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 5196


🏁 Script executed:

# Search for deepseek_v32 model references
rg -n "deepseek_v32\|DeepseekV32" --type py -C 2 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Search for deepseek model files
find . -type f \( -name "*.py" -o -name "*.json" \) | xargs grep -l "deepseek_v32\|DeepseekV32" | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 828


🏁 Script executed:

# Search for how rope_scaling is validated/handled in model configs
rg -n "rope_scaling.*type" --type py -B 2 -A 2 | head -60

Repository: NVIDIA/TensorRT-LLM

Length of output: 5462


🏁 Script executed:

# Look for GLM5 specific model handling
rg -n "glm.*rope\|rope.*glm" --type py -i -C 2 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Search for rope_scaling type validation or mapping
rg -n "RotaryScalingType\|rope.*type.*none" --type py -C 3 | head -60

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Check the deepseek model config
cat tensorrt_llm/models/deepseek_v2/config.py | head -120

Repository: NVIDIA/TensorRT-LLM

Length of output: 4977


🏁 Script executed:

# Look for how rope_scaling with "none" type would be handled
rg -n "rope_scaling\[.*type.*\]" --type py -C 5 | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 8034


🏁 Script executed:

# Search for RotaryScalingType enum definition
rg -n "class RotaryScalingType\|RotaryScalingType.*=" --type py -C 5 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Check if there's a way to disable rope_scaling (set to None)
rg -n "rope_scaling.*=.*None\|if.*rope_scaling.*is.*None" --type py -C 2 | head -60

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Find RotaryScalingType enum definition
rg -n "class RotaryScalingType" --type py -A 10

Repository: NVIDIA/TensorRT-LLM

Length of output: 583


🏁 Script executed:

# Search for where "none" rope type is defined
rg -n '"none".*rope\|rope.*"none"' --type py -C 2 | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Check the GLM5_NVFP4_SERVING.md file to see the full context
cat examples/GLM5_NVFP4_SERVING.md | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 2471


🏁 Script executed:

# Look for deepseek_v32 model implementation
find . -path ./tests -prune -o -type f -name "*.py" -print | xargs grep -l "class.*DeepSeekV32\|deepseek_v32" | grep -v test | head -10

Repository: NVIDIA/TensorRT-LLM

Length of output: 492


🏁 Script executed:

# Check the full RotaryScalingType enum and its from_string method
rg -n "class RotaryScalingType" --type py -A 20

Repository: NVIDIA/TensorRT-LLM

Length of output: 1146


🏁 Script executed:

# Look for how rope_scaling is validated when it's passed through
rg -n "rope_scaling\[.*type" --type py -B 3 -A 3 | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 9530


🏁 Script executed:

# Check if deepseek_v32 model has its own config handling
find . -name "*deepseek*v32*" -o -name "*deepseekv32*" | grep -v __pycache__

Repository: NVIDIA/TensorRT-LLM

Length of output: 2366


🏁 Script executed:

# Look for how rope_scaling "none" is actually used in model conversion or loading
rg -n "rope_scaling.*none\|type.*none" --type py -i -B 2 -A 2 | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Look for deepseek_v32 model files in the models directory
find ./tensorrt_llm/models -name "*deepseek*" -type d

Repository: NVIDIA/TensorRT-LLM

Length of output: 130


🏁 Script executed:

# Check if there's a deepseek_v32 config or model file
find ./tensorrt_llm/models -name "*deepseek*" -type f | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Search for how rope_scaling is handled when type is "none"
rg -n "PositionEmbeddingType.from_string" --type py -B 2 -A 2 | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 3774


🏁 Script executed:

# Check how deepseek models handle rope_scaling configuration
rg -n "rope_scaling\|rotary_scaling" tensorrt_llm/models/deepseek* --type py -C 2 | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Find PositionEmbeddingType enum and its from_string method
rg -n "class PositionEmbeddingType" --type py -A 30

Repository: NVIDIA/TensorRT-LLM

Length of output: 1869


🏁 Script executed:

# Check if there are any config files for deepseek_v32 that show rope_scaling examples
find . -name "config.json" -o -name "*deepseek*config*" | xargs grep -l "rope_scaling" 2>/dev/null | head -10

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Look for actual deepseek_v32 model implementation
find ./tensorrt_llm -name "*deepseek_v32*" -o -name "*deepseekv32*" | grep -v __pycache__

Repository: NVIDIA/TensorRT-LLM

Length of output: 155


🏁 Script executed:

# Check if there's a deepseek_v32 model in the models directory or if it uses deepseek_v2/v1
ls -la ./tensorrt_llm/models/ | grep deepseek

Repository: NVIDIA/TensorRT-LLM

Length of output: 188


🏁 Script executed:

# Search for deepseek_v32 model registration
rg -n "deepseek_v32\|DeepseekV32" --type py -C 3 | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Look for how rope_scaling type is converted for models
rg -n "from_string.*rope_scaling\|rope_scaling.*from_string" --type py -B 3 -A 3 | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Check how model types are mapped
rg -n "model_type.*deepseek" --type py -C 2 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 3908


🏁 Script executed:

# Look for deepseek model in the __init__ or registry
rg -n "deepseek_v32" --type py | head -30

Repository: NVIDIA/TensorRT-LLM

Length of output: 2561


🏁 Script executed:

# Look at the modeling_deepseekv3.py file to see how it handles rope_scaling
sed -n '1,150p' tensorrt_llm/_torch/models/modeling_deepseekv3.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 6252


🏁 Script executed:

# Search for rope_scaling handling in deepseekv3
rg -n "rope_scaling\|rotary" tensorrt_llm/_torch/models/modeling_deepseekv3.py -C 2 | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Check how the config is loaded from HuggingFace for deepseek_v32
rg -n "from_hugging_face\|from_pretrained" tensorrt_llm/_torch/pyexecutor/config_utils.py -B 2 -A 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Search more specifically for rope_scaling and rope in deepseekv3
rg -n "rope_scaling" tensorrt_llm/_torch/models/modeling_deepseekv3.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Check how PositionEmbeddingType.from_string handles different types
rg -n "PositionEmbeddingType.from_string" --type py -B 5 -A 5

Repository: NVIDIA/TensorRT-LLM

Length of output: 7511


🏁 Script executed:

# Look for the actual implementation of from_string method
sed -n '689,750p' tensorrt_llm/functional.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1553


🏁 Script executed:

# Search for how rope_scaling type "none" is actually used
rg -n "rope_scaling\[.*type.*\]\|\.from_string\(config.rope_scaling" --type py -C 5 | head -150

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Check the pyexecutor config for deepseek_v32
cat tensorrt_llm/_torch/pyexecutor/config_utils.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 2298


🏁 Script executed:

# Look for how rope_scaling is loaded/processed from config
rg -n "rope_scaling\|rotary_scaling" tensorrt_llm/_torch/pyexecutor/config_utils.py -C 3

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Check if deepseek_v32 uses a different config class
rg -n "DeepseekV3Config" tensorrt_llm/_torch/models/checkpoints/ -C 2 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Search for actual HF config loading for deepseek models
find tensorrt_llm -name "*config*.py" -path "*/deepseek*" -o -name "*deepseek*config*.py" | head -10

Repository: NVIDIA/TensorRT-LLM

Length of output: 146


🏁 Script executed:

# Find where DeepseekV3Config is defined
find . -name "*.py" | xargs grep -l "class DeepseekV3Config" | head -5

Repository: NVIDIA/TensorRT-LLM

Length of output: 107


🏁 Script executed:

# Look for the config in tensorrt_llm._torch.configs
find tensorrt_llm/_torch/configs -name "*.py" | head -10

Repository: NVIDIA/TensorRT-LLM

Length of output: 145


🏁 Script executed:

# Check if there's a DeepseekV3Config in tensorrt_llm._torch.configs
rg -n "class DeepseekV3Config" --type py -A 30 | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 2342


🏁 Script executed:

# Get the full DeepseekV3Config class
cat tensorrt_llm/_torch/configs/deepseek_v3.py | head -200

Repository: NVIDIA/TensorRT-LLM

Length of output: 3663


🏁 Script executed:

# Check if there's rope_scaling handling in the config init or elsewhere
rg -n "rope_scaling\|rope_theta" tensorrt_llm/_torch/configs/deepseek_v3.py -C 3

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


Remove rope_scaling: {"type": "none", "factor": 1.0} — it will cause a KeyError.

The rope_scaling type field must map to a PositionEmbeddingType enum member. Valid types are: rope_gptj, rope_gpt_neox, long_rope, yarn, mrope, or others. The string "none" is not a recognized type and will fail at runtime when the model attempts to look it up.

For GLM-5 with no scaling, either:

  • Omit rope_scaling entirely (set to null), or
  • Use "rope_scaling": {"type": "linear", "factor": 1.0} (linear with factor 1.0 is equivalent to no scaling)
🤖 Prompt for AI Agents
In `@examples/GLM5_NVFP4_SERVING.md` around lines 32 - 55, The config uses
rope_scaling: {"type": "none", "factor": 1.0} which is not a valid
PositionEmbeddingType and will raise a KeyError; update the model config (the
JSON object containing "architectures", "model_type", "rope_theta" and
"rope_scaling") by either removing the rope_scaling field (or setting it to
null) or replacing it with a valid type such as {"type": "linear", "factor":
1.0} (which is equivalent to no scaling); keep "rope_theta": 1000000 and ensure
any code that maps the rope_scaling.type string to PositionEmbeddingType uses a
valid enum member name.

Comment on lines +57 to +71
## Step 2 — Fix the tokenizer config

> **Note:** These changes may not be required — needs confirmation. Try without them first.

Edit `GLM-5-nvfp4-v1/tokenizer_config.json`:

```jsonc
{
// Change:
"tokenizer_class": "PreTrainedTokenizerFast", // was TokenizersBackend

// Rename key:
"additional_special_tokens": [...] // was extra_special_tokens
}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Uncertainty in Step 2 should be resolved before merging.

Line 59 states "These changes may not be required — needs confirmation." Shipping documentation with unconfirmed steps may confuse users. Consider either:

  1. Testing without these changes and updating accordingly, or
  2. Removing the note if they are indeed required.
🤖 Prompt for AI Agents
In `@examples/GLM5_NVFP4_SERVING.md` around lines 57 - 71, The doc contains an
unresolved note in Step 2 saying the tokenizer_config changes "may not be
required"; resolve this before merging by verifying the tokenizer works with and
without the changes and then updating the doc: if the changes are required,
remove the uncertainty note and keep the edits to tokenizer_config.json (replace
"tokenizer_class": "TokenizersBackend" with "PreTrainedTokenizerFast" and rename
"extra_special_tokens" to "additional_special_tokens"); if they are not
required, remove the suggested edits and the note. Ensure the final text clearly
states the required action for the tokenizer_config.json keys ("tokenizer_class"
and "additional_special_tokens") with no hedging.

# VERIFIED WORKING: 2026-02-14 on 8xB200 (SM100), TRT-LLM 1.3.0rc3/rc4
#
# DIRECTIONS: To serve GLM5 NVFP4 in TRT-LLM, you need the following
# changes on top of the base container (nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc3):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

nvcr.io nvidia tensorrt-llm release 1.3.0rc3 NGC container

💡 Result:

NVIDIA’s official TensorRT‑LLM Quick Start currently shows the NGC “release” container tag as:

docker run --rm -it --ipc host --gpus all \
  --ulimit memlock=-1 --ulimit stack=67108864 -p 8000:8000 \
  nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc2

[1]

I could not find any NVIDIA-published documentation or release page that references an NGC container tag 1.3.0rc3 for nvcr.io/nvidia/tensorrt-llm/release (only 1.3.0rc2 is explicitly documented) [1].

To verify whether 1.3.0rc3 exists on NGC anyway, try:

docker login nvcr.io
docker pull nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc3

If you want, tell me whether docker pull succeeds or fails (and the exact error), and I’ll help you pick the correct available tag.

Sources:
[1] TensorRT‑LLM Quick Start Guide (container launch command shows :1.3.0rc2) (nvidia.github.io)

Citations:


🏁 Script executed:

# First, let's find and check the file mentioned in the review
find . -name "glm5_nvfp4_serving.sh" -type f

Repository: NVIDIA/TensorRT-LLM

Length of output: 95


🏁 Script executed:

# Search for all references to 1.3.0rc3 in the repository
rg "1\.3\.0rc3" --type sh -A 2 -B 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 1063


Update references from 1.3.0rc3 to 1.3.0rc2 to align with official NVIDIA NGC documentation.

Official NVIDIA TensorRT-LLM documentation currently documents only the 1.3.0rc2 container tag. The script references 1.3.0rc3, which is not reflected in published NGC resources. To ensure users pull the correct image, update all container tag references to the officially documented 1.3.0rc2.

Applies to lines 9, 27, 49, 86.

🤖 Prompt for AI Agents
In `@examples/glm5_nvfp4_serving.sh` at line 9, The script references the
container tag string "nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc3" which is not
documented; update every occurrence of that exact string (found in the top
comment and later docker image pulls/variables) to
"nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc2" so all container tag references
match the official NVIDIA NGC documentation.

Comment on lines +43 to +44
salloc -N 1 --gres=gpu:8 --time=04:00:00 --partition=b200@500/None@cr+mp/8gpu-224cpu-2048gb
ssh umb-b200-236
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hardcoded site-specific SLURM partition and hostname.

b200@500/None@cr+mp/8gpu-224cpu-2048gb and umb-b200-236 are internal to your cluster. The README version (lines 77–80) correctly uses placeholders. Consider replacing these with placeholders and a comment, e.g.:

Suggested diff
-salloc -N 1 --gres=gpu:8 --time=04:00:00 --partition=b200@500/None@cr+mp/8gpu-224cpu-2048gb
-ssh umb-b200-236
+salloc -N 1 --gres=gpu:8 --time=04:00:00 --partition=<your-partition>
+ssh <node-name>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
salloc -N 1 --gres=gpu:8 --time=04:00:00 --partition=b200@500/None@cr+mp/8gpu-224cpu-2048gb
ssh umb-b200-236
salloc -N 1 --gres=gpu:8 --time=04:00:00 --partition=<your-partition>
ssh <node-name>
🤖 Prompt for AI Agents
In `@examples/glm5_nvfp4_serving.sh` around lines 43 - 44, Replace the hardcoded
SLURM partition string used in the salloc command ("salloc -N 1 --gres=gpu:8
--time=04:00:00 --partition=b200@500/None@cr+mp/8gpu-224cpu-2048gb") and the
specific host in the ssh command ("ssh umb-b200-236") with clear placeholders
(e.g., --partition=<YOUR_PARTITION> and ssh <HOSTNAME>) and add a brief comment
above them explaining to fill in cluster-specific values or refer to the README;
update the example lines so they mirror the README placeholders and avoid
leaking internal cluster identifiers.

Comment on lines +47 to +49
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-v /home/scratch.asteiner:/workspace/scratch \
-it nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc3 bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hardcoded personal mount path leaks internal directory structure.

/home/scratch.asteiner is a personal/internal path. Replace with a generic placeholder consistent with the README.

Suggested diff
 docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-    -v /home/scratch.asteiner:/workspace/scratch \
+    -v /path/to/your/models:/workspace/scratch \
     -it nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc3 bash
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-v /home/scratch.asteiner:/workspace/scratch \
-it nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc3 bash
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-v /path/to/your/models:/workspace/scratch \
-it nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc3 bash
🤖 Prompt for AI Agents
In `@examples/glm5_nvfp4_serving.sh` around lines 47 - 49, The docker run line in
examples/glm5_nvfp4_serving.sh currently mounts a hardcoded personal path
(/home/scratch.asteiner); change that mount to a generic placeholder (e.g.,
/path/to/scratch or an environment variable like ${SCRATCH_DIR}) so the script
doesn't leak internal directories. Update the volume flag in the docker run
command (the "-v /home/scratch.asteiner:/workspace/scratch" segment) to use the
placeholder and, if desired, add a short comment or README pointer that users
should set SCRATCH_DIR before running.

Comment on lines +78 to +80
curl http://172.17.0.2:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"GLM-5-nvfp4-v1","prompt":"The capital of France is","max_tokens":50}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hardcoded Docker bridge IP 172.17.0.2 is fragile.

This IP is not guaranteed. The README's approach of using docker inspect to discover the IP is better. Replace with a placeholder or show the inspect command here too.

Suggested diff
 # --- Test (from host node, not inside container) ---
-curl http://172.17.0.2:8000/v1/completions \
+# Find container IP: docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' <container-id>
+curl http://<container-ip>:8000/v1/completions \
   -H "Content-Type: application/json" \
   -d '{"model":"GLM-5-nvfp4-v1","prompt":"The capital of France is","max_tokens":50}'
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
curl http://172.17.0.2:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"GLM-5-nvfp4-v1","prompt":"The capital of France is","max_tokens":50}'
# Find container IP: docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' <container-id>
curl http://<container-ip>:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"GLM-5-nvfp4-v1","prompt":"The capital of France is","max_tokens":50}'
🤖 Prompt for AI Agents
In `@examples/glm5_nvfp4_serving.sh` around lines 78 - 80, The curl example
currently hardcodes the Docker bridge IP "172.17.0.2" which is fragile; update
the curl invocation in examples/glm5_nvfp4_serving.sh to use a variable or
command substitution instead of a fixed IP (e.g., replace the literal IP in the
curl command with a placeholder like SERVER_IP or use the docker inspect command
to programmatically obtain the container IP), and include a short inline comment
showing the docker inspect command to discover the container IP so users can set
SERVER_IP before running the curl line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant