Conversation
- Fix headSizeV in attentionOp.cpp: use v_head_dim instead of qk_nope_head_dim for non-sparse MLA context path. GLM5 has v_head_dim=256 vs qk_nope_head_dim=192; the old assumption (v_head_dim == qk_nope_head_dim) only held for DeepSeek V3. - Relax FMHA assertion for context MLA to a warning instead of a fatal error, allowing graceful fallback when kernels are unavailable for a given head size combination. - Add examples/glm5_nvfp4_serving.sh: step-by-step guide to serve GLM5 NVFP4 on 8xB200 using DeepseekV32ForCausalLM (model_type=deepseek_v32) with DSA indexer, FP8 KV cache, and PerkzZheng's FMHA kernel branch. Verified working on 8xB200 (SM100), TRT-LLM 1.3.0rc3/rc4. Co-authored-by: Cursor <cursoragent@cursor.com>
Markdown guide covering all prerequisites, config changes, C++ fixes, build, flashinfer patch, and serve commands. Co-authored-by: Cursor <cursoragent@cursor.com>
deepseek_v32 model type routes context through absorption mode using existing 576/512 FMHA kernels. No C++ fixes, no PerkzZheng kernel branch, no flashinfer patch required. Only needs: config.json + tokenizer_config.json edits, serve YAML. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
deepseek_v32 routes context through absorption mode (mqa with mIsGenerationMLA=true), using 576/512 kernels. The non-sparse context path that uses headSizeV = qk_nope_head_dim is never hit. Co-authored-by: Cursor <cursoragent@cursor.com>
📝 WalkthroughWalkthroughTwo new example files are introduced documenting a complete workflow for serving GLM-5 NVFP4 models via TensorRT-LLM with DeepSeek V3.2 compatibility, including model configuration, tokenizer setup, Docker deployment, sparse attention serving configuration, and testing instructions. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@examples/GLM5_NVFP4_SERVING.md`:
- Around line 57-71: The doc contains an unresolved note in Step 2 saying the
tokenizer_config changes "may not be required"; resolve this before merging by
verifying the tokenizer works with and without the changes and then updating the
doc: if the changes are required, remove the uncertainty note and keep the edits
to tokenizer_config.json (replace "tokenizer_class": "TokenizersBackend" with
"PreTrainedTokenizerFast" and rename "extra_special_tokens" to
"additional_special_tokens"); if they are not required, remove the suggested
edits and the note. Ensure the final text clearly states the required action for
the tokenizer_config.json keys ("tokenizer_class" and
"additional_special_tokens") with no hedging.
- Around line 32-55: The config uses rope_scaling: {"type": "none", "factor":
1.0} which is not a valid PositionEmbeddingType and will raise a KeyError;
update the model config (the JSON object containing "architectures",
"model_type", "rope_theta" and "rope_scaling") by either removing the
rope_scaling field (or setting it to null) or replacing it with a valid type
such as {"type": "linear", "factor": 1.0} (which is equivalent to no scaling);
keep "rope_theta": 1000000 and ensure any code that maps the rope_scaling.type
string to PositionEmbeddingType uses a valid enum member name.
In `@examples/glm5_nvfp4_serving.sh`:
- Around line 43-44: Replace the hardcoded SLURM partition string used in the
salloc command ("salloc -N 1 --gres=gpu:8 --time=04:00:00
--partition=b200@500/None@cr+mp/8gpu-224cpu-2048gb") and the specific host in
the ssh command ("ssh umb-b200-236") with clear placeholders (e.g.,
--partition=<YOUR_PARTITION> and ssh <HOSTNAME>) and add a brief comment above
them explaining to fill in cluster-specific values or refer to the README;
update the example lines so they mirror the README placeholders and avoid
leaking internal cluster identifiers.
- Around line 47-49: The docker run line in examples/glm5_nvfp4_serving.sh
currently mounts a hardcoded personal path (/home/scratch.asteiner); change that
mount to a generic placeholder (e.g., /path/to/scratch or an environment
variable like ${SCRATCH_DIR}) so the script doesn't leak internal directories.
Update the volume flag in the docker run command (the "-v
/home/scratch.asteiner:/workspace/scratch" segment) to use the placeholder and,
if desired, add a short comment or README pointer that users should set
SCRATCH_DIR before running.
- Around line 78-80: The curl example currently hardcodes the Docker bridge IP
"172.17.0.2" which is fragile; update the curl invocation in
examples/glm5_nvfp4_serving.sh to use a variable or command substitution instead
of a fixed IP (e.g., replace the literal IP in the curl command with a
placeholder like SERVER_IP or use the docker inspect command to programmatically
obtain the container IP), and include a short inline comment showing the docker
inspect command to discover the container IP so users can set SERVER_IP before
running the curl line.
- Line 9: The script references the container tag string
"nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc3" which is not documented; update
every occurrence of that exact string (found in the top comment and later docker
image pulls/variables) to "nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc2" so all
container tag references match the official NVIDIA NGC documentation.
| ## Step 1 — Prepare the model config | ||
|
|
||
| Edit `GLM-5-nvfp4-v1/config.json` (back up original first): | ||
|
|
||
| ```bash | ||
| cd /path/to/GLM-5-nvfp4-v1 | ||
| cp config.json config.json.orig | ||
| ``` | ||
|
|
||
| Make these changes: | ||
|
|
||
| ```jsonc | ||
| { | ||
| // Change these two: | ||
| "architectures": ["DeepseekV32ForCausalLM"], // was GlmMoeDsaForCausalLM | ||
| "model_type": "deepseek_v32", // was glm_moe_dsa | ||
|
|
||
| // Add these (GLM5 uses rope_theta=1M, no yarn scaling): | ||
| "rope_theta": 1000000, | ||
| "rope_scaling": {"type": "none", "factor": 1.0}, | ||
|
|
||
| // Everything else stays the same | ||
| } | ||
| ``` |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's understand the repo structure
find . -type f -name "*.md" | grep -i "glm\|deepseek" | head -10Repository: NVIDIA/TensorRT-LLM
Length of output: 787
🏁 Script executed:
# Search for rope_scaling handling in the codebase
rg -n "rope_scaling" --type py -C 2 | head -50Repository: NVIDIA/TensorRT-LLM
Length of output: 5196
🏁 Script executed:
# Search for deepseek_v32 model references
rg -n "deepseek_v32\|DeepseekV32" --type py -C 2 | head -50Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Search for deepseek model files
find . -type f \( -name "*.py" -o -name "*.json" \) | xargs grep -l "deepseek_v32\|DeepseekV32" | head -20Repository: NVIDIA/TensorRT-LLM
Length of output: 828
🏁 Script executed:
# Search for how rope_scaling is validated/handled in model configs
rg -n "rope_scaling.*type" --type py -B 2 -A 2 | head -60Repository: NVIDIA/TensorRT-LLM
Length of output: 5462
🏁 Script executed:
# Look for GLM5 specific model handling
rg -n "glm.*rope\|rope.*glm" --type py -i -C 2 | head -50Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Search for rope_scaling type validation or mapping
rg -n "RotaryScalingType\|rope.*type.*none" --type py -C 3 | head -60Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Check the deepseek model config
cat tensorrt_llm/models/deepseek_v2/config.py | head -120Repository: NVIDIA/TensorRT-LLM
Length of output: 4977
🏁 Script executed:
# Look for how rope_scaling with "none" type would be handled
rg -n "rope_scaling\[.*type.*\]" --type py -C 5 | head -80Repository: NVIDIA/TensorRT-LLM
Length of output: 8034
🏁 Script executed:
# Search for RotaryScalingType enum definition
rg -n "class RotaryScalingType\|RotaryScalingType.*=" --type py -C 5 | head -50Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Check if there's a way to disable rope_scaling (set to None)
rg -n "rope_scaling.*=.*None\|if.*rope_scaling.*is.*None" --type py -C 2 | head -60Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Find RotaryScalingType enum definition
rg -n "class RotaryScalingType" --type py -A 10Repository: NVIDIA/TensorRT-LLM
Length of output: 583
🏁 Script executed:
# Search for where "none" rope type is defined
rg -n '"none".*rope\|rope.*"none"' --type py -C 2 | head -80Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Check the GLM5_NVFP4_SERVING.md file to see the full context
cat examples/GLM5_NVFP4_SERVING.md | head -80Repository: NVIDIA/TensorRT-LLM
Length of output: 2471
🏁 Script executed:
# Look for deepseek_v32 model implementation
find . -path ./tests -prune -o -type f -name "*.py" -print | xargs grep -l "class.*DeepSeekV32\|deepseek_v32" | grep -v test | head -10Repository: NVIDIA/TensorRT-LLM
Length of output: 492
🏁 Script executed:
# Check the full RotaryScalingType enum and its from_string method
rg -n "class RotaryScalingType" --type py -A 20Repository: NVIDIA/TensorRT-LLM
Length of output: 1146
🏁 Script executed:
# Look for how rope_scaling is validated when it's passed through
rg -n "rope_scaling\[.*type" --type py -B 3 -A 3 | head -100Repository: NVIDIA/TensorRT-LLM
Length of output: 9530
🏁 Script executed:
# Check if deepseek_v32 model has its own config handling
find . -name "*deepseek*v32*" -o -name "*deepseekv32*" | grep -v __pycache__Repository: NVIDIA/TensorRT-LLM
Length of output: 2366
🏁 Script executed:
# Look for how rope_scaling "none" is actually used in model conversion or loading
rg -n "rope_scaling.*none\|type.*none" --type py -i -B 2 -A 2 | head -80Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Look for deepseek_v32 model files in the models directory
find ./tensorrt_llm/models -name "*deepseek*" -type dRepository: NVIDIA/TensorRT-LLM
Length of output: 130
🏁 Script executed:
# Check if there's a deepseek_v32 config or model file
find ./tensorrt_llm/models -name "*deepseek*" -type f | head -20Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Search for how rope_scaling is handled when type is "none"
rg -n "PositionEmbeddingType.from_string" --type py -B 2 -A 2 | head -80Repository: NVIDIA/TensorRT-LLM
Length of output: 3774
🏁 Script executed:
# Check how deepseek models handle rope_scaling configuration
rg -n "rope_scaling\|rotary_scaling" tensorrt_llm/models/deepseek* --type py -C 2 | head -100Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Find PositionEmbeddingType enum and its from_string method
rg -n "class PositionEmbeddingType" --type py -A 30Repository: NVIDIA/TensorRT-LLM
Length of output: 1869
🏁 Script executed:
# Check if there are any config files for deepseek_v32 that show rope_scaling examples
find . -name "config.json" -o -name "*deepseek*config*" | xargs grep -l "rope_scaling" 2>/dev/null | head -10Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Look for actual deepseek_v32 model implementation
find ./tensorrt_llm -name "*deepseek_v32*" -o -name "*deepseekv32*" | grep -v __pycache__Repository: NVIDIA/TensorRT-LLM
Length of output: 155
🏁 Script executed:
# Check if there's a deepseek_v32 model in the models directory or if it uses deepseek_v2/v1
ls -la ./tensorrt_llm/models/ | grep deepseekRepository: NVIDIA/TensorRT-LLM
Length of output: 188
🏁 Script executed:
# Search for deepseek_v32 model registration
rg -n "deepseek_v32\|DeepseekV32" --type py -C 3 | head -100Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Look for how rope_scaling type is converted for models
rg -n "from_string.*rope_scaling\|rope_scaling.*from_string" --type py -B 3 -A 3 | head -80Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Check how model types are mapped
rg -n "model_type.*deepseek" --type py -C 2 | head -50Repository: NVIDIA/TensorRT-LLM
Length of output: 3908
🏁 Script executed:
# Look for deepseek model in the __init__ or registry
rg -n "deepseek_v32" --type py | head -30Repository: NVIDIA/TensorRT-LLM
Length of output: 2561
🏁 Script executed:
# Look at the modeling_deepseekv3.py file to see how it handles rope_scaling
sed -n '1,150p' tensorrt_llm/_torch/models/modeling_deepseekv3.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 6252
🏁 Script executed:
# Search for rope_scaling handling in deepseekv3
rg -n "rope_scaling\|rotary" tensorrt_llm/_torch/models/modeling_deepseekv3.py -C 2 | head -100Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Check how the config is loaded from HuggingFace for deepseek_v32
rg -n "from_hugging_face\|from_pretrained" tensorrt_llm/_torch/pyexecutor/config_utils.py -B 2 -A 2Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Search more specifically for rope_scaling and rope in deepseekv3
rg -n "rope_scaling" tensorrt_llm/_torch/models/modeling_deepseekv3.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Check how PositionEmbeddingType.from_string handles different types
rg -n "PositionEmbeddingType.from_string" --type py -B 5 -A 5Repository: NVIDIA/TensorRT-LLM
Length of output: 7511
🏁 Script executed:
# Look for the actual implementation of from_string method
sed -n '689,750p' tensorrt_llm/functional.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1553
🏁 Script executed:
# Search for how rope_scaling type "none" is actually used
rg -n "rope_scaling\[.*type.*\]\|\.from_string\(config.rope_scaling" --type py -C 5 | head -150Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Check the pyexecutor config for deepseek_v32
cat tensorrt_llm/_torch/pyexecutor/config_utils.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 2298
🏁 Script executed:
# Look for how rope_scaling is loaded/processed from config
rg -n "rope_scaling\|rotary_scaling" tensorrt_llm/_torch/pyexecutor/config_utils.py -C 3Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Check if deepseek_v32 uses a different config class
rg -n "DeepseekV3Config" tensorrt_llm/_torch/models/checkpoints/ -C 2 | head -50Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Search for actual HF config loading for deepseek models
find tensorrt_llm -name "*config*.py" -path "*/deepseek*" -o -name "*deepseek*config*.py" | head -10Repository: NVIDIA/TensorRT-LLM
Length of output: 146
🏁 Script executed:
# Find where DeepseekV3Config is defined
find . -name "*.py" | xargs grep -l "class DeepseekV3Config" | head -5Repository: NVIDIA/TensorRT-LLM
Length of output: 107
🏁 Script executed:
# Look for the config in tensorrt_llm._torch.configs
find tensorrt_llm/_torch/configs -name "*.py" | head -10Repository: NVIDIA/TensorRT-LLM
Length of output: 145
🏁 Script executed:
# Check if there's a DeepseekV3Config in tensorrt_llm._torch.configs
rg -n "class DeepseekV3Config" --type py -A 30 | head -100Repository: NVIDIA/TensorRT-LLM
Length of output: 2342
🏁 Script executed:
# Get the full DeepseekV3Config class
cat tensorrt_llm/_torch/configs/deepseek_v3.py | head -200Repository: NVIDIA/TensorRT-LLM
Length of output: 3663
🏁 Script executed:
# Check if there's rope_scaling handling in the config init or elsewhere
rg -n "rope_scaling\|rope_theta" tensorrt_llm/_torch/configs/deepseek_v3.py -C 3Repository: NVIDIA/TensorRT-LLM
Length of output: 45
Remove rope_scaling: {"type": "none", "factor": 1.0} — it will cause a KeyError.
The rope_scaling type field must map to a PositionEmbeddingType enum member. Valid types are: rope_gptj, rope_gpt_neox, long_rope, yarn, mrope, or others. The string "none" is not a recognized type and will fail at runtime when the model attempts to look it up.
For GLM-5 with no scaling, either:
- Omit
rope_scalingentirely (set tonull), or - Use
"rope_scaling": {"type": "linear", "factor": 1.0}(linear with factor 1.0 is equivalent to no scaling)
🤖 Prompt for AI Agents
In `@examples/GLM5_NVFP4_SERVING.md` around lines 32 - 55, The config uses
rope_scaling: {"type": "none", "factor": 1.0} which is not a valid
PositionEmbeddingType and will raise a KeyError; update the model config (the
JSON object containing "architectures", "model_type", "rope_theta" and
"rope_scaling") by either removing the rope_scaling field (or setting it to
null) or replacing it with a valid type such as {"type": "linear", "factor":
1.0} (which is equivalent to no scaling); keep "rope_theta": 1000000 and ensure
any code that maps the rope_scaling.type string to PositionEmbeddingType uses a
valid enum member name.
| ## Step 2 — Fix the tokenizer config | ||
|
|
||
| > **Note:** These changes may not be required — needs confirmation. Try without them first. | ||
|
|
||
| Edit `GLM-5-nvfp4-v1/tokenizer_config.json`: | ||
|
|
||
| ```jsonc | ||
| { | ||
| // Change: | ||
| "tokenizer_class": "PreTrainedTokenizerFast", // was TokenizersBackend | ||
|
|
||
| // Rename key: | ||
| "additional_special_tokens": [...] // was extra_special_tokens | ||
| } | ||
| ``` |
There was a problem hiding this comment.
Uncertainty in Step 2 should be resolved before merging.
Line 59 states "These changes may not be required — needs confirmation." Shipping documentation with unconfirmed steps may confuse users. Consider either:
- Testing without these changes and updating accordingly, or
- Removing the note if they are indeed required.
🤖 Prompt for AI Agents
In `@examples/GLM5_NVFP4_SERVING.md` around lines 57 - 71, The doc contains an
unresolved note in Step 2 saying the tokenizer_config changes "may not be
required"; resolve this before merging by verifying the tokenizer works with and
without the changes and then updating the doc: if the changes are required,
remove the uncertainty note and keep the edits to tokenizer_config.json (replace
"tokenizer_class": "TokenizersBackend" with "PreTrainedTokenizerFast" and rename
"extra_special_tokens" to "additional_special_tokens"); if they are not
required, remove the suggested edits and the note. Ensure the final text clearly
states the required action for the tokenizer_config.json keys ("tokenizer_class"
and "additional_special_tokens") with no hedging.
| # VERIFIED WORKING: 2026-02-14 on 8xB200 (SM100), TRT-LLM 1.3.0rc3/rc4 | ||
| # | ||
| # DIRECTIONS: To serve GLM5 NVFP4 in TRT-LLM, you need the following | ||
| # changes on top of the base container (nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc3): |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
nvcr.io nvidia tensorrt-llm release 1.3.0rc3 NGC container
💡 Result:
NVIDIA’s official TensorRT‑LLM Quick Start currently shows the NGC “release” container tag as:
docker run --rm -it --ipc host --gpus all \
--ulimit memlock=-1 --ulimit stack=67108864 -p 8000:8000 \
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc2[1]
I could not find any NVIDIA-published documentation or release page that references an NGC container tag 1.3.0rc3 for nvcr.io/nvidia/tensorrt-llm/release (only 1.3.0rc2 is explicitly documented) [1].
To verify whether 1.3.0rc3 exists on NGC anyway, try:
docker login nvcr.io
docker pull nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc3If you want, tell me whether docker pull succeeds or fails (and the exact error), and I’ll help you pick the correct available tag.
Sources:
[1] TensorRT‑LLM Quick Start Guide (container launch command shows :1.3.0rc2) (nvidia.github.io)
Citations:
🏁 Script executed:
# First, let's find and check the file mentioned in the review
find . -name "glm5_nvfp4_serving.sh" -type fRepository: NVIDIA/TensorRT-LLM
Length of output: 95
🏁 Script executed:
# Search for all references to 1.3.0rc3 in the repository
rg "1\.3\.0rc3" --type sh -A 2 -B 2Repository: NVIDIA/TensorRT-LLM
Length of output: 1063
Update references from 1.3.0rc3 to 1.3.0rc2 to align with official NVIDIA NGC documentation.
Official NVIDIA TensorRT-LLM documentation currently documents only the 1.3.0rc2 container tag. The script references 1.3.0rc3, which is not reflected in published NGC resources. To ensure users pull the correct image, update all container tag references to the officially documented 1.3.0rc2.
Applies to lines 9, 27, 49, 86.
🤖 Prompt for AI Agents
In `@examples/glm5_nvfp4_serving.sh` at line 9, The script references the
container tag string "nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc3" which is not
documented; update every occurrence of that exact string (found in the top
comment and later docker image pulls/variables) to
"nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc2" so all container tag references
match the official NVIDIA NGC documentation.
| salloc -N 1 --gres=gpu:8 --time=04:00:00 --partition=b200@500/None@cr+mp/8gpu-224cpu-2048gb | ||
| ssh umb-b200-236 |
There was a problem hiding this comment.
Hardcoded site-specific SLURM partition and hostname.
b200@500/None@cr+mp/8gpu-224cpu-2048gb and umb-b200-236 are internal to your cluster. The README version (lines 77–80) correctly uses placeholders. Consider replacing these with placeholders and a comment, e.g.:
Suggested diff
-salloc -N 1 --gres=gpu:8 --time=04:00:00 --partition=b200@500/None@cr+mp/8gpu-224cpu-2048gb
-ssh umb-b200-236
+salloc -N 1 --gres=gpu:8 --time=04:00:00 --partition=<your-partition>
+ssh <node-name>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| salloc -N 1 --gres=gpu:8 --time=04:00:00 --partition=b200@500/None@cr+mp/8gpu-224cpu-2048gb | |
| ssh umb-b200-236 | |
| salloc -N 1 --gres=gpu:8 --time=04:00:00 --partition=<your-partition> | |
| ssh <node-name> |
🤖 Prompt for AI Agents
In `@examples/glm5_nvfp4_serving.sh` around lines 43 - 44, Replace the hardcoded
SLURM partition string used in the salloc command ("salloc -N 1 --gres=gpu:8
--time=04:00:00 --partition=b200@500/None@cr+mp/8gpu-224cpu-2048gb") and the
specific host in the ssh command ("ssh umb-b200-236") with clear placeholders
(e.g., --partition=<YOUR_PARTITION> and ssh <HOSTNAME>) and add a brief comment
above them explaining to fill in cluster-specific values or refer to the README;
update the example lines so they mirror the README placeholders and avoid
leaking internal cluster identifiers.
| docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \ | ||
| -v /home/scratch.asteiner:/workspace/scratch \ | ||
| -it nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc3 bash |
There was a problem hiding this comment.
Hardcoded personal mount path leaks internal directory structure.
/home/scratch.asteiner is a personal/internal path. Replace with a generic placeholder consistent with the README.
Suggested diff
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
- -v /home/scratch.asteiner:/workspace/scratch \
+ -v /path/to/your/models:/workspace/scratch \
-it nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc3 bash📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \ | |
| -v /home/scratch.asteiner:/workspace/scratch \ | |
| -it nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc3 bash | |
| docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \ | |
| -v /path/to/your/models:/workspace/scratch \ | |
| -it nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc3 bash |
🤖 Prompt for AI Agents
In `@examples/glm5_nvfp4_serving.sh` around lines 47 - 49, The docker run line in
examples/glm5_nvfp4_serving.sh currently mounts a hardcoded personal path
(/home/scratch.asteiner); change that mount to a generic placeholder (e.g.,
/path/to/scratch or an environment variable like ${SCRATCH_DIR}) so the script
doesn't leak internal directories. Update the volume flag in the docker run
command (the "-v /home/scratch.asteiner:/workspace/scratch" segment) to use the
placeholder and, if desired, add a short comment or README pointer that users
should set SCRATCH_DIR before running.
| curl http://172.17.0.2:8000/v1/completions \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{"model":"GLM-5-nvfp4-v1","prompt":"The capital of France is","max_tokens":50}' |
There was a problem hiding this comment.
Hardcoded Docker bridge IP 172.17.0.2 is fragile.
This IP is not guaranteed. The README's approach of using docker inspect to discover the IP is better. Replace with a placeholder or show the inspect command here too.
Suggested diff
# --- Test (from host node, not inside container) ---
-curl http://172.17.0.2:8000/v1/completions \
+# Find container IP: docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' <container-id>
+curl http://<container-ip>:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"GLM-5-nvfp4-v1","prompt":"The capital of France is","max_tokens":50}'📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| curl http://172.17.0.2:8000/v1/completions \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"model":"GLM-5-nvfp4-v1","prompt":"The capital of France is","max_tokens":50}' | |
| # Find container IP: docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' <container-id> | |
| curl http://<container-ip>:8000/v1/completions \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"model":"GLM-5-nvfp4-v1","prompt":"The capital of France is","max_tokens":50}' |
🤖 Prompt for AI Agents
In `@examples/glm5_nvfp4_serving.sh` around lines 78 - 80, The curl example
currently hardcodes the Docker bridge IP "172.17.0.2" which is fragile; update
the curl invocation in examples/glm5_nvfp4_serving.sh to use a variable or
command substitution instead of a fixed IP (e.g., replace the literal IP in the
curl command with a placeholder like SERVER_IP or use the docker inspect command
to programmatically obtain the container IP), and include a short inline comment
showing the docker inspect command to discover the container IP so users can set
SERVER_IP before running the curl line.
[None][doc] Add GLM-5 NVFP4 serving guide using DeepSeek V3.2 model mapping
Description
GLM-5 (MLA architecture,
v_head_dim=256,qk_nope_head_dim=192) can be servedin TRT-LLM by remapping to
deepseek_v32(DeepseekV32ForCausalLM).Key insight:
DeepseekV32Attentionhas a built-in DSA indexer that routes contextattention through absorption mode (576/512 FMHA kernels). This avoids the 256/256
separate QKV context path in
DeepseekV3Attention, which falls back to unfused MHAthat does not support MLA (produces garbage output).
No source build or C++ changes required — works on the stock
1.3.0rc3containerwith only:
config.json: remapmodel_type→deepseek_v32,architectures→DeepseekV32ForCausalLM, addrope_thetaandrope_scalingtokenizer_config.json: fix tokenizer class and special tokens key name (needs reconfirmation)sparse_attention_configwith DSA algorithm and GLM-5 indexer paramsFiles added:
examples/GLM5_NVFP4_SERVING.md— step-by-step READMEexamples/glm5_nvfp4_serving.sh— copy-paste script versionVerified working on 8×B200 (SM100), TRT-LLM 1.3.0rc3, producing coherent output.
Test Coverage
requests via OpenAI-compatible API, verified coherent English output.
PR Checklist
Summary by CodeRabbit
New Features
Documentation