Release v1.3.0rc4 · NVIDIA/TensorRT-LLM

Highlights

Model Support
- Add EPD disagg support for Qwen3 VL MoE (#10962)
- MLA revisited and GLM 4.7 Flash support (#11324)
- Initial support of AIGV models in TRTLLM (#11462)
- Fix weight loading for Nemotron 3 models on DGX Spark (#11405)
API
- Add user-provided UUID support for multimodal KV cache identification (#11075)
Feature
- Support GB200 and increase disagg test timeout (#11019)
- Avoid syncs in beam search and other improvements (#11349)
- Implement disaggregated harmony chat (#11336)
- Support different KV cache layout for one-model spec dec (#10502)
- Reduce attention module repeated warnings (#11335)
- Make update_weights compatible with CUDA Graph (#11267)
- Fully non-blocking pipeline parallelism executor loop (#10349)
- Move MambaCacheManager from Python to C++ (#10540)
- Pin host memory and batch sampler setup in beam search (#11390)
- Initial PR for trtllm-gen attention backend (#10784)
- Remove the hard code for activation type definition in TRTLLM Moe Backend (#11164)
- Merge residual+hidden into layer norm at the end of each NemotronH MTP, and remove a % operation (#11406)
- Introduce an abstract WaitingQueue interface to decouple the request scheduling logic from specific queue implementations (#11330)
- Add BOLT compatible build flags for further experimental usage (#11297)
- Multi-image support for EPD disagg (#11264)
- Optimize NemotronH model with elementwise and nvfp4 fusion (#11273)
- TorchSampler general host time optimization (#11141)
Fix
- Disaggregated serving: Only send finished context requests to the KV cache transceiver (#11354)
- Replace etcd3 with etcd-sdk-python (#10886)
- Fix offset calculation in _are_stop_words when using speculative decoding (#10854)
- Fix hang issue by avoid exposing UB buf… (#10842)
- WAR for popen in QA env (#10989)
- Fix Eagle3 draft model weight loading for throughput checkpoint (#11010)
- Respect CUDA_LAUNCH_BLOCKING by fixing doCheckError (#11261)
- Avoid reserved filename on Windows (#11382)
- Fix tinygemm accuracy (#11411)
- Disable cutedsl argmax kernel to fix perf regression (#11403)
- Fix deepeplowlatency with trtllm moe backend running fp8 DS_R1 (#11266)
- Gracefully terminate disagg serving servers to prevent leftover subprocess warnings (#11395)
- Update lm_eval to 4.9.10 and re-enable Skip Softmax Attention tests on CI (#11176)
- Remove overlap scheduler adjustment for max sequence length in create_py_executor function (#9229)
- Fix out-of-bounds array access in kernel factory Get() methods (#11373)
- Fix a bug in PR11336 (#11439)
- Fix GLM engine build dtype (#11246)
- Enable warmup for Helix CP (#11460)
- Pre-Allocation for Auto-Tuning NCCL_SYMMETRIC (#11326)
- Make NVML work with older CUDA driver versions (#11465)
- Fallback to triton_ssm for nvfp4 quantization (#11456, #11455)
- Fix CUDA OOM error (#11219)
Documentation
- Add CLAUDE.md and AGENTS.md (#11358)
- Add multiple-instances section in disaggregated serving doc (#11412)
- Update Skip Softmax attention blog (#11443)
- Add SECURITY.md file to TensorRT-LLM GitHub (#11484)
- Enable Deepwiki docs (#11492)
Benchmark
- Add microbench for MoE Comm methods (#10317)
- Enhance multi-GPU tests for IFB stats (#11239)
- Add DGX-Spark multinode perf cases including eagle3 (#11184)
- Add gpt-oss-120b-Eagle3-throughput case on DGX-Spark (#11419)
Test & Infra
- Move test_trtllm_flashinfer_symbol_collision.py to tests/unittest/_torch (#11168)
- Fix missing test cases (#10881)
- Update test constraint (#11054)
- Add CODEOWNERS coverage for serve/ and commands/ directories (#11359)
- Update model list (#11364)
- Unit test for disagg gen cancellation (#11108)
- Disable spark stages due to migration of spark cloud (#11401)
- Enable sparck ci since spark cloud migration is done (#11407)
- Upload unittest sub results in slurm (#10834)
- Remove obsolete code (#11388)
- Fix the testcase name in timeout xml (#9781)
- Use frontend dgx-h100 and b200 slurm platforms (#11251)
- Update allowlist 2026-02-10 (#11426)
- Lock FI version to 0.6.3 (#11371)
- Pin the torchao version (#11444)
- Refactor finish reasons tests (#11445)
- Move DeepEPLowLatency tests to machines that support IBGDA with GPU handles (#11178)
- Refactor MoE unit tests: add unified ConfigurableMoE test framework (#11437)
- Use weakref in atexit handler (#11476)
- Improve assert in sampler (#11475)
- Update allowlist 2026-02-13 (#11512)

What's Changed

[None][infra] Waive failed case for main branch on 02/09 by @EmmaQiaoCh in #11369
[None][chore] Move test_trtllm_flashinfer_symbol_collision.py to tests/unittest/_torch by @yihwang-nv in #11168
[None][chore] Add microbench for MoE Comm methods. by @bobboli in #10317
[https://nvbugs/5829097][fix] Disaggregated serving: Only send finished context requests to the KV cache transceiver by @Funatiq in #11354
[None][test] Enhance multi-GPU tests for IFB stats by @Funatiq in #11239
[https://nvbugs/5834212][chore] unwaive test_disaggregated_mixed by @reasonsolo in #11372
[#10780][feat] AutoDeploy: Support per-expert scales in FP8 and NVFP4 MoE by @galagam in #11322
[TRTLLM-10030][perf] avoid syncs in beam search + other improvements by @ixlmar in #11349
[None][chroe] Mass integration of release/1.2 - 3rd by @dominicshanshan in #11308
[None][fix] Respect CUDA_LAUNCH_BLOCKING by fixing doCheckError by @hnover-nv in #11261
[TRTLLM-10866][feat] implement disaggregated harmony chat by @reasonsolo in #11336
[None][infra] AutoDeploy: Dump graph IR after every transform by @bmarimuthu-nv in #11045
[None][chore] update model list by @tcherckez-nvidia in #11364
[None][chore] Unit test for disagg gen cancellation by @pcastonguay in #11108
[https://nvbugs/5853997][chore] Unwaive gpt-oss test by @mikeiovine in #11287
[TRTLLM-10321][feat] Support different KV cache layout for one-model spec dec by @ziyixiong-nv in #10502
[https://nvbugs/5855540][fix] AutoDeploy: thread cleanup of eagle test by @lucaslie in #11289
[None][chore] Reduce attention module repeated warnings. by @yuxianq in #11335
[https://nvbugs/5843112][chore] Unwaive ngram test by @mikeiovine in #11320
[None][test] Add DGX-Spark multinode perf cases by @JennyLiu-nv in #11184
[None][fix] Avoid reserved filename on Windows by @tongyuantongyu in #11382
[None][infra] Disable spark stages due to migration of spark cloud by @EmmaQiaoCh in #11401
[TRTC-265][chore] Add CODEOWNERS coverage for serve/ and commands/ directories by @venkywonka in #11359
[#11032][feat] MLA revisited and GLM 4.7 Flash support by @lucaslie in #11324
[TRTC-264][doc] Add CLAUDE.md and AGENTS.md by @venkywonka in #11358
[None][chore] Mass merge commits from release/1.2.0rc6.post1 branch by @longlee0622 in #11384
[TRTLLM-9771][feat] Make update_weights compatible with CUDA Graph by @shuyixiong in #11267
[None][infra] Enable sparck ci since spark cloud migration is done by @EmmaQiaoCh in #11407
[None][doc] add multiple-instances section in disaggregated serving doc by @reasonsolo in #11412
[None][feat] Fully non-blocking pipeline parallelism executor loop. by @yuxianq in #10349
[None][infra] Waive failed cases for main branch on 02/10 by @EmmaQiaoCh in #11413
[None][chore] Unwaive tests after last MI by @dominicshanshan in #11400
[TRTLLM-10331][infra] Upload unittest sub results in slurm by @yiqingy0 in #10834
[https://nvbugs/5791242][chore] remove obsolete code by @ixlmar in #11388
[None][fix] fix tinygemm accuracy by @bo-nv in #11411
[https://nvbugs/5853720][fix] Disable cutedsl argmax kernel to fix perf regression by @chenfeiz0326 in #11403
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #11363
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #11396
[TRTLLM-9711][infra] Fix the testcase name in timeout xml by @yiqingy0 in #9781
[https://nvbugs/5848377][fix] fix deepeplowlatency with trtllm moe backend running fp8 DS_R1 by @leslie-fang25 in #11266
[TRTLLM-10273][feat] Move MambaCacheManager from Python to C++ by @Tabrizian in #10540
[TRTLLM-10030][perf] pin host memory and batch sampler setup in beam search by @ixlmar in #11390
[None][infra] Use frontend dgx-h100 and b200 slurm platforms by @mlefeb01 in #11251
[None][chore] Update allowlist 2026-02-10 by @tburt-nv in #11426
[#11203][feat] AutoDeploy: Refactor node caching and improve engine build time by @taylor-yb-lee in #11250
[https://nvbugs/5868038][fix] Gracefully terminate disagg serving servers to prevent leftover subprocess warnings by @peihu-nv in #11395
[https://nvbugs/5810940][fix] Update lm_eval to 4.9.10 and re-enable Skip Softmax Attention tests on CI. by @bobboli in #11176
[None][chore] Lock FI version to 0.6.3 by @rosong11 in #11371
[None][infra] Waive failed cases for main on 2/11 by @EmmaQiaoCh in #11441
[None][doc] Update Skip Softmax attention blog. by @bobboli in #11443
[None][feat] Initial PR for trtllm-gen attention backend by @yihwang-nv in #10784
[None][infra] Pin the torchao version by @EmmaQiaoCh in #11444
[None][feat] Remove the hard code for activation type definition in T… by @nv-guomingz in #11164
[None][fix] Remove overlap scheduler adjustment for max sequence length in create_py_executor function by @Funatiq in #9229
[None][chore] Merge residual+hidden into layer norm at the end of each NemotronH MTP, and remove a % operation by @hnover-nv in #11406
[None][fix] Fix out-of-bounds array access in kernel factory Get() methods by @hnover-nv in #11373
[None][chore] Introduceing an abstract WaitingQueue interface to decouple the request scheduling logic from specific queue implementations by @lancelly in #11330
[TRTLLM-10793][feat] Add BOLT compatible build flags for further experimental usage. by @hyukn in #11297
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #11392
[TRTLLM-10858][feat] Multi-image support for EPD disagg by @2ez4bz in #11264
[https://nvbugs/5804923][none] unwaive test by @PerkzZheng in #11005
[None][fix] glm engine build dtype by @mandroid6 in #11246
[TRTLLM-10487][feat] Add user-provided UUID support for multimodal KV cache identification. by @SimengLiu-nv in #11075
[None][chore] fix a bug in PR11336 by @reasonsolo in #11439
[None][chore] added AutoDeploy nano_v3_multi_device.yaml by @MrGeva in #10845
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #11451
[TRTLLM-10030][chore] refactor finish reasons tests by @ixlmar in #11445
[https://nvbugs/5808500][chore] Move DeepEPLowLatency tests to machines that support IBGDA with GPU handles by @yuantailing in #11178
[https://nvbugs/5832481][test] Add gpt-oss-120b-Eagle3-throughput case on DGX-Spark by @JennyLiu-nv in #11419
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #11447
[None][feat] Optimize super-v3 nvfp4 for better perf by @Wanli-Jiang in #11273
[https://nvbugs/5810935][test] unwaive RTX 6000 pro tests by @pamelap-nvidia in #11452
[TRTLLM-10791][feat] TorchSampler general host time optimization by @hyukn in #11141
[None][chore] Bump version to 1.3.0rc4 by @tburt-nv in #11485
[https://nvbugs/5888410][fix] Enable warmup for Helix CP by @brb-nv in #11460
[None][fix] Pre-Allocation for Auto-Tuning NCCL_SYMMETRIC by @nv-lschneider in #11326
[https://nvbugs/5887893][fix] Make NVML work with older CUDA driver versions by @Tabrizian in #11465
[TRTINFRA-7648][chore] Add SECURITY.md file to TensorRT-LLM GitHub by @dpitman-nvda in #11484
[TRTLLM-9108][feat] refactor MoE unit tests: add unified ConfigurableMoE test framework by @xxi-nv in #11437
[None][chore] Waive test blocking pre-merge by @brb-nv in #11498
[#11455][fix] Fallback to triton_ssm for nvfp4 quantization by @galagam in #11456
[None][infra] Waive failed test in Post-Merge by @yuanjingx87 in #11491
[TRTLLM-10030][chore] use weakref in atexit handler by @ixlmar in #11476
[https://nvbugs/5847284][fix] fix cuda oom error by @reasonsolo in #11219
[None][docs] enable Deepwiki docs by @venkywonka in #11492
[TRTLLM-10030][chore] improve assert in sampler by @ixlmar in #11475
[None][chore] Update allowlist 2026-02-13 by @dpitman-nvda in #11512
[TRTLLM-10329][feat] Fix weight loading for Nemotron 3 models on DGX Spark by @pamelap-nvidia in #11405
[TRTLLM-10612][feat] Initial support of AIGV models in TRTLLM by @chang-l in #11462

New Contributors

@peihu-nv made their first contribution in #11395
@rosong11 made their first contribution in #11371
@mandroid6 made their first contribution in #11246
@dpitman-nvda made their first contribution in #11484

Full Changelog: v1.3.0rc3...v1.3.0rc4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.3.0rc4

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

What's Changed

New Contributors

Contributors

Uh oh!