v1.3.0rc4
Pre-release
Pre-release
Highlights
-
Model Support
-
API
- Add user-provided UUID support for multimodal KV cache identification (#11075)
-
Feature
- Support GB200 and increase disagg test timeout (#11019)
- Avoid syncs in beam search and other improvements (#11349)
- Implement disaggregated harmony chat (#11336)
- Support different KV cache layout for one-model spec dec (#10502)
- Reduce attention module repeated warnings (#11335)
- Make update_weights compatible with CUDA Graph (#11267)
- Fully non-blocking pipeline parallelism executor loop (#10349)
- Move MambaCacheManager from Python to C++ (#10540)
- Pin host memory and batch sampler setup in beam search (#11390)
- Initial PR for trtllm-gen attention backend (#10784)
- Remove the hard code for activation type definition in TRTLLM Moe Backend (#11164)
- Merge residual+hidden into layer norm at the end of each NemotronH MTP, and remove a % operation (#11406)
- Introduce an abstract WaitingQueue interface to decouple the request scheduling logic from specific queue implementations (#11330)
- Add BOLT compatible build flags for further experimental usage (#11297)
- Multi-image support for EPD disagg (#11264)
- Optimize NemotronH model with elementwise and nvfp4 fusion (#11273)
- TorchSampler general host time optimization (#11141)
-
Fix
- Disaggregated serving: Only send finished context requests to the KV cache transceiver (#11354)
- Replace etcd3 with etcd-sdk-python (#10886)
- Fix offset calculation in _are_stop_words when using speculative decoding (#10854)
- Fix hang issue by avoid exposing UB buf… (#10842)
- WAR for popen in QA env (#10989)
- Fix Eagle3 draft model weight loading for throughput checkpoint (#11010)
- Respect CUDA_LAUNCH_BLOCKING by fixing doCheckError (#11261)
- Avoid reserved filename on Windows (#11382)
- Fix tinygemm accuracy (#11411)
- Disable cutedsl argmax kernel to fix perf regression (#11403)
- Fix deepeplowlatency with trtllm moe backend running fp8 DS_R1 (#11266)
- Gracefully terminate disagg serving servers to prevent leftover subprocess warnings (#11395)
- Update lm_eval to 4.9.10 and re-enable Skip Softmax Attention tests on CI (#11176)
- Remove overlap scheduler adjustment for max sequence length in create_py_executor function (#9229)
- Fix out-of-bounds array access in kernel factory Get() methods (#11373)
- Fix a bug in PR11336 (#11439)
- Fix GLM engine build dtype (#11246)
- Enable warmup for Helix CP (#11460)
- Pre-Allocation for Auto-Tuning NCCL_SYMMETRIC (#11326)
- Make NVML work with older CUDA driver versions (#11465)
- Fallback to triton_ssm for nvfp4 quantization (#11456, #11455)
- Fix CUDA OOM error (#11219)
-
Documentation
-
Benchmark
-
Test & Infra
- Move test_trtllm_flashinfer_symbol_collision.py to tests/unittest/_torch (#11168)
- Fix missing test cases (#10881)
- Update test constraint (#11054)
- Add CODEOWNERS coverage for serve/ and commands/ directories (#11359)
- Update model list (#11364)
- Unit test for disagg gen cancellation (#11108)
- Disable spark stages due to migration of spark cloud (#11401)
- Enable sparck ci since spark cloud migration is done (#11407)
- Upload unittest sub results in slurm (#10834)
- Remove obsolete code (#11388)
- Fix the testcase name in timeout xml (#9781)
- Use frontend dgx-h100 and b200 slurm platforms (#11251)
- Update allowlist 2026-02-10 (#11426)
- Lock FI version to 0.6.3 (#11371)
- Pin the torchao version (#11444)
- Refactor finish reasons tests (#11445)
- Move DeepEPLowLatency tests to machines that support IBGDA with GPU handles (#11178)
- Refactor MoE unit tests: add unified ConfigurableMoE test framework (#11437)
- Use weakref in atexit handler (#11476)
- Improve assert in sampler (#11475)
- Update allowlist 2026-02-13 (#11512)
What's Changed
- [None][infra] Waive failed case for main branch on 02/09 by @EmmaQiaoCh in #11369
- [None][chore] Move test_trtllm_flashinfer_symbol_collision.py to tests/unittest/_torch by @yihwang-nv in #11168
- [None][chore] Add microbench for MoE Comm methods. by @bobboli in #10317
- [https://nvbugs/5829097][fix] Disaggregated serving: Only send finished context requests to the KV cache transceiver by @Funatiq in #11354
- [None][test] Enhance multi-GPU tests for IFB stats by @Funatiq in #11239
- [https://nvbugs/5834212][chore] unwaive test_disaggregated_mixed by @reasonsolo in #11372
- [#10780][feat] AutoDeploy: Support per-expert scales in FP8 and NVFP4 MoE by @galagam in #11322
- [TRTLLM-10030][perf] avoid syncs in beam search + other improvements by @ixlmar in #11349
- [None][chroe] Mass integration of release/1.2 - 3rd by @dominicshanshan in #11308
- [None][fix] Respect CUDA_LAUNCH_BLOCKING by fixing doCheckError by @hnover-nv in #11261
- [TRTLLM-10866][feat] implement disaggregated harmony chat by @reasonsolo in #11336
- [None][infra] AutoDeploy: Dump graph IR after every transform by @bmarimuthu-nv in #11045
- [None][chore] update model list by @tcherckez-nvidia in #11364
- [None][chore] Unit test for disagg gen cancellation by @pcastonguay in #11108
- [https://nvbugs/5853997][chore] Unwaive gpt-oss test by @mikeiovine in #11287
- [TRTLLM-10321][feat] Support different KV cache layout for one-model spec dec by @ziyixiong-nv in #10502
- [https://nvbugs/5855540][fix] AutoDeploy: thread cleanup of eagle test by @lucaslie in #11289
- [None][chore] Reduce attention module repeated warnings. by @yuxianq in #11335
- [https://nvbugs/5843112][chore] Unwaive ngram test by @mikeiovine in #11320
- [None][test] Add DGX-Spark multinode perf cases by @JennyLiu-nv in #11184
- [None][fix] Avoid reserved filename on Windows by @tongyuantongyu in #11382
- [None][infra] Disable spark stages due to migration of spark cloud by @EmmaQiaoCh in #11401
- [TRTC-265][chore] Add CODEOWNERS coverage for serve/ and commands/ directories by @venkywonka in #11359
- [#11032][feat] MLA revisited and GLM 4.7 Flash support by @lucaslie in #11324
- [TRTC-264][doc] Add CLAUDE.md and AGENTS.md by @venkywonka in #11358
- [None][chore] Mass merge commits from release/1.2.0rc6.post1 branch by @longlee0622 in #11384
- [TRTLLM-9771][feat] Make update_weights compatible with CUDA Graph by @shuyixiong in #11267
- [None][infra] Enable sparck ci since spark cloud migration is done by @EmmaQiaoCh in #11407
- [None][doc] add multiple-instances section in disaggregated serving doc by @reasonsolo in #11412
- [None][feat] Fully non-blocking pipeline parallelism executor loop. by @yuxianq in #10349
- [None][infra] Waive failed cases for main branch on 02/10 by @EmmaQiaoCh in #11413
- [None][chore] Unwaive tests after last MI by @dominicshanshan in #11400
- [TRTLLM-10331][infra] Upload unittest sub results in slurm by @yiqingy0 in #10834
- [https://nvbugs/5791242][chore] remove obsolete code by @ixlmar in #11388
- [None][fix] fix tinygemm accuracy by @bo-nv in #11411
- [https://nvbugs/5853720][fix] Disable cutedsl argmax kernel to fix perf regression by @chenfeiz0326 in #11403
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #11363
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #11396
- [TRTLLM-9711][infra] Fix the testcase name in timeout xml by @yiqingy0 in #9781
- [https://nvbugs/5848377][fix] fix deepeplowlatency with trtllm moe backend running fp8 DS_R1 by @leslie-fang25 in #11266
- [TRTLLM-10273][feat] Move MambaCacheManager from Python to C++ by @Tabrizian in #10540
- [TRTLLM-10030][perf] pin host memory and batch sampler setup in beam search by @ixlmar in #11390
- [None][infra] Use frontend dgx-h100 and b200 slurm platforms by @mlefeb01 in #11251
- [None][chore] Update allowlist 2026-02-10 by @tburt-nv in #11426
- [#11203][feat] AutoDeploy: Refactor node caching and improve engine build time by @taylor-yb-lee in #11250
- [https://nvbugs/5868038][fix] Gracefully terminate disagg serving servers to prevent leftover subprocess warnings by @peihu-nv in #11395
- [https://nvbugs/5810940][fix] Update lm_eval to 4.9.10 and re-enable Skip Softmax Attention tests on CI. by @bobboli in #11176
- [None][chore] Lock FI version to 0.6.3 by @rosong11 in #11371
- [None][infra] Waive failed cases for main on 2/11 by @EmmaQiaoCh in #11441
- [None][doc] Update Skip Softmax attention blog. by @bobboli in #11443
- [None][feat] Initial PR for trtllm-gen attention backend by @yihwang-nv in #10784
- [None][infra] Pin the torchao version by @EmmaQiaoCh in #11444
- [None][feat] Remove the hard code for activation type definition in T… by @nv-guomingz in #11164
- [None][fix] Remove overlap scheduler adjustment for max sequence length in create_py_executor function by @Funatiq in #9229
- [None][chore] Merge residual+hidden into layer norm at the end of each NemotronH MTP, and remove a % operation by @hnover-nv in #11406
- [None][fix] Fix out-of-bounds array access in kernel factory Get() methods by @hnover-nv in #11373
- [None][chore] Introduceing an abstract WaitingQueue interface to decouple the request scheduling logic from specific queue implementations by @lancelly in #11330
- [TRTLLM-10793][feat] Add BOLT compatible build flags for further experimental usage. by @hyukn in #11297
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #11392
- [TRTLLM-10858][feat] Multi-image support for EPD disagg by @2ez4bz in #11264
- [https://nvbugs/5804923][none] unwaive test by @PerkzZheng in #11005
- [None][fix] glm engine build dtype by @mandroid6 in #11246
- [TRTLLM-10487][feat] Add user-provided UUID support for multimodal KV cache identification. by @SimengLiu-nv in #11075
- [None][chore] fix a bug in PR11336 by @reasonsolo in #11439
- [None][chore] added AutoDeploy nano_v3_multi_device.yaml by @MrGeva in #10845
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #11451
- [TRTLLM-10030][chore] refactor finish reasons tests by @ixlmar in #11445
- [https://nvbugs/5808500][chore] Move DeepEPLowLatency tests to machines that support IBGDA with GPU handles by @yuantailing in #11178
- [https://nvbugs/5832481][test] Add gpt-oss-120b-Eagle3-throughput case on DGX-Spark by @JennyLiu-nv in #11419
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #11447
- [None][feat] Optimize super-v3 nvfp4 for better perf by @Wanli-Jiang in #11273
- [https://nvbugs/5810935][test] unwaive RTX 6000 pro tests by @pamelap-nvidia in #11452
- [TRTLLM-10791][feat] TorchSampler general host time optimization by @hyukn in #11141
- [None][chore] Bump version to 1.3.0rc4 by @tburt-nv in #11485
- [https://nvbugs/5888410][fix] Enable warmup for Helix CP by @brb-nv in #11460
- [None][fix] Pre-Allocation for Auto-Tuning NCCL_SYMMETRIC by @nv-lschneider in #11326
- [https://nvbugs/5887893][fix] Make NVML work with older CUDA driver versions by @Tabrizian in #11465
- [TRTINFRA-7648][chore] Add SECURITY.md file to TensorRT-LLM GitHub by @dpitman-nvda in #11484
- [TRTLLM-9108][feat] refactor MoE unit tests: add unified ConfigurableMoE test framework by @xxi-nv in #11437
- [None][chore] Waive test blocking pre-merge by @brb-nv in #11498
- [#11455][fix] Fallback to triton_ssm for nvfp4 quantization by @galagam in #11456
- [None][infra] Waive failed test in Post-Merge by @yuanjingx87 in #11491
- [TRTLLM-10030][chore] use weakref in atexit handler by @ixlmar in #11476
- [https://nvbugs/5847284][fix] fix cuda oom error by @reasonsolo in #11219
- [None][docs] enable Deepwiki docs by @venkywonka in #11492
- [TRTLLM-10030][chore] improve assert in sampler by @ixlmar in #11475
- [None][chore] Update allowlist 2026-02-13 by @dpitman-nvda in #11512
- [TRTLLM-10329][feat] Fix weight loading for Nemotron 3 models on DGX Spark by @pamelap-nvidia in #11405
- [TRTLLM-10612][feat] Initial support of AIGV models in TRTLLM by @chang-l in #11462
New Contributors
- @peihu-nv made their first contribution in #11395
- @rosong11 made their first contribution in #11371
- @mandroid6 made their first contribution in #11246
- @dpitman-nvda made their first contribution in #11484
Full Changelog: v1.3.0rc3...v1.3.0rc4