Skip to content

v1.3.0rc4

Pre-release
Pre-release

Choose a tag to compare

@pcastonguay pcastonguay released this 17 Feb 21:04
· 37 commits to main since this release
26901e4

Highlights

  • Model Support

    • Add EPD disagg support for Qwen3 VL MoE (#10962)
    • MLA revisited and GLM 4.7 Flash support (#11324)
    • Initial support of AIGV models in TRTLLM (#11462)
    • Fix weight loading for Nemotron 3 models on DGX Spark (#11405)
  • API

    • Add user-provided UUID support for multimodal KV cache identification (#11075)
  • Feature

    • Support GB200 and increase disagg test timeout (#11019)
    • Avoid syncs in beam search and other improvements (#11349)
    • Implement disaggregated harmony chat (#11336)
    • Support different KV cache layout for one-model spec dec (#10502)
    • Reduce attention module repeated warnings (#11335)
    • Make update_weights compatible with CUDA Graph (#11267)
    • Fully non-blocking pipeline parallelism executor loop (#10349)
    • Move MambaCacheManager from Python to C++ (#10540)
    • Pin host memory and batch sampler setup in beam search (#11390)
    • Initial PR for trtllm-gen attention backend (#10784)
    • Remove the hard code for activation type definition in TRTLLM Moe Backend (#11164)
    • Merge residual+hidden into layer norm at the end of each NemotronH MTP, and remove a % operation (#11406)
    • Introduce an abstract WaitingQueue interface to decouple the request scheduling logic from specific queue implementations (#11330)
    • Add BOLT compatible build flags for further experimental usage (#11297)
    • Multi-image support for EPD disagg (#11264)
    • Optimize NemotronH model with elementwise and nvfp4 fusion (#11273)
    • TorchSampler general host time optimization (#11141)
  • Fix

    • Disaggregated serving: Only send finished context requests to the KV cache transceiver (#11354)
    • Replace etcd3 with etcd-sdk-python (#10886)
    • Fix offset calculation in _are_stop_words when using speculative decoding (#10854)
    • Fix hang issue by avoid exposing UB buf… (#10842)
    • WAR for popen in QA env (#10989)
    • Fix Eagle3 draft model weight loading for throughput checkpoint (#11010)
    • Respect CUDA_LAUNCH_BLOCKING by fixing doCheckError (#11261)
    • Avoid reserved filename on Windows (#11382)
    • Fix tinygemm accuracy (#11411)
    • Disable cutedsl argmax kernel to fix perf regression (#11403)
    • Fix deepeplowlatency with trtllm moe backend running fp8 DS_R1 (#11266)
    • Gracefully terminate disagg serving servers to prevent leftover subprocess warnings (#11395)
    • Update lm_eval to 4.9.10 and re-enable Skip Softmax Attention tests on CI (#11176)
    • Remove overlap scheduler adjustment for max sequence length in create_py_executor function (#9229)
    • Fix out-of-bounds array access in kernel factory Get() methods (#11373)
    • Fix a bug in PR11336 (#11439)
    • Fix GLM engine build dtype (#11246)
    • Enable warmup for Helix CP (#11460)
    • Pre-Allocation for Auto-Tuning NCCL_SYMMETRIC (#11326)
    • Make NVML work with older CUDA driver versions (#11465)
    • Fallback to triton_ssm for nvfp4 quantization (#11456, #11455)
    • Fix CUDA OOM error (#11219)
  • Documentation

    • Add CLAUDE.md and AGENTS.md (#11358)
    • Add multiple-instances section in disaggregated serving doc (#11412)
    • Update Skip Softmax attention blog (#11443)
    • Add SECURITY.md file to TensorRT-LLM GitHub (#11484)
    • Enable Deepwiki docs (#11492)
  • Benchmark

    • Add microbench for MoE Comm methods (#10317)
    • Enhance multi-GPU tests for IFB stats (#11239)
    • Add DGX-Spark multinode perf cases including eagle3 (#11184)
    • Add gpt-oss-120b-Eagle3-throughput case on DGX-Spark (#11419)
  • Test & Infra

    • Move test_trtllm_flashinfer_symbol_collision.py to tests/unittest/_torch (#11168)
    • Fix missing test cases (#10881)
    • Update test constraint (#11054)
    • Add CODEOWNERS coverage for serve/ and commands/ directories (#11359)
    • Update model list (#11364)
    • Unit test for disagg gen cancellation (#11108)
    • Disable spark stages due to migration of spark cloud (#11401)
    • Enable sparck ci since spark cloud migration is done (#11407)
    • Upload unittest sub results in slurm (#10834)
    • Remove obsolete code (#11388)
    • Fix the testcase name in timeout xml (#9781)
    • Use frontend dgx-h100 and b200 slurm platforms (#11251)
    • Update allowlist 2026-02-10 (#11426)
    • Lock FI version to 0.6.3 (#11371)
    • Pin the torchao version (#11444)
    • Refactor finish reasons tests (#11445)
    • Move DeepEPLowLatency tests to machines that support IBGDA with GPU handles (#11178)
    • Refactor MoE unit tests: add unified ConfigurableMoE test framework (#11437)
    • Use weakref in atexit handler (#11476)
    • Improve assert in sampler (#11475)
    • Update allowlist 2026-02-13 (#11512)

What's Changed

  • [None][infra] Waive failed case for main branch on 02/09 by @EmmaQiaoCh in #11369
  • [None][chore] Move test_trtllm_flashinfer_symbol_collision.py to tests/unittest/_torch by @yihwang-nv in #11168
  • [None][chore] Add microbench for MoE Comm methods. by @bobboli in #10317
  • [https://nvbugs/5829097][fix] Disaggregated serving: Only send finished context requests to the KV cache transceiver by @Funatiq in #11354
  • [None][test] Enhance multi-GPU tests for IFB stats by @Funatiq in #11239
  • [https://nvbugs/5834212][chore] unwaive test_disaggregated_mixed by @reasonsolo in #11372
  • [#10780][feat] AutoDeploy: Support per-expert scales in FP8 and NVFP4 MoE by @galagam in #11322
  • [TRTLLM-10030][perf] avoid syncs in beam search + other improvements by @ixlmar in #11349
  • [None][chroe] Mass integration of release/1.2 - 3rd by @dominicshanshan in #11308
  • [None][fix] Respect CUDA_LAUNCH_BLOCKING by fixing doCheckError by @hnover-nv in #11261
  • [TRTLLM-10866][feat] implement disaggregated harmony chat by @reasonsolo in #11336
  • [None][infra] AutoDeploy: Dump graph IR after every transform by @bmarimuthu-nv in #11045
  • [None][chore] update model list by @tcherckez-nvidia in #11364
  • [None][chore] Unit test for disagg gen cancellation by @pcastonguay in #11108
  • [https://nvbugs/5853997][chore] Unwaive gpt-oss test by @mikeiovine in #11287
  • [TRTLLM-10321][feat] Support different KV cache layout for one-model spec dec by @ziyixiong-nv in #10502
  • [https://nvbugs/5855540][fix] AutoDeploy: thread cleanup of eagle test by @lucaslie in #11289
  • [None][chore] Reduce attention module repeated warnings. by @yuxianq in #11335
  • [https://nvbugs/5843112][chore] Unwaive ngram test by @mikeiovine in #11320
  • [None][test] Add DGX-Spark multinode perf cases by @JennyLiu-nv in #11184
  • [None][fix] Avoid reserved filename on Windows by @tongyuantongyu in #11382
  • [None][infra] Disable spark stages due to migration of spark cloud by @EmmaQiaoCh in #11401
  • [TRTC-265][chore] Add CODEOWNERS coverage for serve/ and commands/ directories by @venkywonka in #11359
  • [#11032][feat] MLA revisited and GLM 4.7 Flash support by @lucaslie in #11324
  • [TRTC-264][doc] Add CLAUDE.md and AGENTS.md by @venkywonka in #11358
  • [None][chore] Mass merge commits from release/1.2.0rc6.post1 branch by @longlee0622 in #11384
  • [TRTLLM-9771][feat] Make update_weights compatible with CUDA Graph by @shuyixiong in #11267
  • [None][infra] Enable sparck ci since spark cloud migration is done by @EmmaQiaoCh in #11407
  • [None][doc] add multiple-instances section in disaggregated serving doc by @reasonsolo in #11412
  • [None][feat] Fully non-blocking pipeline parallelism executor loop. by @yuxianq in #10349
  • [None][infra] Waive failed cases for main branch on 02/10 by @EmmaQiaoCh in #11413
  • [None][chore] Unwaive tests after last MI by @dominicshanshan in #11400
  • [TRTLLM-10331][infra] Upload unittest sub results in slurm by @yiqingy0 in #10834
  • [https://nvbugs/5791242][chore] remove obsolete code by @ixlmar in #11388
  • [None][fix] fix tinygemm accuracy by @bo-nv in #11411
  • [https://nvbugs/5853720][fix] Disable cutedsl argmax kernel to fix perf regression by @chenfeiz0326 in #11403
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #11363
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #11396
  • [TRTLLM-9711][infra] Fix the testcase name in timeout xml by @yiqingy0 in #9781
  • [https://nvbugs/5848377][fix] fix deepeplowlatency with trtllm moe backend running fp8 DS_R1 by @leslie-fang25 in #11266
  • [TRTLLM-10273][feat] Move MambaCacheManager from Python to C++ by @Tabrizian in #10540
  • [TRTLLM-10030][perf] pin host memory and batch sampler setup in beam search by @ixlmar in #11390
  • [None][infra] Use frontend dgx-h100 and b200 slurm platforms by @mlefeb01 in #11251
  • [None][chore] Update allowlist 2026-02-10 by @tburt-nv in #11426
  • [#11203][feat] AutoDeploy: Refactor node caching and improve engine build time by @taylor-yb-lee in #11250
  • [https://nvbugs/5868038][fix] Gracefully terminate disagg serving servers to prevent leftover subprocess warnings by @peihu-nv in #11395
  • [https://nvbugs/5810940][fix] Update lm_eval to 4.9.10 and re-enable Skip Softmax Attention tests on CI. by @bobboli in #11176
  • [None][chore] Lock FI version to 0.6.3 by @rosong11 in #11371
  • [None][infra] Waive failed cases for main on 2/11 by @EmmaQiaoCh in #11441
  • [None][doc] Update Skip Softmax attention blog. by @bobboli in #11443
  • [None][feat] Initial PR for trtllm-gen attention backend by @yihwang-nv in #10784
  • [None][infra] Pin the torchao version by @EmmaQiaoCh in #11444
  • [None][feat] Remove the hard code for activation type definition in T… by @nv-guomingz in #11164
  • [None][fix] Remove overlap scheduler adjustment for max sequence length in create_py_executor function by @Funatiq in #9229
  • [None][chore] Merge residual+hidden into layer norm at the end of each NemotronH MTP, and remove a % operation by @hnover-nv in #11406
  • [None][fix] Fix out-of-bounds array access in kernel factory Get() methods by @hnover-nv in #11373
  • [None][chore] Introduceing an abstract WaitingQueue interface to decouple the request scheduling logic from specific queue implementations by @lancelly in #11330
  • [TRTLLM-10793][feat] Add BOLT compatible build flags for further experimental usage. by @hyukn in #11297
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #11392
  • [TRTLLM-10858][feat] Multi-image support for EPD disagg by @2ez4bz in #11264
  • [https://nvbugs/5804923][none] unwaive test by @PerkzZheng in #11005
  • [None][fix] glm engine build dtype by @mandroid6 in #11246
  • [TRTLLM-10487][feat] Add user-provided UUID support for multimodal KV cache identification. by @SimengLiu-nv in #11075
  • [None][chore] fix a bug in PR11336 by @reasonsolo in #11439
  • [None][chore] added AutoDeploy nano_v3_multi_device.yaml by @MrGeva in #10845
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #11451
  • [TRTLLM-10030][chore] refactor finish reasons tests by @ixlmar in #11445
  • [https://nvbugs/5808500][chore] Move DeepEPLowLatency tests to machines that support IBGDA with GPU handles by @yuantailing in #11178
  • [https://nvbugs/5832481][test] Add gpt-oss-120b-Eagle3-throughput case on DGX-Spark by @JennyLiu-nv in #11419
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #11447
  • [None][feat] Optimize super-v3 nvfp4 for better perf by @Wanli-Jiang in #11273
  • [https://nvbugs/5810935][test] unwaive RTX 6000 pro tests by @pamelap-nvidia in #11452
  • [TRTLLM-10791][feat] TorchSampler general host time optimization by @hyukn in #11141
  • [None][chore] Bump version to 1.3.0rc4 by @tburt-nv in #11485
  • [https://nvbugs/5888410][fix] Enable warmup for Helix CP by @brb-nv in #11460
  • [None][fix] Pre-Allocation for Auto-Tuning NCCL_SYMMETRIC by @nv-lschneider in #11326
  • [https://nvbugs/5887893][fix] Make NVML work with older CUDA driver versions by @Tabrizian in #11465
  • [TRTINFRA-7648][chore] Add SECURITY.md file to TensorRT-LLM GitHub by @dpitman-nvda in #11484
  • [TRTLLM-9108][feat] refactor MoE unit tests: add unified ConfigurableMoE test framework by @xxi-nv in #11437
  • [None][chore] Waive test blocking pre-merge by @brb-nv in #11498
  • [#11455][fix] Fallback to triton_ssm for nvfp4 quantization by @galagam in #11456
  • [None][infra] Waive failed test in Post-Merge by @yuanjingx87 in #11491
  • [TRTLLM-10030][chore] use weakref in atexit handler by @ixlmar in #11476
  • [https://nvbugs/5847284][fix] fix cuda oom error by @reasonsolo in #11219
  • [None][docs] enable Deepwiki docs by @venkywonka in #11492
  • [TRTLLM-10030][chore] improve assert in sampler by @ixlmar in #11475
  • [None][chore] Update allowlist 2026-02-13 by @dpitman-nvda in #11512
  • [TRTLLM-10329][feat] Fix weight loading for Nemotron 3 models on DGX Spark by @pamelap-nvidia in #11405
  • [TRTLLM-10612][feat] Initial support of AIGV models in TRTLLM by @chang-l in #11462

New Contributors

Full Changelog: v1.3.0rc3...v1.3.0rc4