Skip to content

Releases: NVIDIA/TensorRT-LLM

v1.3.0rc4

17 Feb 21:04
26901e4

Choose a tag to compare

v1.3.0rc4 Pre-release
Pre-release

Highlights

  • Model Support

    • Add EPD disagg support for Qwen3 VL MoE (#10962)
    • MLA revisited and GLM 4.7 Flash support (#11324)
    • Initial support of AIGV models in TRTLLM (#11462)
    • Fix weight loading for Nemotron 3 models on DGX Spark (#11405)
  • API

    • Add user-provided UUID support for multimodal KV cache identification (#11075)
  • Feature

    • Support GB200 and increase disagg test timeout (#11019)
    • Avoid syncs in beam search and other improvements (#11349)
    • Implement disaggregated harmony chat (#11336)
    • Support different KV cache layout for one-model spec dec (#10502)
    • Reduce attention module repeated warnings (#11335)
    • Make update_weights compatible with CUDA Graph (#11267)
    • Fully non-blocking pipeline parallelism executor loop (#10349)
    • Move MambaCacheManager from Python to C++ (#10540)
    • Pin host memory and batch sampler setup in beam search (#11390)
    • Initial PR for trtllm-gen attention backend (#10784)
    • Remove the hard code for activation type definition in TRTLLM Moe Backend (#11164)
    • Merge residual+hidden into layer norm at the end of each NemotronH MTP, and remove a % operation (#11406)
    • Introduce an abstract WaitingQueue interface to decouple the request scheduling logic from specific queue implementations (#11330)
    • Add BOLT compatible build flags for further experimental usage (#11297)
    • Multi-image support for EPD disagg (#11264)
    • Optimize NemotronH model with elementwise and nvfp4 fusion (#11273)
    • TorchSampler general host time optimization (#11141)
  • Fix

    • Disaggregated serving: Only send finished context requests to the KV cache transceiver (#11354)
    • Replace etcd3 with etcd-sdk-python (#10886)
    • Fix offset calculation in _are_stop_words when using speculative decoding (#10854)
    • Fix hang issue by avoid exposing UB buf… (#10842)
    • WAR for popen in QA env (#10989)
    • Fix Eagle3 draft model weight loading for throughput checkpoint (#11010)
    • Respect CUDA_LAUNCH_BLOCKING by fixing doCheckError (#11261)
    • Avoid reserved filename on Windows (#11382)
    • Fix tinygemm accuracy (#11411)
    • Disable cutedsl argmax kernel to fix perf regression (#11403)
    • Fix deepeplowlatency with trtllm moe backend running fp8 DS_R1 (#11266)
    • Gracefully terminate disagg serving servers to prevent leftover subprocess warnings (#11395)
    • Update lm_eval to 4.9.10 and re-enable Skip Softmax Attention tests on CI (#11176)
    • Remove overlap scheduler adjustment for max sequence length in create_py_executor function (#9229)
    • Fix out-of-bounds array access in kernel factory Get() methods (#11373)
    • Fix a bug in PR11336 (#11439)
    • Fix GLM engine build dtype (#11246)
    • Enable warmup for Helix CP (#11460)
    • Pre-Allocation for Auto-Tuning NCCL_SYMMETRIC (#11326)
    • Make NVML work with older CUDA driver versions (#11465)
    • Fallback to triton_ssm for nvfp4 quantization (#11456, #11455)
    • Fix CUDA OOM error (#11219)
  • Documentation

    • Add CLAUDE.md and AGENTS.md (#11358)
    • Add multiple-instances section in disaggregated serving doc (#11412)
    • Update Skip Softmax attention blog (#11443)
    • Add SECURITY.md file to TensorRT-LLM GitHub (#11484)
    • Enable Deepwiki docs (#11492)
  • Benchmark

    • Add microbench for MoE Comm methods (#10317)
    • Enhance multi-GPU tests for IFB stats (#11239)
    • Add DGX-Spark multinode perf cases including eagle3 (#11184)
    • Add gpt-oss-120b-Eagle3-throughput case on DGX-Spark (#11419)
  • Test & Infra

    • Move test_trtllm_flashinfer_symbol_collision.py to tests/unittest/_torch (#11168)
    • Fix missing test cases (#10881)
    • Update test constraint (#11054)
    • Add CODEOWNERS coverage for serve/ and commands/ directories (#11359)
    • Update model list (#11364)
    • Unit test for disagg gen cancellation (#11108)
    • Disable spark stages due to migration of spark cloud (#11401)
    • Enable sparck ci since spark cloud migration is done (#11407)
    • Upload unittest sub results in slurm (#10834)
    • Remove obsolete code (#11388)
    • Fix the testcase name in timeout xml (#9781)
    • Use frontend dgx-h100 and b200 slurm platforms (#11251)
    • Update allowlist 2026-02-10 (#11426)
    • Lock FI version to 0.6.3 (#11371)
    • Pin the torchao version (#11444)
    • Refactor finish reasons tests (#11445)
    • Move DeepEPLowLatency tests to machines that support IBGDA with GPU handles (#11178)
    • Refactor MoE unit tests: add unified ConfigurableMoE test framework (#11437)
    • Use weakref in atexit handler (#11476)
    • Improve assert in sampler (#11475)
    • Update allowlist 2026-02-13 (#11512)

What's Changed

  • [None][infra] Waive failed case for main branch on 02/09 by @EmmaQiaoCh in #11369
  • [None][chore] Move test_trtllm_flashinfer_symbol_collision.py to tests/unittest/_torch by @yihwang-nv in #11168
  • [None][chore] Add microbench for MoE Comm methods. by @bobboli in #10317
  • [https://nvbugs/5829097][fix] Disaggregated serving: Only send finished context requests to the KV cache transceiver by @Funatiq in #11354
  • [None][test] Enhance multi-GPU tests for IFB stats by @Funatiq in #11239
  • [https://nvbugs/5834212][chore] unwaive test_disaggregated_mixed by @reasonsolo in #11372
  • [#10780][feat] AutoDeploy: Support per-expert scales in FP8 and NVFP4 MoE by @galagam in #11322
  • [TRTLLM-10030][perf] avoid syncs in beam search + other improvements by @ixlmar in #11349
  • [None][chroe] Mass integration of release/1.2 - 3rd by @dominicshanshan in #11308
  • [None][fix] Respect CUDA_LAUNCH_BLOCKING by fixing doCheckError by @hnover-nv in #11261
  • [TRTLLM-10866][feat] implement disaggregated harmony chat by @reasonsolo in #11336
  • [None][infra] AutoDeploy: Dump graph IR after every transform by @bmarimuthu-nv in #11045
  • [None][chore] update model list by @tcherckez-nvidia in #11364
  • [None][chore] Unit test for disagg gen cancellation by @pcastonguay in #11108
  • [https://nvbugs/5853997][chore] Unwaive gpt-oss test by @mikeiovine in #11287
  • [TRTLLM-10321][feat] Support different KV cache layout for one-model spec dec by @ziyixiong-nv in #10502
  • [https://nvbugs/5855540][fix] AutoDeploy: thread cleanup of eagle test by @lucaslie in #11289
  • [None][chore] Reduce attention module repeated warnings. by @yuxianq in #11335
  • [https://nvbugs/5843112][chore] Unwaive ngram test by @mikeiovine in #11320
  • [None][test] Add DGX-Spark multinode perf cases by @JennyLiu-nv in #11184
  • [None][fix] Avoid reserved filename on Windows by @tongyuantongyu in #11382
  • [None][infra] Disable spark stages due to migration of spark cloud by @EmmaQiaoCh in #11401
  • [TRTC-265][chore] Add CODEOWNERS coverage for serve/ and commands/ directories by @venkywonka in #11359
  • [#11032][feat] MLA revisited and GLM 4.7 Flash support by @lucaslie in #11324
  • [TRTC-264][doc] Add CLAUDE.md and AGENTS.md by @venkywonka in #11358
  • [None][chore] Mass merge commits from release/1.2.0rc6.post1 branch by @longlee0622 in #11384
  • [TRTLLM-9771][feat] Make update_weights compatible with CUDA Graph by @shuyixiong in #11267
  • [None][infra] Enable sparck ci since spark cloud migration is done by @EmmaQiaoCh in #11407
  • [None][doc] add multiple-instances section in disaggregated serving doc by @reasonsolo in #11412
  • [None][feat] Fully non-blocking pipeline parallelism executor loop. by @yuxianq in #10349
  • [None][infra] Waive failed cases for main branch on 02/10 by @EmmaQiaoCh in #11413
  • [None][chore] Unwaive tests after last MI by @dominicshanshan in #11400
  • [TRTLLM-10331][infra] Upload unittest sub results in slurm by @yiqingy0 in #10834
  • [https://nvbugs/5791242][chore] remove obsolete code by @ixlmar in #11388
  • [None][fix] fix tinygemm accuracy by @bo-nv in #11411
  • [https://nvbugs/5853720][fix] Disable cutedsl argmax kernel to fix perf regression by @chenfeiz0326 in #11403
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #11363
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #11396
  • [TRTLLM-9711][infra] Fix the testcase name in timeout xml by @yiqingy0 in #9781
  • [https://nvbugs/5848377][fix] fix deepeplowlatency with trtllm moe backend running fp8 DS_R1 by @leslie-fang25 in #11266
  • [TRTLLM-10273][feat] Move MambaCa...
Read more

v1.3.0rc3

12 Feb 19:48
b464c75

Choose a tag to compare

v1.3.0rc3 Pre-release
Pre-release

Highlights:

Model Support
  - Support LoRa BF16 checkpoints with Llama 3.3-70B FP8 (#9808)
  - Add Eagle3 support for Nemotron H (#11131)
  - Enhance support for complex models (#11254)

API
  - Allow overriding quantization configs (#11062)
  - Set continuous_usage_stats default to False to follow OpenAI protocol (#10644)
  - Set max_num_tokens_in_buffer default based on max_seq_len/max_input_len (#11082)

Feature
  - Export ONNX for DriveOS LLM (#10117)
  - Add L2 norm pattern matcher and fusion transform (#10767)
  - Add PDL support for moeAlltoAllKernels (#10591)
  - Integrate KVCacheManager V2 into TRTLLM runtime (#10659)
  - Integrate cuda.tile RMS norm kernels (#9725)
  - Refactor request fetching logic for better separation of concerns (#10988)
  - Implement gen-first disagg_service (#11020)
  - Support disagg SLURM job rescheduling (#11218)
  - Improve layer classification for sharding (#10718)
  - Add priority-based KV cache offload filtering (#10751)
  - Optimize beam search performance (remove GPU sync, fix batching, refactor) (#11276)
  - Avoid sync in PyTorchModelEngine when using beam search (#11341)
  - Adjust DeepGEMM tuning buckets for larger num_tokens scope (#11259)
  - Add CuteDSL FP8 GEMM for Blackwell (#10130)
  - Reduce host memory usage during model loading (#11119)
  - Perfect routing for Deepseek models (#11127)
  - Modularize transceiver for KV manager v2 (step 4) (#11225)

Fix
  - Fix AttributeError with return_perf_metrics on TensorRT backend (#10662)
  - Prevent routing context and generation requests to the same worker; document unique disagg ID (#11095)
  - Prevent out-of-bounds read (#10868)
  - Add __syncthreads() to TinyGEMM to resolve intermittent accuracy issues (#10873)
  - Fix PD disaggregation for VLMs that use mrope (#10865)
  - Always reset drafting states for GuidedDecoder (#10899)
  - Use NCCL as fallback to avoid crash due to insufficient memory (#10928)
  - Fix llama sm120 spec decoding (#10765)
  - Fix MTP one-model sampler (#10369)
  - Align kv_scales with ModelOpt HF checkpoint (#10745)
  - Fix selective_state_update perf regression for T=1 decode path (#11194)
  - Make health_generate work with beam search (#11097)
  - Work around accuracy issue by enforcing paged_context_fmha on Hopper for fmha_v2 (#11192)
  - Fix CuteDSL argmax on sm120 (#11181)
  - Fix amax to avoid NaN issue in fp8_blockscale_gemm_kernel (#11256)
  - Fix VSWA initialization with spec-dec and boundary condition in context input preparation (#10798)
  - Fix partial reuse disabled for disagg (#11247)
  - Retake ownership of mrope tensors in prefill worker (#11217)
  - Fix proto-to-SamplingParams conversion bugs and add gRPC tests (#11292)
  - Fix accuracy drop in VSWA with KV cache block reuse (#10875)

Documentation
  - Add Glm4MoeForCausalLM to model support matrix (#11156)
  - Fix GLM4-MoE Eagle support documentation (#11198)
  - Add CUDA Graph + LoRA to feature combination matrix (#11187)
  - Fix comments for KV cache manager v2 (#11207)
  - Skip Softmax Attention blog and docs (#10592)
  - Add sparse attention docs to index (#11342)

Test & Infra
  - Update GB200 test configs to use frontend SLURM platforms (#11085)
  - Fix jaraco-context and wheel vulnerability (#10901)
  - Add --high-priority in bot help message (#11133)
  - Print memory usage before/after accuracy test in CI (#11155)
  - Fix mocking of HuggingFace downloads in with_mocked_hf_download (#11200)
  - Set rerun report stage UNSTABLE and pipeline SUCCESS when rerun tests pass (#11210)
  - Move 6x H100 test stage to AIHub platform (#11039)
  - Add disagg perf tests (#10912)
  - Provide uniform test framework to test all MoE backends (#11128)
  - Move disagg scripts env configs from bash to submit.py (#10223)
  - Use free port for serve test (#10878)
  - Fix test_auto_scaling for 2 GPUs (#10866)
  - Update test list (#10883)
  - Fix an invalid test name (#11195)
  - Refine QA test list for SM120 (#11248)
  - Fix multimodal serve test (#11296)
  - Pass without_comm to Cutlass and DeepGEMM (#11229)
  - Promote SampleState to TypeVar and fix typing (#11281)
  - Fix bench script test (#10483)

What's Changed

  • [None][feat] Export ONNX for DriveOS LLM by @nvyocox in #10117
  • [#9525][feat] add L2 norm pattern matcher and fusion transform by @karthikvetrivel in #10767
  • [TRTINFRA-7548][infra] Update GB200 test configs to use frontend SLURM platforms by @mlefeb01 in #11085
  • [None][doc] Add Glm4MoeForCausalLM to model support matrix by @venkywonka in #11156
  • [None][feat] Perfect routing for Deepseek models by @brb-nv in #11127
  • [TRTLLM-10398][feat] Enable TRTLLM moe backend for Nemotron Super by @nv-guomingz in #10791
  • [#8242][feat] Add int4 GPTQ support for AutoDeploy by @Fridah-nv in #8248
  • [https://nvbugs/5804683][infra] unwaive Mistral Large3 test by @byshiue in #10680
  • [TRTLLM-9771][feat] Allow overriding quantization configs by @shuyixiong in #11062
  • [None][ci] Waive a flaky test on A10 by @chzblych in #11163
  • [None][infra] Waive failed cases for main on 1/30 by @EmmaQiaoCh in #11142
  • [None][fix] AttributeError with return_perf_metrics on tensorrt backend by @riZZZhik in #10662
  • [https://nvbugs/5834212][fix] prevent routing ctx and gen requests to the same worker; update doc for unique disagg ID by @reasonsolo in #11095
  • [TRTLLM-10666][chore] Refactor request fetching logic for better separation of concerns by @lancelly in #10988
  • [https://nvbugs/5823284][fix] Unwaive no repro hang issue by @liji-nv in #11138
  • [None] [feat] Add PDL support for moeAlltoAllKernels by @kaiyux in #10591
  • [None][infra] Waive failed cases and disable a stage on 02/02 by @EmmaQiaoCh in #11177
  • [TRTLLM-9766][feat] Integration of the KVCacheManager V2 to TRTLLM Runtime by @yizhang-nv in #10659
  • [None][chroe] Mass integration of release/1.2 - 2nd by @dominicshanshan in #11088
  • [None][feat] Integrate cuda.tile RMS norm kernels by @lirundong in #9725
  • [None][test] Fix an invalid test name by @chzblych in #11195
  • [None][feat] Nemotron H: Eagle3 support by @IzzyPutterman in #11131
  • [#10826][feat] AutoDeploy: Eagle One-Model [2/n]: Prefill-Only Implementation by @govind-ramnarayan in #11073
  • [None][doc] Fix GLM4-MoE Eagle support documentation by @venkywonka in #11198
  • [TRTLLM-10561][infra] Fix jaraco-context and wheel vulnerability by @yiqingy0 in #10901
  • [TRTLLM-10307][infra] Add --high-priority in bot help message by @mzweilz in #11133
  • [None][chore] Print memory usage before/after accuracy test in CI by @taylor-yb-lee in #11155
  • [TRTLLM-10803][fix] Fix mocking of HuggingFace downloads in with_mocked_hf_download by @anish-shanbhag in #11200
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #11193
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #11202
  • [TRTLLM-10839][infra] Set rerun report stage UNSTABLE and pipeline SUCCESS in post-merge when there are passed rerun tests by @yiqingy0 in #11210
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #11216
  • [None][fix] Align kv_scales with modelopt HF checkpoint by @cjluo-nv in #10745
  • [https://nvbugs/5739981][fix] unwaive tests using opt-125M by @ixlmar in #11100
  • [TRTLLM-10019][infra] Move 6 h100 test stage to aihub platform by @yuanjingx87 in #11039
  • [TRTLLM-8921][feat] implement gen-first disagg_service by @reasonsolo in #11020
  • [#11086][feat] Optimize Auto Deploy weight loading by preloading weights to CPU by @taylor-yb-lee in #11059
  • [None][fix] Set continuous_usage_stats default to False to follow OpenAI protocol by @riZZZhik in #10644
  • [None][chore] bump version to 1.3.0rc3 by @tburt-nv in #11238
  • [TRTLLM-8263][feat] Add Disagg Perf Tests by @chenfeiz0326 in #10912
  • [None][fix] Fix selective_state_update perf regression for T=1 decode path by @galagam in #11194
  • [TRTLLM-9111][feat] provide the uniform test framework to test all MoE backends by @xxi-nv in #11128
  • [None][fix] make health_generate work with beam search by @ixlmar in https://github.com/NVIDIA/TensorRT...
Read more

v1.2.0rc6.post3

05 Feb 02:36
7c6df0e

Choose a tag to compare

v1.2.0rc6.post3 Pre-release
Pre-release

What's Changed

  • [https://nvbugs/5850094][fix] Fix MoE cost estimation for auto multi-stream scheduling by @yizhang-nv in #11160
  • [None][feat] update TRT-LLM Gen DS FP8 MoE cubins and optimize finalize kernel by @nekorobov in #11104
  • [None][chore] Bump version to 1.2.0rc6.post3 by @yiqingy0 in #11224
  • [None][fix] Fallback to NCCL instead of NCCL symmetric by @Tabrizian in #11174
  • [None][feat] fuse shared to sparse experts in TRT-LLM Gen MoE by @nekorobov in #11143

Full Changelog: v1.2.0rc6.post2...v1.2.0rc6.post3

v1.2.0rc2.post2

05 Feb 02:25
910c070

Choose a tag to compare

v1.2.0rc2.post2 Pre-release
Pre-release

What's Changed

Full Changelog: v1.2.0rc2.post1...v1.2.0rc2.post2

v1.3.0rc2

03 Feb 19:31
f42a6cb

Choose a tag to compare

v1.3.0rc2 Pre-release
Pre-release

Highlights:

  • Known Issues

    • On RTX6000D, one might encounter Instruction 'redux.f32' not supported error. This issue will be resolved in the next release.
  • Model Support

    • Enable MTP for Nemotron Super (#10754)
    • Make TRTLLM MoE the default for GPTOSS on Blackwell (#11074)
    • Add missing absolute position embeddings in Qwen3-VL vision encoder (#11065)
  • API

    • Change context params and disagg params (#10495)
    • Add KVCacheManagerV2 APIs for Transceiver (#11003)
  • Feature

    • Add Skip Softmax MLA kernels for Blackwell and fix NVFP4 KV accuracy bug (#10813)
    • Fuse AllGather for expert statistics required by EPLB (#10885)
    • Add first-iteration streaming for GPT-OSS in trtllm-serve (#10808)
    • Integrate CuteDSL argmax kernel (#10476)
    • Update Mamba decode kernel to FlashInfer (#10757)
    • Improve effective memory bandwidth with TMA.RED (#10987)
    • Reorganize AutoTuner cache file for distributed tuning (#10956)
    • Support attention DP + Helix CP (#10477)
    • Improve performance of _write_finish_reasons in TorchSampler (#10459)
    • Add gRPC server for high-performance external router integration (#11037)
    • Prepare for future KVCacheV2 MTP support (#11029)
  • Fix

    • Fix CuteDSL MoE unit test (#10983)
    • Fix overlap scheduler pause() timing (#10943)
    • Fix Pydantic deepcopy bug (#11004)
    • Restore IPv6 support in serve.py (#10929)
    • Fix conditional compilation for sm10x cubins (#10839)
    • Add graceful fallbacks for NCCL symmetric mode (#11042)
    • Fix enable_alltoall passed to CutlassFusedMoE (#11016)
    • Fix kvCacheManager isLeaf() assertion failure (#10922)
    • Add null pointer check to parseNpyHeader (#10944)
    • Fix attention DP scheduling sort order to prioritize non-relaxed requests (#11106)
  • Documentation

    • Update Qwen2/3-VL models in supported_models.md (#10797)
  • Benchmark

    • Add performance alignment to layer-wise benchmarks (#11018)
    • Clean up layer-wise benchmarks code (#11092)
    • Add DGX-Spark VLM gemm3-12b bfp16/fp4/fp8 accuracy and perf cases (#11096)
  • Test & Infra

    • Add 250K-token NVFP4 MoE + PDL regression tests (#10911)
    • Add timeout for SeedOSS test (#8683)
    • Add Fake Ops for one-sided AlltoAll (#11002)
    • Refactor setup for RNN cache transceiver (#10957)
    • Change SLURM config access to use resolvePlatform (#11006)
    • Update CI allowList (#11040)
    • Add Mamba and MLA layers to sharding tests (#10364)
    • Remove pybind11 bindings and references (#10550, #11026)
    • Add multi-acc and Lyris GB200 test support (#11024)
    • Package triton-kernels as a dependency (#10471)
    • Fix Qwen3 Eagle test (#11030)
    • Dump thread stacks for hanging tests before timeout (#10708)
    • Remove -ccache from build_wheel.py args (#11064)
    • Fix trtllm-serve guided decoding test (#11101)
    • Remove invalid account for Blossom CI (#11126)
    • Add source code pulse scan to PLC nightly pipeline (#10961)

What's Changed

Read more

v1.3.0rc1

27 Jan 09:32
45d7022

Choose a tag to compare

v1.3.0rc1 Pre-release
Pre-release

Highlights

  • Model Support

    • GLM-4.5-Air support (#10653)
    • K-EXAONE MTP support (#10796)
  • API

    • Refactor AutoDeployConfig into LlmArgs (#10613)
    • Support model_kwargs for pytorch backend (#10351)
  • Feature

    • Update disagg slurm scripts (#10712)
    • Re-implement MicroBatchScheduler and CapacityScheduler in Python (#10273)
    • Fix sharding dashboard errors (#10786)
    • Async Transfer Manager (#9891)
    • Speculative One Model: FlashInfer sampling (#10284)
    • Refactor speculative decoding workers (#10768)
    • Use global unique id as disagg request id (#10187)
    • Enable guided decoding with reasoning parsers (#10890)
    • Support partial update weight for fp8 (#10456)
    • Multi-LoRA serving with CUDA Graph (#8279)
    • Support logprobs for Completions API (#10809)
    • Eagle3 Specdec UX improvements (#10124)
    • Python transceiver components (step 2) (#10494)
    • Upgrade NIXL to v0.9.0 (#10896)
    • KV Connector Support for MTP (#10932)
    • Support overlap scheduler for disagg ctx instances (#10755)
    • Adding implementation of KVCacheManagerV2 (#10736)
    • Switch to ConfigurableMoE as the default path (#10792)
  • Fix

    • Enable system memory to transfer active message in NIXL ucx (#10602)
    • Fix the potential misaligned access due to vectorized ld/st instructions in NVLinkOneSided A2A (#10539)
    • Default disable gemm+allreduce fusion (#10656)
    • Fix vulnerability urllib3 and nbconvert (#10551)
    • Fix overlap scheduler race condition (#10610)
    • Replace pickle.load with restricted Unpickler (#10622)
    • Fix copy start_logs in disagg slurm scripts (#10840)
    • Cherry-pick: Disable short profile for tunable ops with MERGE strategy (#10844, #10715)
    • Lock resource to fix potential access to released data (#10827)
    • Cherry-pick: Fix accuracy issue of TWO-SHOT AllReduce kernel (#10841, #10654)
    • Remove weight tensor holder to release memory earlier (#10876)
    • Add missing dist strategy param and fix typo for ad_logger (#10892)
    • Update RMSNorm custom op plumbing (#10843)
    • Fix hmac launch (#10434)
    • Avoid Double update for previous batch (#9888)
    • Re-init TRTLLM sampler to use sample stream in multi-stream cases (#10918)
    • Mtp with async scheduler (#10941)
    • Fix buffer reuse (#10716)
    • Cherry-pick: Fix hanging issue for MNNVL Allreduce under PP (#10750, #10633)
    • Workaround for flashinfer.sampling.sampling_from_logits (#10713)
    • Fix port 8000 being used issue in stress test (#10756)
  • Documentation

    • Clarify LoRA is not supported with --use_fp8_rowwise in Fp8RowwiseAttention (see #2603) (#10320)
    • Add NIXL as a Python attribution (step 4) (#10910)
    • 1.2 Release Notes Headers (#10722)
  • Test & Infra

    • Upload regression info to artifactory (#10599)
    • Add sonarqube scanning in lockfile generation pipeline (#10700)
    • Add Nemotron Nano v3 FP8 autodeploy perf test (#10603)
    • Remove trt flow tests in NIM (#10731)
    • Update config.yaml of slurm scripts to align with submit.py change (#10802)
    • Add a timeout in MNNVL throughput to prevent hangs if one rank crashes (#9532)
    • Trigger multi-gpu tests when install_nixl/ucx.sh is modified (#10624)
    • Add DGX-Spark VLM accuracy and perf spec dec cases (#10804)
    • Fix test list llm_spark_func.txt (#10921)
    • Add test configurable moe module multi gpu (#10699)
    • NVFP4 MoE - Move weights transformation to fusion phase (#10803)
    • Update flashinfer-python to 0.6.1 (#10872)
    • Improve disagg acc tests (#10833)
    • Refine placement group in ray executor (#10235)
    • Regenerate out dated lock file (#10940)
    • Remove long-running sanity check tests on GH200 (#10924, #10969)
    • Add dgx-spark beta notes (#10766)
    • Modify ctx config in 128k8k disagg cases (#10779)
    • Balanced random MoE workload generator for CuteDSL kernel UT, autotuner and layerwise benchmark (#10279)

What's Changed

  • [#10696][fix] AutoDeploy prevent torch.export from specializing batch dimension when max_batch_size=1 by @MrGeva in #10697
  • [None][infra] Add sonarqube scanning in lockfile generation pipeline by @yuanjingx87 in #10700
  • [https://nvbugs/5769712][fix] fix timeout in AutoDeploy llama accuracy test by @lucaslie in #10461
  • [#10688][fix] AutoDeploy Fix CUDA graph batch sizes exceeding max_batch_size by @MrGeva in #10687
  • [#10642][feat] AutoDeploy: optimized canonicalize_graph utilities [1/2] by @lucaslie in #10675
  • [https://nvbugs/5769890][fix] enable system memory to transfer active message in NIXL ucx by @chuangz0 in #10602
  • [https://nvbugs/5814247][fix] unwaive AutoDeploy multi-gpu unit tests by @lucaslie in #10769
  • [TRTLLM-10300][feat] Upload regression info to artifactory by @chenfeiz0326 in #10599
  • [None][chore] Add release/1.2 branch into lockfile generation schedule by @yiqingy0 in #10790
  • [TRTLLM-9581][infra] Use /home/scratch.trt_llm_data_ci in computelab by @ZhanruiSunCh in #10616
  • [None][infra] Waive failed cases for main on 01/19 by @EmmaQiaoCh in #10794
  • [#10607][chore] Add Nemotron Nano v3 FP8 autodeploy perf test by @MrGeva in #10603
  • [None][feat] Update disagg slurm scripts by @qiaoxj07 in #10712
  • [None][test] adjust the dis-agg test timeout threshold by @Shixiaowei02 in #10800
  • [None][chore] docs: clarify LoRA is not supported with --use_fp8_rowwise in Fp8RowwiseAttention (see #2603) by @ssam18 in #10320
  • [None][chore] Remove trt flow tests in NIM by @jieli-matrix in #10731
  • [None][chore] update config.yaml of slurm scripts to align with submit.py change by @dc3671 in #10802
  • [https://nvbugs/5776445][chore] unwaive test by @reasonsolo in #10667
  • [TRTLLM-10029][scheduler] Re-implement MicroBatchScheduler and CapacityScheduler in Python by @lancelly in #10273
  • [TRTLLM-10296][fix] Fix the potential misaligned access due to vectorized ld/st instructions in NVLinkOneSided A2A. by @bobboli in #10539
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #10776
  • [None][fix] default disable gemm+allreduce fusion by @benzh-2025 in #10656
  • [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10787
  • [None][fix] Fix vulnerability urllib3 and nbconvert by @yiqingy0 in #10551
  • [None][test] Update sanity test list by @xinhe-nv in #10825
  • [None][fix] Remove unused params in attn by @yizhang-nv in #10652
  • [TRTLLM-10785][feat] Fix sharding dashboard errors by @greg-kwasniewski1 in #10786
  • [https://nvbugs/5701445][chore] unwaive test. by @yuxianq in #10806
  • [None][infra] trigger multi-gpu tests when install_nixl/ucx.sh is mod… by @bo-nv in #10624
  • [None][infra] Waive failed cases for main branch on 01/20 by @EmmaQiaoCh in #10829
  • [None][chore] Reduce tedious logs by @chzblych in #10847
  • [#10707][fix] AutoDeploy: Super accuracy test fixes by @galagam in #10717
  • [None][chore] Async Transfer Manager by @jthomson04 in #9891
  • [None][fix] fix duplicate entry in waives.txt by @lucaslie in #10853
  • [None][feat] Speculative One Model: FlashInfer sampling by @IzzyPutterman in #10284
  • [https://nvbugs/5670108][fix] Fix overlap scheduler race condition in… by @SimengLiu-nv in #10610
  • [https://nvbugs/5760737][test] only skip mooncake+indexerkcache test by @zhengd-nv in #10266
  • [https://nvbugs/5759698][fix] unwaive test_base_worker by @Superjomn in #10669
  • [None][fix] Add a timeout in MNNVL throughput to prevent hangs if one rank crashes by @djns99 in #9532
  • [https://nvbugs/5670458][chore] Unwaive reward model test by @shuyixiong in #10831
  • [None][chore] Revert #10847 by @chzblych in #10869
  • [https://nvbugs/5775021] [fix] Replace pickle.load with restricted Unpickler by @yibinl-nvidia in #10622
  • [None][fix] Fix copy start_logs in disagg slurm scripts by @qiaoxj07 in #10840
  • [None][fix] Cherry-pick #10715: Disable short profile for tunable ops with MERGE strategy by @hyukn in #10844
  • [https://nvbugs/5740377][fix] Lock resource to fix potential access to released data by @HuiGao-NV in #10827
  • [https://nvbugs/5814253][fix] unwaive test_autotuner_di...
Read more

v1.2.0rc6.post2

22 Jan 16:50
50379d0

Choose a tag to compare

v1.2.0rc6.post2 Pre-release
Pre-release

What's Changed

  • [None][fix] enable EPLB for DEEPGEMM by @xxi-nv in #10618
  • [https://nvbugs/5811697][fix] Fix buffer reuse for release/1.2.0rc6.post1 by @yuxianq in #10734
  • [None][fix] impl fused triton kernel for e8m0 resmooth (target release/1.2.0rc6.post1, cherry-pick from #10327 and #10770) by @yuxianq in #10771
  • [None][chore] Bump version to 1.2.0rc6.post2 by @yiqingy0 in #10907

Full Changelog: v1.2.0rc6.post1...v1.2.0rc6.post2

v1.3.0rc0

22 Jan 08:04
0af1a0e

Choose a tag to compare

v1.3.0rc0 Pre-release
Pre-release

Highlights

  • Model Support

    • Added support for K-EXAONE models (#10355)
    • Integrated MiniMax M2 model (#10532)
    • Added Spark QA functional and performance test cases (#10564)
    • Added support for new Transformers RoPE configuration format (#10636)
    • Support customized sequence length larger than model config (#10600)
  • API Improvements

    • Added processed logprobs functionality to TorchSampler (#9675)
    • Added support for image_embeds in OpenAI API (#9715)
    • Covered LLM API multi_modal_embeddings (#9963)
    • Implemented GET/DELETE v1/responses/{response_id} endpoints (#9937)
    • Use RequestError for validation errors to prevent engine shutdown (#9761)
  • Performance Optimizations

    • Added Hopper XQA decode support for skip softmax attention (#10264)
    • Enabled attention data parallelism for Nemotron Super v3 (#10347)
    • Added fp4 GEMM with AllReduce support (#9729)
    • Use XQA JIT implementation by default with sliding window perf optimization (#10335)
    • Reduced host overhead for unified nvfp4 GEMM tuning path (#10503)
    • Implemented fused Triton kernel for e8m0 resmooth to reduce memory footprint (#10327)
  • MoE (Mixture of Experts) Enhancements

    • Added ExpertStatistic and DUMMY_ALLREDUCE for configurable MoE (#10401)
    • Added test configurable MoE module (#10575)
    • Implemented padding empty chunk for configurable MoE (#10451)
    • Enabled EPLB for DEEPGEMM (#10617)
    • Extended MoE quantization test utilities with comprehensive quant algorithm support (#10691)
  • Disaggregation Features

    • New request states and KV cache transceiver APIs in generation-first disaggregation (#10406)
    • Fixed cancellation with chunked prefill and disaggregation (#10111)
  • Auto Deploy

    • Refactored memory usage logging in AutoDeploy (#8505)
    • Separated RMS pattern detection from fusion (#9969)
    • Auto download speculative models from HuggingFace for PyTorch backend (#10099)
  • Fixes

    • Fixed PP loop hang caused by i-sending new requests (#10665)
    • Avoided write-write race for async PP send (#10488)
    • Fixed hang issue when enabling skip softmax on Blackwell (#10490)
    • Fixed hanging issue for MNNVL Allreduce under PP (#10633)
    • Implemented PP skip forward for all spec workers (#10578)
    • Added warning for gen-only paused state (#10664)
    • Used uint64_t as dtype of lamport_buffer_size to avoid overflow (#10499)
    • Fixed HelixCpMnnvlMemory initialization with PP (#10533)
    • Fixed regression in KV cache resize memory estimation (#10726)
    • Prevented out-of-bounds read (#9879)
    • Solved pillow version conflict (#10537)
    • Support to parse the keyword modules_to_not_convert of HF model config (#10527)
    • Used correct model names for config database regression tests (#10192)
    • Support GuidedDecoder with sharded logits (#10698)
    • Fixed Piecewise CUDA Graph for GPTOSS (#10631)
    • Fixed AutoDeploy EP sharding test (#10460)
    • Fixed the nvfp4 fused_moe in AutoDeploy (#10727)
    • Added quantization check for DeepEP LL low precision combine in new MoE comm API (#10072)
    • Fixed AIPerf issue (#10666)
    • Disabled TinyGEMM PDL due to accuracy issues (WAR) (#10619)
    • Only keep a limited number of performance statistic data (#10569)
    • Convert to CUDA tensor before calling _resmooth_kernel (#10770)
  • Test & Infra

    • Added hang detection for executor loop and worker (#10480)
    • Implemented bot to send performance regression messages to Slack channel (#10489)
    • Made model initialization more general and support weights loading in layer-wise benchmarks (#10562)
    • Updated trtllm-gen to support groupsTokensHeadsQ (#10261)
    • Added support to export data in trtllm-eval (#10075)
    • Added Torch extension API for FusedAddRMSNormQuant kernel (#9905)
    • Enabled ray tests (#10272)
    • Prevented flaky failures in C++ test_e2e.py by using local cached datasets (#10638)
    • Enabled partial reuse in Gemma and GPT OSS test (#10559)

What's Changed

  • [TRTLLM-10195][feat] K-EXAONE support by @yechank-nvidia in #10355
  • [None][test] update core test list by @crazydemo in #10538
  • [#8391][chore] removed llama and added deepseek to AutoDeploy's L0 perf test by @MrGeva in #10585
  • [TRTLLM-10022][feat] Add hopper xqa decode support for skip softmax attention by @pengbowang-nv in #10264
  • [None][chore] update waive list by @jieli-matrix in #10577
  • [None][feat] Add ExpertStatistic and DUMMY_ALLREDUCE for configurable_moe by @qiaoxj07 in #10401
  • [TRTLLM-10248][feat] Support Bot to Send Perf Regression Msg to Slack Channel by @chenfeiz0326 in #10489
  • [None][chore] update deepseekv3.2 test parameter by @yingguo-trt in #10595
  • [None][test] Remove most TRT-backend test cases in llm_perf_nim.yml by @yufeiwu-nv in #10572
  • [https://nvbugs/5794796][chore] waive test blocking premerge by @dc3671 in #10593
  • [None][fix] Solve pillow version conflict by @Wanli-Jiang in #10537
  • [TRTLLM-9522][test] cover LLM API multi_modal_embeddings by @ixlmar in #9963
  • [None][infra] Waive failed tests for main 01/12 by @EmmaQiaoCh in #10604
  • [#10580][fix] re-enable NemotronH MOE MMLU test by @suyoggupta in #10594
  • [https://nvbugs/5761391][fix] Use correct model names for config database regression tests by @anish-shanbhag in #10192
  • [None][chore] Print correct backend name in benchmark report by @galagam in #10597
  • [https://nvbugs/5689235][fix] Fix cancellation+chunked prefill+disagg by @Tabrizian in #10111
  • [https://nvbugs/5762336][fix] support to parse the keyword modules_to_not_convert of the HF model config" by @xxi-nv in #10527
  • [None][chore] Fix disagg assert by @fredricz-20070104 in #10596
  • [TRTLLM-10271][test] Add Spark QA functional and performance cases by @JennyLiu-nv in #10564
  • [None][infra] try removing shared cache dir mount by @tburt-nv in #10609
  • [None][infra] Update allowlist 2026.01.08 by @niukuo in #10535
  • [None][feat] Hang detection for executor loop and worker. by @yuxianq in #10480
  • [TRTLLM-8462][feat] Support GET/DELETE v1/responses/{response_id} by @JunyiXu-nv in #9937
  • [TRTLLM-10060][feat] Enable attention dp for Nemotron Super v3. by @nv-guomingz in #10347
  • [https://nvbugs/5788127][fix] Use uint64_t as the dtype of lamport_buffer_size to avoid overflow by @yilin-void in #10499
  • [NVBUG-5670458][chore] Unwaive lp tests by @hchings in #10524
  • [TRTLLM-8425][doc] document Torch Sampler details by @ixlmar in #10606
  • [None][feat] Layer-wise benchmarks: make model init more general and support weights loading by @yuantailing in #10562
  • [None][test] Unwaive qwen3 next test case. by @nv-guomingz in #9877
  • [None][feat] add fp4 gemm + allreduce by @benzh-2025 in #9729
  • [None][infra] support overriding nspect version by @niukuo in #10402
  • [https://nvbugs/5772396][fix] WAR: Disable TinyGEMM PDL due to accuracy issues by @dongfengy in #10619
  • [None][feat] AutoDeploy: refactor memory usage logging by @nzmora-nvidia in #8505
  • [#9283][feat] AutoDeploy: separate rms pattern detection from fusion by @Fridah-nv in #9969
  • [https://nvbugs/5791900][fix] Fix HelixCpMnnvlMemory init with PP by @brb-nv in #10533
  • [None][chore] Add test configurable moe module by @leslie-fang25 in #10575
  • [https://nvbugs/5781589][fix] Implement pp skip forward for all spec workers. by @yuxianq in #10578
  • [None][fix] Avoid write-write race for async pp send. by @yuxianq in #10488
  • [https://nvbugs/5753788][chore] Padding empty chunk for configurable moe by @leslie-fang25 in #10451
  • [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10589
  • [None][chore] update allowlist 2026-01-13 by @tburt-nv in #10645
  • [None][test] add test into qa test list by @xinhe-nv in #10627
  • [None][test] Spark - Change testlist name and perf yml format by @JennyLiu-nv in #10626
  • [None][chore] waive the CI failure by @xxi-nv in #10655
  • [None][refactor] Unify the usage of MPIDist and TorchDist. by @yuxianq in #10380
  • [None][fix] Reduce host over...
Read more

v1.2.0rc8

15 Jan 05:40
80649a8

Choose a tag to compare

v1.2.0rc8 Pre-release
Pre-release

Highlights

  • Model Support

    • Add export patch for GraniteMoe MoE models to enable torch.export compatibility (#10169)
    • Eagle: qwen2 capture hidden states (#10091)
    • Add pp support for DeepSeek-v3.2 (#10449)
    • Pass lora_params through Qwen2/3 model forward (#10174)
    • Fix export for microsoft/Phi-3-medium-128k-instruct (#10455)
    • Mistral large 3 few code refine (#10405)
    • EPD for Qwen3 VL (#10470)
    • Remove some model support; add device constraint (#10563)
    • Enable AttentionDP on Qwen3-VL and fix test (#10435)
  • API

    • Add stability tags for serve subcommand (#10012)
  • Feature

    • Better align MLA chunking with indexer chunking when chunked prefill enabled for DSV32 (#10552)
    • Sm100 weight-only kernel (#10190)
    • AutoTuner Cache: Support cache file lock and merge all ranks into one (#10336)
    • Apply AutoTuner to AllReduce Op for strategy tuning (#8531)
    • Add transferAgent binding (step 1) (#10113)
    • Add the eos tokens in generation config to stop words in the sampler (#10389)
    • Apply fusion for W4AFP8_AWQ MoE (#9838)
    • Further reduce tuning time for cuteDSL nvFP4 dense gemm (#10339)
    • Run sample_async on extra stream (#10215)
    • Optimize qk rope/nope concat for DSA (#10571)
  • Fix

    • Fix bug of Mistral-Small-3.1-24B-Instruct-2503 (#10394)
    • Use 0 port as arbitrary port when disagg service discovery is enabled (#10383)
    • Fix buffer reuse for CUDA graph attention metadata (#10393)
    • Force release torch memory when LLM is destroyed (#10314)
    • Swap TP-CP grouping order (#10350)
    • TRTLLM MoE maps to lower tuning buckets when ep>1 (#9998)
    • Fix draft token tree chain crash and depth=1 corner case (#10386, #10385)
    • Fixed recursive node traversals (#10379)
    • Fix undefined tokens_per_block (#10438)
    • Skip spec dec for non-last rank (#10445)
    • Setup dist before using autotuner (#10491)
    • Fix broken cast (#9975)
    • Fix sm120 speculation (#10049)
    • Fix mamba_cache_manager when enabling cuda_graph_padding and let test cover this case (#9873)
    • Choose register model config over root config for VLM (#10553)
  • Documentation

    • Update SWA + spec dec support matrix (#10421)
    • Add --config preference over --extra_llm_api_options in CODING_GUIDELINES.md (#10426)
    • Adding parallelism types in feature combination matrix (#9849)
    • Update GPTOSS Doc (#10536)
    • Blog: Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs (#10565)
    • Update Qwen3-Next doc by adding known issues section (#10582)
  • Test & Infra

    • Add tests for DeepSeek v3.2 (#10561)
    • Add accuracy tests for super-v3 with multiple-gpus (#10234)
    • Layer-wise benchmarks: support TEP balance, polish slurm scripts (#10237)
    • Add disag-serving kimi k2 thinking tests (#10357)
    • Partition test_llm_pytorch.py for parallel execution (#10400)
    • Only Use Throughput Metrics to Check Regression (#10404)
    • Add vswa test cases coverage (#10146)
    • Use random port in container port section (#10432)
    • Remove redundant retries while binding to arbitrary port (#10452)
    • Add qwen3-4b accuracy test case (#10382)
    • Update kimi-k2-1k1k dataset (#10473)
    • Fix concurrency list in Wide-EP perf tests (#10529)
    • Restrict max_num_tokens in disagg mtp config (#10442)
    • Add kimi_k2 single node perf test (#10436)
    • Add MMMU test for mistral small (#10530)
    • Workaround OCI-NRT slowdown issue (#10587)

What's Changed

  • [#8391][chore] added deepseek_r1_distill_qwen_32b AutoDeploy perf test to L0 by @MrGeva in #10377
  • [https://nvbugs/5670469][fix] Filter 0s and choose min of kv_head for Nemotron model by @farazkh80 in #10206
  • [https://nvbugs/5772363][fix] fix bug of Mistral-Small-3.1-24B-Instruct-2503 by @byshiue in #10394
  • [https://nvbugs/5649010][fix] use 0 port as arbitrary port when disagg service discovery is enabled by @reasonsolo in #10383
  • [TRTLLM-10065][feat] Add accuracy tests for super-v3 with multiple-gpus by @Wanli-Jiang in #10234
  • [https://nvbugs/5779534][fix] fix buffer reuse for CUDA graph attention metadata by @lfr-0531 in #10393
  • [None][feat] sm100 weight-only kernel by @Njuapp in #10190
  • [https://nvbugs/5701425][chore] Unwaive tests. by @yuxianq in #10269
  • [None][feat] Layer-wise benchmarks: support TEP balance, polish slurm scripts by @yuantailing in #10237
  • [None][infra] Waive failed cases in post-merge on 1/5 by @EmmaQiaoCh in #10399
  • [TRTLLM-10185][feat] AutoTuner Cache: Support cache file lock and merge all ranks into one by @hyukn in #10336
  • [TRTLLM-8242][feat] Add stability tags for serve subcommand by @LinPoly in #10012
  • [https://nvbugs/5752521][fix] Unwaive test_trtllm_flashinfer_symbol_collision.py by @yihwang-nv in #10227
  • [None][infra] Waive failed cases again on 1/5 by @EmmaQiaoCh in #10403
  • [https://nvbugs/5715568][fix] Force to release torch memory when LLM is destroyed by @HuiGao-NV in #10314
  • [TRTLLM-8821][feat] Apply AutoTuner to AllReduce Op for strategy tuning. by @hyukn in #8531
  • [None][feat] update deepgemm to the DeepGEMM/nv_dev branch by @lfr-0531 in #9898
  • [TRTLLM-9381][test] add disag-serving kimi k2 thinking tests by @xinhe-nv in #10357
  • [#10374][fix] fixed race condition in AutoDeploy's mp tests port acquisition by @MrGeva in #10366
  • [TRTLLM-9465][fix] Swap TP-CP grouping order by @brb-nv in #10350
  • [None][perf] TRTLLM MoE maps to lower tuning buckets when ep>1 by @rosenrodt in #9998
  • [TRTLLM-10053][feat] AutoDeploy: Add Super v3 config file, improve test runtime by @galagam in #10397
  • [https://nvbugs/5772521][fix] Fix draft token tree chain crash by @mikeiovine in #10386
  • [https://nvbugs/5772414][fix] Fix draft token tree depth=1 corner case by @mikeiovine in #10385
  • [TRTLLM-9767][feat] Fixed recursive node traversals by @greg-kwasniewski1 in #10379
  • [TRTLLM-9551][infra] Partition test_llm_pytorch.py for parallel execution by @Superjomn in #10400
  • [https://nvbugs/5695984][fix] Unwaive llama3 eagle test by @mikeiovine in #10092
  • [https://nvbugs/5745152][fix] Unwaive gpt oss spec decode test by @mikeiovine in #10370
  • [#10170][fix] Add export patch for GraniteMoe MoE models to enable torch.export compatibility by @karthikvetrivel in #10169
  • [https://nvbugs/5777044][chore] Remove solved bugs from waives.txt by @SimengLiu-nv in #10422
  • [None][feat] precompiled installation from local src dir by @lucaslie in #10419
  • [TRTLLM-9527][feat] Add transferAgent binding (step 1) by @chuangz0 in #10113
  • [None][fix] Only Use Throughput Metrics to Check Regression by @chenfeiz0326 in #10404
  • [None][feat] add the eos tokens in generation config to stop words in the sampler by @JadoTu in #10389
  • [None][chore] Update SWA + spec dec support matrix by @mikeiovine in #10421
  • [None][feat] CuteDSL MOE FC1 Enhancement by @liyuhannnnn in #10088
  • [https://nvbugs/5726962][feat] Apply fusion for W4AFP8_AWQ MoE by @yumin066 in #9838
  • [#2511][fix] eagle: qwen2 capture hidden states by @XiaoXuan42 in #10091
  • [None][docs] Add --config preference over --extra_llm_api_options in CODING_GUIDELINES.md by @venkywonka in #10426
  • [#8460][feat] Revive and simplify Model Explorer visualization integration by @karthikvetrivel in #10150
  • [None][chore] unwaive qwen3 30b test by @kris1025 in #10115
  • [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10384
  • [None][test] update test case constraint by @crazydemo in #10381
  • [https://nvbugs/5769926] [fix] Add no container mount home WAR by @kaiyux in #10431
  • [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10367
  • [TRTLLM-9622][infra] Enable DGX_B300 multi-gpu testing in pre-merge pipeline by @yiqingy0 in #9699
  • [TRTLLM-9896][test] add vswa test cases coverage by @crazydemo in #10146
  • [None] [fix] Fix undefined tokens_per_block by @kaiyux in #10438
  • [https://nvbugs/5772361][ci] Unwaive tests that have been fixed by @2ez4bz...
Read more

v1.2.0rc6.post1

08 Jan 05:52

Choose a tag to compare

v1.2.0rc6.post1 Pre-release
Pre-release

Security Vulnerabilities

GnuPG Vulnerability

A security vulnerability has been identified in GnuPG versions prior to 2.4.9, which is present in the Ubuntu 24.04 LTS utilized by the TensorRT LLM base image. For details regarding this vulnerability, please refer to the official Ubuntu advisory: CVE-2025-68973. An official patched package for the Ubuntu system is currently pending. The fix will be included in the next release once the updated package is published and incorporated. To mitigate potential risks immediately, users are advised to manually upgrade GnuPG to version 2.4.9 or later.

Hugging Face Transformers Vulnerabilities

Several security vulnerabilities have been disclosed regarding the Hugging Face Transformers library used in TensorRT LLM. As these issues originate from an upstream dependency, remediation is dependent on the release of a patch by the Hugging Face team. We are actively monitoring the situation and will update TensorRT LLM to include the necessary fixes once a stable release of the Transformers library addressing these vulnerabilities becomes available. Affected CVEs: CVE-2025-14920, CVE-2025-14921, CVE-2025-14924, CVE-2025-14927, CVE-2025-14928, CVE-2025-14929, CVE-2025-14930

What's Changed

  • [https://nvbugs/5708810][fix] Fix TRTLLMSampler by @moraxu in #9710
  • [TRTLLM-9641][infra] Use public triton 3.5.0 in SBSA by @ZhanruiSunCh in #9652
  • [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #9979
  • [TRTLLM-9794][ci] move more test cases to gb200 by @QiJune in #9994
  • [None][feat] Add routing support for the new model for both cutlass and trtllm moe backend by @ChristinaZ in #9792
  • [TRTLLM-8310][feat] Add Qwen3-VL-MoE by @yechank-nvidia in #9689
  • [https://nvbugs/5731717][fix] fixed flashinfer build race condition during test by @MrGeva in #9983
  • [FMDL-1222][feat] Support weight and weight_scale padding for NVFP4 MoE cutlass by @Wanli-Jiang in #9358
  • [None][chore] Update internal_cutlass_kernels artifacts by @yihwang-nv in #9992
  • [None][docs] Add README for Nemotron Nano v3 by @2ez4bz in #10017
  • [None][infra] Fixing credential loading in lockfile generation pipeline by @yuanjingx87 in #10020
  • [https://nvbugs/5727952][fix] a pdl bug in trtllm-gen fmha kernels by @PerkzZheng in #9913
  • [None][infra] Waive failed test for main branch on 12/16 by @EmmaQiaoCh in #10029
  • [None][doc] Update CONTRIBUTING.md by @syuoni in #10023
  • [None][fix] Fix Illegal Memory Access for CuteDSL Grouped GEMM by @syuoni in #10008
  • [TRTLLM-9181][feat] improve disagg-server prometheus metrics; synchronize workers' clocks when workers are dynamic by @reasonsolo in #9726
  • [None][chore] Final mass integration of release/1.1 by @mikeiovine in #9960
  • [None][fix] Fix iteration stats for spec-dec by @achartier in #9855
  • [https://nvbugs/5741060][fix] Fix pg op test by @shuyixiong in #9989
  • [https://nvbugs/5635153][chore] Remove responses tests from waive list by @JunyiXu-nv in #10026
  • [None] [feat] Enhancements to slurm scripts by @kaiyux in #10031
  • [None][infra] Waive failed tests due to llm model files by @EmmaQiaoCh in #10068
  • [None][fix] Enabled simultaneous support for low-precision combine and MTP. by @yilin-void in #9091
  • [https://nvbugs/5698434][test] Add Qwen3-4B-Eagle3 One-model perf test by @yufeiwu-nv in #10041
  • [TRTLLM-9998][fix] Change trtllm-gen MoE distributed tuning strategy back to INDEPENDENT by @hyukn in #10036
  • [TRTLLM-9989][fix] Disable tvm_ffi for CuteDSL nvFP4 dense GEMM. by @hyukn in #10040
  • [None][chore] Remove unnecessary warning log for tuning. by @hyukn in #10077
  • [TRTLLM-9680][perf] Optimize TRTLLMSampler log_probs performance (Core fix has been merged via #9353) by @tongyuantongyu in #9655
  • [None][chore] Bump version to 1.2.0rc6.post1 by @yiqingy0 in #10484

Full Changelog: v1.2.0rc6...v1.2.0rc6.post1