Onnx memory management and onnx-trt optimisation by Menkib64 · Pull Request #2307 · LeelaChessZero/lc0

Menkib64 · 2025-10-05T18:03:01Z

No description provided.

borg323 · 2025-10-08T10:26:04Z

While network_onnx.cc and the related cuda files should eventually move to src/network/backends/onnx, I prefer to leave that for a followup to avoid complicating ongoing development.

Very small batches require a separate optimisation. It costs too much performance for small sizes if optimising the batch sizes 1. Adding special optimisation for very small batches won't a simple change which should be left for future change.

mooskagh · 2025-10-30T14:12:43Z

It would be nice to somehow have the CUDA part isolated from the pure ONNX, but I don't immediately see a good way to do it. I'll give it a thought.

#2328 tries to separate EP specific code using a provider template type. It could be probably improved.

Copilot

Pull Request Overview

This pull request adds CUDA runtime optimizations for the ONNX backend, improving performance for CUDA and TensorRT execution providers. The changes introduce custom CUDA kernels for input plane expansion, implement resource pooling for InputsOutputs objects, and add proper stream synchronization for asynchronous GPU operations.

Implements custom CUDA kernels for expanding input planes on GPU instead of on CPU
Adds resource pooling mechanism to reuse InputsOutputs allocations across computations
Introduces proper CUDA stream management with separate streams for upload, compute, and download operations

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 13 comments.

File	Description
src/neural/backends/network_onnx.cc	Main implementation changes including InputsOutputs struct, CUDA stream management, resource pooling, and async GPU operations
src/neural/backends/cuda/onnx_kernels.h	Header declaring the expandPlanesOnnx CUDA kernel template
src/neural/backends/cuda/onnx.cu	CUDA kernel implementation for plane expansion
meson.build	Build configuration to enable CUDA runtime support for ONNX backend

Comments suppressed due to low confidence (1)

src/neural/backends/network_onnx.cc:1

Missing parentheses around the logical OR expression. This condition will be parsed as (get_option('cudnn')) or (get_option('plain_cuda') and cu_dart.found() and nvcc.found()) due to operator precedence. It should be (get_option('cudnn') or get_option('plain_cuda')) and cu_dart.found() and nvcc.found() to ensure both conditions require CUDA runtime and nvcc.

/*

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-10-30T14:15:14Z

+          (batch_size < 0 ? std::to_string(batch_size)
+                          : std::to_string(batch_size - batch_size_ + 1) + "-" +
+                                std::to_string(batch_size)) +


The ternary expression is confusing. When batch_size < 0, it just converts it to string, but when positive, it creates a range string. Consider extracting this into a helper function or add a comment explaining the logic (e.g., negative means variable batch size, positive means fixed range).

* reuse onnx input/output tensor memory * bind cuda mem * print some cuda device info * use fixed device addresses in preparation for cuda graph * Allow concurrent compute and DMA memory transfer * Delay network evaluation until previous download has completed * Add expandPlanes cuda kernel to onnx * Remove extra waits from multi step evaluation when using onnx-trt * Let Ort::IoBinding manage tensor object lifetime * Improve onnx-trt optimiser for fixed size size inputs * Add GPU and board ID print to onnx-trt * Check if CUDA version support PCI information. * Add warnings if CUDA support isn't enabled. * Always optimise to the largest batch sizes. Very small batches require a separate optimisation. It costs too much performance for small sizes if optimising the batch sizes 1. Adding special optimisation for very small batches won't a simple change which should be left for future change.

borg323 reviewed Oct 7, 2025

View reviewed changes

Comment thread src/neural/backends/network_onnx.cc

borg323 reviewed Oct 7, 2025

View reviewed changes

Comment thread src/neural/backends/network_onnx.cc

borg323 requested a review from mooskagh October 8, 2025 16:32

Menkib64 mentioned this pull request Oct 27, 2025

Onnx cuda graph support and code refactoring #2328

Open

borg323 and others added 25 commits October 29, 2025 19:31

reuse onnx input/output tensor memory

de1feff

bind cuda mem

71ec701

bug fix

ccbce3b

cleanup

a070adf

print some cuda device info

310504a

another cleanup

d60d3aa

use fixed device addresses in preparation for cuda graph

905c8b1

copilot bug fix

0309fe6

Allow concurrent compute and DMA memory transfer

2477f0d

Delay network evaluation until previous download has completed

bbfe0ba

Only call cuda when using cuda aware EP

65daa70

Add expandPlanes cuda kernel to onnx

480eb60

Reduce expandPlanes threads to half

7802e30

Avoid calls to cudaStreamSynchronize in tensorrt EP

620f448

Revert user_compute_stream addition to the option map

0303559

Avoid uploading padding in onnx-trt/cuda backends

baf5931

Remove extra waits from multi step evaluation when using onnx-trt

5da86d3

Let Ort::IoBinding manage tensor object lifetime

77961a2

Improve onnx-trt optimiser for fixed size size inputs

fb6a078

Disable onnx cudart support if cuda backend is disabled

89d0528

Fix conflict resolution mistake

c5c8bbe

Select a GPU for a new onnx cuda computation

1774f43

Synchronize onnx bindings when CUDART is disabled

254dbf5

Add GPU and board ID print to onnx-trt

4453fad

Check if CUDA version support PCI information.

8e61799

Menkib64 added 4 commits October 29, 2025 19:41

Add warnings if CUDA support isn't enabled.

e46e42d

Add runtime check for cuda version.

e959a1d

Fix clockRate query for CUDA 13

bb45a9f

Menkib64 force-pushed the onnx-mem2-nographs branch from 5b4afd0 to bb45a9f Compare October 29, 2025 17:47

Menkib64 mentioned this pull request Oct 29, 2025

Onnx trt check incorrect fp32 engine #2332

Open

mooskagh requested a review from Copilot October 30, 2025 14:10

mooskagh reviewed Oct 30, 2025

View reviewed changes

Copilot AI reviewed Oct 30, 2025

View reviewed changes

Add a comment, fix ifdef, and missing stream destructors

c9f43b3

borg323 reviewed Oct 31, 2025

View reviewed changes

Comment thread src/neural/backends/network_onnx.cc

Use input_size_ instead of GetBatchSize

b2c69bb

borg323 approved these changes Nov 3, 2025

View reviewed changes

borg323 merged commit 3e49976 into LeelaChessZero:master Nov 4, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Onnx memory management and onnx-trt optimisation#2307

Onnx memory management and onnx-trt optimisation#2307
borg323 merged 31 commits intoLeelaChessZero:masterfrom
Menkib64:onnx-mem2-nographs

Menkib64 commented Oct 5, 2025

Uh oh!

Uh oh!

Uh oh!

borg323 commented Oct 8, 2025

Uh oh!

Uh oh!

mooskagh Oct 30, 2025

Uh oh!

Menkib64 Oct 30, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Oct 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Menkib64 commented Oct 5, 2025

Uh oh!

Uh oh!

Uh oh!

borg323 commented Oct 8, 2025

Uh oh!

Uh oh!

mooskagh Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Menkib64 Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants