Onnx memory management and onnx-trt optimisation#2307
Onnx memory management and onnx-trt optimisation#2307borg323 merged 31 commits intoLeelaChessZero:masterfrom
Conversation
|
While |
Very small batches require a separate optimisation. It costs too much performance for small sizes if optimising the batch sizes 1. Adding special optimisation for very small batches won't a simple change which should be left for future change.
5b4afd0 to
bb45a9f
Compare
There was a problem hiding this comment.
It would be nice to somehow have the CUDA part isolated from the pure ONNX, but I don't immediately see a good way to do it. I'll give it a thought.
There was a problem hiding this comment.
#2328 tries to separate EP specific code using a provider template type. It could be probably improved.
There was a problem hiding this comment.
Pull Request Overview
This pull request adds CUDA runtime optimizations for the ONNX backend, improving performance for CUDA and TensorRT execution providers. The changes introduce custom CUDA kernels for input plane expansion, implement resource pooling for InputsOutputs objects, and add proper stream synchronization for asynchronous GPU operations.
- Implements custom CUDA kernels for expanding input planes on GPU instead of on CPU
- Adds resource pooling mechanism to reuse InputsOutputs allocations across computations
- Introduces proper CUDA stream management with separate streams for upload, compute, and download operations
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 13 comments.
| File | Description |
|---|---|
| src/neural/backends/network_onnx.cc | Main implementation changes including InputsOutputs struct, CUDA stream management, resource pooling, and async GPU operations |
| src/neural/backends/cuda/onnx_kernels.h | Header declaring the expandPlanesOnnx CUDA kernel template |
| src/neural/backends/cuda/onnx.cu | CUDA kernel implementation for plane expansion |
| meson.build | Build configuration to enable CUDA runtime support for ONNX backend |
Comments suppressed due to low confidence (1)
src/neural/backends/network_onnx.cc:1
- Missing parentheses around the logical OR expression. This condition will be parsed as
(get_option('cudnn')) or (get_option('plain_cuda') and cu_dart.found() and nvcc.found())due to operator precedence. It should be(get_option('cudnn') or get_option('plain_cuda')) and cu_dart.found() and nvcc.found()to ensure both conditions require CUDA runtime and nvcc.
/*
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| (batch_size < 0 ? std::to_string(batch_size) | ||
| : std::to_string(batch_size - batch_size_ + 1) + "-" + | ||
| std::to_string(batch_size)) + |
There was a problem hiding this comment.
The ternary expression is confusing. When batch_size < 0, it just converts it to string, but when positive, it creates a range string. Consider extracting this into a helper function or add a comment explaining the logic (e.g., negative means variable batch size, positive means fixed range).
* reuse onnx input/output tensor memory * bind cuda mem * print some cuda device info * use fixed device addresses in preparation for cuda graph * Allow concurrent compute and DMA memory transfer * Delay network evaluation until previous download has completed * Add expandPlanes cuda kernel to onnx * Remove extra waits from multi step evaluation when using onnx-trt * Let Ort::IoBinding manage tensor object lifetime * Improve onnx-trt optimiser for fixed size size inputs * Add GPU and board ID print to onnx-trt * Check if CUDA version support PCI information. * Add warnings if CUDA support isn't enabled. * Always optimise to the largest batch sizes. Very small batches require a separate optimisation. It costs too much performance for small sizes if optimising the batch sizes 1. Adding special optimisation for very small batches won't a simple change which should be left for future change.
No description provided.