[QDP] Add zero-copy amplitude batch encoding from float32 GPU tensors#1029
[QDP] Add zero-copy amplitude batch encoding from float32 GPU tensors#1029viiccwen wants to merge 2 commits intoapache:mainfrom
Conversation
|
Please resolve the conflict, Ty for the contribuiton. |
|
solved! |
db93ede to
b6a4f4f
Compare
|
sorry for the late, |
| launch_amplitude_encode, launch_amplitude_encode_batch, launch_l2_norm, launch_l2_norm_batch, | ||
| launch_l2_norm_f32, | ||
| launch_amplitude_encode, launch_amplitude_encode_batch, launch_amplitude_encode_batch_f32, | ||
| launch_l2_norm, launch_l2_norm_batch, launch_l2_norm_batch_f32, launch_l2_norm_f32, |
There was a problem hiding this comment.
In this file around lines 251–276 and 351–376, amplitude_encode_batch_kernel / _f32 compute input_base = sample_idx * input_len and then do reinterpret_cast<const double2*>(input_batch + input_base) + elem_pair / float2. For odd input_len and sample_idx > 0 this base pointer is only 8‑byte (double) / 4‑byte (float) aligned, not 16‑byte, so the double2/float2 loads are potentially misaligned. This alignment pattern already existed in the original f64 batch kernel and this PR copies it into the new f32 batch path; please either enforce even input_len at the Rust call‑site or rework the kernels to index from a properly aligned double2* / float2* base pointer with a scalar fallback.
b6a4f4f to
ce9d876
Compare
|
Please help resolve conflicts |
There was a problem hiding this comment.
Pull request overview
This PR adds comprehensive float32 amplitude batch encoding support to the QDP engine, enabling zero-copy GPU encoding from PyTorch float32 CUDA tensors.
Changes:
- Added f32 batch amplitude encoding kernels and GPU pointer APIs in core
- Refactored
GpuStateVector::new_batchto accept precision parameter for flexible buffer allocation - Updated all batch encoding call sites to specify output precision
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| qdp/qdp-kernels/src/amplitude.cu | Added f32 batch amplitude kernel (amplitude_encode_batch_kernel_f32) and launcher, plus f32 batch L2 norm support |
| qdp/qdp-kernels/src/lib.rs | Added FFI declarations for f32 batch kernels and stubs for non-CUDA builds |
| qdp/qdp-core/src/lib.rs | Added encode_batch_from_gpu_ptr_f32 public API with validation and added precision() accessor |
| qdp/qdp-core/src/gpu/memory.rs | Refactored new_batch to accept Precision parameter for f32 or f64 buffer allocation |
| qdp/qdp-core/src/gpu/encodings/mod.rs | Added default encode_batch_from_gpu_ptr_f32 trait method |
| qdp/qdp-core/src/gpu/encodings/amplitude.rs | Implemented f32 batch encoding with norm validation |
| qdp/qdp-core/src/gpu/encodings/{angle,basis,iqp}.rs | Updated new_batch calls to pass Precision::Float64 |
| qdp/qdp-core/src/encoding/mod.rs | Updated streaming pipeline to use engine.precision() |
| qdp/qdp-python/src/lib.rs | Added validation for f32/f64 amplitude tensors; 1D f32 supported, 2D f32 returns clear error |
| qdp/qdp-core/tests/gpu_ptr_encoding.rs | Comprehensive f32 batch tests covering success and error paths |
| qdp/qdp-core/tests/dlpack.rs | Updated to pass precision to new_batch |
| testing/qdp/test_bindings.py | Added tests for f32 input with both f32 and f64 engine precision |
| qdp/qdp-python/tests/test_dlpack_validation.py | Updated to verify 1D f32 works and 2D f32 gives clear error |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ce9d876 to
3d8485f
Compare
Purpose of PR
This PR adds batch float32 amplitude encoding support in QDP core and kernels.
It extends the existing float32 GPU-pointer amplitude path from single-sample encoding to batched encoding, and refactors batch state allocation so the output precision is selected explicitly at allocation time.
What changed
Core
QdpEngine::encode_batch_from_gpu_ptr_f32QdpEngine::encode_batch_from_gpu_ptr_f32_with_streamAmplitudeEncoderKernels
launch_amplitude_encode_batch_f32qdp-kernelsAllocation
GpuStateVector::new_batchto accept an explicitPrecisionFloat64orFloat32as appropriateFloat64Python
Tests
Related Issues or PRs
closes #1028
Changes Made
Breaking Changes
Checklist