Skip to content

[QDP] Add zero-copy amplitude batch encoding from float32 GPU tensors#1029

Open
viiccwen wants to merge 2 commits intoapache:mainfrom
viiccwen:add-batch-f32-amplitude-encoding
Open

[QDP] Add zero-copy amplitude batch encoding from float32 GPU tensors#1029
viiccwen wants to merge 2 commits intoapache:mainfrom
viiccwen:add-batch-f32-amplitude-encoding

Conversation

@viiccwen
Copy link
Contributor

@viiccwen viiccwen commented Feb 7, 2026

Purpose of PR

This PR adds batch float32 amplitude encoding support in QDP core and kernels.

It extends the existing float32 GPU-pointer amplitude path from single-sample encoding to batched encoding, and refactors batch state allocation so the output precision is selected explicitly at allocation time.

What changed

Core

  • Added QdpEngine::encode_batch_from_gpu_ptr_f32
  • Added QdpEngine::encode_batch_from_gpu_ptr_f32_with_stream
  • Added the corresponding float32 batch amplitude entry point in AmplitudeEncoder

Kernels

  • Added launch_amplitude_encode_batch_f32
  • Added the float32 batch amplitude CUDA kernel wiring in qdp-kernels
  • Reused the existing float32 batched L2 norm reduction for batch normalization

Allocation

  • Refactored GpuStateVector::new_batch to accept an explicit Precision
  • Updated all batch allocation call sites to pass the correct output precision
    • amplitude batch uses Float64 or Float32 as appropriate
    • angle, basis, IQP, and the streaming pipeline continue to request Float64

Python

  • Kept float32 + amplitude batch encoding unsupported in Python bindings for now
  • Existing Python-facing error behavior remains unchanged and can be handled in a follow-up PR

Tests

  • Added Float32 batch DLPack shape coverage
  • Added core GPU-pointer tests for float32 batch amplitude success and validation cases

Related Issues or PRs

closes #1028

Changes Made

  • Bug fix
  • New feature
  • Refactoring
  • Documentation
  • Test
  • CI/CD pipeline
  • Other

Breaking Changes

  • Yes
  • No

Checklist

  • Added or updated unit tests for all changes
  • Added or updated documentation for all changes
  • Successfully built and ran all unit tests or manual tests locally
  • PR title follows "MAHOUT-XXX: Brief Description" format (if related to an issue)
  • Code follows ASF guidelines

@ryankert01
Copy link
Member

Please resolve the conflict, Ty for the contribuiton.

@ryankert01 ryankert01 added this to the Qumat 0.6.0 milestone Feb 8, 2026
@viiccwen
Copy link
Contributor Author

viiccwen commented Feb 9, 2026

solved!

@viiccwen viiccwen force-pushed the add-batch-f32-amplitude-encoding branch from db93ede to b6a4f4f Compare February 9, 2026 06:42
@rich7420
Copy link
Contributor

sorry for the late,
Overall lg
I'll review deeper later

Copy link
Contributor

@rich7420 rich7420 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viiccwen thanks for the patch!
left some comments:
I think this new f32 batch GPU‑pointer API and related pipelines are not fully covered by tests.

launch_amplitude_encode, launch_amplitude_encode_batch, launch_l2_norm, launch_l2_norm_batch,
launch_l2_norm_f32,
launch_amplitude_encode, launch_amplitude_encode_batch, launch_amplitude_encode_batch_f32,
launch_l2_norm, launch_l2_norm_batch, launch_l2_norm_batch_f32, launch_l2_norm_f32,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this file around lines 251–276 and 351–376, amplitude_encode_batch_kernel / _f32 compute input_base = sample_idx * input_len and then do reinterpret_cast<const double2*>(input_batch + input_base) + elem_pair / float2. For odd input_len and sample_idx > 0 this base pointer is only 8‑byte (double) / 4‑byte (float) aligned, not 16‑byte, so the double2/float2 loads are potentially misaligned. This alignment pattern already existed in the original f64 batch kernel and this PR copies it into the new f32 batch path; please either enforce even input_len at the Rust call‑site or rework the kernels to index from a properly aligned double2* / float2* base pointer with a scalar fallback.

@viiccwen viiccwen force-pushed the add-batch-f32-amplitude-encoding branch from b6a4f4f to ce9d876 Compare February 12, 2026 03:59
@ryankert01
Copy link
Member

ryankert01 commented Feb 21, 2026

Please help resolve conflicts

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive float32 amplitude batch encoding support to the QDP engine, enabling zero-copy GPU encoding from PyTorch float32 CUDA tensors.

Changes:

  • Added f32 batch amplitude encoding kernels and GPU pointer APIs in core
  • Refactored GpuStateVector::new_batch to accept precision parameter for flexible buffer allocation
  • Updated all batch encoding call sites to specify output precision

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
qdp/qdp-kernels/src/amplitude.cu Added f32 batch amplitude kernel (amplitude_encode_batch_kernel_f32) and launcher, plus f32 batch L2 norm support
qdp/qdp-kernels/src/lib.rs Added FFI declarations for f32 batch kernels and stubs for non-CUDA builds
qdp/qdp-core/src/lib.rs Added encode_batch_from_gpu_ptr_f32 public API with validation and added precision() accessor
qdp/qdp-core/src/gpu/memory.rs Refactored new_batch to accept Precision parameter for f32 or f64 buffer allocation
qdp/qdp-core/src/gpu/encodings/mod.rs Added default encode_batch_from_gpu_ptr_f32 trait method
qdp/qdp-core/src/gpu/encodings/amplitude.rs Implemented f32 batch encoding with norm validation
qdp/qdp-core/src/gpu/encodings/{angle,basis,iqp}.rs Updated new_batch calls to pass Precision::Float64
qdp/qdp-core/src/encoding/mod.rs Updated streaming pipeline to use engine.precision()
qdp/qdp-python/src/lib.rs Added validation for f32/f64 amplitude tensors; 1D f32 supported, 2D f32 returns clear error
qdp/qdp-core/tests/gpu_ptr_encoding.rs Comprehensive f32 batch tests covering success and error paths
qdp/qdp-core/tests/dlpack.rs Updated to pass precision to new_batch
testing/qdp/test_bindings.py Added tests for f32 input with both f32 and f64 engine precision
qdp/qdp-python/tests/test_dlpack_validation.py Updated to verify 1D f32 works and 2D f32 gives clear error

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@viiccwen viiccwen force-pushed the add-batch-f32-amplitude-encoding branch from ce9d876 to 3d8485f Compare March 2, 2026 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[QDP] Add zero-copy amplitude batch encoding from float32 GPU tensors

4 participants