Skip to content

[Feat] Add pipelined pointer-return kernel for find* at high load factor#251

Open
rhdong wants to merge 1 commit intoNVIDIA-Merlin:masterfrom
rhdong:feat/pipelined-find-ptr
Open

[Feat] Add pipelined pointer-return kernel for find* at high load factor#251
rhdong wants to merge 1 commit intoNVIDIA-Merlin:masterfrom
rhdong:feat/pipelined-find-ptr

Conversation

@rhdong
Copy link
Member

@rhdong rhdong commented Feb 27, 2026

Problem

find* (pointer-return find) shows a ~61% throughput cliff at λ=1.0, dropping from ~7.0 to ~2.7 B-KV/s, while value-copy find stays stable at ~3.9 B-KV/s. This is logically wrong — find* does strictly less work than find, so it should never be slower.

Root cause: find* dispatches to tlp_lookup_ptr_kernel_with_filter, a 1-thread-per-key kernel that scans bucket digests serially in 9 iterations of 16 digests each, relying on empty-slot early termination. At λ=1.0, all 128 slots are occupied, so every miss scans all 9 iterations instead of ~2.

Solution

Add lookup_ptr_kernel_with_pipeline — a stripped-down version of the value-copy lookup_kernel_with_io_pipeline_v1 that returns V* pointers instead of copying values.

  • Thread model: 32 threads per key (cooperative group), 4 keys/block
  • Pipeline: 3-stage double-buffered (vs 4-stage in value-copy):
    1. Digest prefetch (all 128 digests in 1 parallel step)
    2. Digest match + key prefetch
    3. Key verify + write pointer (no value copy)
  • Dispatch: load_factor > 0.875 && max_bucket_size == 128 selects the pipelined kernel; otherwise the fast TLP kernel is used as before.

Benchmark (H100 NVL, pure HBM, dim=8, capacity=128M)

λ find* before find* after find (ref) Δ
0.50 6.977 6.942 3.910 -0.5%
0.75 5.954 5.963 3.881 +0.2%
1.00 2.688 4.366 3.929 +62%

find* at λ=1.0 is now 11% faster than find (was 32% slower). No regression at low load factors.

## Problem

`find*` (pointer-return find) shows a ~61% throughput cliff at λ=1.0,
dropping from ~7.0 to ~2.7 B-KV/s, while value-copy `find` stays stable
at ~3.9 B-KV/s. This is logically wrong — `find*` does strictly less
work than `find`, so it should never be slower.

**Root cause:** `find*` dispatches to `tlp_lookup_ptr_kernel_with_filter`,
a 1-thread-per-key kernel that scans bucket digests serially in 9
iterations of 16 digests each, relying on empty-slot early termination.
At λ=1.0, all 128 slots are occupied, so every miss scans all 9
iterations instead of ~2.

## Solution

Add `lookup_ptr_kernel_with_pipeline` — a stripped-down version of the
value-copy `lookup_kernel_with_io_pipeline_v1` that returns `V*` pointers
instead of copying values.

- **Thread model:** 32 threads per key (cooperative group), 4 keys/block
- **Pipeline:** 3-stage double-buffered (vs 4-stage in value-copy):
  1. Digest prefetch (all 128 digests in 1 parallel step)
  2. Digest match + key prefetch
  3. Key verify + write pointer (no value copy)
- **Dispatch:** `load_factor > 0.875 && max_bucket_size == 128` selects
  the pipelined kernel; otherwise the fast TLP kernel is used as before.

## Benchmark (H100 NVL, pure HBM, dim=8, capacity=128M)

| λ    | find* before | find* after | find (ref) | Δ     |
|------|-------------|------------|-----------|-------|
| 0.50 | 6.977       | 6.942      | 3.910     | -0.5% |
| 0.75 | 5.954       | 5.963      | 3.881     | +0.2% |
| 1.00 | 2.688       | **4.366**  | 3.929     | **+62%** |

`find*` at λ=1.0 is now 11% faster than `find` (was 32% slower).
No regression at low load factors.
@github-actions
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant