[Feat] Add pipelined pointer-return kernel for find* at high load factor by rhdong · Pull Request #251 · NVIDIA-Merlin/HierarchicalKV

rhdong · 2026-02-27T00:51:58Z

Problem

find* (pointer-return find) shows a ~61% throughput cliff at λ=1.0, dropping from ~7.0 to ~2.7 B-KV/s, while value-copy find stays stable at ~3.9 B-KV/s. This is logically wrong — find* does strictly less work than find, so it should never be slower.

Root cause: find* dispatches to tlp_lookup_ptr_kernel_with_filter, a 1-thread-per-key kernel that scans bucket digests serially in 9 iterations of 16 digests each, relying on empty-slot early termination. At λ=1.0, all 128 slots are occupied, so every miss scans all 9 iterations instead of ~2.

Solution

Add lookup_ptr_kernel_with_pipeline — a stripped-down version of the value-copy lookup_kernel_with_io_pipeline_v1 that returns V* pointers instead of copying values.

Thread model: 32 threads per key (cooperative group), 4 keys/block
Pipeline: 3-stage double-buffered (vs 4-stage in value-copy):
1. Digest prefetch (all 128 digests in 1 parallel step)
2. Digest match + key prefetch
3. Key verify + write pointer (no value copy)
Dispatch: load_factor > 0.875 && max_bucket_size == 128 selects the pipelined kernel; otherwise the fast TLP kernel is used as before.

Benchmark (H100 NVL, pure HBM, dim=8, capacity=128M)

λ	find* before	find* after	find (ref)	Δ
0.50	6.977	6.942	3.910	-0.5%
0.75	5.954	5.963	3.881	+0.2%
1.00	2.688	4.366	3.929	+62%

find* at λ=1.0 is now 11% faster than find (was 32% slower). No regression at low load factors.

## Problem `find*` (pointer-return find) shows a ~61% throughput cliff at λ=1.0, dropping from ~7.0 to ~2.7 B-KV/s, while value-copy `find` stays stable at ~3.9 B-KV/s. This is logically wrong — `find*` does strictly less work than `find`, so it should never be slower. **Root cause:** `find*` dispatches to `tlp_lookup_ptr_kernel_with_filter`, a 1-thread-per-key kernel that scans bucket digests serially in 9 iterations of 16 digests each, relying on empty-slot early termination. At λ=1.0, all 128 slots are occupied, so every miss scans all 9 iterations instead of ~2. ## Solution Add `lookup_ptr_kernel_with_pipeline` — a stripped-down version of the value-copy `lookup_kernel_with_io_pipeline_v1` that returns `V*` pointers instead of copying values. - **Thread model:** 32 threads per key (cooperative group), 4 keys/block - **Pipeline:** 3-stage double-buffered (vs 4-stage in value-copy): 1. Digest prefetch (all 128 digests in 1 parallel step) 2. Digest match + key prefetch 3. Key verify + write pointer (no value copy) - **Dispatch:** `load_factor > 0.875 && max_bucket_size == 128` selects the pipelined kernel; otherwise the fast TLP kernel is used as before. ## Benchmark (H100 NVL, pure HBM, dim=8, capacity=128M) | λ | find* before | find* after | find (ref) | Δ | |------|-------------|------------|-----------|-------| | 0.50 | 6.977 | 6.942 | 3.910 | -0.5% | | 0.75 | 5.954 | 5.963 | 3.881 | +0.2% | | 1.00 | 2.688 | **4.366** | 3.929 | **+62%** | `find*` at λ=1.0 is now 11% faster than `find` (was 32% slower). No regression at low load factors.

github-actions · 2026-02-27T00:53:37Z

Documentation preview

https://nvidia-merlin.github.io/HierarchicalKV/review/pr-251

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Add pipelined pointer-return kernel for find* at high load factor#251

[Feat] Add pipelined pointer-return kernel for find* at high load factor#251
rhdong wants to merge 1 commit intoNVIDIA-Merlin:masterfrom
rhdong:feat/pipelined-find-ptr

rhdong commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rhdong commented Feb 27, 2026

Problem

Solution

Benchmark (H100 NVL, pure HBM, dim=8, capacity=128M)

Uh oh!

github-actions bot commented Feb 27, 2026

Documentation preview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant