[WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps#2186
[WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps#2186sifakis wants to merge 26 commits intoAcademySoftwareFoundation:masterfrom
Conversation
Add WenoLeafPtrs<BuildT>, resolveWenoLeafPtrs, and computeWenoStencil as static __device__ members of VoxelBlockManager<Log2BlockWidth>. These implement the first phase of a two-function WENO5 stencil gather: resolveWenoLeafPtrs performs exactly 3 probeLeaf calls (one per axis) to resolve neighbor leaf pointers; computeWenoStencil fills a caller-provided array with the 19 global sequential indices using WenoPt<i,j,k>::idx. voxelOffset arithmetic uses octal notation: NanoVDB leaf layout encodes (x,y,z) as x*64+y*8+z, so x/y/z strides are 0100/010/1 in octal. WenoPt<i,j,k>::idx is used throughout to remain independent of any future re-alignment with OpenVDB's NineteenPt (which uses a different convention). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Adds the ex_voxelBlockManager_host_cuda example demonstrating the CPU and CUDA VoxelBlockManager implementations, along with design documentation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Remove DecodeInverseMapsCPUPlan.md (implementation complete) and distill its non-obvious design decisions into the knowledge base: - §11: decodeInverseMaps is intentionally single-threaded/stateless; caller distributes blocks; contrast with cooperative GPU version. - §12: mPrefixSum is bypassed for bulk access — recomputing from raw mask words via buildMaskPrefixSums is cheaper than unpacking 9-bit fields; mPrefixSum is still used for the cross-word offset in Step 5. - §13: output fill is range-fill + contiguous copy (not scatter) because shuffleDownMask produces a sorted compacted array; std::fill/copy caveat for alignment when output arrays come from TLS or stack pointers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Implementation complete; design rationale distilled into VBMImplementationKnowledge.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Design reference for the per-block stencil gather kernel: decodes inverse maps into block-local scratch, then resolves neighbor leaf pointers and fills N-point stencil index arrays for all active voxels in the block. WENO5 (N=19, R=3) is the motivating instance; architecture is stencil-agnostic. Covers GPU inner loop, CPU SIMD batch design (SIMDw=16, probeLeaf dedup), unified StencilLeafPtrs template, and reach-R generalization considerations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
- §3: Stencil type as template parameter needs index→offsets direction (for-each-slot gather loop), not the offsets→index direction of WenoPt. Clarify relationship to BaseStencil/WenoStencil: geometry-only descriptor, no accessor coupling. - §4: Kernel lambda signature std::array<ValueType,K> kernel(const ValueType* u); output is homogeneous std::array (not tuple); K=1 degenerates to scalar; SoA output layout results[k][BlockWidth] for SIMD efficiency. - Renumber §4-§8 → §5-§9; update open questions accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…notes
Adds a self-contained test (lift_test.cpp) exploring a generic SIMD-lifting
abstraction: given a scalar tuple→tuple kernel, liftToSimd<W> produces an
SoA-wide version that loops over W lanes and is the auto-vectorization target.
The motivating kernel is WENO5 normSqGrad (19-point stencil, matching
WenoStencil::normSqGrad from Stencils.h). The six weno5() calls vectorize
cleanly; godunovsNormSqrd() blocks vectorization in two distinct ways
depending on how it is written:
1. std::max / bool isOutside ternaries → "control flow in loop"
2. float sign + fmaxf (no ternaries) → "no vectype for stmt" due to
GCC's inability to see through std::tuple's recursive-inheritance
struct layout in GIMPLE alias analysis
INVESTIGATION.md documents all experiments, findings, current blockers,
and proposed next steps (pointer-cache approach, Clang comparison, etc.).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Evangelos Sifakis <esifakis@gmail.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Introduce nanovdb::util::Simd<T,W> (simd_test/Simd.h) — a minimal header-only SIMD abstraction backed by std::array<T,W> with arithmetic operators, SimdMask, min/max, and where(). Mirrors the C++26 std::simd interface for forward compatibility. Rewrite the WENO5 normSqGrad kernel as a template on T: - T=float : scalar __hostdev__ path for GPU (one thread per voxel) - T=Simd<float,W> : W-wide CPU path (one call per batch) A single templated godunovsNormSqrd + normSqGrad definition serves both execution contexts with no #ifdef, structurally matching Stencils.h. Clang 18 vectorizes the Simd<float,16> instantiation (691 ymm instructions in the hot function, assembly-verified); GCC 13 does not. Update INVESTIGATION.md with the full scoreboard, both approaches, and next steps (GCC intrinsics path, benchmarking, nanovdb/util/ integration). Signed-off-by: Efstathios Sifakis <esifakis@cs.wisc.edu> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Auto-detect <experimental/simd> (Parallelism TS v2) via __has_include and __cpp_lib_experimental_parallel_simd. When available, Simd<T,W> and SimdMask<T,W> become thin wrappers around fixed_size_simd / fixed_size_simd_mask, delegating all arithmetic to the standard type. The TS v2 where(mask, v) is a 2-arg masked-assignment proxy; wrap it into the 3-arg select(mask, a, b) form expected by the kernels. Verified with clang++-18 -std=c++26: both paths produce identical assembly (1275 ymm instructions, PASS on all 16 lanes), confirming Clang optimizes through the wrapper completely. Signed-off-by: Efstathios Sifakis <esifakis@cs.wisc.edu> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Document the std::experimental::simd backend alongside the std::array default, including the TS v2 where() adaptation, the auto-detection mechanism, and the assembly comparison showing byte-for-byte identical output between the two backends under Clang 18. Update the vectorization results table and open questions accordingly. Signed-off-by: Efstathios Sifakis <esifakis@cs.wisc.edu> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
StencilKernel.h — new prototype header:
- BaseStencilKernel<T, SIZE>: owns mValues[], mDx2, mInvDx2; no grid coupling
- WenoStencilKernel<T>: derives from above, provides normSqGrad()
- WENO5<T> and GodunovsNormSqrd<T, MaskT>: free functions mirroring Stencils.h
- T=float for GPU scalar path, T=Simd<float,W> for CPU batch path
lift_test.cpp — rewritten to use WenoStencilKernel<T> directly:
- SIMD and scalar reference paths both instantiate the same class
- dx passed to constructor; mValues populated via operator[]
Simd.h — refinements:
- Simd<T,W> and SimdMask<T,W> in Backend A are now pure type aliases for
stdx::fixed_size_simd / fixed_size_simd_mask (no wrapper struct)
- element_aligned_tag / element_aligned: portable load/store tag, always
present; aliases stdx::element_aligned_tag in Backend A, dummy struct in B
- Backend B load constructor and store() accept element_aligned_tag (defaulted)
- NANOVDB_NO_STD_SIMD opt-out flag to force Backend B
INVESTIGATION.md — updated:
- Approach B section updated to reflect class hierarchy instead of free functions
- Backend B GCC note: the struct-access failure was specific to Approach A's
liftToSimd outer-lane loop; Backend B's fixed-count operator loops do vectorize
on GCC when used with the Generic-T class hierarchy
- New ymm tables for both backends under GCC (Backend A: 1267 total,
Backend B: 619 total); both pass correctness
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…chPtrs) Add full CPU batch neighbor-leaf resolution design to the planning doc: - §6: Replace StencilLeafPtrs struct with layered design — shared 3×3×3 bit encoding for probedMask/ptrs[27], stencil-specific batchPtrs population (batchPtrs[4][SIMDw] for WENO5, batchPtrs[3][3][3][SIMDw] for box stencil), and GPU scalar design note kept separate. - §8d: Update lazy-probe section to reference ptrs[27] and 27-bit probedMask; add batchPtrs population step (Phase 2) after the probeLeaf loop. - §8e: Update computeNeededDirs direction table to use 3×3×3 bit positions (bits 4,10,12,14,16,22 for WENO5 face neighbors). - §8f/§8g: Minor notation updates to match ptrs[27] naming. - §9: Resolve ptrs-layout and nExtraLeaves open questions; add prototype scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Standalone CPU-only executable that verifies the neighbor leaf resolution design from StencilGather.md §8d–§8f: - For each VBM block: calls decodeInverseMaps, recomputes nLeaves from jumpMap, then processes SIMDw=16 batches with the full probedMask / lazy-probeLeaf / batchPtrs[4][SIMDw] pipeline. - Does not call computeStencil. Instead verifies batchPtrs against a direct probeLeaf reference for all 18 non-center WENO5 stencil offsets that cross leaf boundaries. - Passes at 0.1, 0.25, 0.5, 0.9 occupancy (2.3M–2.9M lane checks each). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Refactor computeNeededDirs to accept a pre-expanded Simd<uint32_t,SIMDw> vector, moving sentinel/masking responsibility to the single gather site where leafMask is known: - kSentinelExpanded = expandVoxelOffset(292) = 0x41044104 (constexpr) - Caller broadcasts sentinel to all lanes, overwrites leafMask lanes with real expandVoxelOffset() values before calling computeNeededDirs - computeNeededDirs is now a pure add+reduce with no masking or cross-check Carry trick (§8e): expandVoxelOffset packs lz/lx/ly into 6 guarded 3-bit groups; a single vpaddd ymm × 2 + vpor + vpand + shuffle-tree detects all six WENO5 directions simultaneously. kExpandCarryK = 0x514530C3. AVX2 codegen confirmed via objdump: - computeNeededDirs: vpbroadcastd + 2×vpaddd ymm + vpor/vpand ymm + vextracti128/vpsrldq shuffle-tree, no branches or calls in hot path - activeMask/leafMask in runPrototype: vpcmpeqd ymm × 4 + vmovmskps ymm × 2 - Sentinel broadcast: 0x41044104 literal → vpbroadcastd → 2×vmovdqa ymm Always-on scalar cross-check at every computeNeededDirs call site. verifyComputeNeededDirsSentinel() tests both the sentinel carry property and the straddle-lane non-pollution scenario before runPrototype(). StencilGather.md §8e and §8f updated to match new API and codegen notes. Phase 1 prototype marked complete in §9. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Documents the BatchAccessor — the SIMD-batch analog of ValueAccessor — developed from the ex_stencil_gather_cpu Phase 1 prototype discussion. Core concept: instead of caching the path to one leaf, cache the full 3×3×3 neighborhood of 27 leaf pointers around the current center leaf, serving SIMDw voxels per call. Key design elements documented: Eviction policy: fires on none_of(leafMask) only — straddle lanes do not evict. leafMask is the "partial-hit" signal with no scalar-accessor analog. Prefetch coverage argument: - WENO5 (R=3): 6 extremal taps (±R,0,0),(0,±R,0),(0,0,±R) are necessary and sufficient — equivalent to the computeNeededDirs carry trick - Box stencil (R=1): 8 corner taps (±1,±1,±1) collectively cover all 26 non-center directions for any voxel position in the batch Three-tier API: - prefetch<di,dj,dk>(vo, leafMask, treeAcc) - cachedGetValue<di,dj,dk>(vo, leafMask) — no treeAcc, cache assumed warm - getValue<di,dj,dk>(vo, leafMask, treeAcc) — lazy combined (vanilla style) Template <di,dj,dk> rationale vs runtime Coord: compile-time direction bit, dead-axis elimination, VDB convention alignment; runtime Coord overload provided for generic stencil adapters. AVX2 profile: offset arithmetic (vpaddd ymm), lane split (vpcmpgtd ymm), gather from ≤2 leaf pointers (vgatherdps×2 + vpblendvb) — both scalar bottlenecks from Phase 1 prototype are eliminated. StencilGather.md: add cross-reference to BatchAccessor.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Introduces BatchAccessor<BuildT,ValueT,VoxelOffsetT,LeafIDT,PredicateT>, the SIMD analog of NanoVDB's ValueAccessor. Instead of caching the path to one leaf, it caches the 27-entry 3×3×3 neighbor pointer table around the current center leaf and serves a SIMD batch of LaneWidth voxels per call. Key design decisions -------------------- - Eager center: constructor and advance() populate mLeafNeighbors[13] directly (O(1), no probeLeaf), so cachedGetValue<0,0,0> is valid immediately and the probe loop never needs a center special-case. - SWAR neededMask: prefetch<di,dj,dk> expands the 9-bit voxel offsets into a 15-bit packed form (lz@[0:2], lx@[6:8], ly@[12:14]) using SIMD bitwise ops, then adds a compile-time packed stencil offset and checks carry bits for crossing detection. One vpaddw YMM instruction covers all 16 lanes; clang folds packed_d into the blend constants at compile time, reducing the expand+blend+add to 5 SIMD instructions. - Heterogeneous where: Simd.h gains where<T,U,W>(SimdMask<U,W>, ...) so a PredicateT=SimdMask<float,W> can gate a VoxelOffsetT=Simd<uint16_t,W> blend without explicit casting. Array backend uses a trivial bool loop; stdx backend converts via a bool[] round-trip. - Correctness verified in-process: stencil_gather_cpu.cpp integrates BatchAccessor as an alternate execution path and cross-checks all 18 WENO5 tap directions against direct tree references (12.3M lane checks). Simd.h additions (array backend) --------------------------------- - SimdMask<T,W>: converting constructor from SimdMask<U,W> - Simd<T,W>: operator|, &, ^, <<(Simd), >>(Simd) - where<T,U,W>: heterogeneous mask overload (both backends) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Simd.h: - Add scalar_traits<T> / scalar_traits_t<T>: extracts element type from plain scalars (identity) and Simd<T,W> (T); used by BatchAccessor static_asserts and as the shift-count type for the new uniform-shift ops. - Add Simd::operator<<(T) / operator>>(T): uniform scalar shift (all lanes by the same immediate). Maps to vpsllw imm8 / vpsrlw imm8 on x86 — distinguished from the existing per-lane Simd<<Simd overload which would require the nonexistent vpsllvw instruction for 16-bit lanes. BatchAccessor::prefetch SWAR section: - Replace hard-coded uint16_t/uint32_t casts with VoxelOffsetScalarT (= scalar_traits_t<VoxelOffsetT>) so the code is correct for any unsigned 16+-bit instantiation, not just Simd<uint16_t,W>. - Add class-level static_asserts (unsigned + sizeof >= 2) with explanatory messages referencing the SWAR carry-detection contract. - Remove static_assert(LaneWidth >= 16): was a performance aspiration, not a correctness requirement; SWAR works for any LaneWidth >= 1. - Use 'auto' for expanded / packed_lc / packed_sum (type already expressed by the initializer); keep explicit uint32_t for hor_or/hor_and/s where the width is a deliberate semantic choice. - Replace kMask15 hex literal 0x71C7u with 0b111'000'111'000'111u (binary makes the three 3-bit mask fields and three 3-bit gaps visually explicit). - Use vo << VoxelOffsetScalarT(9) (uniform shift) instead of vo << VoxelOffsetT(VoxelOffsetScalarT(9)) (broadcast-then-per-lane). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
|
I know that this is marked as not for review yet, but I just wanted to comment and say that specifically there is one aspect of this PR which is something I've wanted in OpenVDB for a long time. I brought this up in our TSC meetings again recently. A lot of the time when we use the ValueAccessor, it is not true random access, but stencil access patterns and the fact that the ValueAccessor is only able to store one leaf node means you end up thrashing the cache in the ValueAccessor when accessing different neighboring leaf nodes along leaf boundaries. I have done a little experimentation in this area and the lazy leaf neighbor pattern that you use here is what I was envisioning, though I was contemplating whether a templated stencil argument (ie 6-neighbor vs 27-neighbor) may provide improved performance depending on the use case. I also hadn't settled on a name - NeighborAccessor, StencilAccessor, etc. I would be very interested in comparing performance of a similar structure for OpenVDB that doesn't include the SIMD logic to compare with performance of the ValueAccessor (that also does not use SIMD). If it outperforms the ValueAccessor, I think we should look to switch some of our neighbor use cases across. The other big issue with the ValueAccessor IMO is the tight coupling with the Grid and Tree ( (@kmuseth - a good topic for a future TSC meeting I think) |
…lingZeros
- Add 2-argument where(mask, target) = value proxy (stdx and array backends):
stdx-style masked assignment; encourages GCC to emit vpblendvb
- Add util::reduce(v, op) to both backends: tree-reduces to a scalar with
std::bit_or<>{} / std::bit_and<>{} etc.; replaces scalar horizontal loop
- Add scalar reduce(T, BinaryOp) identity overload for W=1 path
- Add util::countTrailingZeros(uint32_t) to nanovdb/util/Util.h: __hostdev__,
CUDA/HIP/__builtin_ctz/MSVC/De Bruijn dispatch; removes ad-hoc ctz from Simd.h
- BatchAccessor: use 2-arg where for packed_lc blend, util::reduce for hor_or/and,
util::countTrailingZeros in toProbe loop, hoist root ref out of loop body
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Replace the per-lane scalar loop in cachedGetValue with a fully SIMD
gather chain that populates offsets, prefixSums, and maskWords without
any scalar iteration over lanes.
Pipeline (all in Simd<T,W>):
packed_sum (uint16_t)
→ ×1129 → >>10 → &31 : d_vec (0..26), stays uint16_t —
bits [10:14] of product lie below bit 16
so modular uint16_t multiply is exact
→ gather(mNeighborLeafIDs) : leaf_id_vec (uint32_t)
→ ×kStride : raw_idx (int32_t, null lanes → 0)
→ gather(offset_base) : mOffset per lane
→ gather(prefix_base) : mPrefixSum packed, then shift-extract field w
→ gather(mask_word_base + w) : valueMask().words()[w] per lane
Switch mLeafNeighbors[27] (const LeafT*) to mNeighborLeafIDs[27] (uint32_t)
with kNullLeafID = ~uint32_t(0) sentinel, enabling the flat-base SIMD gather
pattern. prefetch and advance updated accordingly.
Add simd_cast<DstT>(Simd<SrcT,W>) to Simd.h for widening (uint16_t → int32_t,
uint16_t → uint64_t, uint32_t → int32_t) used in gather index construction.
Debug cross-check (#ifndef NDEBUG) validates all three vectors against the
scalar reference path; 12M+ lane checks pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
§8a was stale ("prefetch uses a scalar loop") — prefetch has no per-lane
loop. Update with accurate description of its two-phase structure:
- SWAR expansion + vpblendvb + vpaddw: fully in YMM across all LaneWidth
lanes (vpsllw, vpor, vpand, vpblendvb, vpaddw)
- Horizontal reduction: unavoidable vextracti128 + vpand/vpor tree to
produce scalar hor_and / hor_or for the per-axis crossing decision
Include the actual Release assembly (ex_stencil_gather_cpu, -O3 -mavx2)
confirming the YMM path survived the mLeafNeighbors → mNeighborLeafIDs
encoding change.
§8c: update nullptr sentinel language to reflect kNullLeafID / valid_u32
mask in the SIMD gather chain.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
- Promote SWAR encoding literals to class-scope uint16_t constants: kSwarXZMask (0x1C07), kSwarYMask (0x00E0), kSwarSentinel (4|4<<5|4<<10). Shared between prefetch and cachedGetValue; implicit conversion to VoxelOffsetT at Simd construction time. - Add direction-extraction local constants in cachedGetValue: kSwarCarryMask (0x6318), kDirMul (1129), kDirMask (31). - Rename w_vec -> wordIndex for clarity. - Move U64Traits alias inside #ifndef NDEBUG where it is only used. - Replace non-ASCII characters (em dashes, arrows, element-of, comparison operators) with ASCII equivalents throughout. - Add explicit parentheses around the d_u16 shift-then-mask chain to make operator precedence unambiguous. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…liases Simd.h: - Add scalar simd_cast<DstT>(SrcT) overload (degrades to static_cast). - Add scalar 2-arg where(bool, T&) masked-assignment proxy matching the SIMD WhereExpression form. - Add scalar gather(const T*, int32_t) and gather_if(T&, bool, ...) to complete the scalar overload set alongside the existing where(bool,T,T). BatchAccessor.h: - Remove LeafIDT template parameter (was reserved/unused; now derived). - Add private class-scope LeafIDVecT and LeafDataVecT using conditional_t: plain uint32_t/uint64_t when LaneWidth==1, Simd<T,W> otherwise. This upholds the convention that scalar instantiations use underlying types directly rather than Simd<T,1> wrappers. - Replace local U32T/U64T aliases in cachedGetValue with class-scope names. stencil_gather_cpu.cpp: - Drop the now-removed LeafIDT argument from the BAccT instantiation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
cachedGetValue is now fully vectorised end-to-end for ValueOnIndex grids. The scalar leaf->getValue(offset) loop is removed; result is filled via a 2-arg where(isActive, result) = ... directly on the output argument so that leafMask-clear and inactive-voxel lanes are never touched. Key design decisions recorded in BatchAccessor.md §8e: - tapLeafOffset_i64 widened to int64_t before *= kStride to avoid uint32_t overflow (kNullLeafID = 0xFFFFFFFF causes wild gather indices in uint32_t). simd_cast_if<int64_t>(dst, valid_u32, src) writes 0 for invalid lanes, keeping gather indices non-negative for vpgatherqq (signed int64_t). - gather_if gains a MaskElemT template parameter to support heterogeneous masks: valid_u32 (SimdMask<uint32_t,W>) applied to uint64_t data fields. - Activity check: ValueOnIndex::getValue returns 0 for inactive voxels. Detected as isActive = (maskWords & (1<<dest_yz)) != 0. Null-leaf lanes have maskWords=0 and are therefore implicitly handled by the same check. - popcount uses a SWAR shift-and-add tree (popcount64 in Simd.h) rather than __builtin_popcountll: AVX2 has no 64-bit lane-wise popcount (VPOPCNTQ is AVX-512DQ only), and the scalar popcnt instruction is not vectorisable. - mOffsetBase, mPrefixBase, mMaskWordBase promoted to class-level const pointers, computed once in the constructor, shared across all 18 cachedGetValue instantiations in a WENO5 stencil gather. Verified: debug build (-O0) + release build (-O3), 12 321 275 lane checks across all 18 WENO5 non-center taps, zero mismatches. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Add §8f with assembly comparison between GCC 13 and Clang 18 (-O3 -DNDEBUG -march=native, cachedGetValue<1,0,0>, W=16): - GCC does not inline any Simd.h helper (gather_if, simd_cast, where, popcount); emits 14 out-of-line calls and 13 vzeroupper transitions. gather_if body uses scalar vmovq/vpinsrq/vinserti128 — no hardware gathers. - Clang inlines everything except popcount; emits 2 vpgatherdd + 12 vpgatherqq hardware gather instructions and only 2 vzeroupper transitions. popcount body (88 ymm instrs, pure SWAR vpsrlq/vpand/vpaddq) is also fully vectorized but remains out-of-line. - 43 vpinsrb in Clang output are mask-widening cost for heterogeneous gather_if (SimdMask<uint32_t,16> → 4x SimdMask<uint64_t,4>). Action item added to §10: [[gnu::always_inline]] on Simd.h helpers would eliminate the GCC regression and fold popcount inline in both compilers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…adeoffs Expand §8f with analysis of three popcount strategies for the 16-lane uint64_t case: - SWAR shift-and-add tree (current): 88 instructions, all AVX2-friendly but port-intensive; stays in vector registers throughout. - Scalar popcnt with extract/reassemble: ~56 instructions; popcnt is pipelined (1/cycle throughput on port 1, 3-cycle latency) so 16 independent lanes retire in ~16 cycles, but the ymm↔GPR domain crossing adds ~2 cycles bypass latency per extraction and port 1 serialises all 16 popcnts. - vpshufb nibble-LUT + vpsadbw (recommended): ~40 instructions, no domain crossing, uses ports 0/5 and 5 (orthogonal to SWAR ports), standard compiler-generated AVX2 popcount pattern. Added as action item in §10. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Adds StencilAccessor<BuildT,W,StencilT> — a compile-time-parameterized SIMD
wrapper around BatchAccessor that owns the straddling loop, hull-prefetch
sequencing, and per-tap cachedGetValue blending for one VBM block. Output
is mIndices[SIZE] of Simd<uint64_t,W>, one vector per stencil tap.
Includes Weno5Stencil (18 taps, 6-tap hull) and the findIndex constexpr
fold for compile-time getValue<DI,DJ,DK>() inverse-map lookup.
Also:
- BatchAccessor.h: add centerLeafID() getter used by StencilAccessor
- BatchAccessor.md: expand §8f assembly matrix (compiler × backend × ISA)
- stencil_gather_cpu.cpp: wire StencilAccessor into runPrototype with
verifyStencilAccessor correctness checks; add rdtsc-based runPerf
- Simd.h: remove = {} default argument from element_aligned_tag overloads
- CLAUDE.md: add project build/architecture reference for Claude Code
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Draft for CI and diff review. WIP.