-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Description
In the CUDA model's reduction, we omit __syncthreads() at the warp level:
Line 220 in 3e10ff9
| if (offset > 16) __syncthreads(); // only need to sync if not working within a warp |
I suspect that this is incorrect for post-Volta GPUs, where at least a __syncwarp() might still be needed. So I suspect that we would need something like
if(offset > warpSize/2)
__syncthreads();
else
__syncwarp();This would also solve portability issues when running the code e.g. using AdaptiveCpp PCUDA on OpenCL devices or CPU devices, where the warp size might be smaller than 32.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels