Skip to content

Potential bug for CUDA reduction #16

@illuhad

Description

@illuhad

In the CUDA model's reduction, we omit __syncthreads() at the warp level:

if (offset > 16) __syncthreads(); // only need to sync if not working within a warp

I suspect that this is incorrect for post-Volta GPUs, where at least a __syncwarp() might still be needed. So I suspect that we would need something like

if(offset > warpSize/2)
  __syncthreads();
else
  __syncwarp();

This would also solve portability issues when running the code e.g. using AdaptiveCpp PCUDA on OpenCL devices or CPU devices, where the warp size might be smaller than 32.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions