Skip to content

Question about synchronization guarantee in DeepEP dispatch low latency kernel #577

@Yi-sir

Description

@Yi-sir

Hi,
I've been reading the DeepEP dispatch low latency kernel code, but I have a concern about the synchronization mechanism:
My understanding of the issue:
In the dispatch kernel, each rank can only guarantee that the tokens it sends have been received by the target expert (through the atomic_finish_counter_per_expert mechanism). However, I don't see a synchronization mechanism that ensures the kernel exits only after receiving all expected tokens from other ranks.

Specifically:
The sending side uses atomic_finish_counter_per_expert to track local send completion
The receiving side uses while (ld_acquire_sys_global(rdma_recv_count...) == 0) to wait for notifications from specific ranks
But there seems to be no barrier or counter that aggregates "total expected receives from all ranks" to ensure the kernel waits for all of them before returning

My question:
Is there a mechanism I'm missing that guarantees a rank has received tokens from all other ranks before the kernel exits? Or does the kernel actually exit without ensuring full reception, relying on external synchronization (e.g., nvshmem_barrier_all() or a second kernel launch) to guarantee consistency?

I would appreciate any clarification on this. Thank you very much!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions