Question about synchronization guarantee in DeepEP dispatch low latency kernel

Hi,
I've been reading the DeepEP dispatch low latency kernel code, but I have a concern about the synchronization mechanism:
My understanding of the issue:
In the dispatch kernel, each rank can only guarantee that the tokens it sends have been received by the target expert (through the atomic_finish_counter_per_expert mechanism). However, I don't see a synchronization mechanism that ensures the kernel exits only after receiving all expected tokens from other ranks.

Specifically:
The sending side uses atomic_finish_counter_per_expert to track local send completion
The receiving side uses while (ld_acquire_sys_global(rdma_recv_count...) == 0) to wait for notifications from specific ranks
But there seems to be no barrier or counter that aggregates "total expected receives from all ranks" to ensure the kernel waits for all of them before returning

My question:
Is there a mechanism I'm missing that guarantees a rank has received tokens from all other ranks before the kernel exits? Or does the kernel actually exit without ensuring full reception, relying on external synchronization (e.g., nvshmem_barrier_all() or a second kernel launch) to guarantee consistency?

I would appreciate any clarification on this. Thank you very much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about synchronization guarantee in DeepEP dispatch low latency kernel #577

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about synchronization guarantee in DeepEP dispatch low latency kernel #577

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions