Hi,
I've been reading the DeepEP dispatch low latency kernel code, but I have a concern about the synchronization mechanism:
My understanding of the issue:
In the dispatch kernel, each rank can only guarantee that the tokens it sends have been received by the target expert (through the atomic_finish_counter_per_expert mechanism). However, I don't see a synchronization mechanism that ensures the kernel exits only after receiving all expected tokens from other ranks.
Specifically:
The sending side uses atomic_finish_counter_per_expert to track local send completion
The receiving side uses while (ld_acquire_sys_global(rdma_recv_count...) == 0) to wait for notifications from specific ranks
But there seems to be no barrier or counter that aggregates "total expected receives from all ranks" to ensure the kernel waits for all of them before returning
My question:
Is there a mechanism I'm missing that guarantees a rank has received tokens from all other ranks before the kernel exits? Or does the kernel actually exit without ensuring full reception, relying on external synchronization (e.g., nvshmem_barrier_all() or a second kernel launch) to guarantee consistency?
I would appreciate any clarification on this. Thank you very much!
Hi,
I've been reading the DeepEP dispatch low latency kernel code, but I have a concern about the synchronization mechanism:
My understanding of the issue:
In the dispatch kernel, each rank can only guarantee that the tokens it sends have been received by the target expert (through the atomic_finish_counter_per_expert mechanism). However, I don't see a synchronization mechanism that ensures the kernel exits only after receiving all expected tokens from other ranks.
Specifically:
The sending side uses atomic_finish_counter_per_expert to track local send completion
The receiving side uses while (ld_acquire_sys_global(rdma_recv_count...) == 0) to wait for notifications from specific ranks
But there seems to be no barrier or counter that aggregates "total expected receives from all ranks" to ensure the kernel waits for all of them before returning
My question:
Is there a mechanism I'm missing that guarantees a rank has received tokens from all other ranks before the kernel exits? Or does the kernel actually exit without ensuring full reception, relying on external synchronization (e.g., nvshmem_barrier_all() or a second kernel launch) to guarantee consistency?
I would appreciate any clarification on this. Thank you very much!