I'm experiencing poor performance with HybridEP on a dual-node setup with 16 B300 GPUs (8 GPUs per node). The Roce bandwidth is significantly lower than expected, causing performance degradation in cross-node communication.
Environment
Hardware: Dual-node setup, 16x NVIDIA B300 GPUs (8 GPUs per node)
Network: RoCE + NVLink (NVL)
Test Configuration:
Processes per node: 8
Token size: 4096
Local experts: 8
Hidden size: 7168
Node num: 2
Nranks: 16
Performance Data
BF16 performance
| Operation |
NVL Bandwidth |
IB Bandwidth |
Ratio (IB/NVL) |
| Dispatch (torch API) |
~13.6 GB/s |
~4.98 GB/s |
36.6% |
| Dispatch+permute |
~34.5 GB/s |
~12.6 GB/s |
36.5% |
| Combine (torch API) |
~133 GB/s |
~49 GB/s |
36.8% |
| Combine+unpermute |
~196 GB/s |
~71.8 GB/s |
36.6% |
| Dispatch kernel |
~160-282 GB/s |
~58-103 GB/s |
~36% |
| Combine kernel |
~234-245 GB/s |
~86-89 GB/s |
~37% |
FP8 performance
| Operation |
NVL Bandwidth |
IB Bandwidth |
Ratio (IB/NVL) |
| Dispatch (torch API) |
~66-69 GB/s |
~25 GB/s |
~37% |
| Dispatch+permute |
~65-69 GB/s |
~24.9 GB/s |
~37% |
| Combine (torch API) |
~201-212 GB/s |
~76.4 GB/s |
~37% |
| Combine+unpermute |
~101-107 GB/s |
~38.5 GB/s |
~37% |
| Dispatch kernel |
~78-133 GB/s |
~28-48 GB/s |
~37% |
| Combine kernel |
~264-286 GB/s |
~97-103 GB/s |
~37% |
The HybridEP performance for EP 16 is much worse than DeepEP。During HybridEP running, there is no log except performance log。 how can I debug this issue ?
I'm experiencing poor performance with HybridEP on a dual-node setup with 16 B300 GPUs (8 GPUs per node). The Roce bandwidth is significantly lower than expected, causing performance degradation in cross-node communication.
Environment
Hardware: Dual-node setup, 16x NVIDIA B300 GPUs (8 GPUs per node)
Network: RoCE + NVLink (NVL)
Test Configuration:
Processes per node: 8
Token size: 4096
Local experts: 8
Hidden size: 7168
Node num: 2
Nranks: 16
Performance Data
BF16 performance
FP8 performance
The HybridEP performance for EP 16 is much worse than DeepEP。During HybridEP running, there is no log except performance log。 how can I debug this issue ?