Skip to content

Poor HybridEP performance on dual-node 16x B300 GPUs - IB bandwidth significantly lower than expected #590

@sssssakuraaaaa

Description

@sssssakuraaaaa

I'm experiencing poor performance with HybridEP on a dual-node setup with 16 B300 GPUs (8 GPUs per node). The Roce bandwidth is significantly lower than expected, causing performance degradation in cross-node communication.

Environment
Hardware: Dual-node setup, 16x NVIDIA B300 GPUs (8 GPUs per node)
Network: RoCE + NVLink (NVL)
Test Configuration:
Processes per node: 8
Token size: 4096
Local experts: 8
Hidden size: 7168
Node num: 2
Nranks: 16
Performance Data

BF16 performance

Operation NVL Bandwidth IB Bandwidth Ratio (IB/NVL)
Dispatch (torch API) ~13.6 GB/s ~4.98 GB/s 36.6%
Dispatch+permute ~34.5 GB/s ~12.6 GB/s 36.5%
Combine (torch API) ~133 GB/s ~49 GB/s 36.8%
Combine+unpermute ~196 GB/s ~71.8 GB/s 36.6%
Dispatch kernel ~160-282 GB/s ~58-103 GB/s ~36%
Combine kernel ~234-245 GB/s ~86-89 GB/s ~37%

FP8 performance

Operation NVL Bandwidth IB Bandwidth Ratio (IB/NVL)
Dispatch (torch API) ~66-69 GB/s ~25 GB/s ~37%
Dispatch+permute ~65-69 GB/s ~24.9 GB/s ~37%
Combine (torch API) ~201-212 GB/s ~76.4 GB/s ~37%
Combine+unpermute ~101-107 GB/s ~38.5 GB/s ~37%
Dispatch kernel ~78-133 GB/s ~28-48 GB/s ~37%
Combine kernel ~264-286 GB/s ~97-103 GB/s ~37%

The HybridEP performance for EP 16 is much worse than DeepEP。During HybridEP running, there is no log except performance log。 how can I debug this issue ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions