-
Notifications
You must be signed in to change notification settings - Fork 41
Description
Hi~
Question: Is there a reasonable path in MVAPICH that is the same as the algorithm used in SST-Macro, in other words, the two benchmarks use the same parameter file (same hardware information). Can SST-Macro get the same similar results as MVAPICH.
Take MPI_Allreduce and MPI_Barrier as an example:
Algorithm in SST-Macro:
- MPI_Barrier : bruck algorithm
- MPI_Allreduce : Wilke-Halving (The wilke algorithm is a variation binary blocks algorithm)
2.1 First reduce rounds(similar to recuriseve-halving algorithm)
2.2 Second recv rounds (similar to bruck algorithm)
Algorithm in MVAPICH :
- MPI_Barrier :
1.1 : if mv2_use_osu_collectives:(default) use pairwise exchange with recursive doubling algorithm
1.2 : else : dissemination algorithm (the bruck algorithm) - MPI_Barrier :
2.1 : if mv2_use_osu_collectives:(default) What algorithm is not analyzed
2.2 : else :
short messages: size <= MPIR_CVAR_ALLREDUCE_LONG_MSG_SIZE
long messages: size > MPIR_CVAR_ALLREDUCE_LONG_MSG_SIZE
2.2.1 For long messages , we use Rabenseifner's algorithm.
First recuriseve-halving algorithm is used.
Second recursive doubling algorithm is used.
2.2.2 For short messages, we use a recursive doubling algorithm.
Based on the algorithm implemented by MPI_Allreduce and MPI_Barrier, it is found that the same algorithm is not used by default in SST-Macro and MVAPICH.
The current test osu_allreduce and osu_barrer benchmarks are in SST-Macro and MVAPICH, and the results are quite different.
As shown in the figure below: The configuration information is shown in parameter.ini (same as the hardware information)
parameters.ini (all benchmark use the same one)
node {
name = simple
app1 {
launch_cmd = aprun -n 4 -N 1
exe=./osu_allreduce_sst
allocation = node_id
node_id_allocation_file = andy-node_id_allocation_topo1_4.txt
mpi {
max_vshort_msg_size = 16384
max_eager_msg_size = 16384
post_header_delay = 0.81us
post_rdma_delay = 0.13us
rdma_pin_latency = 0.9us
rdma_page_delay = 1ns
eager_cutoff = 524288
allgather = ring
}
}
proc {
frequency = 2.6 GHz
ncores = 8
parallelism = 16
}
memory {
name = pisces
total_bandwidth = 12.8GB/s
latency = 12.5ns
arbitrator = cut_through
}
nic {
name = pisces
negligible_size = 0
injection {
mtu = 4096
arbitrator = cut_through
bandwidth = 100Gb/s
latency = 300ns
credits = 64KB
}
ejection{
mtu = 4096
arbitrator = cut_through
bandwidth = 100Gb/s
latency = 300ns
credits = 64KB
}
}
os{
compute_scheduler = simple
stack_size = 128KB
stack_chunk_size = 2MB
}
}
switch {
router {
name = table
}
name = pisces
arbitrator = cut_through
mtu = 512
link {
bandwidth = 200Gb/s
latency = 130ns
credits = 64KB
}
xbar {
bandwidth = 16Tb/s
}
logp {
bandwidth = 200Gb/s
hop_latency = 116ns
out_in_latency = 60ns
}
}
topology {
name = file
filename = topology.json
routing_tables = routing-table.json
}
Using a performance KPI to measure the results of osu_allreduce and osu_barrier (MVAPICH and SST-Macro comparison), the performance can only reach 60% and 70% similar
Hence the question:: Is there a reasonable path in MVAPICH that is the same as the algorithm used in SST-Macro, in other words, the two benchmarks use the same parameter file (same hardware information). Can SST-Macro get the same similar results as MVAPICH.
Thanks a lot,