-
Notifications
You must be signed in to change notification settings - Fork 5
Description
I've benchmarked the OpenMP task, OpenMP loop and serial versions of the code now on SFP.
All results are with
-O3 -g -xCORE_AVX512 -fno-omit-frame-pointer -no-inline-min-size -no-inline-max-per-compile -no-inline-factor -qopt-report=5 -qopt-report-phase=loop,vec
The full node results are mostly uninteresting/expected (OpenMP loop scales better than OpenMP task but node performance is similar enough, and with MPI the story changes more), but I get very different results at low/serial thread counts. Results for 1 thread, 2048x2048, 100 its:
| Parallel Option | Runtime (s) |
|---|---|
| OpenMP loop | 62.3520 |
| OpenMP task | 31.711 |
| Serial | ~60 |
The checksums are identical (so I assume correct).
My original conclusion is that one of the "additional" transformations (InlineTrans, ChunkLoopTrans) that occurs with OpenMP task must be responsible, so I applied both of these transformations to the Serial version, but the runtime didn't improve.
I checked both versions then with VTune, and what appears to happen is that the compiler decides to vectorise all?/most of the loops in the OpenMP task version (VTune reports 99.6% of FP operations to be "Packed") when compared to the other versions (0.9% of Packed FP operations).
Looking at the opt report for the task version, this is reflect - it complains about all the unaligned accesses, but then determines that vectorising these loops is both possible and performant (albeit it only expects ~1.3x from the two momentum loops, more for the other loops)
The compiler output for the (chunked) momentum loops instead says:
loop was not vectorized: loop with multiple exits cannot be vectorized unless it meets search loop idiom criteria [ psy.f90(148,11) ]
This is strange, because the code in the vectorising task-based version is identical apart from a giant task directive after the j_el_inner computation of this code:
DO j_out_var = ua%internal%ystart, ua%internal%ystop, 32
j_el_inner = MIN(j_out_var + (32 - 1), ua%internal%ystop)
# TASK DIRECTIVE GOES HERE
DO j = j_out_var, j_el_inner, 1
DO i = ua%internal%xstart, ua%internal%xstop, 1
Essentially we're failing to vectorise code that the compiler sometimes believes it is able to vectorise depending on surrounding statements? It would be interesting if an OpenMP loop statement also frees the compiler up to make this choice but alas I cannot do this yet.
@arporter @sergisiso any ideas?