Skip to content

Confused about results on SFP due to vectorisation #98

@LonelyCat124

Description

@LonelyCat124

I've benchmarked the OpenMP task, OpenMP loop and serial versions of the code now on SFP.
All results are with
-O3 -g -xCORE_AVX512 -fno-omit-frame-pointer -no-inline-min-size -no-inline-max-per-compile -no-inline-factor -qopt-report=5 -qopt-report-phase=loop,vec

The full node results are mostly uninteresting/expected (OpenMP loop scales better than OpenMP task but node performance is similar enough, and with MPI the story changes more), but I get very different results at low/serial thread counts. Results for 1 thread, 2048x2048, 100 its:

Parallel Option Runtime (s)
OpenMP loop 62.3520
OpenMP task 31.711
Serial ~60

The checksums are identical (so I assume correct).

My original conclusion is that one of the "additional" transformations (InlineTrans, ChunkLoopTrans) that occurs with OpenMP task must be responsible, so I applied both of these transformations to the Serial version, but the runtime didn't improve.

I checked both versions then with VTune, and what appears to happen is that the compiler decides to vectorise all?/most of the loops in the OpenMP task version (VTune reports 99.6% of FP operations to be "Packed") when compared to the other versions (0.9% of Packed FP operations).

Looking at the opt report for the task version, this is reflect - it complains about all the unaligned accesses, but then determines that vectorising these loops is both possible and performant (albeit it only expects ~1.3x from the two momentum loops, more for the other loops)

The compiler output for the (chunked) momentum loops instead says:
loop was not vectorized: loop with multiple exits cannot be vectorized unless it meets search loop idiom criteria [ psy.f90(148,11) ]

This is strange, because the code in the vectorising task-based version is identical apart from a giant task directive after the j_el_inner computation of this code:

      DO j_out_var = ua%internal%ystart, ua%internal%ystop, 32
        j_el_inner = MIN(j_out_var + (32 - 1), ua%internal%ystop)
# TASK DIRECTIVE GOES HERE
        DO j = j_out_var, j_el_inner, 1
          DO i = ua%internal%xstart, ua%internal%xstop, 1

Essentially we're failing to vectorise code that the compiler sometimes believes it is able to vectorise depending on surrounding statements? It would be interesting if an OpenMP loop statement also frees the compiler up to make this choice but alas I cannot do this yet.

@arporter @sergisiso any ideas?

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions