Changes to OpenMP scripts to extract arguments from iom_put#3373
Changes to OpenMP scripts to extract arguments from iom_put#3373LonelyCat124 wants to merge 23 commits intomasterfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3373 +/- ##
==========================================
- Coverage 99.95% 99.94% -0.02%
==========================================
Files 387 387
Lines 54317 54339 +22
==========================================
+ Hits 54295 54309 +14
- Misses 22 30 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Transformation failing on nemo5, and maybe not for structure of array: |
|
ITs all pass now, the remaining question is whether there's any performance degredation - I don't think there should be but its possible this is being done "too widely". One for either @sergisiso or @arporter to review. |
sergisiso
left a comment
There was a problem hiding this comment.
@LonelyCat124 The changes look good, but I would like that the transformation provide more feedback in order to understand why it hasn't given a performance improvement. Also, see if it can be more generic.
examples/nemo/scripts/utils.py
Outdated
| if call.symbol.name == "iom_put": | ||
| arg = call.arguments[1] | ||
| dtype = arg.datatype | ||
| if isinstance(dtype, ArrayType) and isinstance(arg, Operation): | ||
| try: | ||
| DataNodeToTempTrans().apply(arg) | ||
| except TransformationError: | ||
| pass |
There was a problem hiding this comment.
The integration tests don't show any performance advantage which is not what we expected (we are changing from multiple gpu->cpu reads to one, and to maybe preventing the data touched from the gpu to be brought back):
- I can check with a grep how many more loops are offloaded but for the places that it was not applied, could you add as preceding comment the reason why not (if not all transformation errors provide useful information, a verbose option like other transformation have can help)
- There is nothing specific of iom_put, other than we know it is a common pattern. We want to avoid touching things from the CPU as much as possible, could this be applied to all subroutine calls (not functions)?
There was a problem hiding this comment.
I've added a preceding comment now. I'll try generalising it as well.
There was a problem hiding this comment.
This currently causes stuff to fail, I'll try to see if I can get my VPN to start working again and see if I can try manually building NEMO5 to find the cause.
There was a problem hiding this comment.
This has shown a few issues with the DataNodeToTempTrans (partly because some things are Statements that I didn't think, e.g. an IfBlock's condition).
There was a problem hiding this comment.
Fixed the bugs now
|
The DataNode2Temp has a: This may be a problem as walk returns self, and iom_put is not pure. I suppose this was meant for the calls in the arguments and we just forgot to skip self? |
@sergisiso I don't think so. We're calling the transformation on the argument, not the call so the iom_put is the parent. |
|
Ok, so there's some issues with applying it to all calls where potentially we lose performance? The failing cases hit allocated after allocate because I just didn't consider this case. Could be resolved by The original, loop is: So its not as straightforward as just applying it everywhere. I'll look at adding the |
|
I think adding the |
…op parallelisation wherever possible.w
So I have added the allocation check, but the actual cases that this occurs inherently makes the loops not parallelisable, as I found when trying to construct a test for the extraction. For example, the loop in this code snippet is parallelisable by default. If we extract PSyclone now doesn't see this loop as parallelisable, as |
|
I wouldn't worry about making this particular case parallel if we already make it valid (no unprotected allocation). This particular case is better solved by the Sum2CodeTrans that just needs a temporary of the elemental type. My plan was to change the maxval2code trans in utils.py with a metatransformation that expands all the intrinsics. To make it parallel we need to redo the logic that infers private variables to consider arrays, right now only works for scalars. |
This was just a simple example that I could construct in PSyclone easily, but the reason for it was due to what I saw from NEMO where it was extracting an array result from a function call (e.g. a pure call) that it was now stopping parallelism that may have been previously possible (at least the only comment in the resulting code was the error message of "ALLOCATE in loop" as to why the loops weren't parallelised. |
|
Yes I understand. What I am saying is that we need to solve the generic correctness issue regarding allocates that produces the error message. But if there is a performance regression in this particular case (due to not being parallel) it may not matter because the other transformation will get rid of this intrinsics anyway. |
|
Oh ok - I'm not sure all the cases are with intrinsics (but I can't rememeber), but I'm happy to remove the code that tries to move the allocation outside of parent loops for now (as I can't constryct code to test it anyway) and see where we are up to with the ITs. |
|
This still causes an issue for the GPU NEMO5 which i'll look into. Good news is no issues with passthrough at least. |
|
I worked out the fail case for NEMO5 with GPUs. Essentially we end up with an ordering: While looking at why this happens, its due to the |
|
This still results in NEMO5 failures on accuracy after 5 timesteps for GPU Bench: @sergisiso At this point I think its best if we just move back to iom_put for now and look at this later in more detail? |
|
Sure, this already spots and fix quite a few things, we can revert to only iom_put and do a separate issue to continue the more generic application. |
No description provided.