Changes to OpenMP scripts to extract arguments from iom_put by LonelyCat124 · Pull Request #3373 · stfc/PSyclone

LonelyCat124 · 2026-03-13T13:21:53Z

No description provided.

codecov · 2026-03-13T13:34:56Z

Codecov Report

❌ Patch coverage is 77.77778% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 99.94%. Comparing base (a8f288e) to head (f147e94).

Files with missing lines	Patch %	Lines
...ne/psyir/transformations/datanode_to_temp_trans.py	81.48%	5 Missing ⚠️
src/psyclone/psyir/nodes/intrinsic_call.py	66.66%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3373      +/-   ##
==========================================
- Coverage   99.95%   99.94%   -0.02%     
==========================================
  Files         387      387              
  Lines       54317    54339      +22     
==========================================
+ Hits        54295    54309      +14     
- Misses         22       30       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

LonelyCat124 · 2026-03-15T14:54:29Z

Transformation failing on nemo5, and maybe not for structure of array:

psyclone --enable-cache -l output -s /archive/psyclone-tests/action-runner-software/actions-runner/_work/PSyclone-mirror/PSyclone-mirror/examples/nemo/scripts/omp_cpu_trans.py -I /archive/psyclone-tests/latest-run/UKMO-NEMOv5/tests/BENCH_OMP_THREADING_GCC/BLD/tmp -o icewri.psycloned.f90 /archive/psyclone-tests/latest-run/UKMO-NEMOv5/tests/BENCH_OMP_THREADING_GCC/BLD/ppsrc/nemo/icewri.f90
Adding OpenMP threading to subroutine: ice_wri
Traceback (most recent call last):
  File "/archive/psyclone-tests/action-runner-software/actions-runner/_work/PSyclone-mirror/PSyclone-mirror/.runner_venv/bin/psyclonefc", line 45, in <module>
    compiler_wrapper(sys.argv[1:])
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
  File "/archive/psyclone-tests/action-runner-software/actions-runner/_work/PSyclone-mirror/PSyclone-mirror/.runner_venv/lib/python3.14/site-packages/psyclone/psyclonefc_cli.py", line 140, in compiler_wrapper
    main(psyclone_args)
    ~~~~^^^^^^^^^^^^^^^
  File "/archive/psyclone-tests/action-runner-software/actions-runner/_work/PSyclone-mirror/PSyclone-mirror/.runner_venv/lib/python3.14/site-packages/psyclone/generator.py", line 743, in main
    code_transformation_mode(
    ~~~~~~~~~~~~~~~~~~~~~~~~^
        input_file=args.filename,
        ^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<6 lines>...
        line_length=args.limit,
        ^^^^^^^^^^^^^^^^^^^^^^^
        free_form=free_form)
        ^^^^^^^^^^^^^^^^^^^^
  File "/archive/psyclone-tests/action-runner-software/actions-runner/_work/PSyclone-mirror/PSyclone-mirror/.runner_venv/lib/python3.14/site-packages/psyclone/generator.py", line 964, in code_transformation_mode
    trans_recipe(psyir)
    ~~~~~~~~~~~~^^^^^^^
  File "/archive/psyclone-tests/action-runner-software/actions-runner/_work/PSyclone-mirror/PSyclone-mirror/examples/nemo/scripts/omp_cpu_trans.py", line 113, in trans
    iom_put_argument_to_temporary(subroutine.walk(Call))
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
  File "/archive/psyclone-tests/action-runner-software/actions-runner/_work/PSyclone-mirror/PSyclone-mirror/examples/nemo/scripts/utils.py", line 541, in iom_put_argument_to_temporary
    DataNodeToTempTrans().apply(arg)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^
  File "/archive/psyclone-tests/action-runner-software/actions-runner/_work/PSyclone-mirror/PSyclone-mirror/.runner_venv/lib/python3.14/site-packages/psyclone/psyir/transformations/datanode_to_temp_trans.py", line 274, in apply
    node.scope.symbol_table.add(sym_copy)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
  File "/archive/psyclone-tests/action-runner-software/actions-runner/_work/PSyclone-mirror/PSyclone-mirror/.runner_venv/lib/python3.14/site-packages/psyclone/psyir/symbols/symbol_table.py", line 600, in add
    raise KeyError(f"Symbol table already contains a symbol with "
                   f"name '{new_symbol.name}'.")
KeyError: "Symbol table already contains a symbol with name 'Nie0'."

…a lookup

…put_to_temp_

LonelyCat124 · 2026-03-17T12:14:59Z

ITs all pass now, the remaining question is whether there's any performance degredation - I don't think there should be but its possible this is being done "too widely". One for either @sergisiso or @arporter to review.

sergisiso

@LonelyCat124 The changes look good, but I would like that the transformation provide more feedback in order to understand why it hasn't given a performance improvement. Also, see if it can be more generic.

sergisiso · 2026-03-17T12:31:36Z

examples/nemo/scripts/utils.py

+        if call.symbol.name == "iom_put":
+            arg = call.arguments[1]
+            dtype = arg.datatype
+            if isinstance(dtype, ArrayType) and isinstance(arg, Operation):
+                try:
+                    DataNodeToTempTrans().apply(arg)
+                except TransformationError:
+                    pass


The integration tests don't show any performance advantage which is not what we expected (we are changing from multiple gpu->cpu reads to one, and to maybe preventing the data touched from the gpu to be brought back):

I can check with a grep how many more loops are offloaded but for the places that it was not applied, could you add as preceding comment the reason why not (if not all transformation errors provide useful information, a verbose option like other transformation have can help)

There is nothing specific of iom_put, other than we know it is a common pattern. We want to avoid touching things from the CPU as much as possible, could this be applied to all subroutine calls (not functions)?

I've added a preceding comment now. I'll try generalising it as well.

This currently causes stuff to fail, I'll try to see if I can get my VPN to start working again and see if I can try manually building NEMO5 to find the cause.

This has shown a few issues with the DataNodeToTempTrans (partly because some things are Statements that I didn't think, e.g. an IfBlock's condition).

Fixed the bugs now

sergisiso · 2026-03-17T12:56:54Z

The DataNode2Temp has a:

calls = node.walk(Call)
for call in calls:
    if not call.is_pure:
       raise

This may be a problem as walk returns self, and iom_put is not pure. I suppose this was meant for the calls in the arguments and we just forgot to skip self?

LonelyCat124 · 2026-03-17T13:25:12Z

The DataNode2Temp has a:
calls = node.walk(Call)
for call in calls:
    if not call.is_pure:
       raise
This may be a problem as walk returns self, and iom_put is not pure. I suppose this was meant for the calls in the arguments and we just forgot to skip self?

@sergisiso I don't think so. We're calling the transformation on the argument, not the call so the iom_put is the parent.

LonelyCat124 · 2026-04-02T14:57:21Z

Ok, so there's some issues with applying it to all calls where potentially we lose performance?

The failing cases hit allocated after allocate because I just didn't consider this case. Could be resolved by if .not. ALLOCATED then whenever we allocate these arrays, however the allocates are potentially stopping parallelisation of these loops?

    ! Loop cannot be parallelised because psyclone cannot guarantee that the accesses to ['ALLOCATE'] are arrays or pure calls. If
!& they are but the symbol is imported, try adding the module name to RESOLVE_IMPORTS.
    do jj = ntsj - 0, ntej + 0, 1
      ! Loop cannot be parallelised because psyclone cannot guarantee that the accesses to ['ALLOCATE'] are arrays or pure calls.
!& If they are but the symbol is imported, try adding the module name to RESOLVE_IMPORTS.
      do ji = ntsi - 0, ntei + 0, 1
        et_s(ji,jj) = SUM(SUM(e_s(ji,jj,:,:), 2))
        et_i(ji,jj) = SUM(SUM(e_i(ji,jj,:,:), 2))
        ALLOCATE(tmp(1:SIZE(a_ip_eff, dim=3)))
        !$omp parallel do default(shared) private(idx) schedule(static)
        do idx = LBOUND(tmp, dim=1), UBOUND(tmp, dim=1), 1
          tmp(idx) = a_ip_eff(ji,jj,idx + (LBOUND(a_ip_eff, dim=3) - LBOUND(tmp, dim=1))) * a_i(ji,jj,idx + (LBOUND(a_i, dim=3) - &
&LBOUND(tmp, dim=1)))
        enddo
        !$omp end parallel do
        at_ip_eff(ji,jj) = SUM(tmp)

The original, loop is:

      DO jj = ntsj-( 0), ntej+(  0 ) ; DO ji = ntsi-( 0), ntei+(  0)
         et_s(ji,jj)  = SUM( SUM( e_s (ji,jj,:,:), dim=2 ) )
         et_i(ji,jj)  = SUM( SUM( e_i (ji,jj,:,:), dim=2 ) )
         !
         at_ip_eff(ji,jj) = SUM( a_ip_eff(ji,jj,:) * a_i(ji,jj,:) )
         !
         !!GS: tm_su always needed by ABL over sea-ice
         IF( at_i(ji,jj) <= epsi20 ) THEN
            tm_su  (ji,jj) = rt0
         ELSE
            tm_su  (ji,jj) = SUM( t_su(ji,jj,:) * a_i(ji,jj,:) ) / at_i(ji,jj)
         ENDIF
      END DO   ;   END DO

So its not as straightforward as just applying it everywhere. I'll look at adding the if .not. allocated then, but @sergisiso we probably need to revisit where/when we apply it once you're back.

arporter · 2026-04-02T16:59:38Z

I think adding the if .not. allocated() makes sense. Since it's a tmp that we've just introduced, we can also safely hoist it out of any loops as long as any variables involved in dimensioning the array are loop invariant?

…op parallelisation wherever possible.w

LonelyCat124 · 2026-04-07T12:50:19Z

I think adding the if .not. allocated() makes sense. Since it's a tmp that we've just introduced, we can also safely hoist it out of any loops as long as any variables involved in dimensioning the array are loop invariant?

So I have added the allocation check, but the actual cases that this occurs inherently makes the loops not parallelisable, as I found when trying to construct a test for the extraction.

For example, the loop in this code snippet is parallelisable by default.

subroutine test
    integer :: i
    real, dimension(100) :: arr, arr2
    do i = 1, 100
       arr(i) = SUM(ABS(arr2))
    end do
    end subroutine test"

If we extract ABS(arr2) then the loop is (ignoring the allocate):

  do i = 1, 100, 1
    tmp = ABS(arr2)
    arr(i) = SUM(tmp)
  enddo

PSyclone now doesn't see this loop as parallelisable, as tmp = ABS(arr2) is not loop invariant, so it won't look higher to place the allocate. I think I need to be sure that the tmp = ABS(arr2) itself is a loop invariant result, which I guess I know if can_loop_be_parallelised(loop) is True, but I'm not sure if thats always the case? But I think I'm limited by my knowledge of how this function works.

…emptrans

sergisiso · 2026-04-07T13:48:44Z

I wouldn't worry about making this particular case parallel if we already make it valid (no unprotected allocation).

This particular case is better solved by the Sum2CodeTrans that just needs a temporary of the elemental type. My plan was to change the maxval2code trans in utils.py with a metatransformation that expands all the intrinsics.

To make it parallel we need to redo the logic that infers private variables to consider arrays, right now only works for scalars.

LonelyCat124 · 2026-04-07T13:53:25Z

I wouldn't worry about making this particular case parallel if we already make it valid (no unprotected allocation).

This particular case is better solved by the Sum2CodeTrans that just needs a temporary of the elemental type. My plan was to change the maxval2code trans in utils.py with a metatransformation that expands all the intrinsics.

To make it parallel we need to redo the logic that infers private variables to consider arrays, right now only works for scalars.

This was just a simple example that I could construct in PSyclone easily, but the reason for it was due to what I saw from NEMO where it was extracting an array result from a function call (e.g. a pure call) that it was now stopping parallelism that may have been previously possible (at least the only comment in the resulting code was the error message of "ALLOCATE in loop" as to why the loops weren't parallelised.

sergisiso · 2026-04-07T14:22:48Z

Yes I understand. What I am saying is that we need to solve the generic correctness issue regarding allocates that produces the error message.

But if there is a performance regression in this particular case (due to not being parallel) it may not matter because the other transformation will get rid of this intrinsics anyway.

LonelyCat124 · 2026-04-07T14:29:07Z

Oh ok - I'm not sure all the cases are with intrinsics (but I can't rememeber), but I'm happy to remove the code that tries to move the allocation outside of parent loops for now (as I can't constryct code to test it anyway) and see where we are up to with the ITs.

…or now

LonelyCat124 · 2026-04-07T14:49:28Z

This still causes an issue for the GPU NEMO5 which i'll look into. Good news is no issues with passthrough at least.

LonelyCat124 · 2026-04-08T13:36:15Z

I worked out the fail case for NEMO5 with GPUs.

Essentially we end up with an ordering:

loop_start = LBOUND(tmp, 1)
...
do i = 1, n
   if(.NOT.ALLOCATED(tmp)) then
     ALLOCATE(tmp(...))
   endif
   do k = loop_start, loop_stop, 1
      tmp(k) = ...
   end do
...
end do !i loop

While looking at why this happens, its due to the HoistLoopBoundExprTrans, so I could just hoist the allocate statement to above all containing loops, but I'm slightly unsure if thats safe? I could only do it if none of the References are ArrayReference so it can't be dependent on the containing loop maybe?

LonelyCat124 · 2026-04-08T14:09:19Z

This still results in NEMO5 failures on accuracy after 5 timesteps for GPU Bench:

<  it :       5    |ssh|_max:  0.4949503007019763D+01 |U|_max:  0.1145357177490683D-01 |V|_max:  0.5364770396338010D-01 S_min:  0.2996918706295251D+02 S_max:  0.3101392651315916D+02
<  it :       6    |ssh|_max:  0.5140472974504294D+01 |U|_max:  0.1064859943349158D-01 |V|_max:  0.6818865538921472D-01 S_min:  0.2996921262561763D+02 S_max:  0.3101392594056643D+02
<  it :       7    |ssh|_max:  0.5229361171698658D+01 |U|_max:  0.7814316351505365D-02 |V|_max:  0.8358086738711774D-01 S_min:  0.2996923864577587D+02 S_max:  0.3101392538498657D+02
<  it :       8    |ssh|_max:  0.5220719217849735D+01 |U|_max:  0.1141515836389971D-01 |V|_max:  0.9761604183737975D-01 S_min:  0.2996926417117689D+02 S_max:  0.3101392490495337D+02
<  it :       9    |ssh|_max:  0.5145297949564593D+01 |U|_max:  0.1416399592481446D-01 |V|_max:  0.1152759253497915D+00 S_min:  0.2996929035879930D+02 S_max:  0.3101392444612748D+02
<  it :      10    |ssh|_max:  0.4979557010366685D+01 |U|_max:  0.1986785874282287D-01 |V|_max:  0.1303543987480510D+00 S_min:  0.2996931641421842D+02 S_max:  0.3101392405137852D+02
---
>  it :       5    |ssh|_max:  0.4949503007019763D+01 |U|_max:  0.1145357177490683D-01 |V|_max:  0.5364770396338010D-01 S_min:  0.2996918706295250D+02 S_max:  0.3101392651315916D+02
>  it :       6    |ssh|_max:  0.5140472974504294D+01 |U|_max:  0.1064859943349160D-01 |V|_max:  0.6818865538921472D-01 S_min:  0.2996921262561763D+02 S_max:  0.3101392594056643D+02
>  it :       7    |ssh|_max:  0.5229361171698658D+01 |U|_max:  0.7814316351505386D-02 |V|_max:  0.8358086738711780D-01 S_min:  0.2996923864577586D+02 S_max:  0.3101392538498657D+02
>  it :       8    |ssh|_max:  0.5220719217849735D+01 |U|_max:  0.1141515836389971D-01 |V|_max:  0.9761604183737996D-01 S_min:  0.2996926417117689D+02 S_max:  0.3101392490495337D+02
>  it :       9    |ssh|_max:  0.5145297949564593D+01 |U|_max:  0.1416399592481445D-01 |V|_max:  0.1152759253497916D+00 S_min:  0.2996929035879929D+02 S_max:  0.3101392444612748D+02
>  it :      10    |ssh|_max:  0.4979557010366685D+01 |U|_max:  0.1986785874282303D-01 |V|_max:  0.1303543987480510D+00 S_min:  0.2996931641421842D+02 S_max:  0.3101392405137852D+02

@sergisiso At this point I think its best if we just move back to iom_put for now and look at this later in more detail?

sergisiso · 2026-04-08T14:41:29Z

Sure, this already spots and fix quite a few things, we can revert to only iom_put and do a separate issue to continue the more generic application.

Changes to OpenMP scripts to extract arguments from iom_put

52b962a

LonelyCat124 temporarily deployed to integration March 13, 2026 13:22 — with GitHub Actions Inactive

transformation can't always work so catch exception

dae4e86

LonelyCat124 temporarily deployed to integration March 13, 2026 14:22 — with GitHub Actions Inactive

Merge branch 'master' into iom_put_to_temp_

34aeaac

LonelyCat124 temporarily deployed to integration March 13, 2026 14:47 — with GitHub Actions Inactive

LonelyCat124 added 3 commits March 16, 2026 13:16

fix to datanode to temp tarns to handle case sensitivity correctly vi…

a95ece0

…a lookup

Merge branch 'iom_put_to_temp_' of github.com:stfc/PSyclone into iom_…

d79fae1

…put_to_temp_

Merge branch 'master' into iom_put_to_temp_

46d6dfa

LonelyCat124 temporarily deployed to integration March 16, 2026 13:18 — with GitHub Actions Inactive

Merge branch 'master' into iom_put_to_temp_

6339e55

LonelyCat124 temporarily deployed to integration March 17, 2026 10:40 — with GitHub Actions Inactive

LonelyCat124 marked this pull request as ready for review March 17, 2026 12:14

LonelyCat124 requested review from arporter and sergisiso and removed request for arporter March 17, 2026 12:15

LonelyCat124 added the ready for review label Mar 17, 2026

LonelyCat124 requested a review from arporter March 17, 2026 12:15

sergisiso added under review and removed ready for review labels Mar 17, 2026

sergisiso requested changes Mar 17, 2026

View reviewed changes

sergisiso added reviewed with actions and removed under review labels Mar 17, 2026

sergisiso mentioned this pull request Mar 17, 2026

(closes #3022) Use datatype to infer the temporary variable when converting reduction intrinsics to code #3359

Merged

Try extending the iom_put transformation

6ce8a38

LonelyCat124 added 3 commits March 17, 2026 15:12

Fix datanodetotemptrans for ifblock statements and similar

e06176b

Use elemental_type in intrinsic_call

ed69361

Merge master and fix intrinsic call change

af9ea3d

LonelyCat124 temporarily deployed to integration April 1, 2026 14:26 — with GitHub Actions Inactive

LonelyCat124 added 3 commits April 2, 2026 10:58

Fixed the TypeError branch

a912e88

Changed the InternalError to be a fallthrough

78127da

Store the error to fallthrough to internal error correctly

c3804eb

LonelyCat124 temporarily deployed to integration April 2, 2026 10:05 — with GitHub Actions Inactive

precision handling for the TypeError

0420aa4

LonelyCat124 temporarily deployed to integration April 2, 2026 13:22 — with GitHub Actions Inactive

Added an if allocated test and check that we don't block potential lo…

470ed98

…op parallelisation wherever possible.w

LonelyCat124 temporarily deployed to integration April 7, 2026 11:46 — with GitHub Actions Inactive

[skip-ci] Some changes to add test (that fails) for the datanode_to_t…

cf4947e

…emptrans

Don't try to move the allocate statement which wasn't very feasible f…

c6c718c

…or now

LonelyCat124 temporarily deployed to integration April 7, 2026 14:37 — with GitHub Actions Inactive

Hoist the allocate statement if we think its safe

03e0ffa

LonelyCat124 temporarily deployed to integration April 8, 2026 13:58 — with GitHub Actions Inactive

LonelyCat124 added 2 commits April 8, 2026 15:13

Merge branch 'master' into iom_put_to_temp_

93d814c

updated script error

f147e94

Conversation

LonelyCat124 commented Mar 13, 2026

Uh oh!

codecov bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

LonelyCat124 commented Mar 15, 2026

Uh oh!

LonelyCat124 commented Mar 17, 2026

Uh oh!

sergisiso left a comment

Choose a reason for hiding this comment

Uh oh!

sergisiso Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

LonelyCat124 Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

LonelyCat124 Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

LonelyCat124 Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

LonelyCat124 Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

sergisiso commented Mar 17, 2026

Uh oh!

LonelyCat124 commented Mar 17, 2026

Uh oh!

LonelyCat124 commented Apr 2, 2026

Uh oh!

arporter commented Apr 2, 2026

Uh oh!

LonelyCat124 commented Apr 7, 2026

Uh oh!

sergisiso commented Apr 7, 2026

Uh oh!

LonelyCat124 commented Apr 7, 2026

Uh oh!

sergisiso commented Apr 7, 2026

Uh oh!

LonelyCat124 commented Apr 7, 2026

Uh oh!

LonelyCat124 commented Apr 7, 2026

Uh oh!

LonelyCat124 commented Apr 8, 2026

Uh oh!

LonelyCat124 commented Apr 8, 2026

Uh oh!

sergisiso commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Mar 13, 2026 •

edited

Loading