Skip to content

Scalabilities analysis of observation- related steps in the fv3-jedi variational app #489

@TingLei-NOAA

Description

@TingLei-NOAA

In recent static 3dVar Fv3jedi runs for the 3km north American domain, we found that the run using 480 mpi tasks is significantly faster than using 1936 mpi tasks. One major source of slowdown appears to be observation-related steps where the default round robin distribution is used.
The yaml files are for 480 tasks and for 1936. Both jobs use ppn=4 .
The results extracted from the jedi's own parallel timing stats are as below

Image

The most salient behavior is the increase of the clock time (maximum across all mpi ranks) for most of the observation related-steps. It appears using more mpi ranks caused worse imbalance and longer maximum clock times (and hence longer total step times).
This issue is to facilitate collaborative evaluation/investigation into this issue and seeking a quick solution or mitigation strategies.
Update1: seems in the 1936 tasks run untententionally used the halo distributions . I would correct this and give an update on the scalability behavior after rerunning one using exactly the same obs as @delippi suggested.
Update2: An apple to apple comparison stat is given below where the same obs and obs distribution were used. The differences solely came from the different mpi task numbers (and the system status)
The degradation as shown in the previous table is reduced by far. But the remaining degradation from the obs-related steps still cause the 1936 task run slower than the 480 task run, though the saber related steps was faster.
Image

The current finding is that : the obs related steps (round robin distribution) do show a good performance regarding balance, but don't scale well when total mpi task numbers reach 1936. Halo distribution performs worse (though the reason is still not so clear to me). The next step is to explore the atlas obs distributions as @shlyaeva recommended

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions