Skip to content

b4b: Move the changes to initialize the task decomposition from mpi_scan to main development#3666

Open
ekluzek wants to merge 100 commits into
ESCOMP:b4b-devfrom
ekluzek:decomp_mpi_scan_move_to_b4b
Open

b4b: Move the changes to initialize the task decomposition from mpi_scan to main development#3666
ekluzek wants to merge 100 commits into
ESCOMP:b4b-devfrom
ekluzek:decomp_mpi_scan_move_to_b4b

Conversation

@ekluzek
Copy link
Copy Markdown
Contributor

@ekluzek ekluzek commented Dec 16, 2025

Description of changes

This moves the core code changes to initialize the processor decomposition from the mpi_scan branch in #3469 to b4b-dev. This removes some of the changes for memory checking and additional self testing as well as some of the additional timers that don't look useful now.

I created two previous branches before I created the process in #3665 where I worked out the details to NOT make this have too many commits and be hard to do. I also figured out how to remove merge commits as they need special handling, and usually aren't wanted in a case like this. Another way to do this would be to do this outside of git, which might have been similar length as the final version, but could've missed some important changes.

Specific notes

Contributors other than yourself, if any: John Dennis

CTSM Issues Fixed (include github issue #):
Fixes #3370
Fixes #3368
Fixes #3672
Some work on #3448
Some work on #3476

Are answers expected to change (and if so in what way)? No (the determination of the decomposition is identical as well)

Any User Interface Changes (namelist or namelist defaults changes)? No

Does this create a need to change or add documentation? Did you do so? No No

Testing performed, if any: Will run standard testing
The mpi_scan testing branch has had all the test lists run for it: aux_clm, ctsm_sci, decomp_init, decomp_init_uhr, and fates

 Conflicts:
	src/cpl/share_esmf/lnd_set_decomp_and_domain.F90
…e destroyed so remove the destroy for the distgrid, and the two meshes, this runs but doesn't seem to lower memory
…e about leaving the distgrid around, and also delete the meshes as it seems to work with this in place
 Conflicts:
	src/main/decompInitMod.F90
 Conflicts:
	src/cpl/share_esmf/lnd_set_decomp_and_domain.F90
…ted error messaging

 Conflicts:
	src/main/decompInitMod.F90
…he subname, create new internal subroutines in decompInit_lnd for allocate, clean, and check errors, move the check errors part to the first thing done

 Conflicts:
	src/main/decompInitMod.F90
…array sizes are set before allocates, initialize some decompMod values to invalid for error checking, add error checking to get_proc_bounds/get_proc_clumps, seperate out allocate for gindex to own allocate method, as it has be be done later after decomp is done, these are all improvements in ESCOMP#3448

 Conflicts:
	src/main/decompInitMod.F90
… add error handling of nsegspc, don't check endCohort in get_proc_bounds and get_clump_bounds as doesn't seem to be set

 Conflicts:
	src/main/decompInitMod.F90
…etup/clean for each DecompInit test, move the decomp_mod_clean to decompMod and use it for the decompInit tests

 Conflicts:
	src/main/decompInitMod.F90
	src/self_tests/TestDecompInit.F90 --- removed
…re to the regular operation

 Conflicts:
	src/main/decompInitMod.F90
…mpi-serial

 Conflicts:
	src/main/decompInitMod.F90
 Conflicts:
	src/main/decompInitMod.F90
…sor_type structure, and start adding a couple methods to help get them set
…for mpiscan and verify it, allocate the new procinfo gi and gj indices, make sure they are set, compiles but fails at run

 Conflicts:
	src/main/decompInitMod.F90
…the call expects all subgrid levels to be set
…ocate for the local task, this works for the serial case

 Conflicts:
	src/main/decompInitMod.F90
… clump_pproc for the setting of ggidx

 Conflicts:
	src/main/decompInitMod.F90
…moved for the final version

 Conflicts:
	src/main/decompInitMod.F90
…s for serial mode

 Conflicts:
	src/main/decompInitMod.F90
…global clumps

 Conflicts:
	src/self_tests/TestDecompInit.F90
… the log that aren't needed anymore

 Conflicts:
	src/main/decompInitMod.F90
ekluzek and others added 21 commits May 4, 2026 00:38
… gi and gj and add tests for calc_global_index_fromij
…the unit tests were failing, but now work with the fix in place
… fail, this is in advance of removing ggidx to make sure removing it will be OK
…od can't check for abort because it has to use shr_sys_abort, tests now work
…iculty or be slower for PE layouts that aren't a power of 2
@ekluzek ekluzek marked this pull request as ready for review May 10, 2026 20:54
! The test_decompMod.pf unit tests test the 2D calculation of global indices however.
! Other unit tests, exercise endrun calls in various ways so test actual endrun calls.
!
! WORK THAT COULD BE DONE:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@briandobbins and @samsrabin this is ready to review now. This is the one thing that I wonder about doing before it's merged in. I think the test covered is pretty reasonable between the PF unit tests and the system tests. Co-pilot also agrees with that sentiment. I have unit tests that cover what the AI found when @briandobbins asked for a review. And there was another similar error in a test that Co-Pilot found for me.

I could add some testing to verify the point_context log output. The catch there is that I'd need to figure out how to capture log output in our unit testing. There must be ways of doing this, but I'm not familiar with it, so would need to investigate. I could also just do a little bit more in this area and limit the time I spend on doing it. Spending a few hours might be reasonable, but not spending many days. But, I figured I'd ask both your opinion on that before just doing it.

Outside of that, this is good to go. I'm running system tests and barring problems there I'd like to see it merged to b4b-dev.

Let me know if you spot anything else I should pay attention to. Thanks in advance.

@ekluzek
Copy link
Copy Markdown
Contributor Author

ekluzek commented May 10, 2026

Running aux_clm on Derecho and Izumi. Will also run decomp_init and uhr_decomp_init before merging in.

@ekluzek ekluzek moved this from In progress - b4b-dev to Stalled (needs review, blocked etc.) in CTSM: Upcoming tags May 10, 2026
@ekluzek
Copy link
Copy Markdown
Contributor Author

ekluzek commented May 10, 2026

A few tests fail for odd grids/PE layouts that I'll need to look into:

ERS_D_Mmpi-serial_Ld5.5x5_amazon.I2000Clm50FatesRs.derecho_gnu.clm-FatesCold (RUN)
FSURDATMODIFYCTSM_D_Mmpi-serial_Ld1.5x5_amazon.I2000Clm50SpRs.derecho_gnu (RUN)
SMS_D_Mmpi-serial_Ld5.5x5_amazon.I2000Clm60Bgc.derecho_gnu.clm-HillslopeC (RUN)
SMS_Ld1_PS.nldas2_rnldas2_mnldas2.I2000Ctsm50NwpSpNldas.derecho_gnu.clm-default--clm-nofireemis (RUN)
SMS_Ld1_PS.nldas2_rnldas2_mnldas2.I2000Ctsm50NwpSpNldasRs.derecho_gnu.clm-default--clm-nofireemis (RUN)
SMS_D_Mmpi-serial_Ld5.5x5_amazon.I2000Clm60FatesCrujraRs.izumi_nag.clm-FatesCold (NLCOMP RUN)
SMS_D_Mmpi-serial_Ld5.5x5_amazon.I2000Clm60FatesRs.izumi_nag.clm-FatesCold (NLCOMP RUN)
The 25 point amazon grid for mpi-serial, and the NLDAS 104k point grid on one node

@ekluzek
Copy link
Copy Markdown
Contributor Author

ekluzek commented May 11, 2026

0.8 for Sprint 31

@samsrabin
Copy link
Copy Markdown
Member

@ekluzek Please ping me again to review once you have those tests working.

deallocate(gindex_ocn)
deallocate(gindex_ctsm)
! Destroy or release all of the ESMF objects
call ESMF_FieldRedistRelease( rhandle_lnd2ctsm, noGarbage=no_esmf_garbage, rc=rc)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the problem with the failing tests is that some things are being deallocated for cases where they aren't necessarily created first. So this deallocate method either needs to only be called for cases that need it, or it has to check that things were allocated first.

This won't be hard to do. But, will take a little fiddling around...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

b4b bit-for-bit code health improving internal code structure to make easier to maintain (sustainability) enhancement new capability or improved behavior of existing capability performance idea or PR to improve performance (e.g. throughput, memory)

Projects

Status: Stalled (needs review, blocked etc.)
Status: In Progress

Development

Successfully merging this pull request may close these issues.

3 participants