PiControl HadGEM3-GC3.1 suite

Hello,

I am trying to run my piControl HadGem3-HC3.1-LL suite u-dm681 on puma2. My suite is a copy of the standard job u-as037. I have changed the suite to run for 1 month (down from 500yr) and turned off a lot of the stash requests. Any other differences between the suites is simply my attempt to get the suite running (project names, user names etc).

When I run the job, all of the fcm_make_* files and recon succeed. But when it launches the coupled step the job failes. The job logs did not reveal much to help me identify the problem. The end of the job.err has the following:

???
??? WARNING ???
? Warning code: -100
? Warning from routine: CHECK_RUN_DIFFUSION
? Warning message:
? cldbase_opt_sh set to sh_wstar_closure since
? Smagorinsky diffusion not chosen.
? Warning from processor: 0
? Warning number: 9
???

srun: error: nid003203: tasks 0-1,3-127: Aborted
srun: launch/slurm: _step_signal: Terminating StepId=8589187.0+0
srun: launch/slurm: _step_signal: Terminating StepId=8589187.0+1
srun: launch/slurm: _step_signal: Terminating StepId=8589187.0+2
slurmstepd: error: *** STEP 8589187.0+0 ON nid003199 CANCELLED AT 2025-01-30T13:34:22 ***
slurmstepd: error: *** STEP 8589187.0+2 ON nid003204 CANCELLED AT 2025-01-30T13:34:22 ***
slurmstepd: error: *** STEP 8589187.0+1 ON nid003203 CANCELLED AT 2025-01-30T13:34:22 ***
srun: error: nid003204: tasks 0-5: Terminated
srun: Force Terminated StepId=8589187.0+2
srun: error: nid003199: tasks 0-49: Terminated
srun: error: nid003202: tasks 149-197: Terminated
srun: error: nid003201: tasks 100-148: Terminated
srun: error: nid003200: tasks 50-99: Terminated
srun: Force Terminated StepId=8589187.0+0
srun: error: nid003203: task 2: Aborted (core dumped)
srun: Force Terminated StepId=8589187.0+1
[FAIL] run_model <<‘STDIN
[FAIL]
[FAIL] ‘STDIN’ # return-code=143
2025-01-30T13:34:24Z CRITICAL - failed/EXIT

Above this, there was a stream of errors of the form:

MPICH ERROR [Rank 200] [job id 8589187.0] [Thu Jan 30 13:34:21 2025] [nid003203] - Abort(1) (rank 200 in comm 480): application called MPI_Abort(comm=0x84000003, 1) - process 200

I have not yet successfully run the job yet. It appears to me like it can’t submit the job but I am not sure why. Any ideas on what might be causing this would be much appreciated,

Penny

please allow us read permission on your home and work spaces on ARCHER (and your space on puma)

chmod -R g+rX /home/n02/n02/<your-username>
chmod -R g+rX /work/n02/n02/<your-username>

(revert permissions on your .ssh directory & files)

Grenville

Sorry. Permissions updated.

Penny

u-as037 has not been run for some time and has a pre puma2 nemo working copy source. It looks like you changed some NEMO settings that caused the problem. We will commit the nemo_sources branch (you can use /home/n02/n02/ros/nemo/branches/dev_r5518_GO6_package for now) and revert nemo_path_excl to NEMOGCM/NEMO/OPA_SRC/TRD/trdtrc.F90.

My copy of u-as037 with /home/n02/n02/ros/nemo/branches/dev_r5518_GO6_package is running okay.

Grenville

Excellent. That was the problem and it is running now.

I have decided to attempt to convert the piControl run to an SSP245 run. By comparing suites on trac I can see what changes are needed for the MO supercomputer. Is there a suite ported over to archer2 which has ozone redistribution which you could suggest I look at? I was told this might be tricky so it would be good to see how it is implemented. I would be also interested to hear about how porting it went. For example, was it hardware dependent? Or did the fix work much like on the monsoon?

Many thanks,

Penny

Hi Penny,

We have instructions for adding OR to a GC3.1 suite on Archer2: Ozone redistribution

Have a go with that and let me know if you need any help.

Annette

Thanks for the link Annette.

In terms of the ozone redistribution:
There are 2 files I don’t have which I was hoping someone might have access to:

  1. The ozone file which have the following path name in the suite I am trying to recreate (see https://code.metoffice.gov.uk/trac/roses-u/browser/b/j/6/1/6/trunk/rose-suite.conf) here is the path:

n96e/ssp245/Ozone/v1/historic_interpolated_3d_ozone_n96e_2015_2099_ants.anc

  1. Orogorphy files in /work/y07/shared/umshared/hadgem3/ancil/atmos/ there is no N96 resolution.

Other changes (not zone related)
Would it be possible to port over the following file as well please?

/data/d01/ukcmip6/ssp585_N96O1_ensemble1_dumps/bg466o_20150101_restart_trc.nc

Thank you!

Penny

Hi Penny,

The ozone file is in the CMIP6_ANCILS directory on ARCHER2 here:
/work/y07/shared/umshared/cmip6/ancils

The other files I should be able to grab for you tomorrow.

Regards,
Ros

Hi Penny,

Orography files are now in /work/y07/shared/umshared/hadgem3/ancil/atmos/. There were a couple of N96 directories; I’ve guessed at n96e_orca025_go6. If that is not the correct directory please let me know.

After a bit of a hunt I’ve managed to find the start dump directory on the old Met Office XCE/F. The files are on ARCHER2 under /work/y07/shared/umshared/ukcmip6/ssp585_N96O1_ensemble1_dumps

Regards,
Ros.

Amazing. Thank you!

I am still working on getting the piControl suite working. Lots of my questions have already been asked and answered in other channels. Which has been very handy.

The suite it running up until pptransfer. I have followed the steps for setting up the Globus transfer and that went okay. There is limited information in the job.error for why the transfer if failing. It fails not long after starting the run. I have run it twice to make sure it was repeatable. Here is the gist of the error:

[WARN] Failed to generate checksums.
[ERROR] Checksum generation failed.
[FAIL] Command Terminated
[FAIL] Terminating PostProc…

Full path on archer2 for the error is here /work/n02/n02/penmaher/cylc-run/u-dm681/log.20250204T172957Z/job/18500101T0000Z/pptransfer/03/job.err

Thank you!

Penny

Hi Penny,

If you are wanting to automatically transfer data over to JASMIN you will need to go through the Globus setup instructions here: Configuring PPTransfer using Globus

And the suite will also need to be modified as it hasn’t been updated for use with Globus.

  • In panel fcm_make_pp → Config → pp_sources
    Change the revision number of the postproc_2.3_pptransfer_gridftp_nopw branch from 4557 to 5411.

  • In file app/postproc/rose-app.conf add the following variables to the [namelist:pptransfer] section and set gridftp=false:

    globus_cli=true
    globus_default_colls=true
    globus_notify='off'
    

I think that should then work.

Regards,
Ros.

Hi Ros,
Thanks for the response. I have done these steps already. I am running with postproc_2.4 as I plan on including ozone redistribution.
Penny

Ok. I’ll take a look at you suite and get back to you. The separate checksumming step is not used for Globus so there is some switch not set properly.

But one quick question. When you went from postproc_2.3 to postproc_2.4 how did you do the upgrade? Did you run the rose app-upgrade command or just change the branch names?

Cheers,
Ros.

Hi Penny,

Try turning verify_checksums off in postproc. That can only be used for rsync transfers.

In fcm_make_pp set the meta to be: archive_and_meaning/fcm_make/postproc_2.4 to pick up the fcm_make_pp metadata rather than the postproc app metadata.

Cheers,
Ros.

Hi Ros,

Thanks. The suite is now running. I am now implementing the ozone redistribution changes.

Penny

Hi Ros, the files for orography in n96e_orca025_go6 → n96 is what I was looking for, I assume orca025 mean 0.25 grid. Could you tell me what go6 means please? I am looking to recreate the HadGEM3-GC3.1-N96ORCA1 set up.

Hi Ros,
The orogorphy files are not quite what I am looking for. Would it be possible to get the N96 ORCA1 GO6.0 please?
Penny

Hi Penny,

Do you have the full Met Office path to the file please? There isn’t an ORCA1 directory where the other files are.

GO6 is Global Ocean 6

Regards,
Ros.

Hi Ros,

Thanks for the response. I am trying to port suite u-bj616 which runs at the met office. In that suite the orog files are as follows:

OZONE_SHARE = $CYLC_SUITE_SHARE_DIR/ozone_redistribution
OROGRAPHY_INPUT=$OZONE_SHARE/qrparm.orog

REMOTE_PROJECTS_LINK = /hpc/common_xcs
source=$REMOTE_PROJECTS_LINK/um1/ancil/atmos/n96e/orca1/orography/globe30/v6/qrparm.orog

I can’t find where CYLC_SUITE_SHARE_DIR is defined but I am guessing the source line, if it still exists is where they are.

Penny