Puma2: job failing at ozone

Failing at ‘install_ozone and recon’

Would you be able to help with the following:

My job u-da314 is failing at recon and install_ozone_stream

It seemed to be running yesterday – for a month or so at least.

With many thanks,
Jeremy

cat /work/n02/n02/jgrist02/cylc-run/u-da314/log/job/19501001T0000Z/install_ozone_stream/01/job.err

Lmod is automatically replacing “cce/15.0.0” with “gcc/11.2.0”.

Due to MODULEPATH changes, the following have been reloaded:

  1. cray-mpich/8.1.23

Traceback (most recent call last):
File “/work/y07/shared/umshared/bin/mule-convpp”, line 33, in
sys.exit(load_entry_point(‘um-utils==2022.7.1’, ‘console_scripts’, ‘mule-convpp’)())
File “/work/y07/shared/umshared/lib/python3.9/um_utils/convpp.py”, line 133, in _main
if mule.pp.file_is_pp_file(input_file):
File “/work/y07/shared/umshared/lib/python3.9/mule/pp.py”, line 100, in file_is_pp_file
first_word = np.fromfile(file_path, dtype=“>i4”, count=1)
FileNotFoundError: [Errno 2] No such file or directory: ‘/work/n02/n02/jgrist02/cylc-run/u-da314/share/data/History_Data/da314a.po19500101’
[FAIL] convpp_ozone.sh <<‘STDIN
[FAIL]
[FAIL] ‘STDIN’ # return-code=1
2023-10-11T11:26:05Z CRITICAL - failed/EXIT

Hi Jeremy,

What changes have you made since you had it running ok yesterday?

Regards,
Ros.

Actually I can see one problem - you’re starting the suite from October 1950. You need to start this suite from the beginning of the year for the ozone redistribution to work.

Hi Ros,

That’s the thing, I 'm not aware of any changes since yesterday and it ran .

It is a generic CANARI suite - the changes, I have tried to apply are changing the start date to October 1st and (changing the specified ocean, ice, iceberg and atmos start files). This was initially unsuccesfull, but after a few attempts seemed to be working yesterday.

many thanks,

Jeremy

Hi Jeremy,

Without having the output from the successful attempt I can’t say what’s changed.

All I can say is you definitely have to start the CANARI suite from the beginning of a year. The ozone redistribution will not work starting the suite from October 1st.

Also in the suite.rc file remove the line:

hold after point = 19500101T0000Z

as this is CANARI specific.

Regards,
Ros.

Many thanks - I removed the line from suite.rc and turned off
redistribute ozone at suite conf> ozone redistribution.

it (u-da314 ) now runs - but just for 1 month, is is possible to see why this is?

Jeremy

from /work/n02/n02/jgrist02/cylc-run/u-da314/log/job/19501001T0000Z/coupled/01/job.err:

[983] exceptions: Program is “/mnt/lustre/a2fs-work2/work/n02/n02/jgrist02/cylc-run/u-da314/work/19501001T0000Z/coupled/./atmos.exe”
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
MPICH ERROR [Rank 983] [job id 4635046.0] [Wed Oct 11 22:18:35 2023] [nid001047] - Abort(9) (rank 983 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 9) - process 983

???
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 24
? Error from routine: WRITHEAD
? Error message: WRITHEAD: Addressing conflict
? Error from processor: 1003
? Error number: 69
???

[1003] exceptions: An non-exception application exit occured.
[1003] exceptions: whilst in a serial region
[1003] exceptions: Task had pid=217136 on host nid001047
[1003] exceptions: Program is “/mnt/lustre/a2fs-work2/work/n02/n02/jgrist02/cylc-run/u-da314/work/19501001T0000Z/coupled/./atmos.exe”
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
MPICH ERROR [Rank 1003] [job id 4635046.0] [Wed Oct 11 22:18:35 2023] [nid001047] - Abort(9) (rank 1003 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 9) - process 1003

srun: error: nid001017: task 205: Aborted
srun: launch/slurm: _step_signal: Terminating StepId=4635046.0+0
srun: launch/slurm: _step_signal: Terminating StepId=4635046.0+2
slurmstepd: error: *** STEP 4635046.0+0 ON nid001001 CANCELLED AT 2023-10-11T23:18:36 ***
slurmstepd: error: *** STEP 4635046.0+2 ON nid001061 CANCELLED AT 2023-10-11T23:18:36 ***
slurmstepd: error: *** STEP 4635046.0+1 ON nid001048 CANCELLED AT 2023-10-11T23:18:36 ***
srun: launch/slurm: _step_signal: Terminating StepId=4635046.0+1
srun: error: nid001047: task 961: Aborted
srun: error: nid001038: task 487: Aborted
srun: error: nid001002: tasks 64-72,74-119,121-127: Aborted
srun: error: nid001037: tasks 384,386-438,440-447: Aborted