Failing at ‘Coupled’

Hello, could you help with this one:
My submission (u-cw692) fails at coupled

The only changes from a successful suite is to change the restart files

and change model basis time on gui to reflect October start.

(does the msg indicate a problem with those files?)

Part of error file reads:

/work/n02/n02/jgrist02/cylc-run/u-cw692/log/job/19501001T0000Z/coupled/01/job.err

???

??? WARNING ???

? Warning code: -1

? Warning from routine: eg_SISL_setcon

? Warning message: Constant gravity enforced

? Warning from processor: 0

? Warning number: 34

???

MPICH ERROR [Rank 1024] [job id 3617338.0] [Mon May 8 08:33:10 2023] [nid004973] - Abort(32765) (rank 1024 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 32765) - process 1024

srun: error: nid004973: task 0: Aborted

srun: launch/slurm: _step_signal: Terminating StepId=3617338.0+0

srun: launch/slurm: _step_signal: Terminating StepId=3617338.0+1

slurmstepd: error: *** STEP 3617338.0+1 ON nid004973 CANCELLED AT 2023-05-08T09:33:12 ***

slurmstepd: error: *** STEP 3617338.0+0 ON nid001443 CANCELLED AT 2023-05-08T09:33:12 ***

slurmstepd: error: *** STEP 3617338.0+2 ON nid005759 CANCELLED AT 2023-05-08T09:33:12 ***

srun: launch/slurm: _step_signal: Terminating StepId=3617338.0+2

srun: error: nid005781: tasks 8-11: Terminated

srun: error: nid005778: tasks 4-7: Terminated

srun: error: nid005861: tasks 16-19: Terminated

srun: error: nid005759: tasks 0-3: Terminated

srun: error: nid005859: tasks 12-15: Terminated

srun: Force Terminated StepId=3617338.0+2

srun: error: nid001443: tasks 0-63: Terminated

srun: error: nid001764: tasks 512-575: Terminated

srun: error: nid001511: tasks 384-447: Terminated

srun: error: nid001484: tasks 320-383: Terminated

srun: error: nid001463: tasks 256-319: Terminated

srun: error: nid001710: tasks 448-511: Terminated

srun: error: nid001462: tasks 192-255: Terminated

srun: error: nid001461: tasks 128-191: Terminated

srun: error: nid003954: tasks 576-639: Terminated

srun: error: nid003998: tasks 704-767: Terminated

srun: error: nid004011: tasks 832-895: Terminated

srun: error: nid004012: tasks 896-959: Terminated

srun: error: nid003984: tasks 640-703: Terminated

srun: error: nid001459: tasks 64-127: Terminated

srun: error: nid004010: tasks 768-831: Terminated

srun: error: nid004070: tasks 960-1023: Terminated

srun: Force Terminated StepId=3617338.0+0

srun: error: nid005755: tasks 896-1023: Terminated

srun: error: nid005751: tasks 768-895: Terminated

srun: error: nid004975: tasks 256-383: Terminated

srun: error: nid005009: tasks 512-639: Terminated

srun: error: nid004992: tasks 384-511: Terminated

srun: error: nid005756: tasks 1024-1151: Terminated

srun: error: nid005010: tasks 640-767: Terminated

srun: error: nid004973: tasks 1-127: Terminated

srun: error: nid004974: tasks 128-255: Terminated

srun: Force Terminated StepId=3617338.0+1

[FAIL] run_model <<‘STDIN

[FAIL]

[FAIL] ‘STDIN’ # return-code=143

2023-05-08T08:33:13Z CRITICAL - failed/EXIT

I can not see any helpful messages - please set PRINT_STATUS to PrStatus_Diag and rerun the first cycle. That may give better clues

Grenville

hi Grenville,
I have changed PRINT_STATUS to PrStatus_Diag and rerun, to me the message looks the same.

Jeremy

Hi Jeremy

I think the penny has dropped – this is a derivative of the CANARI suite, so it needs to start from Jan for the ozone redistribution to work correctly. I suspect that’s the source of the error.

Grenville

Hi Grenville,

Thanks for that. I tried turning suite conf>Ozone redistribution> ‘Redistribute ozone’ to ‘false’ with no success. Do you think that is insufficient to overide the error? might there be another way of working around?

many thanks,
jeremy

Jeremy

You have /work/n02/n02/jgrist02/cylc-run/restartsCW544/ocean/cw544o_icebergs_19501001_restart.nc as both NEMO iceberg file and CICE restart file – fix the CICE file & try again.

Grenville

Many thanks,

Jeremy