Job failing on Archer2

Hi.
I am running a suite on Archer2 (u-db799). The suite has been running, I got one month done, but the suite is now failing when trying to run atmos_main, while I haven’t changed anything. I am getting this error message:

srun: error: nid004463: tasks 0-63: Segmentation fault
srun: launch/slurm: _step_signal: Terminating StepId=5035364.0+0
srun: launch/slurm: _step_signal: Terminating StepId=5035364.0+1
srun: launch/slurm: _step_signal: Terminating StepId=5035364.0+7
slurmstepd: error: *** STEP 5035364.0+1 ON nid004499 CANCELLED AT 2023-12-08T01:40:29 ***
slurmstepd: error: *** STEP 5035364.0+4 ON nid006248 CANCELLED AT 2023-12-08T01:40:29 ***
etc.

I tried to run it several times and got a srun: error message each time I tried. I was wondering where this error could be coming from?

Many thanks,
Paul-Arthur

Hi Paul-Arthur

please change the permissions on
/work/n02/n02/pmonerie/cylc-run/u-db799/work/19881001T0000Z/atmos_ens0/core

so I can read it

Grenville

You should be able to read it now.
Best,
Paul-Arthur

Hi Paul-Arthur

I just noticed that something has gone wrong to give a zero-length xml file:

-rw-r–r-- 1 pmonerie n02 0 Dec 1 11:55 um-atmos-file_ens_def.xml

I have one in /home/n02/n02/grenvill/xml-paul-arthur that will need the file name to be changed.
It looks like something bad happened when your disk quota was exceeded and the suite wasn’t clever enough to correct.

Grenville

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.