Failure of postproc_nemo

c.j.r.williams · 15 May 2024 14:48

Hi,

Sorry to contact you with another question, but after resolving the postproc_atmos problem by increasing the time limit (see my previous topic), it has now failed at the postproc_nemo stage, giving me the following error:

[SUBPROCESS]: Error = -9:
WARNING: libu: memcpy AMD cpuid detection failed
[ERROR] /work/y07/shared/umshared/nemo/utils/src/REBUILD_NEMO/BLD/bin/rebuild_nemo.exe: Error=-9
→ Failed to rebuild file: /work/n02/n02/cjrw09/cylc-run/u-df570/share/data/History_Data/NEMOhist/rebuilding_DF570O_18501201_RESTART/df570o_18501201_restart.nc
[FAIL] Command Terminated
[FAIL] Terminating PostProc…
[FAIL] main_pp.py nemo <<‘STDIN’
[FAIL]
[FAIL] ‘STDIN’ # return-code=1
2024-05-15T00:26:54Z CRITICAL - failed/EXIT
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=6574852.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

Is this a memory/storage problem (as the first line might suggest) or because it cannot build that certain file? If the latter, what is that file - I haven’t come across that one before?

Thank you,

Charlie

RosalynHatcher · 16 May 2024 12:46

Hi Charlie,

I would first try increasing the memory by adding e.g.

[[[directives]]]
            --mem=25G

to the [[POSTPROC_RESOURCE]] section in site/archer2.rc file

If it still OOMs then I would put the task on a compute node (standard) queue instead.

Cheers,
Ros.

grenville · 16 May 2024 12:57

Hi Charlie

If Ros’s suggestion doesn’t work - try to run postproc on a compute node. In site/archer2.rc

change

    [[POSTPROC_RESOURCE]]
        inherit = HPC_SERIAL
        pre-script = """
                     module load postproc
                     module list 2>&1
                     ulimit -s unlimited
                     """

to

    [[POSTPROC_RESOURCE]]
        inherit = HPC
        pre-script = """
                     module load postproc
                     module list 2>&1
                     ulimit -s unlimited
                     """

        [[[directives]]]
            --nodes = 1
            --ntasks = 1
            --tasks-per-node = 1
            --cpus-per-task = 4

reload the suite and retrigger the task.

Grenville

c.j.r.williams · 18 May 2024 12:33

Okay, many thanks, that seems to have worked without Grenville’s additional suggestion. In fact it completed in just under 15 minutes.

By the way, when I do a rose suite-run --reload (after shutting down my computer i.e. starting from a fresh window), is the suite GUI meant to reappear? If so, it doesn’t. The only way I can see the progress of my suite is by doing cylc gscan&. Is this right?

It has now, perhaps unsurprisingly, failed at the next, pptransfer stage, giving me the following error:

[SUBPROCESS]: Error = 1:
Error loading source credential: GSS failure:
GSS Major Status: General failure
GSS Minor Status Error Chain:
globus_sysconfig: File does not exist: /work/n02/n02/cjrw09/cred.jasmin is not a valid file

But would you like me to open a new ticket for this?

Charlie

RosalynHatcher · 20 May 2024 06:38

Hi Charlie,

To relaunch the cylc gui for a particular suite:

puma2$ cd ~/roses/<suiteid>
puma2$ rose sgc

rose suite-run --reload reloads the suite definition for a running suite

In order for pptransfer to work using gridftp you need to follow the pptransfer setup instructions first which include generation of the cred.jasmin - JASMIN short-lived credential.

Cheers,
Ros.

c.j.r.williams · 21 May 2024 15:12

Thanks very much Ros, have done that. Do we really need to refresh our short lived credential every 30 days, seems a bit of a faff?!

Anyway, much to my surprise, it has worked and finally my test has completed. All of the data appears to have been successfully transferred to JASMIN.

Very quick question: do I need to manually remove the data on ARCHER2 (i.e. at /work/n02/n02/cjrw09/archive), as it still appears to be there, or can it be automatically deleted once pptransfer has run?

Thanks for your help,

Charlie

RosalynHatcher · 21 May 2024 17:13

Hi Charlie,

Yes the short-lived credential needs to be refreshed every 30 days. I usually stick the expiry date in my calendar to remind myself to regenerate it which only takes a minute.

With regard to removing the data from the /work/n02/n02/cjrw09/archive directory, at the moment yes you’ll need to remove that manually. I’m in the process of trying to change the postproc code so that the data is put in a different directory and then you can use the housekeeping task to automatically remove the transferred data. I’ll let you know when this is ready.

Regards,
Ros.

c.j.r.williams · 22 May 2024 10:05

Great, many thanks.

Charlie

system · 21 June 2024 10:05

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Postproc_nemo failure out of memory Unified Model ARCHER2	5	218	6 October 2023
Postproc nemo memory problem with UKESM1 Unified Model ARCHER2	4	138	7 December 2023
NEMO rebuild failed ARCHER2	2	96	16 April 2024
Failure in postproc_nemo Unified Model ARCHER2 , PUMATest	4	281	2 November 2023
Postproc fail for nudged suite Unified Model ARCHER2	7	153	3 January 2024

Failure of postproc_nemo

Related topics