Failure of postproc_nemo

Hi,

Sorry to contact you with another question, but after resolving the postproc_atmos problem by increasing the time limit (see my previous topic), it has now failed at the postproc_nemo stage, giving me the following error:

[SUBPROCESS]: Error = -9:
WARNING: libu: memcpy AMD cpuid detection failed
[ERROR] /work/y07/shared/umshared/nemo/utils/src/REBUILD_NEMO/BLD/bin/rebuild_nemo.exe: Error=-9
→ Failed to rebuild file: /work/n02/n02/cjrw09/cylc-run/u-df570/share/data/History_Data/NEMOhist/rebuilding_DF570O_18501201_RESTART/df570o_18501201_restart.nc
[FAIL] Command Terminated
[FAIL] Terminating PostProc…
[FAIL] main_pp.py nemo <<‘STDIN
[FAIL]
[FAIL] ‘STDIN’ # return-code=1
2024-05-15T00:26:54Z CRITICAL - failed/EXIT
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=6574852.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

Is this a memory/storage problem (as the first line might suggest) or because it cannot build that certain file? If the latter, what is that file - I haven’t come across that one before?

Thank you,

Charlie

Hi Charlie,

I would first try increasing the memory by adding e.g.

[[[directives]]]
            --mem=25G

to the [[POSTPROC_RESOURCE]] section in site/archer2.rc file

If it still OOMs then I would put the task on a compute node (standard) queue instead.

Cheers,
Ros.

Hi Charlie

If Ros’s suggestion doesn’t work - try to run postproc on a compute node. In site/archer2.rc

change

    [[POSTPROC_RESOURCE]]
        inherit = HPC_SERIAL
        pre-script = """
                     module load postproc
                     module list 2>&1
                     ulimit -s unlimited
                     """

to

    [[POSTPROC_RESOURCE]]
        inherit = HPC
        pre-script = """
                     module load postproc
                     module list 2>&1
                     ulimit -s unlimited
                     """

        [[[directives]]]
            --nodes = 1
            --ntasks = 1
            --tasks-per-node = 1
            --cpus-per-task = 4

reload the suite and retrigger the task.

Grenville

Okay, many thanks, that seems to have worked without Grenville’s additional suggestion. In fact it completed in just under 15 minutes.

By the way, when I do a rose suite-run --reload (after shutting down my computer i.e. starting from a fresh window), is the suite GUI meant to reappear? If so, it doesn’t. The only way I can see the progress of my suite is by doing cylc gscan&. Is this right?

It has now, perhaps unsurprisingly, failed at the next, pptransfer stage, giving me the following error:

[SUBPROCESS]: Error = 1:
Error loading source credential: GSS failure:
GSS Major Status: General failure
GSS Minor Status Error Chain:
globus_sysconfig: File does not exist: /work/n02/n02/cjrw09/cred.jasmin is not a valid file

But would you like me to open a new ticket for this?

Charlie

Hi Charlie,

To relaunch the cylc gui for a particular suite:

puma2$ cd ~/roses/<suiteid>
puma2$ rose sgc

rose suite-run --reload reloads the suite definition for a running suite

In order for pptransfer to work using gridftp you need to follow the pptransfer setup instructions first which include generation of the cred.jasmin - JASMIN short-lived credential.

Cheers,
Ros.

Thanks very much Ros, have done that. Do we really need to refresh our short lived credential every 30 days, seems a bit of a faff?!

Anyway, much to my surprise, it has worked and finally my test has completed. All of the data appears to have been successfully transferred to JASMIN.

Very quick question: do I need to manually remove the data on ARCHER2 (i.e. at /work/n02/n02/cjrw09/archive), as it still appears to be there, or can it be automatically deleted once pptransfer has run?

Thanks for your help,

Charlie

Hi Charlie,

Yes the short-lived credential needs to be refreshed every 30 days. I usually stick the expiry date in my calendar to remind myself to regenerate it which only takes a minute.

With regard to removing the data from the /work/n02/n02/cjrw09/archive directory, at the moment yes you’ll need to remove that manually. I’m in the process of trying to change the postproc code so that the data is put in a different directory and then you can use the housekeeping task to automatically remove the transferred data. I’ll let you know when this is ready.

Regards,
Ros.

Great, many thanks.

Charlie

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.