Hi,
Sorry to contact you with another question, but after resolving the postproc_atmos problem by increasing the time limit (see my previous topic), it has now failed at the postproc_nemo stage, giving me the following error:
[SUBPROCESS]: Error = -9:
WARNING: libu: memcpy AMD cpuid detection failed
[ERROR] /work/y07/shared/umshared/nemo/utils/src/REBUILD_NEMO/BLD/bin/rebuild_nemo.exe: Error=-9
→ Failed to rebuild file: /work/n02/n02/cjrw09/cylc-run/u-df570/share/data/History_Data/NEMOhist/rebuilding_DF570O_18501201_RESTART/df570o_18501201_restart.nc
[FAIL] Command Terminated
[FAIL] Terminating PostProc…
[FAIL] main_pp.py nemo <<‘STDIN’
[FAIL]
[FAIL] ‘STDIN’ # return-code=1
2024-05-15T00:26:54Z CRITICAL - failed/EXIT
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=6574852.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
Is this a memory/storage problem (as the first line might suggest) or because it cannot build that certain file? If the latter, what is that file - I haven’t come across that one before?
Thank you,
Charlie
Hi Charlie,
I would first try increasing the memory by adding e.g.
[[[directives]]]
--mem=25G
to the [[POSTPROC_RESOURCE]] section in site/archer2.rc file
If it still OOMs then I would put the task on a compute node (standard) queue instead.
Cheers,
Ros.
Hi Charlie
If Ros’s suggestion doesn’t work - try to run postproc on a compute node. In site/archer2.rc
change
[[POSTPROC_RESOURCE]]
inherit = HPC_SERIAL
pre-script = """
module load postproc
module list 2>&1
ulimit -s unlimited
"""
to
[[POSTPROC_RESOURCE]]
inherit = HPC
pre-script = """
module load postproc
module list 2>&1
ulimit -s unlimited
"""
[[[directives]]]
--nodes = 1
--ntasks = 1
--tasks-per-node = 1
--cpus-per-task = 4
reload the suite and retrigger the task.
Grenville
Okay, many thanks, that seems to have worked without Grenville’s additional suggestion. In fact it completed in just under 15 minutes.
By the way, when I do a rose suite-run --reload (after shutting down my computer i.e. starting from a fresh window), is the suite GUI meant to reappear? If so, it doesn’t. The only way I can see the progress of my suite is by doing cylc gscan&. Is this right?
It has now, perhaps unsurprisingly, failed at the next, pptransfer stage, giving me the following error:
[SUBPROCESS]: Error = 1:
Error loading source credential: GSS failure:
GSS Major Status: General failure
GSS Minor Status Error Chain:
globus_sysconfig: File does not exist: /work/n02/n02/cjrw09/cred.jasmin is not a valid file
But would you like me to open a new ticket for this?
Charlie
Hi Charlie,
To relaunch the cylc gui for a particular suite:
puma2$ cd ~/roses/<suiteid>
puma2$ rose sgc
rose suite-run --reload reloads the suite definition for a running suite
In order for pptransfer to work using gridftp you need to follow the pptransfer setup instructions first which include generation of the cred.jasmin - JASMIN short-lived credential.
Cheers,
Ros.
Thanks very much Ros, have done that. Do we really need to refresh our short lived credential every 30 days, seems a bit of a faff?!
Anyway, much to my surprise, it has worked and finally my test has completed. All of the data appears to have been successfully transferred to JASMIN.
Very quick question: do I need to manually remove the data on ARCHER2 (i.e. at /work/n02/n02/cjrw09/archive), as it still appears to be there, or can it be automatically deleted once pptransfer has run?
Thanks for your help,
Charlie
Hi Charlie,
Yes the short-lived credential needs to be refreshed every 30 days. I usually stick the expiry date in my calendar to remind myself to regenerate it which only takes a minute.
With regard to removing the data from the /work/n02/n02/cjrw09/archive directory, at the moment yes you’ll need to remove that manually. I’m in the process of trying to change the postproc code so that the data is put in a different directory and then you can use the housekeeping task to automatically remove the transferred data. I’ll let you know when this is ready.
Regards,
Ros.