Sorry to contact you with another question, but after resolving the postproc_atmos problem by increasing the time limit (see my previous topic), it has now failed at the postproc_nemo stage, giving me the following error:
[SUBPROCESS]: Error = -9:
WARNING: libu: memcpy AMD cpuid detection failed
[ERROR] /work/y07/shared/umshared/nemo/utils/src/REBUILD_NEMO/BLD/bin/rebuild_nemo.exe: Error=-9
→ Failed to rebuild file: /work/n02/n02/cjrw09/cylc-run/u-df570/share/data/History_Data/NEMOhist/rebuilding_DF570O_18501201_RESTART/df570o_18501201_restart.nc
[FAIL] Command Terminated
[FAIL] Terminating PostProc…
[FAIL] main_pp.py nemo <<‘STDIN’
[FAIL]
[FAIL] ‘STDIN’ # return-code=1
2024-05-15T00:26:54Z CRITICAL - failed/EXIT
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=6574852.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
Is this a memory/storage problem (as the first line might suggest) or because it cannot build that certain file? If the latter, what is that file - I haven’t come across that one before?
Okay, many thanks, that seems to have worked without Grenville’s additional suggestion. In fact it completed in just under 15 minutes.
By the way, when I do a rose suite-run --reload (after shutting down my computer i.e. starting from a fresh window), is the suite GUI meant to reappear? If so, it doesn’t. The only way I can see the progress of my suite is by doing cylc gscan&. Is this right?
It has now, perhaps unsurprisingly, failed at the next, pptransfer stage, giving me the following error:
[SUBPROCESS]: Error = 1:
Error loading source credential: GSS failure:
GSS Major Status: General failure
GSS Minor Status Error Chain:
globus_sysconfig: File does not exist: /work/n02/n02/cjrw09/cred.jasmin is not a valid file
But would you like me to open a new ticket for this?
rose suite-run --reload reloads the suite definition for a running suite
In order for pptransfer to work using gridftp you need to follow the pptransfer setup instructions first which include generation of the cred.jasmin - JASMIN short-lived credential.
Thanks very much Ros, have done that. Do we really need to refresh our short lived credential every 30 days, seems a bit of a faff?!
Anyway, much to my surprise, it has worked and finally my test has completed. All of the data appears to have been successfully transferred to JASMIN.
Very quick question: do I need to manually remove the data on ARCHER2 (i.e. at /work/n02/n02/cjrw09/archive), as it still appears to be there, or can it be automatically deleted once pptransfer has run?
Yes the short-lived credential needs to be refreshed every 30 days. I usually stick the expiry date in my calendar to remind myself to regenerate it which only takes a minute.
With regard to removing the data from the /work/n02/n02/cjrw09/archive directory, at the moment yes you’ll need to remove that manually. I’m in the process of trying to change the postproc code so that the data is put in a different directory and then you can use the housekeeping task to automatically remove the transferred data. I’ll let you know when this is ready.