I am running the suite u-cg162 on the ecmwf machine. After post-processing, the outputs should be moved to MASS. The transfer to MASS failed due to an undefined environment variable :
u-cg162 is a copy of u-bv806 with minor changes in the um physical settings. u-bu806 had no trouble executing the file transfer with moose a few months ago.
I found that the ticket #3055 reported a similar error on Archer, which was due to an update of the rose version. I am not sure how to check the consistency of the rose version I use.
The issue with that ticket was a mismatch in rose version from the machine where you submit the suite (puma), and the HPC (Archer). So it might be a similar issue for you. I can’t seem to log in to ECMWF right now. Can you try checking the rose versions on ecgate and the HPC by running the following on both machines:
I have had a look on ECMWF and I’m not quite sure what is going on. Can you try adding a couple of debugging lines to that job script and re-running it directly.
Go to the directory ~ukbv/cylc-run/u-cg162/log/job/20160801T0000Z/moose_only/04
And edit the job file to add some lines just before the rose task-run line, as follows:
# SCRIPT:
env
rose -V
rose task-run ...
Then just submit the job script: qsub job. It should overwrite the previous job.out and job.err when it’s done.
I think the job is failing before reaching #SCRIPT as I didn’t get any change when I added the extra commands after #SCRIPT. So I added “env” and “rose -V” after # ENV-SCRIPT too.
Among the environment variables is listed :
ROSE_VERSION=2019.01.3
However, ROSE_TASK_CYCLE_TIME is not listed.
rose -V gives
Rose 2019.01.3 (/perm/ms/gb/frmi/rose-2019.01.3)
After a lot of investigating, the problem seems to be due to a recent upgrade to the default python2 module. The new version works OK on the normal nodes, but not the moose nodes which are a different architecture.
The simplest solution would just be to revert to the older module. Edit your .user_profile to specify: