Suite u-cr117 ensemble members on monsoon

Hi cms helpdeck,

I am trying to set up the suite u-cr117 with ensembles (please see the link for changes I made https://code.metoffice.gov.uk/trac/roses-u/changeset?reponame=&new=238461%40c%2Fr%2F1%2F1%2F7%2Ftrunk&old=237475%40c%2Fq%2F6%2F5%2F4%2Ftrunk#file7). The suite u-cr117 runs on Monsoon.

I tried to run the suite, the job passed fcm…, recon, atmos_main but crashed at postproc with the error message:
[FAIL] postproc_1 (key=postproc_1): task has no associated application.

And also crashed at rose_arch_wallclock with the following message:
[FAIL] [arch:job_info.file/]source=cr117_wallclock.list: configuration value error: [Errno 2] No such file or directory: ‘/home/d00/wyzhang/cylc-run/u-cr117/share/data/cr117_wallclock.list’
[FAIL] ! moose:/ens/u-cr117/1/job_info.file/ [compress=None]
[FAIL] ! cr117_iterations.list (cr117_iterations.list)

I’ve got the permission to archive data in ‘ens’ on moose. Thanks in advance for any help with this.

Thanks,
Weiyu

Weiyu

I think the postproc_XXX tasks need to specify ROSE_TASK_APP = postproc

Grenville

Thanks Grenville! It fixed the issue with postproc!

But now the job still failed at ‘rose_arch_allclock’:
[FAIL] [arch:job_info.file/]source=cr117_wallclock.list: configuration value error: [Errno 2] No such file or directory: ‘/home/d00/wyzhang/cylc-run/u-cr117/share/data/cr117_wallclock.list’
[FAIL] ! moose:/ens/u-cr117/1/job_info.file/ [compress=None]
[FAIL] ! cr117_iterations.list (cr117_iterations.list)

Could you please also advice on it?

Thanks,
Weiyu

Hi Grenville,

Not sure if this is the right place to ask this but is this problem relates to moose?

Thanks,
Weiyu

Hi Weiyu

I don’t think so – do you really need to archive the wall clock times? I suggest setting TASK_ARCH_WALL to false

Grenville

Thanks Grenville! I changed it to false and then got the following error message:

[FAIL] moo put -f /scratch/jtmp/pbs.8274152.xcs00.x8z/tmpCQhzXI/tmpTNfKgP.tar.gz moose:/ens/u-cr117/1/job_info.file/19880901T0000Z-1_logs.tar.gz # return-code=3, stderr=

[FAIL] put (attempt 1 of 10): (failed with code ERROR_SYSTEM_UNAVAILABLE) system is currently unavailable.

[FAIL] ////////////////////////////////////////////////////////////////////////

[FAIL] uk.gov.meto.moose.business.requesthandler.service.exceptions.SystemUnavailableException: org.springframework.remoting.RemoteConnectFailureException: Could not connect to HTTP invoker remote service at [https://exxmobsslprd:8442/controller/moose.service]; nested exception is java.net.ConnectException: Connection timed out

[FAIL] ////////////////////////////////////////////////////////////////////////

[FAIL] put (attempt 2 of 10): (failed with code ERROR_SYSTEM_UNAVAILABLE) system is currently unavailable.

So is this now a problem with moose?

Thanks,
Weiyu

Weiyu

Looks like a moose/MASS problem — is it repeatable?

Grenville

Hi Grenville,

Yes, it is. I just had another try and it failed with the same error. Do you have any suggestions on it?

Thanks,
Weiyu