Hi all
I have performed several simulations with GOML3. It does work well. However, I sometimes get a Time Limit issue, which I get systematically in a suite for a month. The problem happens when running the coupled task.
I increased the time limit in archer2.rc (HPC_SERIAL > JOB ), but I still get the error. Is that something I could do for GOML3? I am having this problem with the suite u-di545.
Best
Paul-Arthur
Hi Paul-Arthur
I think the time limit is set in [[COUPLE_RESOURCE]]
thus:
execution time limit = PT{{'%02d' % (CLOCK[0])}}H{{'%02d' % (CLOCK[1])}}M{{'%02d' % (CLOCK[2])}}S
(seems like an overly complex way of doing it)
try changing to
execution time limit = PT5H
(it might be best to revert the HPC_SERIAL time to get better throughput on serial tasks)
Grenville
Hi Grenville
Thank you. I reverted my modifications and changed the time limit in the [[COUPLE_RESSOURCE]] section. However, I am still having the same error. What I have done is just re-run the coupled task. Should I reload the suite?
Best,
Paul-Arthur
Hi Paul-Arthur
Yes, you need to reload the suite.
Grenville
Hi Grenville
I changed the time limit and reloaded the suite, but I still have the same issue. Most of the suites are running fine, so I wonder if something is wrong with this one (u-di545), but the error message I saw is not helping much.
Best
Paul-Arthur
Hi Paul-Arthur
I wonder if the disk quota error that occurred in try 01 back on Aug 10th left the suite in a bad state. The simplest thing to do might be to rerun from 206409.
Grenville
Hi Grenville
I restarted the run as a CRUN, using the KPP.restart and start dumps created from the previous months. It worked.
I have another question regarding GOML3, which is that one of my suite (u-di981) seems to be stuck (nothing’s happening and I cannot stop the suite). Do there is a way to kipp it properly – or to reload the gui?
Best
Paul-Arthur
Hi Paul-Arthur
That’s great - you know more about KPP than I do.
On puma2
[grenvill@puma2 ~]$ ps -flu pmonerie | grep u-di981
1 S pmonerie 1447730 1 0 80 0 - 275396 - Aug23 ? 00:06:01 python2 /home4/home/n02-puma/fcm/metomi/cylc-7.8.12/bin/cylc-run u-di981
then
kill -9 1447730
then
rose suite-run --restart
Grenville
Hi Grenville
Thank you for your help.
I got a new error message. I slightly changed two of my suites (to increase the time value that is allowed to build the UM) (u-dh954 and u-dh895). I made a commit (fcm commit).
I then made a copy of both suites to run an extra simulation. u-dj428 (a copy of u-dh954) is working well, but u-dj430 (a copy of u-dh895) is not. I am getting this error message when trying to run the suite (rose suite-run):
[FAIL] cylc validate -o /tmp/tmp8H7iYr --strict u-dj430 # return-code=1, stderr=
[FAIL] FileParseError:
[FAIL] Invalid line 152: 04H00M00S
Should I try to reverse my commited changes?
Paul-Arthur
Hi Paul-Arthur
It looks like /home/n02/n02/pmonerie/roses/u-dj430/site/archer2.rc
has got corrupted - the first line is {{'%02d' % (CLOCK[0])}}H{{'%02d' % (CLOCK[1])}}M{{'%02d' % (CLOCK[2])}}S{# On Archer2 there are 8 (128/16) NUMA regions per node #}
Why not just copy what’s in /home/n02/n02/pmonerie/roses/u-dj428/site/archer2.rc
Grenville
This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.