Having managed to successfully submit a suite now (many thanks to previous response), and identifying/correcting a couple of failures, it is now failing at the fcm_make2_um stage and I don’t know why. There is nothing obvious in any of the output files, or at least nothing that is similar to any of the other comments here. What have I done wrong?
This is a bit odd. The fcm_make2_um task appeared to succeeded on its first try but left a lock file behind. The second try failed because the lock file is present.
The err file says: [FAIL] /work/n02/n02/cjrw09/cylc-run/u-df570/share/fcm_make_um/fcm-make2.lock: lock exists at the destination
remove the lock file (it is a directory) & retrigger
Thanks Grenville, and sorry for the delay in getting back to you. I don’t know why I always get to this on a Friday afternoon - probably because it takes me that long to work through all my other jobs during the week, all of which seem to require immediate responses!
Anyway, I don’t understand this, because if I look in that directory on Archer2, there is no obvious lock file:
cjrw09@ln01:~> ls /work/n02/n02/cjrw09/cylc-run/u-df570/share/fcm_make_um/
I certainly haven’t removed anything, since we spoke.
But I have now tried running again, and this time it got passed that stage no problem. So it seems to have now built okay, and the reconfiguration is queueing. When (because knowing my luck, it won’t be if) this fails as well, I will open a new ticket if I can’t solve the error myself.
As I feared, it failed at the coupled stage. But at least this error is potentially easy to fix, if indeed this is the error and not a red herring:
BUFFOUT: Write Failed: Disk quota exceeded
FLUSH_UNIT_BUFFER: Error Flushing Buffered Data on PE 0
FLUSH_UNIT_BUFFER: Status is 1.0
FLUSH_UNIT_BUFFER: Length Requested was 524288
FLUSH_UNIT_BUFFER: Length written was 1024
It didn’t even get to the postprocessing or pptransfer stage, so this isn’t a JASMIN problem I guess? Do I not have enough temporary space on ARCHER2?
Thanks very much indeed. However, it crashed again last night, after queueing for ages, with what I think is the following error:
slurmstepd: error: *** STEP 6532490.0+0 ON nid003986 CANCELLED AT 2024-05-10T09:33:38 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 6532490.0+2 ON nid004033 CANCELLED AT 2024-05-10T09:33:38 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 6532490.0+1 ON nid004024 CANCELLED AT 2024-05-10T09:33:38 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 6532490 ON nid003986 CANCELLED AT 2024-05-10T09:33:38 DUE TO TIME LIMIT ***
I don’t think it had actually written anything out.
You need to increase the time limit. The model has got as far as 13/11/1850 in 6hours so just needs a bit more time to complete the year you have requested. Alternatively you could reduce the cycling period.
I’ve found this information in /home/n02/n02/cjrw09/cylc-run/u-df570/work/18500101T0000Z/coupled/pe_output/df570.fort6.pe000
All the model output is under /home/n02/n02/cjrw09/cylc-run/u-df570/share/data/History_Data
Thanks very much indeed. That’s very odd, though, because I modelled my suite on the PI from Seb S, so presumed it would run at the same speed. Given that it almost finished my test year, how much you think I should increase the time to? 8 hours? 12 hours? I’m a bit unsure how the queueing system works now, do we still have different queues with different priorities?
Figuring out the best wallclock is trial and error I’m afraid. Try 8 or 9 hours. You don’t want to cut the time too finely to allow for system jitter. Yet, you don’t want to allow too much excess time as it may impact on the time spent queueing.
Sorry Ros, it has failed again at the coupled stage, despite (as far as I can see) completing the year correctly i.e. writing out everything it should. I can’t see any obvious error this time, so what has happened?
Thank you very much indeed. That’s odd, though, because Grenville only gave me some extra space a couple of days ago. Or was that for /home?
What is a reasonable amount of room to have on /work? At the moment, apart from my current run on Friday (at /work/n02/n02/cjrw09/cylc-run/u-df570) - which itself is a surprisingly large 937 G - the only other data I have on there are 65 G (which is everything I copied over from NEXCS, when we transitioned to ARCHER2). Is that amount okay?
Sorry, but although it has now finished the coupled stage, it has now crashed at the postproc_atmos and postproc_nemo stages. I can see another time limit error, is this the problem?
Many thanks,
Charlie
PS. Let me know if I should start a new ticket with this, as it is technically a new problem?
Yes postproc_atmos & postproc_nemo have run out of time. 2 hours won’t be enough time to processing a year’s worth of data. You’ll need to increase the timelimit.
Change the value of “execution time limit” in site/archer2.rc within section [[POSTPROC_RESOURCE]]
Regards,
Ros.
P.S. If you still have problems, yes please do start a new ticket.
Thanks very much, I will do that now. Do you have any feeling for how long it needs to do a year? Once I have increased the time, can I just retrigger it, or do I need to start all over again?