Fcm_make2_um failing

Hi,

Having managed to successfully submit a suite now (many thanks to previous response), and identifying/correcting a couple of failures, it is now failing at the fcm_make2_um stage and I don’t know why. There is nothing obvious in any of the output files, or at least nothing that is similar to any of the other comments here. What have I done wrong?

My suite is u-df570.

Thanks,

Charlie

Hi Charlie

This is a bit odd. The fcm_make2_um task appeared to succeeded on its first try but left a lock file behind. The second try failed because the lock file is present.

The err file says:
[FAIL] /work/n02/n02/cjrw09/cylc-run/u-df570/share/fcm_make_um/fcm-make2.lock: lock exists at the destination

remove the lock file (it is a directory) & retrigger

Grenville

Thanks Grenville, and sorry for the delay in getting back to you. I don’t know why I always get to this on a Friday afternoon - probably because it takes me that long to work through all my other jobs during the week, all of which seem to require immediate responses!

Anyway, I don’t understand this, because if I look in that directory on Archer2, there is no obvious lock file:

cjrw09@ln01:~> ls /work/n02/n02/cjrw09/cylc-run/u-df570/share/fcm_make_um/

build-atmos extract fcm-make2.cfg fcm-make2.log preprocess-atmos
build-recon fcm-make2-as-parsed.cfg fcm-make2.cfg.orig fcm-make2-on-success.cfg preprocess-recon

I certainly haven’t removed anything, since we spoke.

But I have now tried running again, and this time it got passed that stage no problem. So it seems to have now built okay, and the reconfiguration is queueing. When (because knowing my luck, it won’t be if) this fails as well, I will open a new ticket if I can’t solve the error myself.

Charlie

Hi again,

As I feared, it failed at the coupled stage. But at least this error is potentially easy to fix, if indeed this is the error and not a red herring:

BUFFOUT: Write Failed: Disk quota exceeded

FLUSH_UNIT_BUFFER: Error Flushing Buffered Data on PE 0
FLUSH_UNIT_BUFFER: Status is 1.0
FLUSH_UNIT_BUFFER: Length Requested was 524288
FLUSH_UNIT_BUFFER: Length written was 1024

It didn’t even get to the postprocessing or pptransfer stage, so this isn’t a JASMIN problem I guess? Do I not have enough temporary space on ARCHER2?

Charlie

Hi Charlie

I increased your quota to 1TB - there’s more, but as usual n02 is filling up.

Grenville

Hi,

Thanks very much indeed. However, it crashed again last night, after queueing for ages, with what I think is the following error:

slurmstepd: error: *** STEP 6532490.0+0 ON nid003986 CANCELLED AT 2024-05-10T09:33:38 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 6532490.0+2 ON nid004033 CANCELLED AT 2024-05-10T09:33:38 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 6532490.0+1 ON nid004024 CANCELLED AT 2024-05-10T09:33:38 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 6532490 ON nid003986 CANCELLED AT 2024-05-10T09:33:38 DUE TO TIME LIMIT ***

I don’t think it had actually written anything out.

Charlie

Hi Charlie,

You need to increase the time limit. The model has got as far as 13/11/1850 in 6hours so just needs a bit more time to complete the year you have requested. Alternatively you could reduce the cycling period.

I’ve found this information in /home/n02/n02/cjrw09/cylc-run/u-df570/work/18500101T0000Z/coupled/pe_output/df570.fort6.pe000

All the model output is under /home/n02/n02/cjrw09/cylc-run/u-df570/share/data/History_Data

Regards,
Ros.

Hi Ros,

Thanks very much indeed. That’s very odd, though, because I modelled my suite on the PI from Seb S, so presumed it would run at the same speed. Given that it almost finished my test year, how much you think I should increase the time to? 8 hours? 12 hours? I’m a bit unsure how the queueing system works now, do we still have different queues with different priorities?

Thanks,

Charlie

Hi Charlie,

Figuring out the best wallclock is trial and error I’m afraid. Try 8 or 9 hours. You don’t want to cut the time too finely to allow for system jitter. Yet, you don’t want to allow too much excess time as it may impact on the time spent queueing.

The main compute queue is the standard queue (QoS).
Detail on all the ARCHER2 queues is available here: Running jobs - ARCHER2 User Documentation

Cheers,
Ros

Okay, thanks, all understood. I am just running now with 8 hours, so fingers crossed.

Charlie

Sorry Ros, it has failed again at the coupled stage, despite (as far as I can see) completing the year correctly i.e. writing out everything it should. I can’t see any obvious error this time, so what has happened?

Charlie

Hi Charlie,

You hit your /work disk quota - see job.err file.

I’ve just given you a bit more space, but n02 is currently creaking at the seams at the moment. Please resubmit and it should be fine this time.

Cheers
Ros.

Thank you very much indeed. That’s odd, though, because Grenville only gave me some extra space a couple of days ago. Or was that for /home?

What is a reasonable amount of room to have on /work? At the moment, apart from my current run on Friday (at /work/n02/n02/cjrw09/cylc-run/u-df570) - which itself is a surprisingly large 937 G - the only other data I have on there are 65 G (which is everything I copied over from NEXCS, when we transitioned to ARCHER2). Is that amount okay?

Charlie

Hi again,

Sorry, but although it has now finished the coupled stage, it has now crashed at the postproc_atmos and postproc_nemo stages. I can see another time limit error, is this the problem?

Many thanks,

Charlie

PS. Let me know if I should start a new ticket with this, as it is technically a new problem?

Hi Charlie,

Yes postproc_atmos & postproc_nemo have run out of time. 2 hours won’t be enough time to processing a year’s worth of data. You’ll need to increase the timelimit.

Change the value of “execution time limit” in site/archer2.rc within section [[POSTPROC_RESOURCE]]

Regards,
Ros.

P.S. If you still have problems, yes please do start a new ticket.

Thanks very much, I will do that now. Do you have any feeling for how long it needs to do a year? Once I have increased the time, can I just retrigger it, or do I need to start all over again?

Charlie

Hi Charlie,

The files that have so far been processed are in:
/work/n02/n02/cjrw09/archive/u-df570/18500101T0000Z.

So depending on what streams you have active you might be able to get a rough idea of how much extra time, otherwise again it’s just trial and error.

Change the timelimit, reload the suite and then retrigger the task.

Cheers,
Ros.