Job failure

Hi,

Sorry to bother you, but I submitted 2 jobs last week (u-dw696 and u-dw704), both of which should have run for 20 years. But both of which seem to have failed at exactly the same point, in year 4, at the coupled stage. I have checked the usual suspects e.g. job.err, but nothing obvious. Please can you help?

Charlie

Charlie

The job.err file says:

BUFFOUT: Write Failed: Disk quota exceeded

Grenville

Sorry Grenville, I don’t know how I missed that.

So does that mean I have exceeded the disk quota on ARCHER2, or via the file transfer (using Globus) on JASMIN? If the latter, can you therefore tell me how I can setup automatic archiving onto the elastic tape?

Thanks,

Charlie

The disk quota is exceeded on ARCHER.

Is that because I was trying to run two simulations in parallel? Should I not do this? Or can I have a little bit more temporary space on ARCHER?

Charlie

Charlie

increased work quota to 6TB

Grenville

Thanks very much indeed, really appreciated. But, going back to my previous question, is it not considered good practice to run a couple of suites at the same time, in parallel?

And also (apologies for the basic question), am I able to just restart my suites (using a rose-suite restart) from where it failed, now I have more space?

Thanks a lot,

Charlie

Hi Grenville,

Really sorry, but I have got the same problem with my 2 simulations (u-dw696 and u-dw704) - after you increased my quota, they got another couple of years and then again crashed at the coupled stage. Like before, I can’t see anything obvious in the job.err for either, but then again I didn’t see anything before when you said it had breached its quota. However, looking at SAFE, I can see that I have again breached my quota on /work (currently 6,001 Gb out of 6,000 GB).

If I look at my output (taking u-dw696 as an example, but the same is true for the other), then the main culprit is /work/n02/n02/cjrw09/cylc-run/u-dw696/share, which has about 3 Tb. And if I look at this, the main culprits are each cycle e.g. cycle/18530101T0000Z/u-dw696/18530101T0000Z/. Each of these years appears to contain all of the output for that year, so it’s no wonder /work is filling up quickly. But why is this happening? Have I not set up my housekeeping correctly, such that it is not deleting previous cycles once they have been transferred to JASMIN (via Globus)? How can I resolve this?

Charlie

Charlie

Because of the autonomous asynchronous nature of globus data transfers, housekeeping can not know when globus has finished moving data (housekeeping is doing nothing in your suites.) Removing staged data at the appropriate time is a user responsibility for suites running globus transfers.

Grenville

Okay, thank you. But is there no way of turning housekeeping (or whatever the equivalent now exists) on? Having to manually delete the output in /work/n02/n02/cjrw09/cylc-run/u-dw696/share/cycle/* approximately every 2 years of model run (basically twice a day, every day) is going to be tricky! Has nobody else had this problem? Presumably this is because we changed to Globus, as I am sure this was an option before. Is there no workaround?

Charlie

Charlie

hang on - I’ve confused myself! successful pptransfer should indicate that globus has completed, unsuccessful pptransfer does not necessarily indicate a failed pptransfer.

More thought required

Grenville

Thank you. If it helps, I confirm that for all of my previous cycles, e.g. 1855, pptransfer succeeded and indeed the data have been transferred to JASMIN via Globus. Then the coupled task for 1856 failed, because it ran out of disk quota on /work. And checking /work reveals that all of the transferred data (e.g. 1855, 1854 and so on all the way to 1850) is still on /work.

Charlie

please see UM Post-processing App

section 5 Automatically delete data from ARCHER2 after successful transfer to JASMIN

(delete data by hand that has been transferred successfully)

Grenville

Okay, thank you very much. So just to be clear (so I understand exactly what this is doing), the line

prune{share/cycle}=-P6M

means that, after each forthcoming cycle and if pptransfer has succeeded, it will delete the data on /work. With a timeout of 6 minutes. Presumably that is enough to delete the required amount of data? Is that right?

But for all of my data that has not been deleted up till now (i.e. 1850-1855), I should delete this manually? But going forward, I shouldn’t need to do this?

Charlie

And sorry, 1 more question is:

Once I have made that change in housekeeping, do I need to reload it using a suite run-reload?

Charlie

Hi Charlie,

prune{share/cycle}=-P6M

Means remove the share/cycle/<cycle-point> directories that are 6 months or more from the current cycle point.

FYI P6M is 6 months. PT6M is 6 minutes :grin:

Yes rose suite-run --reload after making the changes to housekeeping.

Cheers,
Ros.

Perfect, thank you very much!

Charlie

Hi again,

Very sorry about this, but the coupled stage has again failed at exactly the same point in my 2 suites (u-dw696 and u-dw704) - it isn’t a disk quota issue this time, because I have turned on housekeeping and indeed it is working. Looking at the job.err, there is again no obvious error - several warning messages, but no obvious error.

Can you help?

Charlie

Charlie

What files are you looking at? - the job.err file says:

BUFFOUT: Write Failed: Disk quota exceeded

You have exceeded your 6TB /work quota.

Grenville

Very sorry Grenville, honestly I think I am going mad. I have just checked the job.err (either in ~/cylc-run/u-dw696/log/job/18620101T0000Z/coupled/NN/job.err or right clicking on the failed task and selecting View job logs (Viewer) > job.err), and I can now see that line - but I can guarantee, absolutely 100%, that it wasn’t there when I checked the same file yesterday. Unless I have completely lost all sense of reality.

But in terms of why it is filling up so quickly and subsequently failing, I don’t understand. On PUMA2:

[cjrw09@puma2 ~]$ pwd
/home/n02/n02/cjrw09
[cjrw09@puma2 ~]$ du -skh *
1.8G cylc-run
38M roses

On ARCHER2:

cjrw09@ln02:/home/n02/n02> du -skh cjrw09/
6.9G cjrw09/
cjrw09@ln02:/home/n02/n02> cd /work/n02/n02/cjrw09
cjrw09@ln02:/work/n02/n02/cjrw09> du -skh *
4.0K archive
8.0K cred.jasmin
5.8T cylc-run
82G gc31
20K gl-ancil_topographic_index.py

so clearly the problem is in cylc-run on /work. Looking at this, it is still not deleting output from each previous directory e.g. in /work/n02/n02/cjrw09/cylc-run/u-dw696/share/cycle/, where the ones I deleted manually are indeed empty (1850-1855) but after this they contain data. Despite me turning on housekeeping according to the instructions, and reloading. It’s almost like housekeeping is working on PUMA2 (e.g. at /home/n02/n02/cjrw09/cylc-run/u-dw696/share/cycle, where all years are empty) but not on ARCHER2.

Please can you advise? In the meantime I will delete manually again and restart.

Charlie