Hi,
Sorry to bother you, but I submitted 2 jobs last week (u-dw696 and u-dw704), both of which should have run for 20 years. But both of which seem to have failed at exactly the same point, in year 4, at the coupled stage. I have checked the usual suspects e.g. job.err, but nothing obvious. Please can you help?
Charlie
Charlie
The job.err file says:
BUFFOUT: Write Failed: Disk quota exceeded
Grenville
Sorry Grenville, I don’t know how I missed that.
So does that mean I have exceeded the disk quota on ARCHER2, or via the file transfer (using Globus) on JASMIN? If the latter, can you therefore tell me how I can setup automatic archiving onto the elastic tape?
Thanks,
Charlie
The disk quota is exceeded on ARCHER.
Is that because I was trying to run two simulations in parallel? Should I not do this? Or can I have a little bit more temporary space on ARCHER?
Charlie
Charlie
increased work quota to 6TB
Grenville
Thanks very much indeed, really appreciated. But, going back to my previous question, is it not considered good practice to run a couple of suites at the same time, in parallel?
And also (apologies for the basic question), am I able to just restart my suites (using a rose-suite restart) from where it failed, now I have more space?
Thanks a lot,
Charlie
Hi Grenville,
Really sorry, but I have got the same problem with my 2 simulations (u-dw696 and u-dw704) - after you increased my quota, they got another couple of years and then again crashed at the coupled stage. Like before, I can’t see anything obvious in the job.err for either, but then again I didn’t see anything before when you said it had breached its quota. However, looking at SAFE, I can see that I have again breached my quota on /work (currently 6,001 Gb out of 6,000 GB).
If I look at my output (taking u-dw696 as an example, but the same is true for the other), then the main culprit is /work/n02/n02/cjrw09/cylc-run/u-dw696/share, which has about 3 Tb. And if I look at this, the main culprits are each cycle e.g. cycle/18530101T0000Z/u-dw696/18530101T0000Z/. Each of these years appears to contain all of the output for that year, so it’s no wonder /work is filling up quickly. But why is this happening? Have I not set up my housekeeping correctly, such that it is not deleting previous cycles once they have been transferred to JASMIN (via Globus)? How can I resolve this?
Charlie
Charlie
Because of the autonomous asynchronous nature of globus data transfers, housekeeping can not know when globus has finished moving data (housekeeping is doing nothing in your suites.) Removing staged data at the appropriate time is a user responsibility for suites running globus transfers.
Grenville
Okay, thank you. But is there no way of turning housekeeping (or whatever the equivalent now exists) on? Having to manually delete the output in /work/n02/n02/cjrw09/cylc-run/u-dw696/share/cycle/* approximately every 2 years of model run (basically twice a day, every day) is going to be tricky! Has nobody else had this problem? Presumably this is because we changed to Globus, as I am sure this was an option before. Is there no workaround?
Charlie
Charlie
hang on - I’ve confused myself! successful pptransfer should indicate that globus has completed, unsuccessful pptransfer does not necessarily indicate a failed pptransfer.
More thought required
Grenville
Thank you. If it helps, I confirm that for all of my previous cycles, e.g. 1855, pptransfer succeeded and indeed the data have been transferred to JASMIN via Globus. Then the coupled task for 1856 failed, because it ran out of disk quota on /work. And checking /work reveals that all of the transferred data (e.g. 1855, 1854 and so on all the way to 1850) is still on /work.
Charlie
please see UM Post-processing App
section 5 Automatically delete data from ARCHER2 after successful transfer to JASMIN
(delete data by hand that has been transferred successfully)
Grenville
Okay, thank you very much. So just to be clear (so I understand exactly what this is doing), the line
prune{share/cycle}=-P6M
means that, after each forthcoming cycle and if pptransfer has succeeded, it will delete the data on /work. With a timeout of 6 minutes. Presumably that is enough to delete the required amount of data? Is that right?
But for all of my data that has not been deleted up till now (i.e. 1850-1855), I should delete this manually? But going forward, I shouldn’t need to do this?
Charlie
And sorry, 1 more question is:
Once I have made that change in housekeeping, do I need to reload it using a suite run-reload?
Charlie
Hi Charlie,
prune{share/cycle}=-P6M
Means remove the share/cycle/<cycle-point> directories that are 6 months or more from the current cycle point.
FYI P6M is 6 months. PT6M is 6 minutes 
Yes rose suite-run --reload after making the changes to housekeeping.
Cheers,
Ros.
Perfect, thank you very much!
Charlie
Hi again,
Very sorry about this, but the coupled stage has again failed at exactly the same point in my 2 suites (u-dw696 and u-dw704) - it isn’t a disk quota issue this time, because I have turned on housekeeping and indeed it is working. Looking at the job.err, there is again no obvious error - several warning messages, but no obvious error.
Can you help?
Charlie
Charlie
What files are you looking at? - the job.err file says:
BUFFOUT: Write Failed: Disk quota exceeded
You have exceeded your 6TB /work quota.
Grenville
Very sorry Grenville, honestly I think I am going mad. I have just checked the job.err (either in ~/cylc-run/u-dw696/log/job/18620101T0000Z/coupled/NN/job.err or right clicking on the failed task and selecting View job logs (Viewer) > job.err), and I can now see that line - but I can guarantee, absolutely 100%, that it wasn’t there when I checked the same file yesterday. Unless I have completely lost all sense of reality.
But in terms of why it is filling up so quickly and subsequently failing, I don’t understand. On PUMA2:
[cjrw09@puma2 ~]$ pwd
/home/n02/n02/cjrw09
[cjrw09@puma2 ~]$ du -skh *
1.8G cylc-run
38M roses
On ARCHER2:
cjrw09@ln02:/home/n02/n02> du -skh cjrw09/
6.9G cjrw09/
cjrw09@ln02:/home/n02/n02> cd /work/n02/n02/cjrw09
cjrw09@ln02:/work/n02/n02/cjrw09> du -skh *
4.0K archive
8.0K cred.jasmin
5.8T cylc-run
82G gc31
20K gl-ancil_topographic_index.py
so clearly the problem is in cylc-run on /work. Looking at this, it is still not deleting output from each previous directory e.g. in /work/n02/n02/cjrw09/cylc-run/u-dw696/share/cycle/, where the ones I deleted manually are indeed empty (1850-1855) but after this they contain data. Despite me turning on housekeeping according to the instructions, and reloading. It’s almost like housekeeping is working on PUMA2 (e.g. at /home/n02/n02/cjrw09/cylc-run/u-dw696/share/cycle, where all years are empty) but not on ARCHER2.
Please can you advise? In the meantime I will delete manually again and restart.
Charlie