All suites stopped after "Disk quota exceeded"

Hello.

I’m quite frustrated now. I was running around 20 suites from Monday, and now they are all gone. They all have stopped, and I have no idea why. The only clue I have is a message saying “Disk quota exceeded” in Pumatest. So I have some questions.

Can you increase my quota in Pumatest?

Is there a way to retrieve the suites in their current status? Or do I have to start from scratch again? This is the last list I got from cylc scan.

-bash-4.1$ cylc scan
u-bo026-n512-ens6 luciana@pumatest.nerc.ac.uk:43082
u-bo026-n512-ens0-r2 luciana@pumatest.nerc.ac.uk:43030
u-bo026-n512-ens0-orig luciana@pumatest.nerc.ac.uk:43085
u-bo026-n512-ens3-r2 luciana@pumatest.nerc.ac.uk:43066
u-bo026-n216-ens12-orig luciana@pumatest.nerc.ac.uk:43070
u-bo026-n512-ens3-r3 luciana@pumatest.nerc.ac.uk:43091
u-bo026-n512-ens0 luciana@pumatest.nerc.ac.uk:43098
u-bo026-n512-ens3-orig luciana@pumatest.nerc.ac.uk:43069
u-bo026-n512-t7-3 luciana@pumatest.nerc.ac.uk:43014
u-bo026-n512-ens6-r1 luciana@pumatest.nerc.ac.uk:43055
u-bo026-n512-ens3 luciana@pumatest.nerc.ac.uk:43017
u-bo026-n512-ens0-r1 luciana@pumatest.nerc.ac.uk:43054
u-bo026-n512-t8-t3 luciana@pumatest.nerc.ac.uk:43096
u-bo026-n512-ens6-r2 luciana@pumatest.nerc.ac.uk:43012
u-bo026-n512-ens6-orig luciana@pumatest.nerc.ac.uk:43034
u-bo026-n512-ens0-r3 luciana@pumatest.nerc.ac.uk:43073
u-bo026-n512-ens6-r3 luciana@pumatest.nerc.ac.uk:43021
u-bo026-n512-ens3-r1 luciana@pumatest.nerc.ac.uk:43033

To solve the problem of quota, I removed the cylc-run directory. Thinking about that now, I believe that reduced the chances of getting the suites running again. I didn’t change the cylc-run directory in Archer2, so there is still a small chance.

In Archer2, I still have jobs pending:

lrpedro@uan01:~> squeue -u lrpedro
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
619669 standard u-bo026- lrpedro PD 0:00 1 (Priority)
619670 standard u-bo026- lrpedro PD 0:00 1 (Priority)
619816 standard u-bo026- lrpedro PD 0:00 1 (Priority)
619924 standard u-bo026- lrpedro PD 0:00 10 (Priority)
619937 standard u-bo026- lrpedro PD 0:00 1 (Priority)
619940 standard u-bo026- lrpedro PD 0:00 1 (Priority)
619941 standard u-bo026- lrpedro PD 0:00 1 (Priority)
619942 standard u-bo026- lrpedro PD 0:00 1 (Priority)
619945 standard u-bo026- lrpedro PD 0:00 6 (Priority)
620079 standard u-bo026- lrpedro PD 0:00 10 (Priority)
620096 standard u-bo026- lrpedro PD 0:00 10 (Priority)
620111 standard u-bo026- lrpedro PD 0:00 10 (Priority)
620120 standard u-bo026- lrpedro PD 0:00 10 (Priority)
620122 standard u-bo026- lrpedro PD 0:00 10 (Priority)
620123 standard u-bo026- lrpedro PD 0:00 10 (Priority)
620192 standard u-bo026- lrpedro PD 0:00 10 (Priority)
620194 standard u-bo026- lrpedro PD 0:00 10 (Priority)
620195 standard u-bo026- lrpedro PD 0:00 10 (Priority)
621226 standard u-bo026- lrpedro PD 0:00 1 (Priority)
621229 standard u-bo026- lrpedro PD 0:00 1 (Priority)
621231 standard u-bo026- lrpedro PD 0:00 1 (Priority)

What can I do, and what have I done wrong to put myself in this situation?

Kind regards, Luciana.

Hi Luciana,
I have increased your puma test quota from 8GB to 20GB. Someone else from the CMS team will be able to comment on restarting your jobs.
Cheers
Andy

Hi Luciana,

If you run out of disk space on pumatest the suites can’t continue to run so they will all stop until you sort out the disk space issue and then restart them all.

If you delete the cylc-run directories then you can’t just restart the suites with a simple rose suite-restart because you have deleted all the required suite status information, etc that cylc needs.

From ARCHER2 it looks like they were all on the first cycle - is that correct? In which case you’ll have to start all the runs again from scratch.

Resources are finite, you need to keep aware of how much disk space you are using both on pumatest and ARCHER2 and manually tidy up as required to stop the runs falling over. For example, on pumatest delete cylc-run/suite directories of those suites that have finished running. Remove old log directories. If it’s the log files that are filling up space, which I suspect it was we recommend configuring the suites not to pull over all the log files from ARCHER2.

E.g on a per suite basis in the archer2.rc file

[[HPC]]
...
  [[[remote]]]
      host = login-4c.archer2.ac.uk
      retrieve job logs = False

Or you can configure it globally for all suites.

Regards,
Ros.

Dear Ros.

Thank you for the clarifications.

All the suites I run have only one cycle, so I have restarted them from scratch. Hopefully, with the new quota, that won’t happen again. As far as I know, there is nothing else on Pumatest other than the suites I’m running. I’ll check the log files too; I’ve never used them.

Kind regards, Luciana.