Postproc failure

Hi Ros,

I had another question I was wondering if you could help with.

A model run repeatedly crashed in postproc with an error “grid-T Seasonal mean for Autumn 1978 not possible as only got 2 file(s)”. Would the best option here be to manually trigger a rerun of this month’s ‘coupled’ task?

The model continued to run the subsequent ‘coupled’ tasks successfully, so I am hoping that if I manually trigger one ‘coupled’ task, it won’t automatically re-run the following ones?

Many thanks,
Tarkan

This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses.

Hi Tarkan,

I’d need to take a look at the job. What’s the suite id?

You can’t just trigger coupled tasks out of order. The model knows where it has got to and needs to have the appropriate atmos and nemo/ice dumps available.

Cheers,
Ros.

Hi Ros,

Thanks for taking a look, the suite ID is u-cz107.

The model ran until 1981/02 but I held the run because I wanted to do some postproc, and move files to JASMIN, was that a mistake?

The postproc then failed in 1978/11.

Cheers,
Tarkan

This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses.

Hi Tarkan,

It looks like you’ve run the first cycle (19780901) postproc out of order. The 03 try ran at 11:40 but the files in the coupled model work directory (/work/n02/n02/tarlge/cylc-run/u-cz107/work/19780901T0000Z/coupled) weren’t created until afterwards at 11:44 when the 02 try of coupled finished.

Please leave the suite to run the tasks when it determines they are ready to run. Also running postproc so far behind the model has the potential to cause problems.

I suggest you retrigger the postproc.19780901T0000Z, when that has succeeded, then the 19781101 postproc can then be retriggered and should run ok. Once you’ve got that far, if the next postproc (19781201) is still held (pink); right click it and select release. Let the rest run in their own time and wait until they have all caught up with the model before releasing and letting the suite continue.

Cheers,
Ros.

If that doesn’t work. You will need to start the suite again.

Regards,
Ros.

Hi Ros,

Apologies - it must have been my mistake to run them out of order.

Regarding running postproc, do these tasks run automatically after each coupled task? I ran it over the weekend, and the coupled tasks just seemed to run ahead without the postproc running, I hadn’t meant for them to be so far apart. Perhaps this is because there was a crash in the postproc.

I am not sure how to trigger the postproc.197809, as it is not appearing in the Cylc GUI, can it be manually triggered from somewhere?

Appreciate your help,
Tarkan

This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses.

Hi Tarkan,

Yes postproc does run automatically after each coupled task. Then the housekeeping will run automatically after that. Subsequent postprocs won’t run until the previous one has completed successfully.

Sorry the running ahead so far is a typo on my part. :slightly_frowning_face: In the ~/roses/u-cz107/suite.rc please change max active cycle points = 33 to be 3 not 33! Then do a rose suite-run --reload to pick up the change. That will only let coupled run ahead 2 cycles if something in a cycle fails (e.g. postproc).

I’ll fix the standard suite too.

With regard to rerunning postproc.19780901. Try running on puma:

cylc insert u-cz107 postproc.19780901T0000Z

Hopefully that will re-insert the cycle/task in the cylc GUI; you can then trigger it to run, if it doesn’t automatically start running.

Regards,
Ros.

Hi Ros,

Thanks for explaining. And thanks for the fix on the coupled running ahead, I’ve made that change in my suite now.

I have inserted the postproc.19780901T0000Z task, but it seems to be stuck in ‘waiting’ for the prerequisite ‘coupled’ task, which was completed successfully but is still appearing as an unmet prerequisite. Perhaps I need to insert this too somehow?

Cheers,
Tarkan

This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses.

Hi Tarkan,

You can manually trigger the postproc.19780901 task. Right click on it and select ‘trigger (Run Now)’.

Cheers,
Ros.

Hi Ros,

This worked! The postproc caught up with the model now and is ready to continue, thanks!

In terms of leaving the model running and having it automatically transfer output to JASMIN, I tried to set up pptransfer (https://cms.ncas.ac.uk/unified-model/pptransfer/), but it doesn’t seem to be copying over. My “JASMIN short-lived credential” is valid, and the files are staging correctly to my ARCHER2 archive directory, but I can’t seem to get them to copy over automatically. Do you have any recommendations for this?

Many thanks,
Tarkan

This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses.

Hi Tarkan,

Glad to hear that worked.

Regarding transfer to JASMIN; in suite conf -> Build and Run switch on PP Transfer. Then do rose suite-run --reload to load the change. The pptransfer task will then appear and start running from the current cycle.

For all previous cycles you will need to manually run rsync or gridftp to copy the directories under /work/n02/n02/tarlge/archive/u-cz107 to JASMIN.

Regards,
Ros.

Hi Ros,

Two more questions, sorry!

  1. The output that is accumulating in my ARCHER2 archive folder: /work/n02/n02/tarlge/archive/u-cz107 seems to vary a lot in size depending on the month, with some directories being 6.7G and some being 50G, is this normal?

  2. Now that I have turned pptransfer on I can see a ‘housekeeping’ task for each month, but even though the prerequisite for these are met, they just seem to stay in the ‘waiting’ state, is this supposed to automatically happen, and to include the pptransfer?
    Thanks,
    Tarkan

This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses.

Hi Tarkan,

  1. It depends on what model output is being kept and the frequency; seasonal means will only archived at the end of a year, archiving of the atmos, ocean, ice restart dumps is set to yearly. You’ll also see more variation in sizes with monthly cycling. The December cycle (50G) for instance is much bigger than the others as it contains the atmosphere dump and all the seasonal means whereas the October cycle (6.5G) only contains the September files.

  2. In u-cz107, I can see that the housekeeping tasks have run up to and including 19810801T0000Z and succeeded. Please send me a screenshot of the cylc GUI will all the groupings expanded (Down arrow next to View 1 box: Deselect “Group” and also select “Expand”). pptransfer is a completely separate task and should be visible labelled pptransfer and run before housekeeping. What cycle did you insert the pptransfer task into (ie. from what cycle did you start running pptransfer)? I can’t see it at all.

Regards,
Ros.

Hi Ros,

Thanks for your answer to my first question, that makes sense to me now.

I believe that I tried to insert the pptransfer task in at 198102. I changed it in the suite conf, and ran a suite-run --reload, but I also never saw the pptransfer task after that, only housekeeping. I could try to continue the run and see if it appears. I was away for a few days and the model has disappeared from my cylc GUI, is the best way to continue it to run a rosie suite-run --restart? Or to insert the next task manually and go from there?

Thanks,
Tarkan

This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses.

Hi Tarkan,

From looking at the suite log files on ARCHER2 it doesn’t look like pptransfer has run for any cycle; there’s no log file for it under 19810201T0000Z. When you did the cylc insert command did you see the task pop up in the cylc GUI?

You restart the suite with rose suite-run --restart

Can you then please send me a screenshot of the cylc GUI as per my previous response so I can see what state the suite is in and advise more precisely.

Regards,
Ros.

Hi Ros,

I don’t remember ever seeing a pptransfer task, perhaps I didn’t insert it correctly.
I tried to take the screenshot like you said, below. This is what I meant about the housekeeping, it seems to stay on ‘waiting’ until I manually trigger it. Once I’ve done this then the model continues running for a month, and then the next housekeeping task stays on ‘waiting’ again and the model stops running.

Cheers,
Tarkan

This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses.

Hi Tarkan,

If you didn’t see the pptransfer task pop up into the cylc GUI then it wasn’t successful and the task would never have run, however the housekeeping task will now be waiting as it is expecting to run after pptransfer.

Please run:

cylc insert --no-check u-cz107 pptransfer.19810801T0000Z

The pptransfer task should than appear in the GUI. If it doesn’t automatically run - manually trigger it to run. Once it has run successfully I hope the next one should appear in the 19810901T0000Z cycle. If it doesn’t repeat the above command for the next cycle and probably the next 2 when it should then be spawned when the Dec 1981 cycle appears.

Regards,
Ros.

Hi Ros,

Thanks for this, this seemed to work!

I was also wondering, is there a way for the ARCHER2 archive data to automatically be removed as it moves to JASMIN? So that I don’t need to manually clear it out (to prevent my disk filling up when I leave things running).

Also, do you know who I can contact to extend my PUMA disk allocation a little bit? I’m planning on running quite a few suites and it seems that it is only big enough for 6 or so suites.

Cheers,
Tarkan

This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses.

A post was split to a new topic: Pptransfer - no archive directory to transfer

Hi Tarkan,

The suite uc-z107 is already setup to delete the staged data from ARCHER2 following successful transfer to JASMIN. See panel postproc → Post processing → ARCHER2-JASMIN → Data Transfer where you will find an option called delete_staged.

Regarding disk space. I will ask Andy if he can increase you PUMA space a bit. However, you will need to do regularly tidying up of the cylc-run directory on PUMA to delete old files. Any cylc-run directories for suites that you have finished running should be deleted. You can also delete old tar’ed up log files under e.g /home/tarlge/cylc-run/u-cz107/log for suites still running. If that still isn’t enough then we advise configuring the suite to not pull back the log files from ARCHER2. If you get to that point let me know and I’ll tell you how to do that.

Regards,
Ros.