Postproc timing out

Hello,

I am trying to run a suite (u-dk877) on Monsoon, but I have been running into an issue with postproc.

I currently have the suite set up with a cycling frequency of 1 month. After the 1st month the postproc task completed quickly. However, after the 2nd month the process seems to get stuck on “command: main_pp.py atmos”.

I received error messages about two files and I paste an example here:
"[WARN] [SUBPROCESS]: Command: moo put -f -vv /home/d00/esands/cylc-run/u-dk877/share/data/History_Data/dk877a.p41850jan.pp moose:crum/u-dk877/ap4.pp
[SUBPROCESS]: Error = 11:
### put, command-id=1968016247, estimated-cost=0byte(s), files=1

task-id=0, estimated-cost=0byte(s), resource=/home/d00/esands/cylc-run/u-dk877/share/data/History_Data/dk877a.p41850jan.pp

/home/d00/esands/cylc-run/u-dk877/share/data/History_Data/dk877a.p41850jan.pp: (ERROR_CLIENT_ZERO_LENGTH_FILE) attempted to archive a zero-length file.

2025-01-17 02:29:00 UTC: polled server for ready tasks: #0

put: failed (11)".

After reaching the walltime limit (3hr), the task resubmits and the same thing happens.

I have tried updating the suite configuration to match some previous work that runs successfully, but I’m not sure this has helped.

I am unsure what else to try.

Thank you for your help,
Emma

Hi Emma,

As the error is indicating, the ap4 file is being generated but is empty. Your suite seems to have deactivated or removed all STASH output to that stream (usage=UP4), but the stream is still ‘active’.
In app/um/rose-app.conf, search for an nlstcall_pp entry that contains filename_base=$DATAM/$RUNIDa.p4% and deactivate this.
Look for other zero size files in ~/cylc-run/u-dk877/share/data/History_Data (or if you have deactivated any other usage profiles) and comment out the corresponding nlstcall_pp entries as well, in addition to removing the files.

Note that there is normally no archiving step for January as the postprocessing waits to create the djf seasonal mean from the monthly files and then archives all three.

Thank you for your advice!
Deactivating the stream has solved the error messages, however, I am still struggling with the process timing out.

I have tried setting up another suite with fewer changes in an attempt to isolate the issue (u-dm596). Similarly to my previous issue, the suite runs until the postproc step, at which point it fails due to the time limit.

I include a screenshot of job.err (/home/d00/esands/cylc-run/u-dm596/log/job/18500101T0000Z/postproc/03/job.err). For most attempts I did not receive any specific error, while at the 3rd attempt there was a ‘system is currently unavailable’ and ‘path already exists’ message.

Do you have any further advice on how to solve the issue with the process timing out? I was expecting 3 hours to be plenty for this step.

Hi Emma,

The ‘dataset already exists’ issue can be ignored as the mkset command runs every time, but that will not stop further archiving.
Note that there has been a wider issue with MASS since the New Year due to heavy load and backlog of ‘put’ commands actually going onto the tapes. GET and SELECTS (and occasionally PUTs) are being disabled for a few hours almost every day.

I also understand that there is a specific issue - possibly related to above- with moo commands on Monsoon not returning promptly or appearing to hang, hence the timeouts even though the transfer may have been started successfully in the background. Try e-mailing the Monsoon team to see if there is any update on this issue. I will anyway raise this at the Monsoon Management meeting later this week.

Thank you! I will reach out to the Monsoon team