Postproc timing out

egsands · 17 January 2025 10:33

Hello,

I am trying to run a suite (u-dk877) on Monsoon, but I have been running into an issue with postproc.

I currently have the suite set up with a cycling frequency of 1 month. After the 1st month the postproc task completed quickly. However, after the 2nd month the process seems to get stuck on “command: main_pp.py atmos”.

I received error messages about two files and I paste an example here:
"[WARN] [SUBPROCESS]: Command: moo put -f -vv /home/d00/esands/cylc-run/u-dk877/share/data/History_Data/dk877a.p41850jan.pp moose:crum/u-dk877/ap4.pp
[SUBPROCESS]: Error = 11:
### put, command-id=1968016247, estimated-cost=0byte(s), files=1

task-id=0, estimated-cost=0byte(s), resource=/home/d00/esands/cylc-run/u-dk877/share/data/History_Data/dk877a.p41850jan.pp

/home/d00/esands/cylc-run/u-dk877/share/data/History_Data/dk877a.p41850jan.pp: (ERROR_CLIENT_ZERO_LENGTH_FILE) attempted to archive a zero-length file.

2025-01-17 02:29:00 UTC: polled server for ready tasks: #0

put: failed (11)".

After reaching the walltime limit (3hr), the task resubmits and the same thing happens.

I have tried updating the suite configuration to match some previous work that runs successfully, but I’m not sure this has helped.

I am unsure what else to try.

Thank you for your help,
Emma

mdalvi · 17 January 2025 12:41

Hi Emma,

As the error is indicating, the ap4 file is being generated but is empty. Your suite seems to have deactivated or removed all STASH output to that stream (usage=UP4), but the stream is still ‘active’.
In app/um/rose-app.conf, search for an nlstcall_pp entry that contains filename_base=$DATAM/$RUNIDa.p4% and deactivate this.
Look for other zero size files in ~/cylc-run/u-dk877/share/data/History_Data (or if you have deactivated any other usage profiles) and comment out the corresponding nlstcall_pp entries as well, in addition to removing the files.

Note that there is normally no archiving step for January as the postprocessing waits to create the djf seasonal mean from the monthly files and then archives all three.

egsands · 20 January 2025 10:08

Thank you for your advice!
Deactivating the stream has solved the error messages, however, I am still struggling with the process timing out.

I have tried setting up another suite with fewer changes in an attempt to isolate the issue (u-dm596). Similarly to my previous issue, the suite runs until the postproc step, at which point it fails due to the time limit.

I include a screenshot of job.err (/home/d00/esands/cylc-run/u-dm596/log/job/18500101T0000Z/postproc/03/job.err). For most attempts I did not receive any specific error, while at the 3rd attempt there was a ‘system is currently unavailable’ and ‘path already exists’ message.

Do you have any further advice on how to solve the issue with the process timing out? I was expecting 3 hours to be plenty for this step.

mdalvi · 20 January 2025 10:42

Hi Emma,

The ‘dataset already exists’ issue can be ignored as the mkset command runs every time, but that will not stop further archiving.
Note that there has been a wider issue with MASS since the New Year due to heavy load and backlog of ‘put’ commands actually going onto the tapes. GET and SELECTS (and occasionally PUTs) are being disabled for a few hours almost every day.

I also understand that there is a specific issue - possibly related to above- with moo commands on Monsoon not returning promptly or appearing to hang, hence the timeouts even though the transfer may have been started successfully in the background. Try e-mailing the Monsoon team to see if there is any update on this issue. I will anyway raise this at the Monsoon Management meeting later this week.

egsands · 20 January 2025 10:49

Thank you! I will reach out to the Monsoon team

system · 19 February 2025 10:49

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Postproc error Unified Model ARCHER2 , postproc	6	354	14 October 2021
Error in Postproc Unified Model ARCHER2 , postproc	5	334	7 August 2021
Suite u-cr117 ensemble members on monsoon Unified Model Monsoon2	11	287	5 January 2024
Postproc failure Unified Model	36	450	24 November 2023
Postprocessing error ARCHER2	0	125	25 July 2023

Postproc timing out

task-id=0, estimated-cost=0byte(s), resource=/home/d00/esands/cylc-run/u-dk877/share/data/History_Data/dk877a.p41850jan.pp

2025-01-17 02:29:00 UTC: polled server for ready tasks: #0

Related topics