Model failures

SimonTett · 29 October 2025 13:28

I submitted a bunch of short atmosphere simulations last night. Two failed:

opt_dfols4/d4002 – blew up on timestep 3775 Model time: 2010-10-23 10:20:00..
opt_dfols4/d4007 – ran out of time. Last timestep in the pe000 file was 370…

I reran both cases and both are running as normal – d4002 now at timestep 5636, d4007 happily ran 3 months and is now running the next three months in the workflow.

I can understand that if archer2’s I/O system is loaded then simulations can go slow. The blow up is more worrying. It is worth noting that my simulations do not output very much data.

Given, I want to run O(200) of these simulations, are there any changes I can make that will make them more reliable? Or should I complain to the archer2 helpdesk…

Simon

grenville · 29 October 2025 14:45

Simon,

It’s a bit worrying that it failed in the solver first time, then succeeded second time. That seems to imply that the model does not run the same way twice in a row. We have seen this behaviour on rare occasions but have never found an explanation (never really known how to start looking for one.) So, I have nothing to add to help with such problems. We recommend that model tasks do not automatically resubmit on failure - that can leave a suite in a mess if slurm misbehaves (which it does.)

It’s always worth reporting such behaviour to Archer.

Grenville

SimonTett · 29 October 2025 20:07

Thanks @grenville

And I’ve had a bunch of files which are zero sized on jasmin (and deleted on archer2). But when I look in the globus output the transfer seems fine. See Globus for one example.

Here are all the directories on jasmin which have 0 length files along with the number of files…

find /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/ -type f -size 0 -exec dirname {} ;| uniq -c
3 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4000/20101201T0000Z
3 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4000/20110301T0000Z
1 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4000/20110601T0000Z
1 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4000/20111201T0000Z
2 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4001/20101201T0000Z
2 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4001/20110301T0000Z
2 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4001/20110601T0000Z
1 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4004/20100901T0000Z
2 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4004/20101201T0000Z
1 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4004/20110301T0000Z
3 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4006/20101201T0000Z
1 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4006/20110301T0000Z

I’ll complain to archer2 about the model failures but I wonder if there was some general I/O problem today.

Simon

RosalynHatcher · 30 October 2025 09:18

Hi Simon,

We can’t see people’s globus logs. Can you send a screen shot of the error logs so I can see if it was a transfer that stop-started, etc and confirm that the checksums verified.

Ta.
Cheers,
Ros.

SimonTett · 30 October 2025 10:12

Hi Ros,

thanks – here you are.

Job summary

Here is the debug data:

{“DATA_TYPE”:“task”,“bytes_checksummed”:0,“bytes_transferred”:179756352,“command”:“API 0.10”,“completion_time”:“2025-10-29T04:30:39.000Z”,“deadline”:“2025-11-01T04:11:01.000Z”,“delete_destination_extra”:false,“destination_endpoint_display_name”:“JASMIN Default Collection”,“destination_endpoint_id”:“a2f53b7f-1b4e-4dce-9b7c-349ae760fee0”,“directories”:1,“duration_at_last_fetch”:2198000,“effective_bytes_per_second”:81772,“encrypt_data”:false,“fail_on_quota_errors”:true,“faults”:6,“files”:3,“files_skipped”:0,“files_transferred”:3,“history_deleted”:false,“is_delete”:false,“is_paused”:false,“is_transfer”:true,“label”:“opt_dfols4/d4000/20101201T0000Z”,“owner_id”:“5c4dcee1-2ece-4a42-9e6c-9fb2c08dc9c4”,“preserve_timestamp”:false,“request_time”:“2025-10-29T03:54:01.000Z”,“skip_source_errors”:false,“source_endpoint_display_name”:“Archer2 file systems”,“source_endpoint_id”:“3e90d018-0d05-461a-bbaf-aab605283d21”,“status”:“SUCCEEDED”,“sync_level”:3,“task_id”:“e63437c0-b47a-11f0-af29-0affdd0cd947”,“type”:“TRANSFER”,“username”:“u_lrg45yjozzfefhtmt6zmbdojyq”,“verify_checksum”:true}

RosalynHatcher · 30 October 2025 10:35

Is there anything listed in the event log tab?

SimonTett · 30 October 2025 12:32

SimonTett · 30 October 2025 13:32

And to confirm I’ve had another one. Globus thinks it has transferred 60 Mbytes to JASMIN. Jasmin thinks the file size is 0…

{
“DATA_TYPE”: “successful_transfer”,
“checksum”: “1570d31f5278da9cb6b8bad45fd01c89be55046b”,
“checksum_algorithm”: “SHA1”,
“destination_path”: “/gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4008/20110301T0000Z/d4008a.pz2011apr.pp”,
“dynamic”: false,
“size”: 59918784,
“source_path”: “/mnt/lustre/a2fs-work2/work/n02/shared/tetts/opt_dfols4/d4008/output/opt_dfols4/d4008/20110301T0000Z/d4008a.pz2011apr.pp”
}

[tetts@sci-vm-02 ~]$ ls -ltr /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4008/20110301T0000Z/d4008a.pz2011apr.pp
-rw-r--r-- 1 tetts gws_terrafirma 0 Oct 30 11:53 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4008/20110301T0000Z/d4008a.pz2011apr.pp
[tetts@sci-vm-02 ~]$

RosalynHatcher · 30 October 2025 14:42

Hi Simon,

Can you send me the globus task id for the latest failure above please? You’ll see it listed in the overview tab for the task and will look like this for example: 3111ea66-7eab-11f0-a2d0-0affef98d17b. I want to send it to JASMIN to see if they can see anything in their globus logs. It should have run a checksum.

Cheers,
Ros

SimonTett · 30 October 2025 15:39

Here you are:

f5d490ae-b586-11f0-9781-027493648695

Simon

SimonTett · 3 November 2025 10:15

Had a bunch of failures early AM – small startup jobs for two models which all ran out of time.

install_ancil and fcm_make2pp both failed because of running out of time. More painful was opt_dfols4/d400l/atmos_main running out of time having managed to run only a few days. I’ve increased the time for the small jobs and reran them and resubmitted the failed atmos_main – it looks to be running quite slowly…I also have opt_dfols4/d400k/atmos_main running and it seems to be running at the rate I expect.

Simon

SimonTett · 3 November 2025 12:18

And opt_dfols4/d400l/atmos_main failed again – running to Timestep 354 in 1 hour 55 mins before being culled…

Simon

SimonTett · 4 November 2025 13:10

atmos_main did eventually work.

Next run I got some globus errors:

Shell debugging restarted
[WARN] file:atmospp.nl: skip missing optional source: namelist:moose_arch
[WARN] file:atmospp.nl: skip missing optional source: namelist:script_arch
/bin/sh: BASH_XTRACEFD: 19: invalid value for trace file descriptor
[WARN]  [SUBPROCESS]: Command: globus transfer --format unix --jmespath task_id --recursive --fail-on-quota-errors --sync-level checksum --label opt_dfols4/d400m/20100901T0000Z --verify-checksum --notify off 3e90d018-0d05-461a-bbaf-aab605283d21:/mnt/lustre/a2fs-work2/work/n02/shared/tetts/opt_dfols4/d400m/output/opt_dfols4/d400m/20100901T0000Z a2f53b7f-1b4e-4dce-9b7c-349ae760fee0:/gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d400m/20100901T0000Z
[SUBPROCESS]: Error = 1:
	Globus CLI Error: A Transfer API Error Occurred.
HTTP status:      502
request_id:       1MdHZbJtm
code:             ExternalError
message:
                  Error validating login to endpoint 'JASMIN Default Collection (a2f53b7f-1b4e-4dce-9b7c-349ae760fee0)', Error (connect)
                  Endpoint: JASMIN Default Collection (a2f53b7f-1b4e-4dce-9b7c-349ae760fee0)
                  Server: 130.246.1.15:443
                  Message: The operation timed out
                  

[WARN]  Transfer command failed: globus transfer --format unix --jmespath 'task_id' --recursive --fail-on-quota-errors --sync-level checksum --label opt_dfols4/d400m/20100901T0000Z --verify-checksum --notify off 3e90d018-0d05-461a-bbaf-aab605283d21:/mnt/lustre/a2fs-work2/work/n02/shared/tetts/opt_dfols4/d400m/output/opt_dfols4/d400m/20100901T0000Z a2f53b7f-1b4e-4dce-9b7c-349ae760fee0:/gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d400m/20100901T0000Z
[ERROR]  transfer.py: Globus Error: Network or server error occurred (Globus ReturnCode=1)
[FAIL]  Command Terminated
[FAIL] Terminating PostProc...
[FAIL] transfer.py # return-code=1
2025-11-04T11:44:01Z CRITICAL - failed/ERR

I ran the globus tak interactively and it worked. Date transfered fine to jasmin.

RosalynHatcher · 4 November 2025 14:29

JASMIN have maintenance day today. Globus endpoints have consequently been unavailable at times.

Cheers,
Ros

system · 5 November 2025 16:26

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Vn13.3 AMIP suite on Archer2 Monsoon2 , ARCHER2	54	398	2 December 2024
Job failure Unified Model ARCHER2	21	50	21 February 2026
UKESM1.1 AMIP run -- u-dp730 Unified Model ARCHER2	40	288	17 July 2025
Pptransfer fail from ARCHER to JASMIN Unified Model JASMIN , ARCHER2	8	102	10 December 2024
Jobs submit-failure Unified Model ARCHER2	3	88	16 February 2024

Model failures

Related topics