Model failures

I submitted a bunch of short atmosphere simulations last night. Two failed:

  • opt_dfols4/d4002 – blew up on timestep 3775 Model time: 2010-10-23 10:20:00..
  • opt_dfols4/d4007 – ran out of time. Last timestep in the pe000 file was 370…

I reran both cases and both are running as normal – d4002 now at timestep 5636, d4007 happily ran 3 months and is now running the next three months in the workflow.

I can understand that if archer2’s I/O system is loaded then simulations can go slow. The blow up is more worrying. It is worth noting that my simulations do not output very much data.

Given, I want to run O(200) of these simulations, are there any changes I can make that will make them more reliable? Or should I complain to the archer2 helpdesk…

Simon

Simon,

It’s a bit worrying that it failed in the solver first time, then succeeded second time. That seems to imply that the model does not run the same way twice in a row. We have seen this behaviour on rare occasions but have never found an explanation (never really known how to start looking for one.) So, I have nothing to add to help with such problems. We recommend that model tasks do not automatically resubmit on failure - that can leave a suite in a mess if slurm misbehaves (which it does.)

It’s always worth reporting such behaviour to Archer.

Grenville

Thanks @grenville

And I’ve had a bunch of files which are zero sized on jasmin (and deleted on archer2). But when I look in the globus output the transfer seems fine. See Globus for one example.

Here are all the directories on jasmin which have 0 length files along with the number of files…

find /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/ -type f -size 0 -exec dirname {} ;| uniq -c
3 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4000/20101201T0000Z
3 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4000/20110301T0000Z
1 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4000/20110601T0000Z
1 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4000/20111201T0000Z
2 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4001/20101201T0000Z
2 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4001/20110301T0000Z
2 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4001/20110601T0000Z
1 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4004/20100901T0000Z
2 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4004/20101201T0000Z
1 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4004/20110301T0000Z
3 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4006/20101201T0000Z
1 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4006/20110301T0000Z

I’ll complain to archer2 about the model failures but I wonder if there was some general I/O problem today.

Simon

Hi Simon,

We can’t see people’s globus logs. Can you send a screen shot of the error logs so I can see if it was a transfer that stop-started, etc and confirm that the checksums verified.

Ta.
Cheers,
Ros.

Hi Ros,

thanks – here you are.

Job summary

Here is the debug data:

{“DATA_TYPE”:“task”,“bytes_checksummed”:0,“bytes_transferred”:179756352,“command”:“API 0.10”,“completion_time”:“2025-10-29T04:30:39.000Z”,“deadline”:“2025-11-01T04:11:01.000Z”,“delete_destination_extra”:false,“destination_endpoint_display_name”:“JASMIN Default Collection”,“destination_endpoint_id”:“a2f53b7f-1b4e-4dce-9b7c-349ae760fee0”,“directories”:1,“duration_at_last_fetch”:2198000,“effective_bytes_per_second”:81772,“encrypt_data”:false,“fail_on_quota_errors”:true,“faults”:6,“files”:3,“files_skipped”:0,“files_transferred”:3,“history_deleted”:false,“is_delete”:false,“is_paused”:false,“is_transfer”:true,“label”:“opt_dfols4/d4000/20101201T0000Z”,“owner_id”:“5c4dcee1-2ece-4a42-9e6c-9fb2c08dc9c4”,“preserve_timestamp”:false,“request_time”:“2025-10-29T03:54:01.000Z”,“skip_source_errors”:false,“source_endpoint_display_name”:“Archer2 file systems”,“source_endpoint_id”:“3e90d018-0d05-461a-bbaf-aab605283d21”,“status”:“SUCCEEDED”,“sync_level”:3,“task_id”:“e63437c0-b47a-11f0-af29-0affdd0cd947”,“type”:“TRANSFER”,“username”:“u_lrg45yjozzfefhtmt6zmbdojyq”,“verify_checksum”:true}

Is there anything listed in the event log tab?

And to confirm I’ve had another one. Globus thinks it has transferred 60 Mbytes to JASMIN. Jasmin thinks the file size is 0…

{
“DATA_TYPE”: “successful_transfer”,
“checksum”: “1570d31f5278da9cb6b8bad45fd01c89be55046b”,
“checksum_algorithm”: “SHA1”,
“destination_path”: “/gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4008/20110301T0000Z/d4008a.pz2011apr.pp”,
“dynamic”: false,
“size”: 59918784,
“source_path”: “/mnt/lustre/a2fs-work2/work/n02/shared/tetts/opt_dfols4/d4008/output/opt_dfols4/d4008/20110301T0000Z/d4008a.pz2011apr.pp”
}

[tetts@sci-vm-02 ~]$ ls -ltr /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4008/20110301T0000Z/d4008a.pz2011apr.pp
-rw-r--r-- 1 tetts gws_terrafirma 0 Oct 30 11:53 /gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d4008/20110301T0000Z/d4008a.pz2011apr.pp
[tetts@sci-vm-02 ~]$ 

Hi Simon,

Can you send me the globus task id for the latest failure above please? You’ll see it listed in the overview tab for the task and will look like this for example: 3111ea66-7eab-11f0-a2d0-0affef98d17b. I want to send it to JASMIN to see if they can see anything in their globus logs. It should have run a checksum.

Cheers,
Ros

Here you are:

f5d490ae-b586-11f0-9781-027493648695

Simon

Had a bunch of failures early AM – small startup jobs for two models which all ran out of time.

install_ancil and fcm_make2pp both failed because of running out of time. More painful was opt_dfols4/d400l/atmos_main running out of time having managed to run only a few days. I’ve increased the time for the small jobs and reran them and resubmitted the failed atmos_main – it looks to be running quite slowly…I also have opt_dfols4/d400k/atmos_main running and it seems to be running at the rate I expect.

Simon

And opt_dfols4/d400l/atmos_main failed again – running to Timestep 354 in 1 hour 55 mins before being culled…

Simon

atmos_main did eventually work.

Next run I got some globus errors:

Shell debugging restarted
[WARN] file:atmospp.nl: skip missing optional source: namelist:moose_arch
[WARN] file:atmospp.nl: skip missing optional source: namelist:script_arch
/bin/sh: BASH_XTRACEFD: 19: invalid value for trace file descriptor
[WARN]  [SUBPROCESS]: Command: globus transfer --format unix --jmespath task_id --recursive --fail-on-quota-errors --sync-level checksum --label opt_dfols4/d400m/20100901T0000Z --verify-checksum --notify off 3e90d018-0d05-461a-bbaf-aab605283d21:/mnt/lustre/a2fs-work2/work/n02/shared/tetts/opt_dfols4/d400m/output/opt_dfols4/d400m/20100901T0000Z a2f53b7f-1b4e-4dce-9b7c-349ae760fee0:/gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d400m/20100901T0000Z
[SUBPROCESS]: Error = 1:
	Globus CLI Error: A Transfer API Error Occurred.
HTTP status:      502
request_id:       1MdHZbJtm
code:             ExternalError
message:
                  Error validating login to endpoint 'JASMIN Default Collection (a2f53b7f-1b4e-4dce-9b7c-349ae760fee0)', Error (connect)
                  Endpoint: JASMIN Default Collection (a2f53b7f-1b4e-4dce-9b7c-349ae760fee0)
                  Server: 130.246.1.15:443
                  Message: The operation timed out
                  

[WARN]  Transfer command failed: globus transfer --format unix --jmespath 'task_id' --recursive --fail-on-quota-errors --sync-level checksum --label opt_dfols4/d400m/20100901T0000Z --verify-checksum --notify off 3e90d018-0d05-461a-bbaf-aab605283d21:/mnt/lustre/a2fs-work2/work/n02/shared/tetts/opt_dfols4/d400m/output/opt_dfols4/d400m/20100901T0000Z a2f53b7f-1b4e-4dce-9b7c-349ae760fee0:/gws/nopw/j04/terrafirma/tetts/optclim/opt_cases/opt_dfols4/d400m/20100901T0000Z
[ERROR]  transfer.py: Globus Error: Network or server error occurred (Globus ReturnCode=1)
[FAIL]  Command Terminated
[FAIL] Terminating PostProc...
[FAIL] transfer.py # return-code=1
2025-11-04T11:44:01Z CRITICAL - failed/ERR

I ran the globus tak interactively and it worked. Date transfered fine to jasmin.

JASMIN have maintenance day today. Globus endpoints have consequently been unavailable at times.

Cheers,
Ros

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.