I submitted a bunch of short atmosphere simulations last night. Two failed:
opt_dfols4/d4002 – blew up on timestep 3775 Model time: 2010-10-23 10:20:00..
opt_dfols4/d4007 – ran out of time. Last timestep in the pe000 file was 370…
I reran both cases and both are running as normal – d4002 now at timestep 5636, d4007 happily ran 3 months and is now running the next three months in the workflow.
I can understand that if archer2’s I/O system is loaded then simulations can go slow. The blow up is more worrying. It is worth noting that my simulations do not output very much data.
Given, I want to run O(200) of these simulations, are there any changes I can make that will make them more reliable? Or should I complain to the archer2 helpdesk…
It’s a bit worrying that it failed in the solver first time, then succeeded second time. That seems to imply that the model does not run the same way twice in a row. We have seen this behaviour on rare occasions but have never found an explanation (never really known how to start looking for one.) So, I have nothing to add to help with such problems. We recommend that model tasks do not automatically resubmit on failure - that can leave a suite in a mess if slurm misbehaves (which it does.)
It’s always worth reporting such behaviour to Archer.
And I’ve had a bunch of files which are zero sized on jasmin (and deleted on archer2). But when I look in the globus output the transfer seems fine. See Globus for one example.
Here are all the directories on jasmin which have 0 length files along with the number of files…
We can’t see people’s globus logs. Can you send a screen shot of the error logs so I can see if it was a transfer that stop-started, etc and confirm that the checksums verified.
Can you send me the globus task id for the latest failure above please? You’ll see it listed in the overview tab for the task and will look like this for example: 3111ea66-7eab-11f0-a2d0-0affef98d17b. I want to send it to JASMIN to see if they can see anything in their globus logs. It should have run a checksum.
Had a bunch of failures early AM – small startup jobs for two models which all ran out of time.
install_ancil and fcm_make2pp both failed because of running out of time. More painful was opt_dfols4/d400l/atmos_main running out of time having managed to run only a few days. I’ve increased the time for the small jobs and reran them and resubmitted the failed atmos_main – it looks to be running quite slowly…I also have opt_dfols4/d400k/atmos_main running and it seems to be running at the rate I expect.