Previous working-job now seg-faulting on 1st timestep after OS upgrade (ARCHER2 v8.4 GA4 UM-UKCA)

gmann · 18 September 2023 12:38

The GLOMAP-v8.2-upgraded copy of the NCAS-CMS ported v8.4 GA4 UM-UKCA job
(copy of Ros’s xoxta job) is giving a seg-fault on timestep 1 since the resumption from
the May/June OS upgrade.

I’ve checked in detail the ARCHER2 OS upgrade page, and tried a number of different ways
to explore the cause of the seg-fault, including adding compiler over-ride files with

updated cce-compiler (v15.0.0), and 2) as 1 with also -g debug flag in to the compile lines

I can see that the compile-lines are effected on ARCHER2, so these options are propogating
through successfully to upgrade the complier version and to

The job I am testing, is copy of xplt-o job, which ran fine on 256 cores prior to the OS upgrade,
and was one of the ensemble-members we ran, within the two HErSEA-Pinatubo GA4 UM-UKCA ensembles we submitted, as the UK contribution to the ISA-MIP Pinatubo intercomparison paper (Quaglia et al., 2023, ACP, https://acp.copernicus.org/articles/23/921/2023/acp-23-921-2023.pdf )

The post-OS-upgrade re-run of the xplt-o job is for a paper in preparation from the ACSIS research to compare UK Agung model simulations to searchlight, lidar and balloon
measurements from the 1963-65 period.

The observations suggest a higher-altitude SO2 emission would give better agreement
to the measurements, are we are then keen to be able to re-run with the “obs-calibrated”
emission source parameter settings.

The simulations in Dhomse et al. (2020) were all with the same best-estimate mid-level
injection height, and then it’s straightforward to re-run with the higher injection-height.

Basically the Pinatubo base-case is giving a seg-fault on the 1st timestep, with the
128-core (xprb-r) and 256-core (xprb-s) debug-settings-compiled jobs both giving
near-identical behaviour.

With the info re: the change in behaviour of the slurm system to no-longer pass-on the
ntasks-per-mode settings, I also tried re-running both with OpenMP switched off, this
requiring a different compiler over-ride file, but although the no-OMP equivalent jobs
do get slightly further, still both fail on the 1st timestep, see xprb-p (128-core) and
xprb-q (256-core)

I did try to run the gdb4hpc de-bugger, following the instructions at
https://docs.archer2.ac.uk/user-guide/debug/#gdb4hpc

But trying both methods to either “attach” the de-bugger to the running job, and
running the debugger directly, I couldn’t get the gdb4hpc system to co-run with
the executable.

I had hoped to run the totalview de-bugger, and this worked well at v7.3
(from modifying the qsexecute script, to add the totalview syntax around the aprun
command on HECToR phase 3).

Unfortunately, I could not see any info on totalview on the ARCHER2 system, but
I guess it will be possible to install this locally, to try this out, with help from ARCHER2 support).

I’m just raising this query here, to ask whether it might simply be that I need to add in some
additional flags, either within the compiler over-ride, or another machine over-ride or hand-edit

I checked-back Ros Hatcher’s job xoxta, and I could see there this was run in June, but also
failed at run-time, although it looks like it had not yet reached the run-stage.

Note that I had to do the re-configuration step interactively from ARCHER2, via a
sbatch umuisubmit_rcf from the umui_runs directory, because the re-configuration
step launched when submitting from the UMUI did not complete.

I also had to increase the number of nodes to 4 for the reconfiguration (for the 256-core job)
to get the reconfiguration to work.

If you look in the /work/n02/n02/gmann/output directory you see the sequence of test-runs,
and the corresponding error messages etc. at the various stages of testing re; the above
test jobs xprbt, xprbs, xprbr, xprbq and xprbp.

When checking back to the original xoxta job from Ros, I noticed the revision number for
the FCM “container” had changed, since I originally copied the job.

And I have updated that sequence of jobs xbrb-s, -r, -q and -p for that updated container revision number, but the earlier jobs xprbt, and the Agung test jobs xprb-g, xprb-d etc.
will still have the original container FCM revision number.

Please can someone in the team try running cp of the reference GA4 UM-UKCA Job (xoxta)
and refer to the differences in the test jobs xprb-r etc., re: compiler over-ride file, and see
if the same error is given, to then seek assistance from the ARCHER2 team, if needed.

I realise the ARCHER2 system is down for maintenance today, and am having to prioritise
teaching matters for the next 2 weeks (after the suspension of the UCU
marking & assessment boycott).

However, I will be able to reply to messages either here or via email.

Thanks a lot for your help with this,

Best regards,

Dr. Graham Mann
Lecturer in Atmospheric Science
University of Leeds

RosalynHatcher · 19 September 2023 09:34

Hi Graham,

Yes I did update the 2 reference jobs following the Archer2 OS Upgrade to change compiler, etc. I thought I did get it to run ok, but do remember hitting some Archer2 temporary issue at somepoint which, hopefully is why you see my failed output.

I’m in the middle of doing some UMUI work, so once ARCHER2 is back up next week, I’ll let you know how I get on running xoxta/b.

Cheers,
Ros.

gmann · 19 September 2023 12:45

Hi Ros,

OK, that’s great, thanks a lot,

Cheers
Graham

RosalynHatcher · 29 September 2023 08:34

A quick update:
xoxta doesn’t work on Archer2 anymore as one of the input files has gone walkies.
I’ve started looking at your job xprbs.

RosalynHatcher · 29 September 2023 10:05

Hi Graham,

Can you please change the permissions on /work/n02/n02/gmann/UMdumps/xmvxha.da20021201_00 so that I can read it please?

Cheers,
Ros.

gmann · 29 September 2023 13:27

Thanks Ros – I’ve done a "chmod a+r " on those files so you should be able to access that file now.

Cheers
Graham

xphroa.da19910116_00 xphrsa.da19910116_00 xpllca.da20220114_00 xpltca.da19911201_00
gmann@ln01:/work/n02/n02/gmann/UMdumps> ls -lrt
total 316100676
-rw-r–r-- 1 gmann n02 8617328640 Dec 3 2016 xmvxha.da20030601_00
-rw-r–r-- 1 gmann n02 11589529600 Jan 12 2020 xnbeka.da19630301_00
-rw-r–r-- 1 gmann n02 11589529600 Jan 12 2020 xneoka.da19630301_00
-rw-r–r-- 1 gmann n02 10586243072 Jan 19 2022 xlazsa.da19921201_00
-rw-r–r-- 1 gmann n02 10906017792 Jan 22 2022 xpdbwa.da20021201_00
-rw-r–r-- 1 gmann n02 10906017792 Jan 22 2022 xpdbwa.da20030601_00
-rw-r–r-- 1 gmann n02 10906017792 Jan 23 2022 xpdbya.da20000601_00
-rw-r–r-- 1 gmann n02 11141574656 Jan 27 2022 xmbcla.da19981201_00
-rw-r–r-- 1 gmann n02 11141574656 Jan 27 2022 xmbcla.da19971201_00
-rw-r–r-- 1 gmann n02 11404312576 Jan 27 2022 xmotoa.da20091201_00
-rw-r–r-- 1 gmann n02 10634555392 Jan 28 2022 xphrta.da19910611_00
-rw-r–r-- 1 gmann n02 10634555392 Jan 28 2022 xphrua.da19910611_00
-rw-r–r-- 1 gmann n02 10694479872 Jan 30 2022 xphroa.da19910116_00
-rw-r–r-- 1 gmann n02 10694479872 Jan 30 2022 xphrpa.da19910116_00
-rw-r–r-- 1 gmann n02 10694479872 Jan 30 2022 xphrqa.da19910116_00
-rw-r–r-- 1 gmann n02 10694479872 Jan 30 2022 xphrra.da19910116_00
-rw-r–r-- 1 gmann n02 10694479872 Jan 30 2022 xphrsa.da19910116_00
-rw-r–r-- 1 gmann n02 11404312576 Aug 29 2022 xmotoa.da20080901_00
-rw-r–r-- 1 gmann n02 9720463360 Aug 31 2022 xpllca.da20220101_00
-rw-r–r-- 1 gmann n02 9720463360 Aug 31 2022 xpllca.da20220114_00
-rw------- 1 gmann n02 8617328640 Sep 2 2022 xmvxha.da19991201_00
-rw------- 1 gmann n02 8617328640 Sep 2 2022 xmvxha.da20021201_00
-rw------- 1 gmann n02 8617328640 Sep 2 2022 xmvxha.da20051201_00
-rw------- 1 gmann n02 8617328640 Sep 3 2022 xmvxha.da20031201_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 4 2022 xploaa.da19910601_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 16 2022 xpltca.da19911201_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 16 2022 xpltba.da19930601_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 16 2022 xpltaa.da19930601_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 16 2022 xpltda.da19930601_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 16 2022 xpltga.da19930601_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 16 2022 xpltea.da19930601_00
gmann@ln01:/work/n02/n02/gmann/UMdumps> chmod a+r *
gmann@ln01:/work/n02/n02/gmann/UMdumps> ls -lrt
total 316100676
-rw-r–r-- 1 gmann n02 8617328640 Dec 3 2016 xmvxha.da20030601_00
-rw-r–r-- 1 gmann n02 11589529600 Jan 12 2020 xnbeka.da19630301_00
-rw-r–r-- 1 gmann n02 11589529600 Jan 12 2020 xneoka.da19630301_00
-rw-r–r-- 1 gmann n02 10586243072 Jan 19 2022 xlazsa.da19921201_00
-rw-r–r-- 1 gmann n02 10906017792 Jan 22 2022 xpdbwa.da20021201_00
-rw-r–r-- 1 gmann n02 10906017792 Jan 22 2022 xpdbwa.da20030601_00
-rw-r–r-- 1 gmann n02 10906017792 Jan 23 2022 xpdbya.da20000601_00
-rw-r–r-- 1 gmann n02 11141574656 Jan 27 2022 xmbcla.da19981201_00
-rw-r–r-- 1 gmann n02 11141574656 Jan 27 2022 xmbcla.da19971201_00
-rw-r–r-- 1 gmann n02 11404312576 Jan 27 2022 xmotoa.da20091201_00
-rw-r–r-- 1 gmann n02 10634555392 Jan 28 2022 xphrta.da19910611_00
-rw-r–r-- 1 gmann n02 10634555392 Jan 28 2022 xphrua.da19910611_00
-rw-r–r-- 1 gmann n02 10694479872 Jan 30 2022 xphroa.da19910116_00
-rw-r–r-- 1 gmann n02 10694479872 Jan 30 2022 xphrpa.da19910116_00
-rw-r–r-- 1 gmann n02 10694479872 Jan 30 2022 xphrqa.da19910116_00
-rw-r–r-- 1 gmann n02 10694479872 Jan 30 2022 xphrra.da19910116_00
-rw-r–r-- 1 gmann n02 10694479872 Jan 30 2022 xphrsa.da19910116_00
-rw-r–r-- 1 gmann n02 11404312576 Aug 29 2022 xmotoa.da20080901_00
-rw-r–r-- 1 gmann n02 9720463360 Aug 31 2022 xpllca.da20220101_00
-rw-r–r-- 1 gmann n02 9720463360 Aug 31 2022 xpllca.da20220114_00
-rw-r–r-- 1 gmann n02 8617328640 Sep 2 2022 xmvxha.da19991201_00
-rw-r–r-- 1 gmann n02 8617328640 Sep 2 2022 xmvxha.da20021201_00
-rw-r–r-- 1 gmann n02 8617328640 Sep 2 2022 xmvxha.da20051201_00
-rw-r–r-- 1 gmann n02 8617328640 Sep 3 2022 xmvxha.da20031201_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 4 2022 xploaa.da19910601_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 16 2022 xpltca.da19911201_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 16 2022 xpltba.da19930601_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 16 2022 xpltaa.da19930601_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 16 2022 xpltda.da19930601_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 16 2022 xpltga.da19930601_00
-rw-r–r-- 1 gmann n02 10694479872 Sep 16 2022 xpltea.da19930601_00

grenville · 17 October 2023 15:57

Hi Graham

Please can you set up the umui job to reference files/hand edits on puma2 - then we can try to fix the seg fault,

Please also allow us read permissions on puma2 home space.

Grenville

gmann · 25 October 2023 16:35

Hi Grenville,

I’ve just gone to pumanew to check re: the copying over the data to my new PUMA2 account on ARCHER2.

And I’ve just seen that I’ve missed the deadline, with the switch-off of the pumanew machine happening yesterday.

Sorry about this, I’ve been away at a worksjhop in Switzerland and then with hosting a US visitor at Leeds and then sorting out a whole series of urgent tasks re: teaching with the suspension of the marking and assessment boycott.

I saw the email below, which says that all home directories from pumanew will be archived (for a period of time), and that these archived directories will be made available, to enable people who didn’t manage to copy the data across before the deadline to do so afterwards.

I see the email-message for this says there will be an email sent with info on this archiving of the pumanew directories, but the email will only be sent out after the JASMIN down-time.

Perhaps I’m too early after the pumanew switch-off, but pls can you reply to clarify when we can expect to be able to access the archiving of the pumanew home directories.

Are these already available on JASMIN for people like me who missed the pumanew-switchoff deadline to still be able to copy these across to PUMA2 on ARCHER2?

Thanks for your help with this,

Best regards,

Cheers
Graham

From: Email list for NCAS-PUMA-UMUI NCAS-PUMA-UMUI@MAILLISTS.READING.AC.UK on behalf of Annette Osprey a.osprey@READING.AC.UK
Date: Thursday, 19 October 2023 at 14:06
To: NCAS-PUMA-UMUI@MAILLISTS.READING.AC.UK NCAS-PUMA-UMUI@MAILLISTS.READING.AC.UK
Subject: Final reminder: PUMA shutdown 24 October
Dear PUMA users,

A final reminder that the current PUMA sever (pumanew) will be retired at lunchtime on 24 October, and will not be accessible to users after this time.

If there are any remaining users that have not moved over to PUMA2, please do so immediately.

The PUMA2 transition page gives detailed instructions on applying for an account, copying your data, and moving over your Rose/cylc suites and UMUI jobs.

Note that we will be archiving all pumanew home directories, and will make this data available on ARCHER2 for a period of time. This will not include cylc-run directories. For non-active users who are unable to access pumanew, we recommend retrieving their data from this archive. The archiving will be done after the Jasmin downtime has completed. We will make an announcement when the data is available.

Please do contact the CMS heldpesk with any queries.

Best wishes,

Annette

NCAS-CMS

To unsubscribe from the NCAS-PUMA-UMUI list, click the following link:

From: Email list for NCAS-PUMA-UMUI NCAS-PUMA-UMUI@MAILLISTS.READING.AC.UK on behalf of Rosalyn Hatcher rosalyn.hatcher@NCAS.AC.UK
Date: Thursday, 12 October 2023 at 10:59
To: NCAS-PUMA-UMUI@MAILLISTS.READING.AC.UK NCAS-PUMA-UMUI@MAILLISTS.READING.AC.UK
Subject: Transition to PUMA2 for UMUI users
Dear All,

If you DO NOT use the UMUI on PUMA then you can stop reading now.

For all UMUI users, the UMUI will move to PUMA2 on Tuesday 17th October (next week).

Please follow the instructions on how to apply for a PUMA2 account account ahead of 17th October, which include instructions for getting setup and copying your data across.
We suggest that if you don’t already have an ARCHER2/PUMA2 account, you register using the same username as you already have on PUMA - this will make transition to PUMA2 easier.

Timeline:
• Monday 16th October 16:00hrs UMUI switched off on current PUMA server (pumanew)
• Tuesday 17th October UMUI will be brought back up on PUMA2.

Important Changes:
• Your PUMA2 username may or may not be the same as the username you had on PUMA. If your username is different you will not be able to edit your existing UMUI jobs. You will need to first copy them to a new Expt/Job Id.
• For submission of jobs to ARCHER2 (vn8.4) the “SUBMIT” mechanism will change. You will run the UMSUBMIT_ARCHER2 script on the command line.

Full details on these and any other changes can be found on the CMS Website: Using the UMUI on PUMA2

If you have any issues, please contact the CMS helpdesk

Best Regards,
Ros

To unsubscribe from the NCAS-PUMA-UMUI list, click the following link:

grenville · 25 October 2023 16:49

Hi Graham

pumanew data will be available soon after JASMIN comes back - which should be a couple of weeks from now.

Grenville

gmann · 25 October 2023 18:53

Hi Grenville,

Thanks for the fast reply – and ah, OK, it won’t be until after the return from the JASMIN outage.

Cheers
Graham

gmann · 10 November 2023 13:21

Hi Grenville or Ros,

JASMIN came back up last week, and then please can one of you reply with the file-path on JASMIN where the pumanew directories being copied across are stored?

Thanks a lot for your help with this – much appreciated,

Cheers
Graham

grenville · 10 November 2023 13:42

Graham

pumanew data has been copied to ARCHER2 at /work/n02/n02/ajh/pumanew-23Oct2023

Grenville

gmann · 10 November 2023 14:47

Hi Grenville,

That’s great – thanks.

I’m just rsync’ing my gmann directory across, and then all the files will be in equivalent file-locations on PUMA2, to where they were on pumanew…

And then it should simply be a case of changing the file-path to match the ARCHER2 syntax
(or I’ll probably add an ARCHER2_HOME environment-variable in the UMUI to do that).

I’ll confirm when I’ve updated the ARCHER2-job, but the files re: the seg-fault testing will all be there once the rsync has completed.

Cheers
Graham

gmann · 10 November 2023 14:54

The rsync has completed now, and the hand-edits I tended mostly to keep in
the same “stashfiles” directory as the user STASHmaster files we developed:

/home/n02/n02/gmann/stashfiles/

gmann · 10 November 2023 14:57

However, whereas the UMUI was working OK when I tried it a few weeks ago, today it’s giving
an error message when opening a UM job

When opening the job in the UMUI I’m getting the error message:

Error in startup script: couldn’t read file “/home/annette/bin/umui/setcols.tcl”: no such file or directory

Please can you point me to where I can change the file-path for this?

Is this set in the .UMUI file on PUMA2?

Thanks
Graham

RosalynHatcher · 10 November 2023 15:48

Try your ~/.umui_appearance file, but I don’t think Annette has those files anymore.

Probably best to keep your own copy of those umui files. I do have a copy under my ~ros/bin directory which you can copy, though I can’t guarantee they will work; I’ve not tested them since we moved to puma2.

Cheers,
Ros

gmann · 10 November 2023 16:35

Hi Ros,

Thanks for this – ah, OK yes I can see the path there

[gmann@puma2 ~]$ more ~/.umui_appearance
source /home/annette/bin/umui/setcols.tcl
source /home/annette/bin/umui/navigation-override.tcl

I have scp’d the .tcl files from Annette’s directory within the ARCHER2-ported pumanew directory-tree to a new bin/umui/ directory off my home-space on PUMA2, e.g.:

/home/n02/n02/gmann/bin/umui/setcols.tcl"

I then updated the .umui_appearance file to point to these 2 file-paths instead, and the UMUI then works fine now (opening the umui jobs with the colours etc. as before).

As I say, the files should now all be in the stashfiles directory on PUMA2 at:

/home/n02/n02/gmann/stashfiles/

So re: the hand-edits (and user-STASH-master files) re: testing the seg-faulting GA4 UM-UKCA job, it’s just a case of updating the file-paths (or adding a UMUI environment-variable).

Thanks again for your help with this,

Best regards,

Cheers
Graham

gmann · 17 November 2023 11:02

Hi Grenville & Ros,

As I noted above, I’ve recovered all my file-space from pumanew to the PUMA2 filesystem.

And this morning, I’ve secured time to do what I said I’d do, re: Grenville asking me to setup the UMUI job to reference files/hand-edits on PUMA2.

Unfortunately, the UMUI doesn’t seem to be displaying from PUMA2 when I’m trying this morning.

As I say, all the hand-edits are there, in exactly the same place as on pumanew
(but with the /home/n02/n02/gmann rather than the slightly-different home-dir-path on pumanew)

So whilst I’ve tried to do that, I need to prioritise my teaching duties with my lectureship at Leeds.

As I say, simply changing the home-directory will then find the hand-edits, so the work-flow Ros started
on 29th September, all should be in-place now to be able to re-sume that.

I’ll try again if I get chance, and update this post – but all should be there and accesible now on PUMA2.

Thanks again for your help with this,

Cheers
Graham
you should be able o

gmann · 17 November 2023 11:04

Literally after clicking “send” the UMUI then appeared OK
(maybe the system came back-up at 11.00am?)

So I will complete what i said I’d do there,
(hopefully will only take me about 15 mins, and will give a clean hand-over, also checking I can submit jobs OK to ARCHER2 from PUMA2).

Thanks
Graham

gmann · 17 November 2023 12:55

OK, I’ve done that now – to add the paths for the job that Ros has referred to already – xprb-s.
(That is the ARCHER2-ported GA4 UM-UKCA job xplt-o with GLOMAP-v8.2 configured for transient atmos-only Pinatubo 14Tg run xnbec, 1 of the 3 ensemble-members from the 14Tg@21-24km SMURPHS runs we published in Dhomse et al., 2020)

What I tried to do (a few weeks ago) was track-down the various hand-edits and user-STASH-master files
from the ukca home-directory from PUMA were in other user-directories already (Luke’s, Mohit’s).

It turned-out that only a few of the hand-edits were already available on PUMA2, but those that I did track-down, I’d already implemented into an “interim job”.

If you refer to experiment xpsq, you see there that there are 3 UM jobs in that experiment-folder.

The xpsq-s is an exact-copy of the xprb-s job that Ros referred to, that having previously ran fine on ARCHER2 (see job xplt-o, prior to the OS upgrade in May/June).

Then xpsq-t is that “interim job” where I’d found some of the hand-edits were already present in Mohit and Luke’s directories.

What I’ve done this morning, is the job xpsq-u, and that has all the hand-edits and user pre-STASH-master files from directories on PUMA2.

Topic		Replies	Views
I successfully updated the RJ4.0 GA4 UM-UKCA for GLOMAP v8.1 & ACTIVATE but n02 & accounts have negative CUs Unified Model ARCHER2	5	168	23 February 2022
Compilation Time-Limit exceedance on ARCHER2 re: v8.4 GA4 UM-UKCA Unified Model ARCHER2	5	150	18 September 2023
Running GLOMAP-aerosol version of RJ4.0 UM-UKCA on ARCHER-2 from temporary pumatest	13	221	22 February 2022
Job failing on Archer2 ARCHER2	4	114	8 December 2023
Innermost LAM suddenly failing due to walltime limit Unified Model ARCHER2	3	231	26 July 2023

Previous working-job now seg-faulting on 1st timestep after OS upgrade (ARCHER2 v8.4 GA4 UM-UKCA)

Related topics