As the tasks have failed to submit there are no job.err or job.out files yet. The actual error message when tasks “submit-fail” can be found in the job-activity.log. Your jobs are failing to submit because they are trying to use qsub which was the scheduler for the old ARCHER system.
From a quick look it seems that the site/archer2.rc file is just a copy of the old archer.rc file with the hostnames changed. Unfortunately, porting a suite from ARCHER to ARCHER2 is not that simple.
Full instructions on how to make the transition from ARCHER to ARCHER2 can be found here: Porting a suite to ARCHER2
A similar suite that I have already ported which should help you is u-ca634 this is the vn11.7 equivalent of the suite you are trying to run.
Many thanks for your kind guidance. I think I could now port a vn12.0 suite u-cq224 to ARCHER2. The latest version of this suite with ARCHER2 related modifications are committed back to the repository as well.
There are however some mild issues to be sorted out:
Install_cold is failing with the following error message.
[FAIL] file:/work/n02/n02/tfrancis/cylc-run/u-cq224/share/data/etc/um_ancils_gl=source=/work/y07/shared/umshared/ancil/data/ancil_versions/n96e_orca025/GA8.0CMIP6_AMIP/v1/ancils: bad or missing value
2022-08-17T09:49:55Z CRITICAL - failed/EXIT
This I realise, is because the GA8.0 specific files are not available in this path. You may suggest me any alternate path if available.
While fcm_make_um completes successfully, the fcm_make2_um fails after running for almost 15 minutes, with the following error message:
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=2188085.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
The ancils that were originally set in this suite were: fcm:ancil_data.xm-br/dev/paulearnshaw/r9491_ga8_ancilvn/ancil_versions/n96e_orca025/GA8.0_AMIP @10880 and I already have those under my area at: /work/n02/n02/ros/ancil/paulearnshaw/r9491_ga8_ancilvn - with a GA8.0_AMIP directory if that helps.
I don’t know anything about a GA8.0CMIP6_AMIP/v1/ancils directory; that doesn’t exist under $UMDIR/ancil/data on the Met Office XCS either. Who told you about the GA8.0CMIP6_AMIP directory? Are they able to help locate it?
You’ll need to increase the available memory. In site/archer2.rc file in the [[UMBUILD_RESOURCE]] section add:
Yes, I have now pointed the ancillaries to your area: /work/n02/n02/ros/ancil/paulearnshaw/r9491_ga8_ancilvn, and it solved the Install_cold problem. Also increasing the memory helped avoid the issue in fcm_make2_um.
Now the suite is successful upto atmos_main.
The atmos_main is failing with the following error. You may kindly refer to the log files in my area for the suite u-cq224 for more on the errors.
Please could you guide me on what could be causing these errors below.
[16] exceptions: An exception was raised:11 (Segmentation fault)
[16] exceptions: the exception reports the extra information: Address not mapped to object.
[16] exceptions: whilst in a serial region
[16] exceptions: Task had pid=217581 on host nid004283
[16] exceptions: Program is “/work/n02/n02/tfrancis/cylc-run/u-cq224/share/fcm_make_um/build-atmos/bin/um-atmos.exe”
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
[16] exceptions: Data address (si_addr): 0x00000000; rip: 0x2ae3dabac32f
[16] exceptions: [backtrace]: has 5 elements:
[16] exceptions: [backtrace]: ( 1) : Address: [0x2ae3dabac32f]
[1] exceptions: An exception was raised:11 (Segmentation fault)
[1] exceptions: the exception reports the extra information: Address not mapped to object.
[1] exceptions: whilst in a serial region
[1] exceptions: Task had pid=217566 on host nid004283
[1] exceptions: Program is “/work/n02/n02/tfrancis/cylc-run/u-cq224/share/fcm_make_um/build-atmos/bin/um-atmos.exe”
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
[1] exceptions: Data address (si_addr): 0x00000000; rip: 0x2b5fc70cc32f
[1] exceptions: [backtrace]: has 5 elements:
[1] exceptions: [backtrace]: ( 1) : Address: [0x2b5fc70cc32f]
[15] exceptions: An exception was raised:11 (Segmentation fault)
[15] exceptions: the exception reports the extra information: Address not mapped to object.
[15] exceptions: whilst in a serial region
[15] exceptions: Task had pid=217580 on host nid004283
[15] exceptions: Program is “/work/n02/n02/tfrancis/cylc-run/u-cq224/share/fcm_make_um/build-atmos/bin/um-atmos.exe”
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
[15] exceptions: Data address (si_addr): 0x00000000; rip: 0x2ac264f1a32f
[15] exceptions: [backtrace]: has 5 elements:
[15] exceptions: [backtrace]: ( 1) : Address: [0x2ac264f1a32f]
[2] exceptions: An exception was raised:11 (Segmentation fault)
[2] exceptions: the exception reports the extra information: Address not mapped to object.
[2] exceptions: whilst in a serial region
[2] exceptions: Task had pid=217567 on host nid004283
[2] exceptions: Program is “/work/n02/n02/tfrancis/cylc-run/u-cq224/share/fcm_make_um/build-atmos/bin/um-atmos.exe”
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
[2] exceptions: Data address (si_addr): 0x00000000; rip: 0x2b6adaeca32f
[2] exceptions: [backtrace]: has 5 elements:
[2] exceptions: [backtrace]: ( 1) : Address: [0x2b6adaeca32f]
[3] exceptions: An exception was raised:11 (Segmentation fault)
[/quote]
I feel that some issues in pointing to the right ancillary/version could be a possible cause. So I have now reverted my changes in the ancillary paths in /home/eartfr/roses/u-cq224/app/install_cold/rose-app.conf
But when I submit the suite, I get the following error:
But now I get the following error, in install_cold.
The following have been reloaded with a version change:
cce/11.0.4 => cce/12.0.3
[FAIL] file:/work/n02/n02/tfrancis/cylc-run/u-cq224/share/data/etc/um_ancils_gl=source=fcm:ancil_data.xm-br/dev/paulearnshaw/r9491_ga8_ancilvn/ancil_versions/n96e_orca025/GA8.0_AMIP/v1/ancils@10880: bad or missing value
2022-08-19T15:59:38Z CRITICAL - failed/EXIT
It is worth noting that the /home/eartfr/roses/u-cq224/app/install_cold/rose-app.conf file is identical to the one I have been successfully using with the Met Office Monsoon HPC. So I was wondering why it should fail in Archer2 ?
Yes, there was a fix we tried by pointing the ancillary path to /work/n02/n02/ros/ancil/paulearnshaw/r9491_ga8_ancilvn.
But, then I got the segmentation errors reported later in the posts. These segmentation errors, I doubted, could be linked to this alternate pointing of the ancillaries. So I reverted this ancillary path to see if it solve the segmentation errors.
But /work/n02/n02/ros/ancil/paulearnshaw/r9491_ga8_ancilvn is the same as fcm:ancil_data.xm-br/dev/paulearnshaw/r9491_ga8_ancilvn/ancil_versions/n96e_orca025/GA8.0_AMIP @10880 (said Ros)
Did you ever find out about GA8.0CMIP6_AMIP/v1/ancils ?
On ARCHER2 you cannot extract direct from a MOSRS repository. So the fcm:ancil_data.xm.br/.... link won’t work. The directory I pointed you to is this url extracted. I have used this succesfully in my equivalent UM13.0 GA suite. You only need to specify a revision number when you extract from the repository. No rev number is required for a working copy.