Submit-failed

Hi,

I was trying to run a suite u-cp479 in Archer2 from Pumatest.

The submit-failed with:

ERROR: file not found: /home/eartfr/cylc-run/u-cp479/log/job/19880901T0000Z/fcm_make_um/01/job.err

ERROR: file not found: /home/eartfr/cylc-run/u-cp479/log/job/19880901T0000Z/install_cold/01/job.err

Please could you guide me on the possible cause.

Regards,
Timmy

Hi Timmy,

As the tasks have failed to submit there are no job.err or job.out files yet. The actual error message when tasks “submit-fail” can be found in the job-activity.log. Your jobs are failing to submit because they are trying to use qsub which was the scheduler for the old ARCHER system.

From a quick look it seems that the site/archer2.rc file is just a copy of the old archer.rc file with the hostnames changed. Unfortunately, porting a suite from ARCHER to ARCHER2 is not that simple.

Full instructions on how to make the transition from ARCHER to ARCHER2 can be found here: Porting a suite to ARCHER2

A similar suite that I have already ported which should help you is u-ca634 this is the vn11.7 equivalent of the suite you are trying to run.

Regards,
Ros.

Hi Ros,

Many thanks for your kind guidance. I think I could now port a vn12.0 suite u-cq224 to ARCHER2. The latest version of this suite with ARCHER2 related modifications are committed back to the repository as well.

There are however some mild issues to be sorted out:

  1. Install_cold is failing with the following error message.

[FAIL] file:/work/n02/n02/tfrancis/cylc-run/u-cq224/share/data/etc/um_ancils_gl=source=/work/y07/shared/umshared/ancil/data/ancil_versions/n96e_orca025/GA8.0CMIP6_AMIP/v1/ancils: bad or missing value
2022-08-17T09:49:55Z CRITICAL - failed/EXIT

This I realise, is because the GA8.0 specific files are not available in this path. You may suggest me any alternate path if available.

  1. While fcm_make_um completes successfully, the fcm_make2_um fails after running for almost 15 minutes, with the following error message:

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=2188085.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

Please could you help me on these.

Cheers,
Timmy

Hi Timmy,

  1. The ancils that were originally set in this suite were:
    fcm:ancil_data.xm-br/dev/paulearnshaw/r9491_ga8_ancilvn/ancil_versions/n96e_orca025/GA8.0_AMIP @10880 and I already have those under my area at:
    /work/n02/n02/ros/ancil/paulearnshaw/r9491_ga8_ancilvn - with a GA8.0_AMIP directory if that helps.

I don’t know anything about a GA8.0CMIP6_AMIP/v1/ancils directory; that doesn’t exist under $UMDIR/ancil/data on the Met Office XCS either. Who told you about the GA8.0CMIP6_AMIP directory? Are they able to help locate it?

  1. You’ll need to increase the available memory. In site/archer2.rc file in the [[UMBUILD_RESOURCE]] section add:
        [[[directives]]]
                --mem=20Gb

Regards,
Ros.

Hi Ros,

Yes, I have now pointed the ancillaries to your area: /work/n02/n02/ros/ancil/paulearnshaw/r9491_ga8_ancilvn, and it solved the Install_cold problem. Also increasing the memory helped avoid the issue in fcm_make2_um.

Now the suite is successful upto atmos_main.

The atmos_main is failing with the following error. You may kindly refer to the log files in my area for the suite u-cq224 for more on the errors.

Please could you guide me on what could be causing these errors below.

Cheers,
Timmy


? Warning from routine: eg_SISL_setcon
? Warning message: Constant gravity enforced
? Warning from processor: 0
? Warning number: 27
???

[16] exceptions: An exception was raised:11 (Segmentation fault)
[16] exceptions: the exception reports the extra information: Address not mapped to object.
[16] exceptions: whilst in a serial region
[16] exceptions: Task had pid=217581 on host nid004283
[16] exceptions: Program is “/work/n02/n02/tfrancis/cylc-run/u-cq224/share/fcm_make_um/build-atmos/bin/um-atmos.exe”
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
[16] exceptions: Data address (si_addr): 0x00000000; rip: 0x2ae3dabac32f
[16] exceptions: [backtrace]: has 5 elements:
[16] exceptions: [backtrace]: ( 1) : Address: [0x2ae3dabac32f]
[1] exceptions: An exception was raised:11 (Segmentation fault)
[1] exceptions: the exception reports the extra information: Address not mapped to object.
[1] exceptions: whilst in a serial region
[1] exceptions: Task had pid=217566 on host nid004283
[1] exceptions: Program is “/work/n02/n02/tfrancis/cylc-run/u-cq224/share/fcm_make_um/build-atmos/bin/um-atmos.exe”
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
[1] exceptions: Data address (si_addr): 0x00000000; rip: 0x2b5fc70cc32f
[1] exceptions: [backtrace]: has 5 elements:
[1] exceptions: [backtrace]: ( 1) : Address: [0x2b5fc70cc32f]
[15] exceptions: An exception was raised:11 (Segmentation fault)
[15] exceptions: the exception reports the extra information: Address not mapped to object.
[15] exceptions: whilst in a serial region
[15] exceptions: Task had pid=217580 on host nid004283
[15] exceptions: Program is “/work/n02/n02/tfrancis/cylc-run/u-cq224/share/fcm_make_um/build-atmos/bin/um-atmos.exe”
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
[15] exceptions: Data address (si_addr): 0x00000000; rip: 0x2ac264f1a32f
[15] exceptions: [backtrace]: has 5 elements:
[15] exceptions: [backtrace]: ( 1) : Address: [0x2ac264f1a32f]
[2] exceptions: An exception was raised:11 (Segmentation fault)
[2] exceptions: the exception reports the extra information: Address not mapped to object.
[2] exceptions: whilst in a serial region
[2] exceptions: Task had pid=217567 on host nid004283
[2] exceptions: Program is “/work/n02/n02/tfrancis/cylc-run/u-cq224/share/fcm_make_um/build-atmos/bin/um-atmos.exe”
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
[2] exceptions: Data address (si_addr): 0x00000000; rip: 0x2b6adaeca32f
[2] exceptions: [backtrace]: has 5 elements:
[2] exceptions: [backtrace]: ( 1) : Address: [0x2b6adaeca32f]
[3] exceptions: An exception was raised:11 (Segmentation fault)
[/quote]

Please set permissions on /work and /home so that we can read files:

chmod -R g+rX /home/n02/n02/<username>
chmod -R g+rX /work/n02/n02/<username>

Grenville

Hi Grenville,

I have run these commands now to change the permissions. Please let me know if it’s still not accessible.

Cheers,
Timmy

Timmy

In the rose gui, search for PRINT_STATUS and set it to Extra diagnostic messages and run again - that might provide more clues.

Grenville

Hi Grenville,

I have now set the PRINT_STATUS to ‘Extra diagnostic messages’ and submitted the job again.

Cheers,
Timmy

Hi Grenville,

I feel that some issues in pointing to the right ancillary/version could be a possible cause. So I have now reverted my changes in the ancillary paths in /home/eartfr/roses/u-cq224/app/install_cold/rose-app.conf

But when I submit the suite, I get the following error:

[INFO] symlink: rose-conf/20220819T155251-run.conf <= log/rose-suite-run.conf

[INFO] symlink: rose-conf/20220819T155251-run.version <= log/rose-suite-run.version

[INFO] install: app

[INFO] source: /home/eartfr/roses/u-cq224/app

[INFO] REGISTERED u-cq224 → /home/eartfr/cylc-run/u-cq224

[FAIL] rsync -a --exclude=.* --timeout=1800 --rsh=ssh\ -oBatchMode=yes --exclude=d91d41de-7059-4501-812b-3968a8022640 --exclude=log/d91d41de-7059-4501-812b-3968a8022640 --exclude=share/d91d41de-7059-4501-812b-3968a8022640 --exclude=share/cycle/d91d41de-7059-4501-812b-3968a8022640 --exclude=work/d91d41de-7059-4501-812b-3968a8022640 --exclude=/.* --exclude=/cylc-suite.db --exclude=/log --exclude=/log.* --exclude=/state --exclude=/share --exclude=/work ./ tfrancis@login3.archer2.ac.uk:cylc-run/u-cq224 # return-code=12, stderr=

[FAIL] rsync: mkdir “/home1/home/n02/n02/tfrancis/cylc-run/u-cq224” failed: No such file or directory (2)

[FAIL] rsync error: error in file IO (code 11) at main.c(664) [Receiver=3.1.3]

[FAIL] rsync: connection unexpectedly closed (9 bytes received so far) [sender]

[FAIL] rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]

I was wondering why this /home1/home/ ? Is this some kind of error here ?

Cheers,
Timmy

I assume you have tried several times - in which case try rose suite-run --new to kick start the whole suite

Grenville

Hi Grenville,

Yes, the rose suite-run --new helped.

But now I get the following error, in install_cold.

The following have been reloaded with a version change:

  1. cce/11.0.4 => cce/12.0.3

[FAIL] file:/work/n02/n02/tfrancis/cylc-run/u-cq224/share/data/etc/um_ancils_gl=source=fcm:ancil_data.xm-br/dev/paulearnshaw/r9491_ga8_ancilvn/ancil_versions/n96e_orca025/GA8.0_AMIP/v1/ancils@10880: bad or missing value
2022-08-19T15:59:38Z CRITICAL - failed/EXIT

It is worth noting that the /home/eartfr/roses/u-cq224/app/install_cold/rose-app.conf file is identical to the one I have been successfully using with the Met Office Monsoon HPC. So I was wondering why it should fail in Archer2 ?

Cheers,
Timmy

Did Ros not fix this in an earlier post in this thread?

Grenville

Hi Grenville,

Yes, there was a fix we tried by pointing the ancillary path to /work/n02/n02/ros/ancil/paulearnshaw/r9491_ga8_ancilvn.

But, then I got the segmentation errors reported later in the posts. These segmentation errors, I doubted, could be linked to this alternate pointing of the ancillaries. So I reverted this ancillary path to see if it solve the segmentation errors.

Cheers,
Timmy

But /work/n02/n02/ros/ancil/paulearnshaw/r9491_ga8_ancilvn is the same as fcm:ancil_data.xm-br/dev/paulearnshaw/r9491_ga8_ancilvn/ancil_versions/n96e_orca025/GA8.0_AMIP @10880 (said Ros)

Did you ever find out about GA8.0CMIP6_AMIP/v1/ancils ?

GA8.0CMIP6_AMIP was a typo error somehow made it to my local version of the suite. The correct name is GA8.0_AMIP.

Again, in /work/n02/n02/ros/ancil/paulearnshaw/r9491_ga8_ancilvn/ancil_versions/n96e_orca025/GA8.0_AMIP/v1/ancils

there is no mentioning of revision number !!

I think it should be :

/work/n02/n02/ros/ancil/paulearnshaw/r9491_ga8_ancilvn/ancil_versions/n96e_orca025/GA8.0_AMIP/v1/ancils**@10880**

In Ros area there is no ancils**@10880**

I think this lack of revision number is the cause of segmentation errors.

Timmy

Hi Timmy,

On ARCHER2 you cannot extract direct from a MOSRS repository. So the fcm:ancil_data.xm.br/.... link won’t work. The directory I pointed you to is this url extracted. I have used this succesfully in my equivalent UM13.0 GA suite. You only need to specify a revision number when you extract from the repository. No rev number is required for a working copy.

Regards
Ros

Hi Ros,

That’s great. So we can now rule out this possibility, and look for other reasons for the segmentation fault error messages.

I will get back to you after retesting some other aspects in the suite.

Cheers,
Timmy

Hi Timmy,

What’s the suite id of the suite you had been running successfully on Monsoon? u-cm696?

Cheers,
Ros.

Hi Ros,

Yes, that’s right. u-cm696

Please make sure you take the latest version @234646

There was a mild update I made a few minutes ago, and I have reconfirmed that this version is running successfully in Monsoon now.

Cheers,
Timmy