Continuing our email conversation…
You are trying to upgrade an N1280 suite from vn11.6 to vn13.1. We’ve advised to use the rose upgrade macros to update through each version one at a time. See: Upgrading suite to vn12.0 - #4 by dcase
Once you get to vn13.1 the fcm_make app will need a config branch. In the rose_app.conf file, set the following:
You also need to upgrade one of the code branches. To do this:
Create a new branch:
fcm bc remove_defensive_checks fcm:firstname.lastname@example.org
Check out the branch:
fcm co fcm:um.x-br/dev/torstenauerswald/vn13.1_remove_defensive_checks
Merge in the vn11.6 branch and commit the changes
fcm merge fcm:um.x-br/dev/annetteosprey/vn11.6_remove_defensive_checks
Well I should say that before your commit the branch, check there aren’t any merge errors, and the changes have been copied in correctly!
Thanks for the suggestions. I followed your instructions and everything seemed to work. My suite is now at version 13.1 and I was able to create the branch for the defensive checks for version 13.1. However, when I try to run my suite the fcm_make2_um task fails with the following error:
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3183164.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
Since it seems to indicate that it ran out of memory I tried to change the number of tasks in archer2.rc, hoping that I would get access to more RAM. I set ntasks=4 and ROSE_TASK_N_JOBS = 4 (both were set to 1 before) in the [[HPC_SERIAL]] section.
The compilation process did run 20 min longer but then still crashed with the same error. I also did a test where I updated the suite to version 11.7 and didn’t have any problems there.
Do you have any suggestions what I could do to fix this?
and search for
# Define memory required for this jobs. By default, you would
# get just under 2 GB, but you can ask for up to 125 GB.
We’d be interested to know what memory is required.
Thanks Grenville. That solved my problem. I tried 8 GB but it still failed with the same error. With 16 GB it worked.
The relevant part in my archer2.rc file looks like this now:
inherit = HPC_SERIAL
execution time limit = PT90M
After successfully compiling the UM, I faced a problem with the UM run itself. The atmos_main task runs really slow now. In vn11.7 one time step would take about 1s or so. In vn 13.1 it takes about 50s. A 1 month simulation in vn11.7 took about 4.5h (10800 time steps). With vn13.1 I completed 892 time steps in 12h.
Is there any experience with performance issues after upgrading to newer UM versions? I thought maybe some compile options need to be changed? I did make sure to set the config_root_path the way Annette suggested above. The config_revision field is empty.
I am running suite u-cv063.
I hope you can point me in the right direction.
Can we rule out ARCHER2 being very slow? Is the slow performance repeatable?
Thanks Grenville. I re-submitted the run and this time it ran at normal speed (slightly faster than in vn11.7).