Fcm_make starts build even though this should be postponed to fcm_make2

Hi,

I am trying to configure user suite u-ck730 (https://code.metoffice.gov.uk/trac/roses-u/browser/c/k/7/3/0/trunk?rev=215323) to extract data without building in fcm_make, and postpone the build step to fcm_make2. The reason is that I would like to interactively trigger the build of the fcm_make2 phase on a GPU node on CSD3 cluster in Cambridge.

However, the fcm_make phase also triggers a build on our frontend (we want to de-activate the build in fcm_make). Can you tell me why the build is triggered in fcm_make?

Thanks,
Kiril Dichev

Hi Kiril,

I’ve just graphed your suite and it correctly shows 2 tasks for the build. fcm_make_um and fcm_make2_um. The first would be the extract and mirror of the source code and the fcm_make2_um would then do the build.

However, in the site/csd3.rc all the remote hosts are set to localhost. So it will effectively run the extract and build on localhost in 2 steps.

I’d suggest change the host = localhost line in the [[HPC]] family to switch the build to your HPC.

Hope I have understood correctly.

Cheers,
Ros.

Hi Ros,

thanks for the reply. Yes, it is probably not correct that I have host = localhost in all settings. Let me just clarify how I wish to proceed:

  • I want fcm_make to do extract and mirror of the source code on the localhost (the front end node).
  • I want then for fcm_make2 to manage to generate in a cylc-run/…/fcm_make2/… directory a ‘job’ script.This way I can interactively finish the process of building fcm_make2, by starting an interactive session to a GPU node and manually running ‘job’ in the right directory.

This approach was proposed to me by a user of ARCHER, who has proceeded in a similar fashion during trainings requiring compilation on worker nodes. I need the linking of my binaries to happen on a worker node, as the front end compilation fails with the compiler I need to use - as it searches for newer GCC libraries than the front end has.

So far I seem unable to generate the ‘job’ script.

Do you think changing the ‘host = localhost’ line will help me in this case? Any hint on how I should set it? I am a bit short on existing examples for what I want to do. If it helps, the front ends all have the pattern login-e-* in their hostname, and all worker nodes I wish to use have the pattern gpu-q-* in their name.

Thanks,
Kiril

Hi Kiril,

Thanks for the explanation, I understand now I think. You shouldn’t need to change the host = localhost I don’t think. I made the assumption you were submitting from a different machine to that which you’re running on.

So the problem is that cylc only generates the job script when it’s ready to submit the task to run. The 2 are inextricably linked. Cylc doesn’t have the option to generate the job script but not submit the job.

So the training you refer to got around this by adding a manual build option to the suite so that the fcm_make2_um task fails to submit to the queues due to an invalid account code. You can then logon to the build platform and then manually run the job script.

Not knowing how the Cambridge cluster is setup - is the use of an invalid scheduler option something that could be done there too?

Regards,
Ros.

Hi Ros,

Thanks for your help with this. We’ve been trying to replicate how I set up the UKCA training suite and I know just enough cylc to know it should be possible but not enough to know where it’s going wrong.

I’m confused as to why the fcm_make_um is doing the build at all as the suite.rc doesn’t include the UMBUILD_RESOURCE - this is only placed in the fcm_make2_um, so I can’t work out where it’s actually being given the instructions to do the build in that job.

“All” we want is for the fcm_make_um to do just the extract and then (at least) generate the job script for the build in the fcm_make2_um. Due to different environments we can’t compile for the A100 nodes on the login node and we can’t authenticate to MOSRS on the A100 nodes (so can’t do the extract there). Once we have the build-only job script we could then run this manually on the A100 nodes as in the training. It’s not a sustainable solution for lots of users, but it’s should be acceptable enough to then allow us to test running UKCA on these nodes.

Many thanks and best wishes,
Luke

Hi Luke, Kiril,

I’d guess it’s to do with incorrect build steps being passed through to FCM. Can you commit your cluster config branch (vn11.8_cambs_config_11_8) so I can try running locally to see what’s going on?

Cheers,
Ros.

Flying blind… so apologies if you already have this, but make sure you have these lines in the platform config file (e.g. in fcm-make/csd3-x86-nvhpc/um-atmos-safe.cfg)

$extract{?} = extract
$mirror{?} = mirror
$steplist{?} = $extract $mirror

Which should cause the steplist to be correctly set to extract and mirror only for the fcm_make_um step.

Cheers,
Ros.

Just to add you may get away without the mirror as the extract and build are on the same platform but I have never tried splitting into a 2-step fcm_make without the mirror step so I have no idea if that would work or not…

Good news - the um-atmos-safe.cfg did not have the steplist option. Once I added it, the fcm_make step succeeded, and I think it is not trying to build anymore.
Also, I see that we finally have the job script here:

./u-ck730/log.20220119T120912Z/job/19880901T0000Z/fcm_make2_um/01/job

Which is what we wanted.

The branch is https://code.metoffice.gov.uk/svn/um/main/branches/dev/arjentamerus/vn11.8_cambs_config_11_8/ by the way. But I think we might have figured this out.

Excellent. That sounds very encouraging. :grinning:

Due to different environments we can’t compile for the A100 nodes on the login node and we can’t authenticate to MOSRS on the A100 nodes (so can’t do the extract there). Once we have the build-only job script we could then run this manually on the A100 nodes as in the training.

Completely out of curiosity, I totally understand the not being able to compile on the login nodes or do the extract from the A100 nodes which is exactly the same situation as we have on ARCHER but what’s the reasoning for not being able to automatically submit the build step to run on the A100 nodes?

I believe our thinking is that by interactively re-running the job script, we can also do development on the fly which we can then commit. The automated process is something we might do on a coarse-grained level, once we have large portions of our code changes working.

Worth mentioning: The development we do needs to access GPUs at all times.

As Kiril says, we’re developing how this works and so the initial aim is to get it compiling on the GPU nodes - once this works we can investigate batch submission, but it may not actually be necessary for this work. We’ll probably be tinkering a lot with the code and so having a bit more control would be useful at this stage.

At the moment, the fcm_make2 job script is not happy, it does not seem to have a configuration:

[FAIL] make 2 config-parse # 0.1s
[FAIL] make 2 # 0.1s
[FAIL] no configuration specified or found

Any pointers where to look?

If I am not mistaken, following file is read before the error message:

cat /home/kd486/cylc-run/u-ck730/app/fcm_make_um/file/fcm-make.cfg
use = $prebuild

include = $config_root_path/fcm-make/$platform_config_dir/um-$config_type-$optimisation_level.cfg$config_revision

extract.location{diff}[um] = $um_sources
extract.location{diff}[shumlib] = $shumlib_sources
extract.location{diff}[casim] = $casim_sources
extract.location{diff}[jules] = $jules_sources
extract.location{diff}[socrates] = $socrates_sources

Hmmmm, I think this is to do with the fact you’re doing the fcm_make and fcm_make2 on the same host.

If I run your suite as is, I get the same error as you but if I switch the fcm_make2_um to run on another machine it generates and find the config fine. This might sound odd, but can you by any chance ssh from the login node to itself without any authentication prompt? We do this on Monsoon2 and the 2-step then works fine. It would at least allow you to progress even if it is a bit hacky.

I am afraid I can’t do that on the Cambridge login node. We are strongly discouraged from not using at least passphrases – so even though I just set-up a passwordless ‘ssh localhost’ from login node to itself, I still need to enter my passphrase. Not sure if using password-less and passphrase-less login would have made a difference – but so far the result is the same.

I thought it was probably a forlorn hope you’d be able to but worth a shot! In case you haven’t found it the config file that should be used for the next step of the fcm_make is in the share/fcm_make_um/mirror directory.

It still looks like it’s not fully realising it’s a 2-step as the file should be named fcm-make2.cfg and have name = 2 within the file and be copied into the directory above (at least that is what you get when the hosts are different). Total hack but might allow you to do POC, you could try manually copying it into the share/fcm_make_um directory and name it fcm-make2.cfg. And then try running the fcm_make2_um task.

I’ve also tried turning the mirror off and that didn’t help either. :unamused:

If I don’t have any inspiration by tomorrow morning I’d probably suggest asking the FCM/Rose guys at the Met Office.

Cheers,
Ros.

Not to worry. We made good progress, but the entire toolchain is pretty complex :frowning:

Yes, unfortunately it is complicated and not helped by you having non perfectly straightforward setup too. :slightly_frowning_face: I’m sure there will be a way around it - can’t be that uncommon a setup I’m sure.

For my own sanity I just switched a Monsoon job, which does a 2 step with both steps on the same machine but setting the fcm_make2 step to login to itself, to both using localhost and it bails out with the exact same issue you’ve currently got.

Hi Ros,

I did try manually copying

cp ./share/fcm_make_um/mirror/fcm-make.cfg share/fcm_make_um/fcm-make2.cfg

And then I did (on a GPU node):
sh ./log.20220124T084613Z/job/19880901T0000Z/fcm_make2_um/01/job

I can confirm that the compilation was successful after that. It is not quite as expected, as the compilation seems to happen in the fcm_make_um build directory, and the binary is generated under

./share/fcm_make_um/build-atmos/bin/um-atmos.exe

But I suppose I can temporarily work with this.