Memory space error at recon stage

mfleg · 10 June 2024 11:03

Hi, I’ve been trying to run a UKESM1.0 AMIP version of the UM with some changed inputs, working in u-de347. For now, I am just trying to run one month of the model to check it works with my changes. However, I have been getting this error for recon:

lib-4205 : UNRECOVERABLE library error
The program was unable to request more memory space.
tcmalloc: large alloc 1441714830712012800 bytes == (nil)

My jobs have also been stuck in queues for a long time; last time it took six days just to get to the (failed) recon stage; it seems like the jobs are slower with every iteration I try. I don’t think I’ve been overusing the queues (as I am literally just running this one simple job every few days), so I don’t think I would be getting deprioritised for that reason. It’s really frustrating because it’s now taking weeks just to figure out if my model set up works.

Any advice would be appreciated!

grenville · 10 June 2024 11:24

It looks like the reconfiguration is trying to read a file that has the wrong endianness. What inputs did you change?

Grenville

mfleg · 10 June 2024 13:02

Hi Grenville, thanks for your reply.

I am trying to use my own values for surface albedo (stash requests 244 and 245). The original albedo file used by the model, when ancil file is not specified, is a .land file. I haven’t been able to figure out what format that is exactly, but I am inputting a netcdf file with a manually added .land extension. I’m not sure this is the way to go to be honest - I had a discussion with Patrick about it here: Changing albedo climatology in JULES.

The file I am inputting is a copy of one of the albedo files in /work/y07/shared/umshared/ancil/atmos - I believe that is the destination the files are normally taken from when not manually specified - with some changes to the values.

grenville · 10 June 2024 14:33

Please allow us read permissions on your /work and /home spaces on ARCHER2

on an ARCHER2 login node

chmod -R g+rX /home/n02/n02/mfleg
chmod -R g+rX /work/n02/n02/mfleg

mfleg · 10 June 2024 15:10

Thanks Grenville, I’ve done that now

grenville · 11 June 2024 15:29

Michaela

Ancillary files are in proprietary UM fields-file format (the .land extension is just a convenience and has no relevance to the file format.)

I think there is an easier way to create your own ancillary file for stash items 244 and 245.

use xconv to convert the currently used albedo ancil file into a netcdf file
use netcdf4 to change the data but preserve all metadata, variable names etc
use xancil to convert the modified netcdf file back to ancillary file format.

I ran a test case, without step 2, starting from (arbitrarily) /work/y07/shared/umshared/ancil/atmos/n96e/general_land/GlobAlbedo/v2/qrclim.land that worked without any problems.

xconv (very intuitive) and xancil (Xancil 0.58 documentation) are both installed on ARCHER2.

Grenville

mfleg · 1 July 2024 12:37

Hi Grenville,

Thanks for this - this seems to have solved my issue as far as I can tell. I’ve been trying to test it and I am now running into another problem, getting this error:

? Error code: 1
? Error from routine: portio2a:flush_unit_buffer
? Error message: Failed in output_buffer()
? Error from processor: 0
? Error number: 72

It seems to be a memory space issue. Would it be possible to increase my /work quota, please?

grenville · 1 July 2024 13:17

increased to 1TB (enough?)

Grenville

mfleg · 1 July 2024 13:29

Should be enough, thank you!

system · 31 July 2024 13:30

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Archer2 storage space and MetUM running requirements for PolarRES Unified Model ARCHER2	7	337	11 November 2023
An error from the ‘recon’ task Unified Model ARCHER2	2	222	13 December 2021
Error in u-ch427 Unified Model ARCHER2	8	290	4 January 2022
N512 nodes and decomposition Unified Model ARCHER2	10	400	20 October 2023
Problem with ancillary file Unified Model Nesting-Suite	4	209	19 May 2023

Memory space error at recon stage

Related topics