CESM2.X on ARC4

The Community Earth System Model (CESM) is a coupled climate model. It has seven different components: atmosphere, ocean, river run off, sea ice, land ice. The Common Infrastructure for Modeling the Earth (CIME) is a python based tool for coupling these different aspects together. The latest release is CESM 2.1.3 using CIME 5.6. The centralised install version has been configured to use Intel compilier and Intel MPI.

Advanced Research Computing node 4 (ARC4) is the high performance computing facilities provided by the Research Computing group at the University of Leeds.

Contents

Getting started on ARC4

All Postgraduate researchers are entitled to an account on ARC4. You can request an account IT, as detailed on the ARC4 docs page. Taught students may also be entitled but must speak to their supervisor before submitting a request.

Once logged on, you can set up the correct module environment using

$ module use /nobackup/CESM/module_env
$ module load cesm

By default, this will load cesm2.1.3, but you can specify a version using cesm/<version>. Each time you log off ARC4 the environment is reset, so you will need to start each new session by setting the environment. Loading the cesm module also loads the Earth System Modeling Framework (ESMF) module, a high-perfomance, flexible software infrastructure for building and coupling Earth system models. Both modules are found within the module_env directory.

Batch Jobs

ARC4 uses a batch scheduler called Sun Grid Engine (SGE), which allocates resources and prioritises jobs that are submitted to the compute nodes through the queue. There is a fair share policy in place, so a users’ priority will decrease with the more resources they use. More information on writing job submission scripts (job scripts) can be found here: ARC4 job scripts.

First time usage

The home directory for CESM2 code is located at /nobackup/CESM on ARC4, where you can find the ported versions (currently 2.1.3), module environments and input data directories. The ported versions are configured to run on ARC4, with the correct batch and compiler information, and these are the versions available for use. You do not need to download your own copy of the model code unless you want to use a different version, or make changes to the code. There are a few things you need to do to prepare your account for using these versions.

To use the case scripts, such as create_newcase, you will need to be added to the UNIX group arc-cesm2, which will give you the correct group permissions. Please contact CEMAC.

Make a directory in your $HOME directory to store case setups.

$ mkdir $HOME/cesm_prep

Scripts from the central installation (/nobackup/CESM/cesm2.1.3) are copied here for use with a specific case, e.g. case.setup and env_run.xml. Your run submission outputs, run.*.o* and run.*.e*, will also be created and updated here.

Data output and detailed log files are written to a directory in /nobackup. If you do not already have a user directory there, you need to make one using

$ mkdir /nobackup/<username>

and replace <username> with an appropriate name.

It is important to note that any files written to your /nobackup directory will not be backed up, as the name implies. This is a temporary storage place and any output that you wish to keep must be moved to appropriate storage, else it will be deleted after 90 days without access. Your $HOME directory is backed up once a week, however it is limited to 10GB per user.

Preparing a case

Change directory to your cesm_prep directory and use the create_newcase script, located in the central installation to create your case,

$ cd $HOME/cesm_prep
$ $CESM_HOME/cime/scripts/create_newcase --case $HOME/cesm_prep/<case_name> --compset <compset> --res <grid_resolution>

Change working directory into the case directory

$ cd $HOME/cesm_prep/case_name

Run setup and build

$ ./case.setup
$ ./case.build

After running ./case.setup, you can use ./preview_run to see the information that will be used at run time. This includes the number of nodes requested, the commands for setting the environment and the run commands. It is useful to check the run command is submitting your job to the correct queue, with the correct resources, and whether archiving is set up correctly.

Input data

The model is configured to look for the input data in a central directory /nobackup/CESM/inputdata so that all users can share downloaded files. This is currently read only to all users except CEMAC staff. If you find your case is missing files, please contact CEMAC (cemac-support@leeds.ac.uk) and include the compset and resolution configuration that you want.

Archiving

The CIME framework allows for short-term and long-term archiving of model output. This is particularly useful when the model is configured to output to a small storage space and large files may need to be moved during larger simulations. On ARC4, the model is configured to use short-term archiving, but not yet configured for long-term archiving. Short-term archiving is on by default for compsets and can be toggled on and off using the DOUT_S parameter set to True or False (see Making changes to a case). When DOUT_S=TRUE, calling ./case.submit will automatically submit a “st_archive” job to the batch system that will be held in the queue until the main job is complete. This can be configured in the same way as the main job for a different queue, wallclock time, etc, however the default should be appropriate in most cases. Note that the main job and the archiving job share some parameter names so a flag (--subgroup) specifying which you want to change, if not both, should be used. The archive is currently set up to move .nc files and logs from /nobackup/<username>/case_sims/<case_root> to /nobackup/<username>/case_sims/archive. As such, the quota being used is the communal /nobackup space whether archiving is switched on or off. There is a lot of storage available in this space, however it is not backed up and should only be left there for short periods as it will be removed after 90 days. If a user wants to archive their files directly to a different location, this can be set using the DOUT_S_ROOT parameter.

Making changes to a case

After creating a new case, the CIME functions can be used to make changes to the case setup, such as changing the wallclock time, number of cores etc. ARC4 has a maximum job time limit of 24 hours and has 40 cores per node.

You can query settings using the function

$ ./xmlquery <name_of_setting>

Adding -p as a flag allows you to look up partial names, e.g.

$ ./xmlquery -p JOB

Output:
Results in group case.run
        JOB_QUEUE: 40core-192G.q
        JOB_WALLCLOCK_TIME: 01:30:00

Results in group case.st_archive
        JOB_QUEUE: 40core-192G.q
        JOB_WALLCLOCK_TIME: 0:20:00

When you know which setting you want to change, you can do so using

$ ./xmlchange <name_of_setting>=<new_value>

For example to change the wallclock time to 30 minutes, without knowing the exact name, you could do

$ ./xmlquery -p WALLCLOCK

Output:
Results in group case.run
        JOB_WALLCLOCK_TIME: 01:30:00

Results in group case.st_archive
        JOB_WALLCLOCK_TIME: 0:20:00

$ ./xmlchange JOB_WALLCLOCK_TIME=00:30:00 --subgroup case.run

$ ./xmlquery JOB_WALLCLOCK_TIME

Output:
Results in group case.run
        JOB_WALLCLOCK_TIME: 00:30:00

Results in group case.st_archive
        JOB_WALLCLOCK_TIME: 0:20:00

Note

The flag --subgroup case.run is used to change only the main job wallclock without affecting the st_archive wallclock.

Note

If you try to set a parameter equal to a value that is not known to the program, it might suggest using a --force flag. This may be useful, for example, in the case of using a queue that has not been configured yet, but use with care!

Some changes to the case must be done before calling ./case.setup or ./case.build, otherwise the case will need to be reset or cleaned, using ./case.setup --reset and ./case.build --clean-all. These are as follows:

  • Before calling ./case.setup, changes to NTASKS, NTHRDS, ROOTPE, PSTRID and NINST must be made, as well as any changes to the env_mach_specific.xml file, which contains some configuration for the module environment and environment variables.

  • Before calling ./case.build, ./case.setup must have been called and any changes to env_build.xml and Macros.make must have been made. This includes whether you have edited the file directly, or used ./xmlchange to alter the variables.

Many of the namelist variables can be changed just before calling ./case.submit.

Submitting a job

To use ARC4 compute nodes, you can submit a job to the queue through a batch scheduler. It requires you to request the number of cores (or nodes), memory and time needed and pointing to the program you wish to run. The batch scheduler takes your resources into account and allocates your job a place in the queue. More information on writing job submission scripts (job scripts) can be found here: ARC4 job scripts.

For CESM2, the CIME framework has been configured for ARC4 so that you can use the package functions to submit jobs. Before you do so, you can preview commands that will be run at submission time using

$ ./preview_run

You can submit the job using

$ ./case.submit

The default queue is 40core-192G, which has 5960 cores available for use (ARC4 has 40 cores per node). There are other queues available, though most are privately owned. If you have access to another queue through a specific project, you can choose it when preparing your case using ./xmlchange. On ARC4, usually the project would be specified and it would automatically sort your submitted job to the correct queue assigned to that project. The way CESM2 is set up, if only the project is specified, it will default to the main queue causing your job to hang. Both the project and the associated queue must be specified, e.g.

$ ./xmlchange PROJECT=<name_of_project>
$ ./xmlchange JOB_QUEUE=<name_of_queue>

Note

This will change the PROJECT and JOB_QUEUE for the short-term archive job as well (st_archive). If you only want to change the queue of your main job, you can add the --subgroup case.run flag as in Making changes to a case.

Note

If you try to set a queue name that is not known, you will need to use the --force flag.

You can check the qsub command again using ./preview_run and the case can then be run using the submit command, as before with ./case.submit.

Monitoring jobs

You can check the status of all your jobs using the command

$ qstat -u $USER

Also, if you want to check the demand on the queues you can use

$ qstat -g c

to see a table of node availability.

Jobs can be cancelled using

$ qdel <job_id>

For further information on the ARC4 docs, see Monitoring jobs on ARC4.

Troubleshooting

If a run fails, the first place to check is in your run submission outputs, run.*.o* and run.*.e*, within $HOME/cesm_prep/<case_name>. In particular, the .e file (for error output) may give some indication and it might point you to the cesm log files in /nobackup/<username>/cesm_sims/<case_name>/run/ for more information. Additionally, in the same directory there are log files for each of the coupled models (atm, lnd, etc) where you can check for errors.

Note

If you are using short term archiving (DOUT_S=True), these logs files will be located at /nobackup/<username>/cesm_sims/archive/<case_name>/logs/.

For more information on the job from the batch system, you can use qacct -j <job_id> at any time, or qstat -j <job_id> while the job is queueing or runnning.