Our Experiments

Overview

Teaching: min
Exercises: min

Questions

How can I confirm if everything ran correctly?

How can I troubleshoot errors?

Objectives

We have three cases (or experiments) we are working with. We will take a look at what happened with each one of them.

b.day1.0

This was our first case that ran initially for 5-days. Let’s set CONTINUE_RUN=TRUE and ran it again for another 5-days. This run should have completed successfully. How can we confirm this?

Check your CaseStatus file

Go to your case directory for this case and look at the end of the CaseStatus file
tail CaseStatus

Now let’s check out the output from this case. Remember, it is located in the DOUT_S_ROOT directory.

cases/b.day1.0> ./xmlquery DOUT_S_ROOT


	DOUT_S_ROOT: /glade/derecho/scratch/cstan/archive/b.day1.0

We will go there and see if we now have 10 days of data.

cd /glade/derecho/scratch/cstan/b.day1.0
cd ocn/hist

We now have two output files for the ocean: b.day1.0.pop.h.nday1.0001-01-01.nc and b.day1.0.pop.h.nday1.0001-01-06.nc.

b.day1.0.pop.h.nday1.0001-01-01.nc is the same file we looked at last week which has the first 5-days.
b.day1.0.pop.h.nday1.0001-01-06.nc contains the next 5-days that we ran by setting CONTINUE_RUN=TRUE Use ncview to look at these files.

Second case

This is our case that we are running for 4-years with daily precip and standard monthly output to use for Assignment #3. Assuming the configuration and namelist changes were entered correctly, this run should have completed successfully.

Check your CaseStatus file.
If errors, check your log file
If no errors, check your output:

There should be monthly and daily output for the atmosphere. Let’s confirm:

cd /glade/derecho/scratch/cstan/archive/run.2/atm/hist
ls

The run2.cam.h0.*.nc files contain monthly averaged data.

The run2.cam.h1.*.nc contain daily averaged data.

What is in these files?

We will look at each file using ncdump -h to understand what is in the files.

What variables are in the h0 files? What variables are in the h1 files?

We set this with the namelist options

fincl2 = 'PRECC', 'PRECL'

nhtfrq = 0, -24

How many times are in the h0 files?

How many times are in the h1 files?

We set this with the mfilt = 1,1 namelist option.

Remember, you can look up namelist options

mfilt

Array containing the maximum number of time samples written to a history file. The first value applies to the primary history file, the second through tenth to the auxillary history files. Default: 1,30,30,30,30,30,30,30,30,30

We have lots of history files and we can look at each of them using ncview, but that is not very useful.

We can read in all the files using Python xarray, but its a lot of data.

There are some useful tools for postprocessing the data to get timeseries files and easily take a look at some common diagnostics. We will learn about those later in this class.

BRANCH case:

This is the branch case we ran with lots of configuration changes and namelist changes. This run produces an error with the configuration provided.

We can review how we created and setup our run by looking at the first line of the README.case file:

cases/branchwrong> head -n1 README.case

2023-03-05 17:55:30: ./create_newcase --case /glade/u/home/cstan/cases/branchwrong --res f19_g17 --compset B1850 --project UGMU0035

You can see that I initially made a mistake in my create_newcase by mistyping the project number.

We can review any changes we made to the configuration of the run by looking at CaseStatus

cases/branchwrong> more CaseStatus

2025-03-14 09:15:36: xmlchange success <command> ./xmlchange RUN_TYPE=branch,RUN_REFCASE=b.day1.0,RUN_REFDATE=0001-01-05,CLM_NAMELIST_OPTS=,GET_REFCASE=FALSE,STOP_OPTION=nmonths,STOP_N=1,RESUBMIT=1,CCSM_CO2_PPMV=569.4  </command>
 ---------------------------------------------------
2025-03-14 09:16:01: case.setup starting
 ---------------------------------------------------
2025-03-14 09:16:04: case.setup success
 ---------------------------------------------------
2025-03-14 09:16:30: case.build starting
 ---------------------------------------------------
2025-03-14 09:16:35: case.build error
ERROR: Missing required pointer_file /glade/derecho/scratch/cstan/branchwrong/run/rpointer.ocn.restart ---has pop initial data been prestaged to /glade/derecho/scratch/cstan/branchwrong/run?
 ---------------------------------------------------
2025-03-14 09:24:24: case.build starting
 ---------------------------------------------------
CESM version is release-cesm2.1.5
Processing externals description file : Externals.cfg (/glade/work/cstan/cesm2.1.5)
Processing externals description file : Externals_CAM.cfg (/glade/work/cstan/cesm2.1.5/components/cam)
Processing externals description file : Externals_CISM.cfg (/glade/work/cstan/cesm2.1.5/components/cism)
Processing externals description file : Externals_CLM.cfg (/glade/work/cstan/cesm2.1.5/components/clm)
Processing externals description file : Externals_POP.cfg (/glade/work/cstan/cesm2.1.5/components/pop)
Checking local status of required & optional components: cam, chem_proc, carma, clubb, cosp2, cice, cime, cism, source_cism, clm, fates, mosart, pop, cvmix, marbl, rtm, ww3,
    ./cime
        clean sandbox, on cime5.6.49
    ./components/cam
        clean sandbox, on cam_cesm2_1_rel_60
    ./components/cam/chem_proc
        clean sandbox, on tools/proc_atm/chem_proc/release_tags/chem_proc5_0_03_rel
    ./components/cam/src/physics/carma/base
        clean sandbox, on carma/release_tags/carma3_49_rel
    ./components/cam/src/physics/clubb
        clean sandbox, on vendor_clubb_r8099_n03
    ./components/cam/src/physics/cosp2/src
        clean sandbox, on v2.1.4cesm
    ./components/cice
        clean sandbox, on cice5_cesm2_1_1_20231220
    ./components/cism
        clean sandbox, on cism-release-cesm2.1.2_04
    ./components/cism/source_cism
        clean sandbox, on release-cism2.1.04
    ./components/clm
        clean sandbox, on release-clm5.0.37
    ./components/clm/src/fates
        clean sandbox, on sci.1.30.0_api.8.0.0
    ./components/mosart
        clean sandbox, on release-cesm2.0.04
    ./components/pop
        clean sandbox, on pop2_cesm2_1_rel_n15
    ./components/pop/externals/CVMix
        clean sandbox, on v0.93-beta
    ./components/pop/externals/MARBL
        clean sandbox, on cesm2.1-n00
    ./components/rtm
        clean sandbox, on release-cesm2.0.04
    ./components/ww3
        clean sandbox, on ww3_181001
2025-03-14 09:32:46: case.build success
 ---------------------------------------------------
2025-03-14 09:35:32: case.submit starting
 ---------------------------------------------------
2025-03-14 09:35:48: case.submit error
ERROR: Command: 'qsub -q main -l walltime=12:00:00 -A UGMU0035 -l job_priority=regular -v ARGS_FOR_SCRIPT='--resubmit' .case.run' failed with error 'b'qsub: Invalid account for CPU usage, available accounts:\nProject, Status, Active\nP06010014, Normal, True\nUGMU0049, Normal, True'' from dir '/glade/u/home/cstan/cases/branchwrong'
 ---------------------------------------------------
2025-03-14 09:38:07: xmlchange success <command> ./xmlchange PROJECT=UGMU0049  </command>
 ---------------------------------------------------
2025-03-14 09:45:56: case.submit starting
 ---------------------------------------------------
2025-03-14 09:46:11: case.submit success case.run:8798603.desched1, case.st_archive:8798604.desched1
 ---------------------------------------------------
2025-03-14 10:28:11: case.run starting
 ---------------------------------------------------
2025-03-14 10:28:21: model execution starting
 ---------------------------------------------------
2025-03-14 10:28:29: model execution success
 ---------------------------------------------------
2025-03-14 10:28:29: case.run error
ERROR: RUN FAIL: Command 'mpiexec  --label  --line-buffer  -n 768 /glade/derecho/scratch/cstan/branchwrong/bld/cesm.exe  >> cesm.log.$LID 2>&1 ' failed
See log file for details: /glade/derecho/scratch/cstan/branchwrong/run/cesm.log.8800215.desched1.250314-112041
 ---------------------------------------------------

Another thing we did that is not documented automatically is to copy the restart files from our b.day1.0 case to our new run directory. This was so the model has a set of restart files to start the run from.

cp /glade/derecho/scratch/cstan/archive/b.day1.0/rest/0001-01-16-00000/* /glade/scratch/cstan/branchwrong/run/ 

How do we figure out what went wrong?

Look at your log file and use grep -i to find errors.

cases/branchwrong> grep -i error /glade/derecho/scratch/cstan/branchwrong/run/cesm.log.8800215.desched1.250314-112041

ERROR: GETFIL: FAILED to get b.day1.0.cam.r.0001-01-05-00000.nc
ERROR: GETFIL: FAILED to get b.day1.0.cam.r.0001-01-05-00000.nc
ERROR: GETFIL: FAILED to get b.day1.0.cam.r.0001-01-05-00000.nc
ERROR: GETFIL: FAILED to get b.day1.0.cam.r.0001-01-05-00000.nc
ERROR: GETFIL: FAILED to get b.day1.0.cam.r.0001-01-05-00000.nc
ERROR: GETFIL: FAILED to get b.day1.0.cam.r.0001-01-05-00000.nc
ERROR: GETFIL: FAILED to get b.day1.0.cam.r.0001-01-05-00000.nc
ERROR: GETFIL: FAILED to get b.day1.0.cam.r.0001-01-05-00000.nc
...

Why is the error repeated many times?

The model runs on many processors. Each one is reporting the error.

What does the error mean?

This is telling that the model is trying to get a file called b.day1.0.cam.r.0001-01-05-00000.nc and it is unable to get it.

Let’s go back to our configuration and think back about how we set up this experiment. What do all the configuration changes mean? We can take a look in env_run.xml to confirm what each setting means.

RUN_TYPE=branch: This is a branch run
RUN_REFCASE=b.day1.0: Reference directory containing RUN_REFCASE data - used for hybrid or branch runs
RUN_REFDATE=0001-01-05: Reference date for hybrid or branch runs (yyyy-mm-dd)
CLM_NAMELIST_OPTS=’’: CLM-specific namelist settings for -namelist option in the CLM build-namelist. CLM_NAMELIST_OPTS is normally set as a compset variable and in general should not be modified for supported compsets. It is recommended that if you want to modify this value for your experiment, you should use your own user-defined component sets via using create_newcase with a compset_file argument. This is an advanced flag and should only be used by expert users.

It seems this option was provided in the NCAR tutorial example, but is not necessary.

GET_REFCASE=FALSE: Flag for automatically prestaging the refcase restart dataset. If TRUE, then the refcase data is prestaged into the executable directory
STOP_OPTION=nmonths: Sets the run length along with STOP_N and STOP_DATE
STOP_N=1: Provides a numerical count for $STOP_OPTION.
RESUBMIT=1: If RESUBMIT is greater than 0, then case will automatically resubmit Since we later set our queue time to only 2 hours, there may be a need to resubmit to complete the run.
CCSM_CO2_PPMV=569.4: Mechanism for setting the CO2 value in ppmv for CLM if CLM_CO2_TYPE is constant or for POP if OCN_CO2_TYPE is constant. This is the CO2 value that gets propogated to the ocean and land models.
JOB_WALLCLOCK_TIME=2:00:00: The machine wallclock setting. This means how long we tell it to run in the queue. The maximum and default are 12:00:00, but we can get our run in more quickly if we tell it we need less time.

Do you see anything in the configuration that could have led to our error?

Look at the RUN_REFDATE and the date we used for our restart file
Solution
RUN_REFDATE=0001-01-16
./xmlchange RUN_REFDATE='0001-01-16'
Fix it!

What did those namelist changes do?

We can look them up

In user_nl_cam

co2vmr=569.4e-6: CO2 volume mixing ratio. This is used as the time invariant surface value of CO2 if no time varying values are specified. Default: set by build-namelist.
ch4vmr = 1583.2e-9: CH4 volume mixing ratio. This is used as the time invariant surface value of CH4 if no time varying values are specified. Default: set by build-namelist.
inithist=’MONTHLY’: Frequency that initial files will be output This produces initial condition files monthly.

Nothing looks questionable there.

Resubmit your case!

Key Points

History and Setup for Diagnostics

Overview

Teaching: 0 min
Exercises: 0 min

Questions

What is in our history files?

How do I setup to run the postprocessing and diagnostics packages

Objectives

We will now return to the output from our 4-year case. Let’s go to the atmospheric history directory for our case. If your 4-year case did not run to completion, you are welcome to look at mine.

cd /glade/derecho/scratch/cstan/archive/run.2/atm/hist

History vs. Timeseries Files

History files: contain all of the variables for a componenet for a particular frequency and are output directly by the model.
Timeseries files: usually span a number of timesteps and contain only one major variable. They are created offline.

When NCAR provides output from their model simulations publicly, they typically provide timeseries files for a select set of variables.

Examples:

A history file: f40_test.cam.h0.1993-11.nc

1 monthly timestep (Nov 1993)
200+ CAM varaibles

A timeseries file: f40_test.cam.h0.PSL.199001-199912.nc

120 monthly timesteps (Jan 1990-Dec1999)
1 CAM variable (PSL), along with coordinate variables like time,lat,’lon`,etc.

CESM Time Variable

The time coordinate variable in CESM history and timeseries files represents the end of the averaging period for variables that are averages. The time that gets resolved when the data are read in does not match the date in the filename. For monthly averaged data, the filename is correct. This can be a source of much confusion.

Example: run.2.cam.h0.0001-05.nc

This is a history file for may of year one of our run.
When you read in this file, the first time is resolved as: 0001-06-01. This means that Jun 1 of year 0001 is the end of the averaging period and the data contains the average for May of year 1.
To verify the averaging period in the files, consult the time_bnds, time_bound, or time_bounds variable. Always check!!!!
This is a convention used by CESM to allow averaged and instantaneous variables to be stored in the same file.

Postprocessing

The process of going from history files to timeseries files and to convert 3D atmospheric data from the model coordinate system to selected pressure levels. We will learn how to use the CESM Postprocessing Tools which are primarily written in NCAR Command Language (NCL) and PyNGL the Python version of NCL. We will use the preprepared NCL scripts without having to know too much NCL.

Diagnostics Packages

There is a large suite of postprocessing and diagnostic packages developed by NCAR using Python scripts that automatically generate a variety of different kinds of plots from model output files and used to evaluate a simulation. They all compute a series of pre-defined metrics and display the plots via a website. These packages are under development.

There are two main diagnostics packages:

The Atmosphere Model Working Group (AMWG) Diagnostics Framework (ADF)
- Climate Variability and Diagnostics Package (CVDP)
CESM Unified Postprocessing and Diagnostics (CUPiD)
- ADF
- Climate Variability and Diagnostics Package (CVDP)

Postprocessing and Diagnostics Packages Setup

We will setup everything necessary for you to be able to run the postprocessing and diagnostics packages on the NCAR computers. We will work on Casper, the system system of specialized data analysis and visualization resources; large-memory, multi-GPU nodes; and high-throughput computing nodes.

$ aah -XY username@casper.hpc.ucar.edu

After running the ssh command, you will be asked to finish loggin in. Casper has full access to glade/

Checkout ADF and activate the conda environment:

$ git clone --recursive https://github.com/NCAR/ADF.git
$ module load conda
$ conda activate npl-2024a

Also, along with these python requirements, the ncrcat NetCDF Operator (NCO) is also needed. This can be loaded by simply running:

$ module load nco
$ module load ncl

Configuration files:

The ADF requires 2 different yaml configuration files:

config_amwg_default_plots.yaml and adf_variable_defaults.yaml

Do not modify either of these files!

It is recommended to make a copy of each file, make modifications in those copies, and then run them with the ADF.

$ cd ADF

Run-time yaml

config_amwg_default_plots.yaml

This is the most important file for the ADF, it stores all the necessary information that the ADF needs to run including all the relevant information about the case and baseline/observation/cmip runs.

Make a copy of tis file that you will edit

$ cp config_amwg_default_plots.yaml config_amwg_myCopy_plots.yaml

Open that copied file and the main sections you will want to change are:

user - use your NCAR ursename

compare_obs - set true if you want to compare your run with observations or false if you want to compare two runs

hist_str - [cam.h0, cam.h1]

cam_case_name - the name of the case run (no path included)

cam_hist_loc - where the h# history files live (example for my run.2 case: /glade/derecho/scratch/cstan/archive/${diag_cam_climo.cam_case_name}/atm/hist)

start_year, end_year - climo years desired

Key Points

Model Diagnostics Packages

Overview

Teaching: 0 min
Exercises: 0 min

Questions

How do I run the model diagnostics packages?

Objectives

ADF can also be used to run the AMWG diagnostics package and the CVDP package. AMWG is desigend to compare CAM or CAM-like simulations against other CAM simulations, observations, reanalysis or model comparison sets. THis can be run in Jupyter Notebook:

Start NCAR JupyterHub and open a Jupyter Notebook with Kernel NPL2024a

Change to the ADF directory and make a copy of the config_cam_baseline_example.yaml file. Open that copied file and make the following changes:

user - use your NCAR ursename

compare_obs - set true if you want to compare your run with observations or false if you want to compare two runs

hist_str - [cam.h0, cam.h1]

cam_case_name - the name of the case run (no path included)

cam_hist_loc - where the h# history files live (example for my run.2 case: /glade/derecho/scratch/cstan/archive/${diag_cam_climo.cam_case_name}/atm/hist)

start_year, end_year - climo years desired

Save the yaml file and open the jupyter_sample.ipynb notebook. Modify the line

config_file=os.path.join(adf_code,"config_cam_baseline_example.yaml")

to reflect the name of your yaml file.

Key Points

Postprocessing

Overview

Teaching: 0 min
Exercises: 0 min

Questions

What are some common run time configuration changes?

How do I make these changess?

Objectives

Key Points

CESM Troubleshooting

Our Experiments

Overview

Check your CaseStatus file

What is in these files?

How do we figure out what went wrong?

Do you see anything in the configuration that could have led to our error?

Solution

Key Points

History and Setup for Diagnostics

Overview

History vs. Timeseries Files

CESM Time Variable

Postprocessing

Diagnostics Packages

Postprocessing and Diagnostics Packages Setup

Login to Casper:

Checkout ADF and activate the conda environment:

Configuration files:

Do not modify either of these files!

Run-time yaml

Key Points

Model Diagnostics Packages

Overview

Key Points

Postprocessing

Overview

Key Points