Our Experiments
Overview
Teaching: min
Exercises: minQuestions
How can I confirm if everything ran correctly?
How can I troubleshoot errors?
Objectives
We have three cases (or experiments) we are working with. We will take a look at what happened with each one of them.
-
- b.day1.0
- This was our first case that ran initially for 5-days. Let’s set
CONTINUE_RUN=TRUE
and ran it again for another 5-days. This run should have completed successfully. How can we confirm this?
Check your CaseStatus file
Go to your case directory for this case and look at the end of the
CaseStatus
filetail CaseStatus
Now let’s check out the output from this case. Remember, it is located in the DOUT_S_ROOT directory.
cases/b.day1.0> ./xmlquery DOUT_S_ROOT
DOUT_S_ROOT: /glade/scratch/cstan/archive/b.day1.0
We will go there and see if we now have 10 days of data.
cd /glade/scratch/cstan/b.day1.0
cd ocn/hist
We now have two output files for the ocean: b.day1.0.pop.h.nday1.0001-01-01.nc
and b.day1.0.pop.h.nday1.0001-01-06.nc
.
b.day1.0.pop.h.nday1.0001-01-01.nc
is the same file we looked at last week which has the first 5-days.
b.day1.0.pop.h.nday1.0001-01-06.nc
contains the next 5-days that we ran by setting CONTINUE_RUN=TRUE
Use ncview to look at these files.
-
- Second case
- This is our case that we are running for 4-years with daily precip and standard monthly output to use for Assignment #3. Assuming the configuration and namelist changes were entered correctly, this run should have completed successfully.
- Check your
CaseStatus
file. - If errors, check your log file
- If no errors, check your output:
There should be monthly and daily output for the atmosphere. Let’s confirm:
cd /glade/scratch/cstan/archive/run.2/atm/hist
ls
The test1.cam.h0.*.nc
files contain monthly averaged data.
The test1.cam.h1.*.nc
contain daily averaged data.
What is in these files?
We will look at each file using
ncdump -h
to understand what is in the files.What variables are in the h0 files? What variables are in the h1 files?
We set this with the namelist options
fincl2 = 'PRECC', 'PRECL'
nhtfrq = 0, -24
How many times are in the h0 files?
How many times are in the h1 files?
We set this with the
mfilt = 1,1
namelist option.Remember, you can look up namelist options
- mfilt
- Array containing the maximum number of time samples written to a history file. The first value applies to the primary history file, the second through tenth to the auxillary history files. Default: 1,30,30,30,30,30,30,30,30,30
We have lots of history files and we can look at each of them using ncview
, but that is not very useful.
We can read in all the files using Python xarray
, but its a lot of data.
There are some useful tools for postprocessing the data to get timeseries
files and easily take a look at some common diagnostics. We will learn about those later in this class.
-
- BRANCH case:
- This is the branch case we ran with lots of configuration changes and namelist changes. This run produces an error with the configuration provided.
We can review how we created and setup our run by looking at the first line of the README.case
file:
cases/branchwrong> head -n1 README.case
2023-03-05 17:55:30: ./create_newcase --case /glade/u/home/cstan/cases/branchwrong --res f19_g17 --compset B1850 --project UGMU0035
You can see that I initially made a mistake in my create_newcase
by mistyping the project number.
We can review any changes we made to the configuration of the run by looking at CaseStatus
cases/branchwrong> more CaseStatus
2023-03-05 18:01:16: xmlchange success <command> ./xmlchange RUN_TYPE=branch,RUN_REFCASE=b.day1.0,RUN_REFDATE=0001-01-05,CLM_NAMELIST_OPTS=,GET_REFCASE=FALSE,STOP_OPTION=nmonths,STOP_N=1,RESUBMIT=1,CCSM_CO2_PPMV=569.4 </command>
---------------------------------------------------
2023-03-05 18:01:46: xmlchange success <command> ./xmlchange JOB_WALLCLOCK_TIME=2:00:00 </command>
---------------------------------------------------
2023-03-05 18:02:13: case.setup starting
---------------------------------------------------
2023-03-05 18:02:16: case.setup success
---------------------------------------------------
2023-03-05 18:03:08: case.build starting
---------------------------------------------------
2023-03-05 18:03:10: case.build error
ERROR: Missing required pointer_file /glade/scratch/cstan/branchwrong/run/rpointer.ocn.restart ---has pop initial data been prestaged to /glade/scratch/cstan/branchwrong/run?
---------------------------------------------------
2023-03-05 18:08:31: case.build starting
---------------------------------------------------
CESM version is cesm2.1.3-rc.01
Processing externals description file : Externals.cfg
Processing externals description file : Externals_CLM.cfg
Processing externals description file : Externals_POP.cfg
Processing externals description file : Externals_CISM.cfg
Processing externals description file : Externals_CAM.cfg
Checking status of externals: clm, fates, ptclm, mosart, ww3, cime, cice, pop, cvmix, marbl, cism, source_cism, rtm, cam, clubb, carma, cosp2, chem_proc,
./cime
clean sandbox, on cime5.6.32
./components/cam
clean sandbox, on cam_cesm2_1_rel_41
./components/cam/chem_proc
clean sandbox, on tools/proc_atm/chem_proc/release_tags/chem_proc5_0_03_rel
./components/cam/src/physics/carma/base
clean sandbox, on carma/release_tags/carma3_49_rel
./components/cam/src/physics/clubb
clean sandbox, on vendor_clubb_r8099_n03
./components/cam/src/physics/cosp2/src
clean sandbox, on CFMIP/COSPv2.0/tags/v2.1.4cesm/src
./components/cice
clean sandbox, on cice5_cesm2_1_1_20190321
./components/cism
clean sandbox, on cism-release-cesm2.1.2_02
./components/cism/source_cism
clean sandbox, on release-cism2.1.03
./components/clm
clean sandbox, on release-clm5.0.30
./components/clm/src/fates
clean sandbox, on sci.1.30.0_api.8.0.0
./components/clm/tools/PTCLM
clean sandbox, on PTCLM2_20200121
./components/mosart
clean sandbox, on release-cesm2.0.04
./components/pop
clean sandbox, on pop2_cesm2_1_rel_n09
./components/pop/externals/CVMix
clean sandbox, on v0.93-beta
./components/pop/externals/MARBL
clean sandbox, on cesm2.1-n00
./components/rtm
clean sandbox, on release-cesm2.0.04
./components/ww3
clean sandbox, on ww3_181001
2023-03-05 18:19:23: case.build success
---------------------------------------------------
2023-03-05 18:20:23: case.submit starting
---------------------------------------------------
2023-03-05 18:20:29: case.submit error
ERROR: Command: 'qsub -q regular -l walltime=2:00:00 -A UGMU0035 -v ARGS_FOR_SCRIPT='--resubmit' .case.run' failed with error 'qsub: Invalid account, available accounts:
Project, Status, Active
UGMU0041, Normal, True' from dir '/glade/u/home/cstan/cases/branchwrong'
---------------------------------------------------
2023-03-05 18:25:52: xmlchange success <command> ./xmlchange PROJECT=UGMU0041 </command>
---------------------------------------------------
2023-03-05 18:26:38: case.submit starting
---------------------------------------------------
2023-03-05 18:26:45: case.submit success case.run:8818830.chadmin1.ib0.cheyenne.ucar.edu, case.st_archive:8818831.chadmin1.ib0.cheyenne.ucar.edu
---------------------------------------------------
2023-03-05 18:26:50: case.run starting
---------------------------------------------------
2023-03-05 18:26:57: model execution starting
---------------------------------------------------
2023-03-05 18:27:00: model execution success
---------------------------------------------------
2023-03-05 18:27:00: case.run error
ERROR: RUN FAIL: Command 'mpiexec_mpt -p "%g:" -np 576 omplace -tm open64 /glade/scratch/cstan/branchwrong/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /glade/scratch/cstan/branchwrong/run/cesm.log.8818830.chadmin1.ib0.cheyenne.ucar.edu.230305-182650
---------------------------------------------------
Another thing we did that is not documented automatically is to copy the restart files from our b.day1.0
case to our new run directory. This was so the model has a set of restart files to start the run from.
cp /glade/scratch/cstan/archive/b.day1.0/rest/0001-01-06-00000/* /glade/scratch/cstan/branchwrong/run/
How do we figure out what went wrong?
Look at your log file and use grep -i
to find errors.
cases/branchwrong> grep -i error /glade/scratch/cstan/branchwrong/run/cesm.log.8818830.chadmin1.ib0.cheyenne.ucar.edu.230305-182650
16: ERROR: GETFIL: FAILED to get b.day1.0.cam.r.0001-01-05-00000.nc
4: ERROR: GETFIL: FAILED to get b.day1.0.cam.r.0001-01-05-00000.nc
29: ERROR: GETFIL: FAILED to get b.day1.0.cam.r.0001-01-05-00000.nc
33: ERROR: GETFIL: FAILED to get b.day1.0.cam.r.0001-01-05-00000.nc
34: ERROR: GETFIL: FAILED to get b.day1.0.cam.r.0001-01-05-00000.nc
3: ERROR: GETFIL: FAILED to get b.day1.0.cam.r.0001-01-05-00000.nc
22: ERROR: GETFIL: FAILED to get b.day1.0.cam.r.0001-01-05-00000.nc
18: ERROR: GETFIL: FAILED to get b.day1.0.cam.r.0001-01-05-00000.nc
...
Why is the error repeated many times?
The model runs on many processors. Each one is reporting the error.
What does the error mean?
This is telling that the model is trying to get a file called b.day1.0.cam.r.0001-01-05-00000.nc
and it is unable to get it.
Let’s go back to our configuration and think back about how we set up this experiment. What do all the configuration changes mean?
We can take a look in env_run.xml
to confirm what each setting means.
- RUN_TYPE=branch
- This is a branch run
- RUN_REFCASE=b.day1.0
- Reference directory containing RUN_REFCASE data - used for hybrid or branch runs
- RUN_REFDATE=0001-01-05
- Reference date for hybrid or branch runs (yyyy-mm-dd)
- CLM_NAMELIST_OPTS=’’
- CLM-specific namelist settings for -namelist option in the CLM build-namelist. CLM_NAMELIST_OPTS is normally set as a compset variable and in general should not be modified for supported compsets. It is recommended that if you want to modify this value for your experiment, you should use your own user-defined component sets via using create_newcase with a compset_file argument. This is an advanced flag and should only be used by expert users.
It seems this option was provided in the NCAR tutorial example, but is not necessary.
- GET_REFCASE=FALSE
- Flag for automatically prestaging the refcase restart dataset. If TRUE, then the refcase data is prestaged into the executable directory
- STOP_OPTION=nmonths
- Sets the run length along with STOP_N and STOP_DATE
- STOP_N=1
- Provides a numerical count for $STOP_OPTION.
- RESUBMIT=1
- If RESUBMIT is greater than 0, then case will automatically resubmit Since we later set our queue time to only 2 hours, there may be a need to resubmit to complete the run.
- CCSM_CO2_PPMV=569.4
- Mechanism for setting the CO2 value in ppmv for CLM if CLM_CO2_TYPE is constant or for POP if OCN_CO2_TYPE is constant. This is the CO2 value that gets propogated to the ocean and land models.
- JOB_WALLCLOCK_TIME=2:00:00
- The machine wallclock setting. This means how long we tell it to run in the queue. The maximum and default are 12:00:00, but we can get our run in more quickly if we tell it we need less time.
Do you see anything in the configuration that could have led to our error?
Look at the
RUN_REFDATE
and the date we used for our restart fileSolution
RUN_REFDATE=0001-01-06 ./xmlchange RUN_REFDATE='0001-01-06'
Fix it!
What did those namelist changes do?
In user_nl_cam
- co2vmr=569.4e-6
- CO2 volume mixing ratio. This is used as the time invariant surface value of CO2 if no time varying values are specified. Default: set by build-namelist.
- ch4vmr = 1583.2e-9
- CH4 volume mixing ratio. This is used as the time invariant surface value of CH4 if no time varying values are specified. Default: set by build-namelist.
- inithist=’MONTHLY’
- Frequency that initial files will be output This produces initial condition files monthly.
Nothing looks questionable there.
Resubmit your case!
Key Points