This file is part of the Data Assimilation Research Testbed (DART).
DART is free software; you can redistribute it and/or modify it and are expected to follow the terms of the GNU General Public License as published by the Free Software Foundation.
DART is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with DART; if not, write to:
Free Software Foundation, Inc.Anything underlined is a URL.
All filenames look like this -- (typewriter font, green).
Program names look like this -- (italicized font, green).
user input looks like this -- (bold, magenta).
The Data Assimilation Research Testbed (DART) is designed to facilitate the combination of assimilation algorithms, models, and observation sets to allow increased understanding of all three. (an in-depth design discussion) For the ASP colloquium on Data Assimilation, a subset of the complete DART facility was used to examine ensemble filter assimilation algorithms using synthetic observations. The DART programs were compiled with the Intel 7.1 Fortran compiler and run on a linux compute-server while analysis was performed with Matlab on a DEC workstation. If your system is different, you will definitely need to read the Customizations section.
DART programs can require three different types of input. First, some of the DART programs, those for creating synthetic observational datasets, require interactive input from the keyboard. For simple cases, this interactive input can be made directly from the keyboard. In more complicated cases, a file containing the appropriate keyboard input can be created and this file can be directed to the standard input of the DART program. Second, many DART programs expect one or more input files in DART specific formats to be available. For instance, perfect_model_obs creates a synthetic observation set given a particular model and a description of a sequence of observations requires an input file that describes this observation sequence. At present, all DART-specific input files are inefficient but human-readable ascii files. Third, many DART modules (including main programs) make use of the Fortan90 namelist facility to obtain values of certain parameters at run-time. All programs look for a namelist input file called input.nml in the directory in which the program is executed. The input.nml file can contain a sequence of individual Fortran90 namelists which specify values of particular parameters for modules that compose the executable program. Unfortunately, the Fortran90 namelist interface is poorly defined in the language standard, leaving considerable leeway to compiler developers in implementing the facility. The Intel 7.1 compiler has some particularly unpleasant behavior when a namelist file contains an entry that is NOT defined in the program reading the namelist. Error behavior is unpredictable, but often results in read errors for other input files opened by DART programs. If you encounter run-time read errors, the first course of action should be to ensure the components of the namelist are actual components. Changing the names of the namelist components will create unpleasantries.
DART uses the netCDF self-describing data format with a particular metadata convention to describe output that is used to analyze the results of assimilation experiments. These files have the extension .nc and can be read by a number of standard data analysis tools. Three sets of tools are available to work with netCDF files for the ASP colloquium. First, the simple tool ncview is provided to do rudimentary graphical display of slices of output data fields. ncview will be of most use for output of the more comprehensive models at the end of the exercise set. Second, a set of tools called the NCO tools, produced by UCAR's Unidata group, are aailable to do operations like concatenating, slicing, and dicing of netCDF files. Finally, a set of Matlab scripts, designed to produce graphical diagnostics from DART netCDF output files are available.
This document outlines the installation of the DART software and the system requirements. For convenience, some of the original colloquium exercises are repeated here, mostly just to check the installation. The entire installation process is summarized in the following steps:
We have tried to make the code as portable as possible, but we do not have access to all compilers on all platforms, so there are no guarantees. We are interested in your experience building the system, so please email me (Tim Hoar) at thoar@ucar.edu
After the installation, you might want to peruse the following.
The DART software has been successfully built on several Linux/x86 platforms with the Intel Fortran Compiler 7.1 for Linux, which is free for individual scientific use. It has also been built and successfully run with the Portland Group Fortran Compiler. Since recompiling the code is a necessity to experiment with different models, there are no binaries to distribute.
DART uses the netCDF self-describing data format for the results of assimilation experiments. These files have the extension .nc and can be read by a number of standard data analysis tools. In particular, DART also makes use of the F90 interface to the library which are available through the netcdf.mod and typesizes.mod modules. IMPORTANT: different compilers create these modules with different "case" filenames, and sometimes they are not both installed into the expected directory. It is required that both modules be present. The normal place would be in the netcdf/include directory, as opposed to the netcdf/lib directory.
If the netCDF library does not exist on your system, you must build it (as well as the F90 interface modules). The library and instructions for building the library or installing from an RPM may be found at the netCDF home page: http://www.unidata.ucar.edu/packages/netcdf/ Pay particular attention to the compiler-specific patches that must be applied for the Intel Fortran Compiler. (Or the PG compiler, for that matter.)
The location of the netCDF library, libnetcdf.a, and the locations of both netcdf.mod and typesizes.mod will be needed by the makefile template, as described in the compiling section.
DART also uses the very common udunits library for manipulating units of physical quantities. If, somehow, it is not installed on your system, you will need to install it (instructions are available from Unidata's Downloads page).
The location of the udunits library, libudunits.a, will be needed by the makefile template, as described in the compiling section.
The DART source code is distributed as a compressed tar file. DART_ASP_SUMMER_2003.tar.gz [8307414 bytes]. When untarred, the source tree will begin with a directory named DART and will be approximately 23.5Mb. Compiling the code in this tree (as is usually the case) will necessitate much more space.
The code tree is very "bushy"; there are many directories of support routines, etc. but only a few directories involved with the customization and installation of the DART software. If you can compile and run ONE of the low-order models, you should be able to compile and run ANY of the low-order models. For this reason, we can focus on the Lorenz `63 model. Subsequently, the only directories with files to be modified to check the installation are: DART/mkmf, DART/models/lorenz_63/work, and DART/matlab (but only for analysis).
DART executable programs are constructed using two tools: make and mkmf. The make utility is a relatively common piece of software that requires a user-defined input file that records dependencies between different source files. make then performs a hierarchy of actions when one or more of the source files is modified. The mkmf utility is a custom preprocessor that generates a make input file (named Makefile) and is designed specifically to work with object-oriented Fortran90 (and other languages) for systems like DART.
mkmf requires two separate input files. The first is a `template' file which specifies details of the commands required for a specific Fortran90 compiler and may also contain pointers to directories containing pre-compiled utilities required by the DART system. This template file will need to be modified to reflect your system. The second input file is a `path_names' file which includes a complete list of the locations (either relative or absolute) of all Fortran90 source files that are required to produce a particular DART program. Each 'path_names' file must contain a path for exactly one Fortran90 file containing a main program, but may contain any number of additional paths pointing to files containing Fortran90 modules. An mkmf command is executed which uses the 'path_names' file and the mkmf template file to produce a Makefile which is subsequently used by the standard make utility.
Shell scripts that execute the mkmf command for all standard DART executables are provided as part of the standard DART software. For more information on mkmf see the FMS mkmf description
A series of templates for different compilers/architectures exists in the DART/mkmf/ directory and have names with extensions that identify either the compiler, the architecture, or both. This is how you inform the build process of the specifics of your system. For the discussion that follows, knowledge of the contents of one of these templates (i.e. mkmf.template.pgi) is needed: (note that only the first few lines are shown here)
Essentially, each of the lines defines some part of the resulting Makefile. Since make is particularly good at sorting out dependencies, the order of these lines really doesn't make any difference. The FC = pgf90 line ultimately defines the Fortran90 compiler to use, etc. The lines which are most likely to need site-specific changes start with FFLAGS and LIBS, which indicate where to look for the netCDF F90 modules and the location of the netCDF and udunits libraries.
Each compiler has different compile flags, so there is really no way to exhaustively cover this other than to say the templates as we supply them should work -- depending on the location of the netCDF modules netcdf.mod and typesizes.mod. Change the /usr/local/netcdf/include string to reflect the location of your modules. The low-order models can be compiled without the -r8 switch, but the bgrid_solo model cannot.
Modifying the LIBS value should be relatively
straightforward.
Change the /usr/local/netcdf/lib string to
reflect the location of your libnetcdf.a.
Change the /usr/local/udunits-1.11.7/lib string to
reflect the location of your libudunits.a.
Several path_name_* files are provided in the work directory for each specific model, in this case: DART/models/lorenz_63/work.
Currently, DART executables are constructed in a work subdirectory under the directory containing code for the given model. In the top-level DART directory, change to the L63 work directory and list the contents:
filter_ics input.nml mkmf_create_obs_sequence mkmf_create_obs_set_def mkmf_filter mkmf_perfect_model_obs path_names_create_obs_sequence path_names_create_obs_set_def path_names_filter path_names_perfect_model_obs perfect_ics
There are four mkmf_xxxxxx files for the programs create_obs_set_def, create_obs_sequence, perfect_model_obs, and filter along with the corresponding path_names_xxxxxx files. You can examine the contents of one of the path_names_xxxxxx files, for instance path_names_filter, to see a list of the relative paths of all files that contain Fortran90 modules required for the program filter for the L63 model. All of these paths are relative to the DART directory that you copied into your local storage. The first path is the main program (filter.f90) and is followed by all the Fortran90 modules used by this program.
The mkmf_xxxxxx scripts are considerably more cryptic and need to be modified to use the appropriate DART/mkmf/mkmf.template.xxx file containing the site-specific customizations of the previous section (i.e. the template customization section). For example, suppose you modified DART/mkmf/mkmf.template.ifc and want to create the create_obs_set_def program. Simply make sure the appropriate template file (kmf.template.ifc) is referenced by mkmf_create_obs_set_def:
The first command generates an appropriate Makefile and the second results in the compilation of a series of Fortran90 modules which ultimately produces an executable file: create_obs_set_def. Should you need to make any changes to the DART/mkmf/mkmf.template.ifc, you will need to regenerate the Makefile. A series of .o and .mod files for each module compiled will also be left in the work directory. You can proceed to create the other three programs needed to work with L63 in DART as follows:
program | purpose |
---|---|
create_obs_set_def | specify a (set) of observation characteristics taken by a particular (set of) instruments |
create_obs_sequence | specify the temporal attributes of the observation sets |
perfect_model_obs | spinup, generate "true state" for synthetic observation experiments, ... |
filter | perform experiments |
Create an observation set definition. create_obs_set_def creates an observation set definition, the time-independent part of an observation sequence. An observation set definition file only contains the location, type, and observational error characteristics (normally just the diagonal observational error variance) for a related set of observations. There are no actual observations, nor are there any times associated with the definition. For spin-up, we are only interested in integrating the L63 model, not in generating any particular synthetic observations. Begin by creating a minimal observation set definition.
In general, for the low-order models, only a single observation set need be defined. Next, the number of individual scalar observations (like a single surface pressure observation) in the set is needed. To spin-up an initial condition for the L63 model, only a single observation is needed. Next, the error variance for this observation must be entered. Since we are not interested in this observation having any impact on an assimilation (it will only be used for spinning up the model and the ensemble), enter a very large value for the error variance. An observation with a very large error variance has essentially no impact on deterministic filter assimilations like the default variety implemented in DART. Finally, the location and type of the observation need to be defined. For all types of models, the most elementary form of synthetic observations are called 'identity' observations. These observations are generated simply by adding a random sample from a specified observational error distribution directly to the value of one of the state variables. This defines the observation as being an identity observation of the first state variable in the L63 model. The program will respond by terminating after generating a file (generally named set_def.out) that defines the single identity observation of the first state variable of the L63 model. The following is a screenshot, the user input looks like this.
[unixprompt]$ ./create_obs_set_def create_obs_set_def attributes: $Source: /fs/cgd/home0/thoar/CVS.REPOS/CISL/IMAGe/DAI/DART/ASP_DART_exercise.html,v $ $Revision: 1.1 $ $Date: 2005/02/08 23:10:23 $ Input the filename for output of observation set_def_list? [set_def.out] set_def.out assim_model attributes: $Source: /fs/cgd/home0/thoar/CVS.REPOS/CISL/IMAGe/DAI/DART/ASP_DART_exercise.html,v $ $Revision: 1.1 $ $Date: 2005/02/08 23:10:23 $ namelist read; values are sigma is 10.00000000000000 r is 28.00000000000000 b is 2.666666666666667 deltat is 1.0000000000000000E-002 output_state_vector is T model attributes: $Source: /fs/cgd/home0/thoar/CVS.REPOS/CISL/IMAGe/DAI/DART/ASP_DART_exercise.html,v $ $Revision: 1.1 $ $Date: 2005/02/08 23:10:23 $ model size is 3 Input the number of unique observation sets you might define 1 How many observations in set 1 1 Defining observation 1 Input error variance for this observation definition 1000000 Input an integer index if this is identity observation, else -1 1 set_def.out successfully created. Terminating normally.
Create an observation sequence definition. create_obs_sequence creates an 'observation sequence definition' by extending the 'observation set definition' with the temporal attributes of the observations.
The first input is the name of the file created in the previous step, i.e. the name of the observation set definition that you've just created. It is possible to create sequences in which the observation sets are observed at regular intervals or irregularly in time. Here, all we need is a sequence that takes observations over a long period of time - indicated by entering a 1. Although the L63 system normally is defined as having a non-dimensional time step, the DART system arbitrarily defines the model timestep as being 3600 seconds. By declaring we have 1000 observations taken once per day, we create an observation sequence definition spanning 24000 'model' timesteps; sufficient to spin-up the model onto the attractor. Finally, enter a name for the 'observation sequence definition' file. Note again: there are no observation values present in this file. Just an observation type, location, time and the error characteristics. We are going to populate the observation sequence with the perfect_model_obs program.
[thoar@ghotiol work]$ ./create_obs_sequence create_obs_sequence attributes: $Source: /fs/cgd/home0/thoar/CVS.REPOS/CISL/IMAGe/DAI/DART/ASP_DART_exercise.html,v $ $Revision: 1.1 $ $Date: 2005/02/08 23:10:23 $ What is name of set_def_list? [set_def.out] set_def.out Setting times for obs_def 1 To input a regularly repeating time sequence enter 1 To enter an irregular list of times enter 2 1 Input number of observations in sequence 1000 Input time of initial ob in sequence in days and seconds 1, 0 Input period of obs in days and seconds 1, 0 time 1 is 0 1 time 2 is 0 2 time 3 is 0 3 ... time 998 is 0 998 time 999 is 0 999 time 1000 is 0 1000 Input file name for output of obs_sequence? [obs_seq.in] obs_seq.in
Initialize the model onto the attractor. perfect_model_obs can now advance the arbitrary initial state for 24,000 timesteps to move it onto the attractor.
perfect_model_obs uses the Fortran90 namelist input mechanism instead of (admittedly gory, but temporary) interactive input. In input.nml, the namelist for perfect_model_obs the following values should be set:
namelist variable | description |
---|---|
async | Simply ignore this. Leave it set to '.false.' |
obs_seq_in_file_name | specifies the file name that results from running create_obs_sequence, i.e. the 'observation sequence definition' file. |
obs_seq_out_file_name | specifies the output file name containing the 'observation sequence', finally populated with (perfect?) 'observations'. |
start_from_restart | When set to 'false', perfect_model_obs generates an arbitrary initial condition (which cannot be guaranteed to be on the L63 attractor). |
output_restart | When set to 'true', perfect_model_obs will record the model state at the end of this integration in the file named by restart_out_file_name. |
restart_in_file_name | is ignored when 'start_from_restart' is 'false'. |
restart_out_file_name | if output_restart is 'true', this specifies the name of the file containing the model state at the end of the integration. |
init_time_xxxx | the start time of the integration. |
output_interval | interval at which to save the model state. |
Executing perfect_model_obs will integrate the model 24,000 steps and output the resulting state in the file perfect_restart.
Generating ensemble initial conditions is achieved by changing a perfect_model_obs namelist parameter, copying perfect_restart to perfect_ics, and rerunning perfect_model_obs. This execution of perfect_model_obs will advance the model state from the end of the first 24,000 steps to the end of an additional 24,000 steps and place the final state in perfect_restart.
A True_State.nc file is also created. It contains the 'true' state of the integration.
Generating the ensemble: is done with the program filter, which also uses the Fortran90 namelist mechanism for input.
Only the non-obvious(?) entries will be discussed.
namelist variable | description |
---|---|
ens_size | Number of ensemble members. 20 is sufficient for most of the exercises. |
cutoff | to limit the impact of an observation, set to 0.0 (i.e. spin-up) |
cov_inflate | A value of 1.0 results in no inflation.(spin-up) |
The filter is told to generate its own ensemble initial conditions since start_from_restart is '.false.'. However, it is important to note that the filter still makes use of perfect_ics which is set to be the restart_in_file_name. This is the model state generated from the first 24,000 step model integration by perfect_model_obs. Filter generates its ensemble initial conditions by randomly perturbing the state variables of this state.
The arguments output_state_ens_mean and output_state_ens_spread are '.true.' so that these quantities are output at every time for which there are observations (once a day here) and num_output_ens_members means that the same diagnostic files, Posterior_Diag.nc and Prior_Diag.nc also contain values for all 20 ensemble members once a day. Once the namelist is set, execute filter to integrate the ensemble forward for 24,000 steps with the final ensemble state written to the filter_restart. Copy the perfect_model_obs restart file perfect_restart (the `true state') to perfect_ics, and the filter restart file filter_restart to filter_ics so that future assimilation experiments can be initialized from these spun-up states.
The spin-up of the ensemble can be viewed by examining the output in the netCDF files True_State.nc generated by perfect_model_obs and Posterior_Diag.nc and Prior_Diag.nc generated by filter. To do this, see the detailed discussion of matlab diagnostics in Appendix I.
Begin by using create_obs_set_def to generate an observation set in which each of the 3 state variables of L63 is observed with an observational error variance of 1.0 for each observation. To do this, use the following input sequence (the text including and after # is a comment and does not need to be entered):
set_def.out | # Output file name |
1 | # Number of sets |
3 | # Number of observations in set (x, y, and z) |
1.0 | # Variance of first observation |
1 | # First ob is identity observation of state variable 1 (x) |
1.0 | # Variance of second observation |
2 | # Second is identity observation of state variable 2 (y) |
1.0 | # Variance of third ob |
3 | # Identity ob of third state variable (z) |
Now, generate an observation sequence definition by running create_obs_sequence with the following input sequence:
set_def.out | # Input observation set definition file |
1 | # Regular spaced observation interval in time |
1000 | # 1000 observation times |
0, 43200 | # First observation after 12 hours (0 days, 3600 * 12 seconds) |
0, 43200 | # Observations every 12 hours |
obs_seq.in | # Output file for observation sequence definition |
An observation sequence file is now generated by running perfect_model_obs with the namelist values (unchanged from step 2):
This integrates the model starting from the state in perfect_ics for 1000 12-hour intervals outputting synthetic observations of the three state variables every 12 hours and producing a netCDF diagnostic file, True_State.nc.
Finally, filter can be run with its namelist set to:
The large value for the cutoff allows each observation to impact all other state variables (see Appendix V for localization). filter produces two output diagnostic files, Prior_Diag.nc which contains values of the ensemble members, ensemble mean and ensemble spread for 12- hour lead forecasts before assimilation is applied and Posterior_Diag.nc which contains similar data for after the assimilation is applied (sometimes referred to as analysis values).
Now try applying all of the matlab diagnostic functions described in the Matlab Diagnostics section.The output files are netCDF files, and may be examined with many different software packages. We happen to use Matlab, and provide our diagnostic scripts in the hopes that they are useful.
The Matlab diagnostic scripts and underlying functions reside in the DART/matlab directory. They are reliant on the public-domain netcdf toolbox from http://woodshole.er.usgs.gov/staffpages/cdenham/public_html/MexCDF/nc4ml5.html as well as the public-domain CSIRO matlab/netCDF interface from http://www.marine.csiro.au/sw/matlab-netcdf.html. If you do not have them installed on your system and want to use Matlab to peruse netCDF, you must follow their installation instructions.
Once you can access the getnc function from within Matlab, you can use our diagnostic scripts. It is necessary to prepend the location of the DART/matlab scripts to the matlabpath. Keep in mind the location of the netcdf operators on your system WILL be different from ours ... and that's OK.
And the matlab graphics window will display the spread of the ensemble error for each state variable. The scripts are designed to do the "obvious" thing for the low-order models and will prompt for additional information if needed. The philosophy of these is that anything that starts with a lower-case plot_some_specific_task is intended to be user-callable and should handle any of the models. All the other routines in DART/matlab are called BY the high-level routines.0[269]0 ghotiol:/<5>models/lorenz_63/work]$ matlab -nojvm < M A T L A B > Copyright 1984-2002 The MathWorks, Inc. Version 6.5.0.180913a Release 13 Jun 18 2002 Using Toolbox Path Cache. Type "help toolbox_path_cache" for more info. To get started, type one of these: helpwin, helpdesk, or demo. For product information, visit www.mathworks.com. >> which getnc /contrib/matlab/matlab_netcdf_5_0/getnc.m >>ls *.nc ans = Posterior_Diag.nc Prior_Diag.nc True_State.nc >>path('../../../matlab',path) >>which plot_ens_err_spread ../../../matlab/plot_ens_err_spread.m >>help plot_ens_err_spread DART : Plots summary plots of the ensemble error and ensemble spread. Interactively queries for the needed information. Since different models potentially need different pieces of information ... the model types are determined and additional user input may be queried. Ultimately, plot_ens_err_spread will be replaced by a GUI. All the heavy lifting is done by PlotEnsErrSpread. Example 1 (for low-order models) truth_file = 'True_State.nc'; diagn_file = 'Prior_Diag.nc'; plot_ens_err_spread >>plot_ens_err_spread
Matlab script | description |
---|---|
plot_bins | plots ensemble rank histograms |
plot_correl | Plots space-time series of correlation between a given variable at a given time and other variables at all times in a n ensemble time sequence. |
plot_ens_err_spread | Plots summary plots of the ensemble error and ensemble spread. Interactively queries for the needed information. Since different models potentially need different pieces of information ... the model types are determined and additional user input may be queried. |
plot_ens_mean_time_series | Queries for the state variables to plot. |
plot_ens_time_series | Queries for the state variables to plot. |
plot_phase_space | Plots a 3D trajectory of (3 state variables of) a single ensemble member. Additional trajectories may be superimposed. |
plot_total_err | Summary plots of global error and spread. |
plot_var_var_correl | Plots time series of correlation between a given variable at a given time and another variable at all times in an ensemble time sequence. |
A simple, but surprisingly effective way of dealing with filter divergence is known as covariance inflation. In this method, the prior ensemble estimate of the state is expanded around its mean by a constant factor, effectively increasing the prior estimate of uncertainty while leaving the prior mean estimate unchanged. The program filter has a namelist parameter that controls the application of covariance inflation, cov_inflate. Up to this point, cov_inflate has been set to 1.0 indicating that the prior ensemble is left unchanged. Increasing cov_inflate to values greater than 1.0 inflates the ensemble before assimilating observations at each time they are available. Values smaller than 1.0 contract (reduce the spread) of prior ensembles before assimilating.
You can do this by modifying the value of cov_inflate in the namelist, (try 1.05 and 1.10 and other values at your discretion) and run the filter as above. In each case, use the diagnostic matlab tools to examine the resulting changes to the error, the ensemble spread (via rank histogram bins, too), etc. What kind of relation between spread and error is seen in this model?
Synthetic observations are generated from a `perfect' model integration, which is often referred to as the `truth' or a `nature run'. A model is integrated forward from some set of initial conditions and observations are generated as y = H(x) + e where H is an operator on the model state vector, x, that gives the expected value of a set of observations, y, and e is a random variable with a distribution describing the error characteristics of the observing instrument(s) being simulated. Using synthetic observations in this way allows students to learn about assimilation algorithms while being isolated from the additional (extreme) complexity associated with model error and unknown observational error characteristics. In other words, for the real-world assimilation problem, the model has (often substantial) differences from what happens in the real system and the observational error distribution may be very complicated and is certainly not well known. Be careful to keep these issues in mind while exploring the capabilities of the ensemble filters with synthetic observations.