# DART software - Copyright UCAR. This open source software is provided # by UCAR, "as is", without charge, subject to all terms of use at # http://www.image.ucar.edu/DAReS/DART/DART_download # # DART $Id$ Greetings. This README file contains information about the programs in this directory, and troubleshooting help if you are trying to get the MPI option in DART to compile and run. INTRODUCTION: Starting with the Jamaica release of DART there is an option to use MPI in the filter program, to do the data assimilation step in parallel on a parallel machine. There is also the option to have the model advance in parallel as well. This is controlled by the 'async' namelist setting in the &filter_nml namelist section of the input.nml file. The default settings compile DART without MPI, but if you want to enable that option you need to have a working MPI library and run-time system. Generally this means simply compiling with mpif90 instead of the underlying Fortran 90 compiler, but it can also be a bit more complicated if there are multiple Fortran compilers or if MPI is not installed in a standard directory. This directory contains some small test programs which use both MPI and the netCDF libraries. It may be simpler to debug any build problems here, and if you need to submit a problem report to your system admin people these single executables are much simpler than the entire DART build tree. Be sure that the file $DART/mkmf/mkmf.template has been updated for your particular system and compiler. We supply a default file but generally you find the mkmf.template.compiler.system file which is closest to your setup, copy it over mkmf.template, and then update the location of the netCDF libraries (since they are often not installed in a location searched by default). If your system uses the 'module' commands to select different versions, make sure you have consistent versions of the Fortran compiler, the netCDF libraries, and the MPI libraries. There are 2 places where you may need to customize files in the $DART/mpi_utilities directory. First, some systems come with only an include file for parameters; some come with only a Fortran 90 module. Second, some systems require an interface block to declare the use of the system() function; others will not compile if it is specified. There are large block comments at the top of the test file in this directory - ftest_mpi.f90. If you can get it compiled here, then you know what changes (if any) you have to make in the mpi_utilities_mod.f90 and null_mpi_utilities_mod.f90 files in the $DART/mpi_utilities directory. HOW TO VERIFY AN INSTALLATION: This directory contains a small set of test programs which use the MPI (Message Passing Interface) communications library. They may compile first time with no problem, but especially on Linux clusters there can be an almost infinite number of permutations of batch queue systems, compilers, and mpi libraries. Examples of batch queue systems: PBS, LSF, LoadLeveler Examples of compiler vendors: Intel (ifort), PGI (pgf90), Absoft (f95) Examples of MPI libraries: mpich, LAM, OpenMPI, MPT Make sure that the closest mkmf.template.compiler.system file has been copied over to $DART/mkmf/mkmf.template. Type: make to compile the programs. Type: make check to run the programs interactively and look for errors. Note that on some larger systems it is prohibited to run MPI programs on the login nodes; they can only be submitted to a batch system. If the f90 and nc programs run but the mpi program fails, you still might be ok. Move on to 'make batch' before declaring an emergency. Edit the Makefile and select the proper line (bsub, qsub, or neither). Then type: make batch to submit the test programs to the batch queue. This will almost certainly not work without modifications to the 'runme' script if you have LSF or PBS. (There are directives #LSF and #PBS which select the specific queue, specify max runtime, accounting charge code, etc which are very system specific.) Once you have a working script here you can transfer your changes to the model-specific files under $DART/models/your-model/work. WHAT YOU GET: ftest_mpi.f90 is a Fortran 90 program which calls a few basic MPI library functions. If it compiles and runs interactively, you have mounted one of the 2 large hurdles in running with MPI. If you can submit this executable to the batch queue and have it run, you are done. Go have a beverage of your choice. After that, you can start to do actual science instead of system-wrangling. ftest_nc.f90 is a (non-mpi) Fortran 90 program which uses the netCDF libraries. It can be used to test that the netcdf libs are installed and that you have the proper setting for the NETCDF variable in your mkmf.template. 'make check' will try to build and run both of these programs interactively. 'make batch' will submit them to the batch system to execute them. If you have problems, keep reading below for more help in diagnosing exactly where things are going wrong. TROUBLESHOOTING: If the ftest_mpi.f90 program does not compile, here are a few things to check. You must be able to compile and run this simple program before anything else is going to work. 1. Include file vs module Some MPI installations supply a header file (a .h or .inc file) which define the parameters for the MPI library. Others supply a Fortran 90 module which contains the parameters and subroutine prototypes. Use one or the other. The code contains a commented out 'use' statement and comes by default expecting to use the include file. 2. Interface block for system() vs not While this isn't strictly an MPI issue, the system() function is used by some of the DART code and is called from the mpi module, so if you need to comment this block in or out, you are editing the same files. If you get an error trying to link your program and the message seems related to 'undefined external _system_' (or some close permutation of that message), go into BOTH mpi_utilities_mod.f90 and null_mpi_utilities_mod.f90 and comment in (or out) the interface block near the comment which has 'BUILD TIP 2'. 3. Compiler wrappers Most MPI installations include compiler "wrapper" programs which you call instead of the actual compiler. They add any needed compiler flags and they add the MPI libraries to the link lines. But they are usually built for one particular compiler, so if your system has multiple Fortran compilers available you will need to find the right set of MPI wrappers. Generally it is called 'mpif90' for the Fortran 90 compiler. Try to go this route if at all possible. This might mean adding a new directory to your shell search path, or loading a new module with the 'module' command. Once you have a compiled executable, you have to run it. This may mean dealing with the batch system. 4. Batch systems Most clusters have some form of batch control. You login to one node on the cluster, but to execute a compute job you must run a command which adds the job to a list of waiting jobs. Especially for MPI jobs which expect to use multiple processors at the same time, a batch control system ensures that each job is started on the right number of processors and does not conflict with other running jobs. The batch control system knows how many nodes are available for jobs, whether some queues have higher or lower priority, the maximum time a job can run, the maximum number of processors a job can request, and it schedules the use of the nodes based on the jobs in the execution queues. The two most common batch systems currently are PBS and LSF. They are complicated, but don't despair. This directory comes with a script which has settings for the most commonly required options. If they do not work on your system the simplest way to proceed is to find a colleague with a working script and copy it; check your local support web pages or support people; or for the independent-minded, google for examples out on the web. Queue names tend to be different between systems, and many larger systems which charge by the job require an account code specified on either a #LSF or #PBS line. These values will have to come from a locally knowledgable person. If you have a small or exclusive-use cluster you may not have a batch system. In that case you can probably simply start your job with the 'mpirun' command. But even in this case, you may need to supply some system-specific options, like a 'machine file' which says what node names are part of this cluster and are available to run your job. OTHER THINGS: A few other programs are included in this directory to help diagnose non-working setups. To compile and run everything: make everything It will echo messages as things pass or fail. ftest_f90.f90 is a simple, Fortran 90 program without MPI. It confirms you have a working F90 compiler. Try: make ftest_f90 to compile only this program. ftest_nc.f90 is a non-MPI Fortran 90 program without MPI which opens and writes a netCDF file (ftestdata.nc). If your netCDF library is not installed where the NETCDF variable points in your mkmf.template file, you can use this program to debug netcdf problems. Try: make ftest_nc (to compile), then: ./ftest_nc to run. It should create a small netcdf file called ftestdata.nc which can be dumped with "ncdump ftestdata.nc". ftest_stop.f90 and ftest_go.f90 are MPI test programs for the async=4 MPI model advance option. If you are using async = 4 for filter and are having problems, try running this pair. In particular pay attention to the 'hostname' of the script and compare it to the Process 0 message from ftest_stop. If they are *not* the same, you will need a patch to your runme script. Email us for more help. Try: make async4 to run this test. You may have to edit the Makefile to select the proper batch system. ctest.c is a C language program. The Fortran executables have no dependency on C, but if there is any question about whether there is a working C compiler on the system, try: make ctest. ctest_mpi.c is a C language program which uses the C versions of the MPI library calls. DART does not have any C code in it, but it is possible that the MPI libraries were compiled without the Fortran interfaces. If this routine compiles and runs but the Fortran ones do not, that might be a useful clue. Try: make ctest_mpi (to compile), then: make run_c (to execute). You may need to select the proper batch system submit command in the Makefile (bsub, qsub, or neither). ctest_nc.c is a non-MPI C language program which uses the C versions of the netCDF library calls. Again, DART has no C code in it, but it is possible that the netCDF libraries were compiled without the Fortran interfaces. If this routine compiles and runs but the Fortran ones do not, that might be a useful clue. Try: make ctest_nc (to compile), then: ./ctest_nc to run. It should create a small netcdf file called ctestdata.nc which can be dumped with "ncdump ctestdata.nc". Any questions, email me at: nancy@ucar.edu Good luck - nancy collins # # $URL$ # $Revision$ # $Date$