Contact: | Nancy Collins |
Revision: | $Revision: 2801 $ |
Source: | $URL: http://subversion.ucar.edu/DAReS/DART/trunk/mpi_utilities/mpi_utilities_mod.html $ |
Change Date: | $Date: 2007-04-04 22:11:48 -0600 (Wed, 04 Apr 2007) $ |
Change history: | try "svn log" or "svn diff" |
This module provides subroutines which access the MPI (message passing interface) parallel communications library. To compile without using MPI substitute the null_mpi_utilities_mod.f90 for this one.
types_mod utilities_mod time_manager_mod mpi (or mpif.h if mpi module not available)
character(len=*), intent(in), optional :: progname character(len=*), intent(in), optional :: alternatename
Initializes the MPI library, creates a private communicator, stores the total number of tasks and the local task number for later use, and registers this module. This routine calls initialize_utilities() internally before returning, so the calling program need only call this one routine to initialize the DART internals.
On some implementations of MPI (in particular some variants of MPICH) it is best to initialize MPI before any I/O is done from any of the parallel tasks, so this routine should be called as close to the process startup as possible.
It is not an error to try to initialize the MPI library more than once. It is still necessary to call this routine even if the application itself has already initialized the MPI library. Thise routine creates a private communicator so internal communications are shielded from any other communication called outside the DART libraries.
It is an error to call any of the other routines in this file before calling this routine.
progname | If given, written to the log file to document which program is being started. |
alternatename | If given, use this name as the log file instead of the default dart_log.out. |
logical, intent(in), optional :: callfinalize integer, intent(in), optional :: async
Frees the local communicator, and shuts down the MPI library unless callfinalize is specified and is .FALSE.. On some hardware platforms it is problematic to try to call print or write from the parallel tasks after finalize has been executed, so this should only be called immediately before the process is ready to exit. This routine does an MPI_Barrier() call before calling MPI_Finalize() to ensure all tasks are finished writing.
If the application itself is using MPI the callfinalize argument can be used to defer closing the MPI library until the application does it itself. This routine does close the DART log file and releases the local communicator even if not calling MPI_Finalize, so no other DART routines which might generate output can be used after calling this routine.
It is an error to call any of the other routines in this file after calling this routine.
callfinalize | If false, do not call the MPI_Finalize() routine. |
async | If the model advance mode (selected by the async namelist value in the filter_nml section) requires any synchronization or actions at shutdown, this is done. Currently async=4 requires an additional set of actions at shutdown time. |
integer :: task_count
Returns the total number of MPI tasks this job was started with. Note that MPI task numbers start at 0, but this is a count. So a 4-task job will return 4 here, but the actual task numbers will be from 0 to 3.
var | Total number of MPI tasks in this job. |
my_task_id
Returns the local MPI task number. This is one of the routines in which all tasks can make the same function call but each returns a different value. The return can be useful in creating unique filenames or otherwise distinguishing resources which are not shared amongst tasks. MPI task numbers start at 0, so valid task id numbers for a 4-task job will be 0 to 3.
var | My unique MPI task id number. |
Synchronize tasks. This call does not return until all tasks have called this routine. This ensures all tasks have reached the same place in the code before proceeding. All tasks must make this call or the program will hang.
integer, intent(in) :: dest_id real(r8), dimension(:), intent(in) :: srcarray type(time_type), intent(in), optional :: time
Use the MPI library to send a copy of an array of data from one task to another task. The sending task makes this call; the receiving task must make a corresponding call to receive_from().
If time is specified, it is also sent to the receiving task. The receiving call must match this sending call regarding this argument; if time is specified here it must also be specified in the receive; if not given here it cannot be given in the receive.
The current implementation uses MPI_Ssend() which does a synchronous send. That means this routine will not return until the receiving task has called the receive routine to accept the data. This may be subject to change; MPI has several other non-blocking options for send and receive.
dest_id | The MPI task id of the receiver. |
srcarray | The data to be copied to the receiver. |
time | If specified, send the time as well. |
The send and receive subroutines must be used with care. These calls must be used in pairs; the sending task and the receiving task must make corresponding calls or the tasks will hang. Calling them with different array sizes will result in either a run-time error or a core dump. The optional time argument must either be given in both calls or in neither or one of the tasks will hang. (Executive summary: There are lots of ways to go wrong here.)
integer, intent(in) :: src_id real(r8), dimension(:), intent(out) :: destarray type(time_type), intent(out), optional :: time
Use the MPI library to receive a copy of an array of data from another task. The receiving task makes this call; the sending task must make a corresponding call to send_to(). Unpaired calls to these routines will result in the tasks hanging.
If time is specified, it is also received from the sending task. The sending call must match this receiving call regarding this argument; if time is specified here it must also be specified in the send; if not given here it cannot be given in the send.
The current implementation uses MPI_Recv() which does a synchronous receive. That means this routine will not return until the data has arrived in this task. This may be subject to change; MPI has several other non-blocking options for send and receive.
src_id | The MPI task id of the sender. |
destarray | The location where the data from the sender is to be placed. |
time | If specified, receive the time as well. |
See the notes section of send_to().
integer, intent(in) :: exit_code
A replacement for calling the Fortran intrinsic exit. This routine calls MPI_Abort() to kill all MPI tasks associated with this job. This ensures one task does not exit silently and leave the rest hanging. This is not the same as calling finalize_mpi_utilities() which waits for the other tasks to finish, flushes all messages, closes log files cleanly, etc. This call immediately and abruptly halts all tasks associated with this job.
Depending on the MPI implementation and job control system, the exit code may or may not be passed back to the calling job script.
exit_code | A numeric exit code. |
It would generally be helpful to write out some kind of message before calling this routine, to indicate where in the code it is dying. The eventual intent is that this routine will be called by the standard error handler in the utilities module after writing out the pending error message, but right now that would create a circular reference in the modules since the mpi module uses the utilities module already for other services.
real(r8), dimension(:), intent(inout) :: array integer, intent(in) :: root
All tasks must make this call together, but the behavior in each task differs depending on whether it is the root or not. On the task which has a task id equal to root the contents of the array will be sent to all other tasks. On any task which has a task id not equal to root the array is the location where the data is to be received into. Thus array is intent(in) on root, and intent(out) on all other tasks.
When this routine returns, all tasks will have the contents of the root array in their own arrays.
array | Array containing data to send to all other tasks, or the location in which to receive data. |
root | Task ID which will be the data source. All others are destinations. |
This is another of the routines which must be called by all tasks. The MPI call used here is synchronous, so all tasks block here until everyone has called this routine.
logical :: iam_task0
Returns .TRUE. if called from the task with MPI task id 0. Returns .FALSE. in all other tasks. It is frequently the case that some code should execute only on a single task. This allows one to easily write a block surrounded by if (iam_task0()) then ... .
var | Convenience function to easily test and execute code blocks on task 0 only. |
integer, intent(in) :: from real(r8), dimension(:), intent(inout) :: array1 real(r8), dimension(:), intent(inout), optional :: array2 real(r8), dimension(:), intent(inout), optional :: array3 real(r8), dimension(:), intent(inout), optional :: array4 real(r8), dimension(:), intent(inout), optional :: array5 real(r8), intent(inout), optional :: scalar1 real(r8), intent(inout), optional :: scalar2 real(r8), intent(inout), optional :: scalar3 real(r8), intent(inout), optional :: scalar4 real(r8), intent(inout), optional :: scalar5
Cover routine for array_broadcast(). This call must be matched with the companion call broadcast_recv(). This routine should only be called on the task which is the root of the broadcast; it will be the data source. All other tasks must call broadcast_recv(). This routine sends up to 5 data arrays and 5 scalars in a single call. A common pattern in the DART filter code is sending 2 arrays, but other combinations exist. This routine ensures that from is the same as the current task ID. The arguments to this call must be matched exactly in number and type with the companion call to broadcast_recv() or an error (or hang) will occur.
In reality the data here are intent(in) only but this routine will be calling array_broadcast() internally and so must be intent(inout) to match.
from | Current task ID; the root task for the data broadcast. |
array1 | First data array to be broadcast. |
array2 | If given, second data array to be broadcast. |
array3 | If given, third data array to be broadcast. |
array4 | If given, fourth data array to be broadcast. |
array5 | If given, fifth data array to be broadcast. |
scalar1 | If given, first data scalar to be broadcast. |
scalar2 | If given, second data scalar to be broadcast. |
scalar3 | If given, third data scalar to be broadcast. |
scalar4 | If given, fourth data scalar to be broadcast. |
scalar5 | If given, fifth data scalar to be broadcast. |
This is another of the routines which must be called consistently; only one task makes this call and all other tasks call the companion broadcast_recv routine. The MPI call used here is synchronous, so all tasks block until everyone has called one of these two routines.
integer, intent(in) :: from real(r8), dimension(:), intent(inout) :: array1 real(r8), dimension(:), intent(inout), optional :: array2 real(r8), dimension(:), intent(inout), optional :: array3 real(r8), dimension(:), intent(inout), optional :: array4 real(r8), dimension(:), intent(inout), optional :: array5 real(r8), intent(inout), optional :: scalar1 real(r8), intent(inout), optional :: scalar2 real(r8), intent(inout), optional :: scalar3 real(r8), intent(inout), optional :: scalar4 real(r8), intent(inout), optional :: scalar5
Cover routine for array_broadcast(). This call must be matched with the companion call broadcast_send(). This routine must be called on all tasks which are not the root of the broadcast; the arguments specify the location in which to receive data from the root. (The root task should call broadcast_send().) This routine receives up to 5 data arrays and 5 scalars in a single call. A common pattern in the DART filter code is receiving 2 arrays, but other combinations exist. This routine ensures that from is not the same as the current task ID. The arguments to this call must be matched exactly in number and type with the companion call to broadcast_send() or an error (or hang) will occur.
In reality the data arrays here are intent(out) only but this routine will be calling array_broadcast() internally and so must be intent(inout) to match.
from | The task ID for the data broadcast source. |
array1 | First array location to receive data into. |
array2 | If given, second data array to receive data into. |
array3 | If given, third data array to receive data into. |
array4 | If given, fourth data array to receive data into. |
array5 | If given, fifth data array to receive data into. |
scalar1 | If given, first data scalar to receive data into. |
scalar2 | If given, second data scalar to receive data into. |
scalar3 | If given, third data scalar to receive data into. |
scalar4 | If given, fourth data scalar to receive data into. |
scalar5 | If given, fifth data scalar to receive data into. |
This is another of the routines which must be called consistently; all tasks but one make this call and exactly one other task calls the companion broadcast_send routine. The MPI call used here is synchronous, so all tasks block until everyone has called one of these two routines.
integer, intent(in) :: addend integer, intent(out) :: sum
All tasks call this routine, each with their own different addend. The returned value in sum is the total of the values summed across all tasks, and is the same for each task.
addend | Single input value per task to be summed up. |
sum | The sum. |
This is another of those calls which must be made from each task, and the calls block until this is so.
Create a named pipe (fifo) and read from it to block the process in such a way that it consumes no CPU time. Beware that once you put yourself to sleep you cannot wake yourself up. Some other MPI program must call restart_task() on the same set of processors the original program was distributed over.
Even though fifos appear to be files, in reality they are implemented in the kernel. The write into the fifo must be executed on the same node as the read is pending on. See the man pages for the mkfifo(1) command for more details.
Write into the pipe to restart the reading task. Note that this must be an entirely separate executable from the one which called block_task(), because it is asleep like Sleeping Beauty and cannot wake itself. See filter and wakeup_filter for examples of a program pair which uses these calls in async=4 mode.
Even though fifos appear to be files, in reality they are implemented in the kernel. The write into the fifo must be executed on the same node as the read is pending on. See the man pages for the mkfifo(1) command for more details.
integer, intent(in) :: async
For async=4 and task id = 0, write into the main filter-to-script fifo to tell the run script that filter is exiting. Does nothing else otherwise.
Even though fifos appear to be files, in reality they are implemented in the kernel. The write into the fifo must be executed on the same node as the read is pending on. See the man pages for the mkfifo(1) command for more details.
integer :: shell_execute character(len=*), intent(in) :: execute_string logical, intent(in), optional :: serialize
Wrapper routine around the system() library function to execute shell level commands from inside the Fortran program. Will wait for the command to execute and will return the error code. 0 means ok, any other number indicates error.
rc | Return code from the shell exit after the command has been executed. |
execute_string | Command to be executed by the shell. |
serialize | If specified and if .TRUE. run the command from each PE in turn, waiting for each to complete before beginning the next. The default is .FALSE. and does not require that all tasks call this routine. If given and .TRUE. then all tasks must make this call. |
real(r8), intent(in) :: naplength
Wrapper routine for the sleep command. Argument is a real in seconds. Some systems have different lower resolutions for the minimum time it will sleep. This routine can round up to even seconds if a smaller than 1.0 time is given.
naplength | Number of seconds to sleep as a real value. |
The amount of time this routine will sleep is not precise and might be in units of whole seconds on some platforms.
We adhere to the F90 standard of starting a namelist with an ampersand '&' and terminating with a slash '/'.
No namelist interfaces are currently defined for this module, but at some point in the future an optional namelist interface &mpi_utilities_nml may be supported. It would be read from file input.nml.
One anticipated addition would be a list of task numbers in which the default option is to print out all informational messages; the current default is that messages are only printed from task 0 and all others suppressed. (Warnings and Errors print in all cases.)
Depending on the implementation of MPI, the library routines are either defined in an include file (mpif.h) or in a proper Fortran 90 module (use mpi). If it is available the module is preferred; it allows for better argument checking and optional arguments support in the MPI library calls.
If MPI returns an error, the DART error handler is called with the numeric error code it received from MPI. See any of the MPI references for an up-to-date list of error codes.