Institute for Mathematics Applied to Geosciences (IMAGe)

GSP's guide to netCDF format data and the 'R' package 'ncdf'.

netCDF is a common, self-describing, portable binary format for geophysical data. GSP made an executive decision earlier this year (i.e. Tim and Doug talked after lunch) to use this format as much as possible when creating or manipulating data sets. For the statistical readership we should note that there are contributed packages for R that allow for the efficient reading and writing of netCDF files and part of the intent of this web page is to provide some simple examples to get users started. Some advantages of this format are:



A brief description of netCDF

With regards to netCDF, a little philosophy goes a long way. A netCDF file is intended to provide all the information needed to interpret the data, as well as the data itself. The information about the length of each dimension, how many dimensions, the units of the quantity, etc. are all contained in a netCDF file. They are intended to be self-describing. The netCDF format is flexible enough to allow for a tremendous variety of incantations. Some netCDF files contain the data for (a set of) radiosonde instruments, some represent the output of a General Circulation Model (GCM), or a set of observations taken at a weather station, for example. All three cases will be explored in the following discussion.

The netCDF file can be broken down into logical parts. To that end, lets take a look at the header of a very simple netCDF file.

netCDF example {
dimensions:
        EW = 87 ;
        SN = 61 ;
variables:
        double EW(EW) ;
                EW:units = "meters" ;
        double SN(SN) ;
                SN:units = "meters" ;
        float Elevation(SN, EW) ;
                Elevation:units = "meters" ;
                Elevation:missing_value = -1.f ;
}

There are a set of integer scalars called dimensions that declare the lengths of the following variables. Each of the dimensions has a name. In this case, there are two dimensions, 'EW' and 'SN'. There is one special dimension: the unlimited dimension. This is a dynamically adjustable dimension that can change throughout the creation of the file with no performance penalty. Most of the time, the unlimited dimension is 'time' and as the results for the GCM are created, they are appended to the existing variables (for example).

There are three variables in this netCDF file, but two of them have the same name as the dimensions! Those two are precisely referred to as coordinate variables (as opposed to just 'variable'). These are the variables that will aid in the interpretation of the dataset. Ever get a matrix from someone who says it is written 'lat by lon'? Well -- WHICH lats? Which lons? The coordinate variables uniquely describe this sort of information. You can be assured that any variable with a dimension 'EW' (for example) is ordered the same as the coordinate variable 'EW'. Put another way, the coordinate variables are great for labelling the axes.

Finally, we get to a variable that is not named the same as a dimension. This is simply called a variable, and may be nearly any 'regular' shape. The big restriction in netCDF is that there are no 'ragged' objects. In order to regularize the dimension, extensive use of special 'missing values' may be needed. The station data netCDF files will explain further. In this case, 'Elevation' is a two-dimensional variable, but netCDF supports much higher dimensional objects. I think the max is up around 8 or 9, well beyond my mental grasp.

There's more. variables can have attributes. In this case both 'EW' and 'NS' have units of 'meters'. Usually a good thing to know -- and annotate! The netCDF interface allows you to query and recover the string 'meters' for just that purpose. Furthermore, the variables can have different storage modes or types -- char, 16 bit integers, 32 bit integers, 32 bit reals (float), 64 bit reals (double) ... The storage mode is entirely separate from the shape of the variable. In fact, usually you just ignore the storage mode altogether unless you are trying to create a compact representation of some data.

netCDF 3.x vs. netCDF 4.x and ncdf vs. ncdf4

There are two fundamentally different libraries for netCDF, with a set of very similar user interface routines. The user interface routines would be identical except that netCDF 4 has added functionality.

From here on, I'm just going to refer to the 'ncdf' package. I really have not used the 'ncdf4' package enough to be able to write useful examples.

Installing the ncdf package -- WINDOWS

We tried to install ncdf 1.5 with R 2.0.1 to no avail but were successful with other combinations. You do not need to install the netcdf libraries from UNIDATA before installing the ncdf package under Windows.

  1. Note your R version in case there are troubles later
  2. From within R, click "Packages" from the top menu bar
  3. Scroll down to "install package(s)"
  4. Select "ncdf" from the list of available packages
  5. Click to install

Installing the ncdf package -- Linux, UNIX, OS X, ...

Is a two-step process under these operating systems. First, you have to instal the netCDF libraries -- and then you can install the ncdf package. The ncdf package needs to know the location of the local netCDF installation to get the proper libraries and include files, etc.

Installing the netCDF libraries

The best guide is already written and is available at UNIDATA. You MUST have the netCDF libraries available for the R ncdf package to install correctly. There is no getting around it.

As a bonus, when you install the libraries, you also get 'ncdump' -- the command to dump the header and/or contents of a netCDF file in ASCII. If you really want to go crazy, download ncview, a great visualization tool for netCDF objects.

Installing the ncdp package

You must know where the netCDF libraries are installed on your system in order to install the ncdf package. They might be in a nice default location, but I don't live like that. There are two parts you need to find, the include files and the actual netCDF library. Once you find them (with the linux commands "locate netcdf.h" and "locate libnetcdf", for example) you can supply some optional arguments to the "R CMD INSTALL" command:

R CMD INSTALL --configure-args="-with-netcdf_incdir=/usr/local/netcdf/include -with-netcdf_libdir=/usr/local/netcdf/lib" ncdf_xx.yy.tar.gz

Examples for creating and reading netCDF format using R

The following examples assume that you have installed the netCDF libraries correctly as well as the ncdf package in R. You might also want to add our useful summary function for ncdf objects in R: summary.ncdf

Simple 2D example

Obtain the (binary) netCDF file: example.nc   [22kb -- tiny]. Most netCDF files are substantially more complicated than this simple example. From the unix prompt, you can dump the entire file (not particularly suggested) or the header of the file (suggested) in the following way:

ncdump -h example.nc

netCDF example {
dimensions:
        EW = 87 ;
        SN = 61 ;
variables:
        double EW(EW) ;
                EW:units = "meters" ;
        double SN(SN) ;
                SN:units = "meters" ;
        float Elevation(SN, EW) ;
                Elevation:units = "meters" ;
                Elevation:missing_value = -1.f ;
}

Dumping the entire file to ASCII can only be described as a bad habit. netCDF files are generally BIG and converting to ASCII makes things HUGE. The data can be imported directly into R using the ncdf package which has two main benefits -- it is faster and maintains the original precision. Converting to/from ASCII generally results in roundoff errors for anything not in integer format.

Reading a netCDF file into R

library(ncdf)
ex.nc = open.ncdf("example.nc")

This 'ex.nc' object has the information about the netCDF file example.nc but it does not contain the data! Retrieving various subsets of the data is done in separate steps. From within R you can get some information about this fil

print(ex.nc) 

[1] "file example.nc has 2 dimensions:"
[1] "EW 87"
[1] "SN 61"
[1] "------------------------"
[1] "file example.nc has 1 variables:"
[1] "Elevation" "EW"        "SN"        "Elevation" "-1"     

# or the more concise 
summary(ex.nc)

Elevation ( variable 1 ) has shape 87 61

These data consists of an 87X61 matrix of the variable 'Elevation' and the coordinate variables associated with the 2 dimensions: SN ( south - north axis ) and EW ( east-west axis). The elevations are from the R data set volcano. To retrieve the relevant variables and to make a surface plot:


y = get.var.ncdf( ex.nc, "SN")          # coordinate variable
x = get.var.ncdf( ex.nc, "EW")          # coordinate variable
z = get.var.ncdf( ex.nc, "Elevation")   # variable

# image plot of terrain cribbed from help(volcano)
filled.contour(x,y,z, color = terrain.colors, asp = 1)

Here are some examples of extracting only part of the variable. It would be wise to assume you will not be able to read the entire variable into R and then subset. The hottest topic in the netCDF user community right now is "large file" support. In the GCM community, a "large file" is over 2GB, anything less that that is "just a file" ...

z1 = get.var.ncdf( ex.nc, "Elevation", start=c(11,26), count=c( 5,5))
# compare to 
z[11:15, 26:30]

z2 = get.var.ncdf( ex.nc, "Elevation", start=c(11,1), count=c( 5,-1))
# compare to 
z[11:15,]

Creating a netCDF file from within R

Large projects create netCDF files using the low-level Unidata library routines. However, there are also functions for creating netCDF files within R. The following code creates the example.nc file used above. The reader should also refer to the help on put.var.ncdf from the ncdf R package for a good set of examples of creating multiple arrays.

# 
library(ncdf)
data(volcano)

# put the data in a handy form 
z = 10*volcano         # matrix of elevations
x = 100* (1:nrow(z))   # meter spacing (S to N)
y = 100* (1:ncol(z))   # meter spacing (E to W)

# define the netcdf coordinate variables -- note these have values!

dim1 = dim.def.ncdf( "EW","meters", as.double(x))
dim2 = dim.def.ncdf( "SN","meters", as.double(y))

# define the EMPTY (elevation) netcdf variable

varz = var.def.ncdf("Elevation","meters", list(dim1,dim2), -1, 
          longname="The Classic R New Zealand Volcano")

# associate the netcdf variable with a netcdf file   
# put the variable into the file, and
# close

nc.ex = create.ncdf( "example.nc", varz )
put.var.ncdf(nc.ex, varz, z)
close.ncdf(nc.ex)

note: I believe there is a bug in the ncdf library which is causing the longname attribute to be ignored. As far as I can tell -- it should be there.

Reading a variable with a 'time' coordinate.

Most temporal variables are recorded as a "time since" some start date. In accordance with several conventions, the time origin is recorded as an attribute for the time variable, as are the units. For example, if the header dump is as follows:

 double synTime(recNum) ;
                synTime:long_name = "Synoptic Time" ;
                synTime:units = "Seconds since (1970-1-1 00:00:0.0)" ;

You have to add Jan 1, 1970 to the values of "synTime" to be able to convert between synTime and some calendar date. In 'R' the chron library has a function (also called cron) to determine the number of days since some origin. If you want to be really clever, you can grab the synTime:units attribute (if it is POSIX-compliant) and use it as the origin for the chron command. Be sure to take a look at the R commands for POSIX manipulation - POSIXlt, POSIXct

library(chron)
snytime = get.var.ncdf(ex.nc,'synTime') 
a = chron(syntime/86400, origin=c(month=1,day=1,year=1970))

More complicated access of multidimensional arrays and time variables

Obtain the netCDF file: Daily_b06_45.nc   [193kb -- still tiny].

library(fields)
library(ncdf)
library(chron)

# This dataset happens to be the control experiment
control = open.ncdf(con="Daily_b06_45.nc", write=FALSE, readunlim=FALSE)

cat(paste(control$filename,"has",control$nvars,"variables"), fill=TRUE)

lonmat  = get.var.ncdf(nc=control,varid="longitude")   # reads entire matrix
latmat  = get.var.ncdf(nc=control,varid="latitude")    # ditto
timearr = get.var.ncdf(nc=control,varid="time")        # reads entire time array

plot( lonmat, latmat, main='The (MM5) grid locations')
US( add=T )

# source summary function look at this for the gory details of accessing
# details of ncdf object

source("summary.ncdf.r")

# this prints out sensible information
summary(control)

#----------------------------------------------------------------------
# Find a day of interest and read the other variables for JUST that day.
# netCDF has some "philosophy" associated with this
# We need to make a "start" and "count" array
#----------------------------------------------------------------------

# It is very common for netCDF files to be created with a monotonic time array.
# Enough information should be given in the time attributes to decode the time 
# array into a calendar date. R has several ways to convert calendar dates 
# into monotonic time arrays -- 

# the metadata in the netcdf file indicates the first time is 1 July 1995
# and the base time is in units such that noon on 1 Jan 1601 is t = 0.5.

t1 = julian(x=7, d=1, y=1995,  origin=c(month = 1, day = 1, year = 1601))
t1
timearr[1]

# given that, there's lots of ways to find a specific date

targettime = julian(x=7,d=4,y=1995,origin=c(month = 1, day = 1, year = 1601))
inds       = (1:dim(timearr))
tind       = inds[targettime == timearr]

# tind = 4,  the 4th time point in this case.

ndims    = control$var[['PRCP']]$ndims      # also works (3 in this case)
varsize  = control$var[['PRCP']]$varsize    # 61 49 7

start = c(    1,          1,      tind)     # knowing time is the last dimension
count = c(varsize[1], varsize[2],    1)
# read in this slice of the data

cntlprcp1 = get.var.ncdf(nc=control,varid="PRCP",start,count)

x = 1:nrow(cntlprcp1)   # R plots rows along X axis!
y = 1:ncol(cntlprcp1)
image.plot(x,y,cntlprcp1,col=tim.colors())

# If you want to get 3 timesteps (starting from a date), for example
# This also illustrates a special case if you want ENTIRE dimensions

start = c(   1,  1, tind)
count = c(  -1, -1,   3)
cntlprcp2 = get.var.ncdf(nc=control,varid="PRCP",start,count)
# plot 'em up
set.panel(3,1)
for ( k in 1:3){
image.plot( x,y,cntlprcp2[,,k],col=tim.colors())
}

# an example of subsetting on the other space dimensions.
# Lets assume for a moment that you want to ignore the edges
# i=1,2, N-2,N-1,N ...
# (for just one timestep, you get the picture)

start = c(      3,            3,        tind)
count = c(  varsize[1]-5, varsize[2]-5,    1)
cntlprcp3 = get.var.ncdf(nc=control,varid="PRCP",start,count)

#
# Grabbing a time series for just one location (25,30)
# lon,lat location: lonmat[25,30],latmat[25,30]

start = c(      25,         30,       1)
count = c(  1, 1, -1)
cntltmax = get.var.ncdf(nc=control,varid="TMAX",start,count)
plot( cntltmax, type="h")

# The result of this matches the comments above ...
month.day.year(t1, origin=c(month = 1, day = 1, year = 1601) )

Station Data

Obtain the netCDF file: STN_050258.nc   [28kb -- miniscule]. PLEASE dump the header of this file! This is a 'real world' file. For the purpose of demonstration, I will assume you've done it and know there are the following variables:

station = open.ncdf(con="STN_050258.nc", write=FALSE, readunlim=FALSE)

cat(paste(station$filename,"has",station$nvars,"variables"), fill=TRUE)

lons    = get.var.ncdf(nc=station,varid="longitude")   # reads entire coordinate variable
lats    = get.var.ncdf(nc=station,varid="latitude")    # kinda boring, since the station
elev    = get.var.ncdf(nc=station,varid="elevation")   # didn't move. They usually do.
timearr = get.var.ncdf(nc=station,varid="time")        # reads entire time array
prcp    = get.var.ncdf(nc=station,varid="PRCP")        # reads entire precip array
tmin    = get.var.ncdf(nc=station,varid="TMIN")
tmax    = get.var.ncdf(nc=station,varid="TMAX")
tobs    = get.var.ncdf(nc=station,varid="TOBS")
snow    = get.var.ncdf(nc=station,varid="SNOW")

# plot 'em up
set.panel(5,1)
plot( timearr, snow, main='SNOW' )
plot( timearr, prcp, main='PRCP')
plot( timearr, tmin, main='TMIN')
plot( timearr, tmax, main='TMAX')
plot( timearr, tobs, main='TOBS')

# note that there are many more observations of snow and precip than temperature.
# The unobserved temperature data must be coded as 'missing' to use the same time
# coordinate variable.