GSP's guide to netCDF format data and the 'R' package 'ncdf'.
netCDF
is a common, self-describing, portable binary format for geophysical data.
GSP made an executive decision earlier this year (i.e. Tim and Doug talked
after lunch) to use this format as much as possible when creating or
manipulating data sets. For the statistical readership we
should note that there are contributed packages for R that allow for the
efficient reading and writing of netCDF files and part of the intent of
this web page is to provide some simple
examples to get users started.
Some advantages of this format are:
- The netCDF libraries to create and access files for many (all?) architectures
are free and available through
UNIDATA.
- A netCDF file not only contains the "data" but also a description
of the variables, the creation history, and any other important
attributes about the data set.
- A netCDF file is a reasonably compact portable binary format.
i.e. You can make one on a 'supercomputer' and read it on a PC.
- The netCDF interface can extract parts of a large data file without
having to read the entire record. Since many netCDF files are o~ Gigabytes,
this is important.
- netCDF is a standard format not only for geophysical observational
data but also for numerical model output, such as the NCAR community
climate system model.
- A contributed package in
R, ncdf is free,
readily available, and simplifies the interface to the otherwise
gory low-level routines available in the Unidata library.
A brief description of netCDF
With regards to netCDF, a little philosophy goes a long way.
A netCDF file is intended to provide all the information needed to interpret
the data, as well as the data itself. The information about the length of
each dimension, how many dimensions, the units of the quantity, etc. are all
contained in a netCDF file. They are intended to be self-describing.
The netCDF format is flexible enough to allow for a tremendous variety of
incantations. Some netCDF files contain the data for (a set of) radiosonde
instruments, some represent the output of a General Circulation Model (GCM),
or a set of observations taken at a weather station, for example. All three
cases will be explored in the following discussion.
The netCDF file can be broken down into logical parts. To that end, lets take
a look at the header of a very simple netCDF file.
netCDF example {
dimensions:
EW = 87 ;
SN = 61 ;
variables:
double EW(EW) ;
EW:units = "meters" ;
double SN(SN) ;
SN:units = "meters" ;
float Elevation(SN, EW) ;
Elevation:units = "meters" ;
Elevation:missing_value = -1.f ;
}
There are a set of integer scalars called dimensions that declare the
lengths of the following variables. Each of the dimensions has a name.
In this case, there are two dimensions, 'EW' and 'SN'.
There is one special dimension: the unlimited dimension.
This is a dynamically adjustable dimension that can change throughout the
creation of the file with no performance penalty. Most of the time, the unlimited
dimension is 'time' and as the results for the GCM are created, they are appended
to the existing variables (for example).
There are three variables in this netCDF file, but two of them
have the same name as the dimensions! Those two are precisely referred
to as coordinate variables (as opposed to just 'variable'). These
are the variables that will aid in the interpretation of the dataset.
Ever get a matrix from someone who says it is written 'lat by lon'?
Well -- WHICH lats? Which lons? The coordinate variables uniquely describe
this sort of information. You can be assured that any variable with
a dimension 'EW' (for example) is ordered the same as the
coordinate variable 'EW'. Put another way, the coordinate
variables are great for labelling the axes.
Finally, we get to a variable that is not named the same as a dimension.
This is simply called a variable, and may be nearly any 'regular' shape.
The big restriction in netCDF is that there are no 'ragged' objects. In order
to regularize the dimension, extensive use of special 'missing values' may be needed.
The station data netCDF files will explain further.
In this case, 'Elevation' is a two-dimensional variable, but netCDF supports much higher
dimensional objects. I think the max is up around 8 or 9, well beyond my mental grasp.
There's more. variables can have attributes. In this case both
'EW' and 'NS' have units of 'meters'. Usually a good thing to know -- and annotate!
The netCDF interface allows you to query and recover the string 'meters' for just that purpose.
Furthermore, the variables can have different storage modes or types -- char,
16 bit integers, 32 bit integers, 32 bit reals (float), 64 bit reals (double) ...
The storage mode is entirely separate from the shape of the variable.
In fact, usually you just ignore the storage mode altogether unless you are trying
to create a compact representation of some data.
netCDF 3.x vs. netCDF 4.x and ncdf vs. ncdf4
There are two fundamentally different libraries for netCDF, with a set of
very similar user interface routines. The user interface routines would be
identical except that netCDF 4 has added functionality.
- ncdf installs with the netCDF 3.x libraries and reads netCDF 3 format files.
- ncdf4 installs with the netCDF 4.x libraries and works with all netCDF files.
From here on, I'm just going to refer to the 'ncdf' package. I really have not
used the 'ncdf4' package enough to be able to write useful examples.
Installing the ncdf package -- WINDOWS
We tried to install ncdf 1.5 with R 2.0.1 to no avail but were successful with
other combinations. You do not need to install the netcdf libraries from
UNIDATA before installing the ncdf package under Windows.
- Note your R version in case there are troubles later
- From within R, click "Packages" from the top menu bar
- Scroll down to "install package(s)"
- Select "ncdf" from the list of available packages
- Click to install
Installing the ncdf package -- Linux, UNIX, OS X, ...
Is a two-step process under these operating systems.
First, you have to instal the netCDF libraries -- and then
you can install the ncdf package. The ncdf package needs to know
the location of the local netCDF installation to get the proper libraries and include
files, etc.
Installing the netCDF libraries
The best guide is already written and is available at
UNIDATA.
You MUST have the netCDF libraries available for the
R ncdf package to install correctly. There is no getting around it.
As a bonus, when you install the libraries, you also get 'ncdump' -- the command
to dump the header and/or contents of a netCDF file in ASCII. If you really want
to go crazy, download ncview,
a great visualization tool for netCDF objects.
Installing the ncdp package
You must know where the netCDF libraries are installed on your system in
order to install the ncdf package. They might be in a nice default location,
but I don't live like that. There are two parts you need to find, the include files
and the actual netCDF library. Once you find them (with the linux commands
"locate netcdf.h" and "locate libnetcdf", for example) you can supply some
optional arguments to the "R CMD INSTALL" command:
R CMD INSTALL --configure-args="-with-netcdf_incdir=/usr/local/netcdf/include -with-netcdf_libdir=/usr/local/netcdf/lib" ncdf_xx.yy.tar.gz
Examples for creating and reading netCDF format using R
The following examples assume that you have installed the netCDF
libraries correctly as well as the ncdf package in R.
You might also want to add our useful summary function for ncdf objects in R:
summary.ncdf
Simple 2D example
Obtain the (binary) netCDF file: example.nc [22kb -- tiny].
Most netCDF files are substantially more complicated than this simple example.
From the unix prompt, you can dump the entire file (not particularly suggested) or
the header of the file (suggested) in the following way:
ncdump -h example.nc
netCDF example {
dimensions:
EW = 87 ;
SN = 61 ;
variables:
double EW(EW) ;
EW:units = "meters" ;
double SN(SN) ;
SN:units = "meters" ;
float Elevation(SN, EW) ;
Elevation:units = "meters" ;
Elevation:missing_value = -1.f ;
}
Dumping the entire file to ASCII can only be described as a bad habit. netCDF files
are generally BIG and converting to ASCII makes things HUGE.
The data can be imported directly into R using the ncdf package which has two main
benefits -- it is faster and maintains the original precision. Converting to/from
ASCII generally results in roundoff errors for anything not in integer format.
Reading a netCDF file into R
library(ncdf)
ex.nc = open.ncdf("example.nc")
This 'ex.nc' object has the information about the netCDF file example.nc
but it does not contain the data! Retrieving various subsets of the data is done
in separate steps. From within R you can get some information about this fil
print(ex.nc)
[1] "file example.nc has 2 dimensions:"
[1] "EW 87"
[1] "SN 61"
[1] "------------------------"
[1] "file example.nc has 1 variables:"
[1] "Elevation" "EW" "SN" "Elevation" "-1"
# or the more concise
summary(ex.nc)
Elevation ( variable 1 ) has shape 87 61
These data consists of an 87X61 matrix of the variable 'Elevation' and
the coordinate variables associated with the 2 dimensions:
SN ( south - north axis ) and EW ( east-west axis).
The elevations are from the R data set volcano.
To retrieve the relevant variables and to make a surface plot:
y = get.var.ncdf( ex.nc, "SN") # coordinate variable
x = get.var.ncdf( ex.nc, "EW") # coordinate variable
z = get.var.ncdf( ex.nc, "Elevation") # variable
# image plot of terrain cribbed from help(volcano)
filled.contour(x,y,z, color = terrain.colors, asp = 1)
Here are some examples of extracting only part of the variable.
It would be wise to assume you will not be able to read the entire variable
into R and then subset. The hottest topic in the netCDF user community right
now is "large file" support. In the GCM community, a "large file" is over 2GB,
anything less that that is "just a file" ...
z1 = get.var.ncdf( ex.nc, "Elevation", start=c(11,26), count=c( 5,5))
# compare to
z[11:15, 26:30]
z2 = get.var.ncdf( ex.nc, "Elevation", start=c(11,1), count=c( 5,-1))
# compare to
z[11:15,]
Creating a netCDF file from within R
Large projects create netCDF files using the low-level Unidata library routines.
However, there are also functions for creating netCDF files within R.
The following code creates the example.nc file used above.
The reader should also refer to the help on put.var.ncdf from the ncdf R
package for a good set of examples of creating multiple arrays.
#
library(ncdf)
data(volcano)
# put the data in a handy form
z = 10*volcano # matrix of elevations
x = 100* (1:nrow(z)) # meter spacing (S to N)
y = 100* (1:ncol(z)) # meter spacing (E to W)
# define the netcdf coordinate variables -- note these have values!
dim1 = dim.def.ncdf( "EW","meters", as.double(x))
dim2 = dim.def.ncdf( "SN","meters", as.double(y))
# define the EMPTY (elevation) netcdf variable
varz = var.def.ncdf("Elevation","meters", list(dim1,dim2), -1,
longname="The Classic R New Zealand Volcano")
# associate the netcdf variable with a netcdf file
# put the variable into the file, and
# close
nc.ex = create.ncdf( "example.nc", varz )
put.var.ncdf(nc.ex, varz, z)
close.ncdf(nc.ex)
note: I believe there is a bug in the ncdf library which is causing the
longname attribute to be ignored. As far as I can tell -- it should be there.
Reading a variable with a 'time' coordinate.
Most temporal variables are recorded as a "time since" some start date.
In accordance with several conventions, the time origin is recorded as
an attribute for the time variable, as are the units.
For example, if the header dump is as follows:
double synTime(recNum) ;
synTime:long_name = "Synoptic Time" ;
synTime:units = "Seconds since (1970-1-1 00:00:0.0)" ;
You have to add Jan 1, 1970 to the values of "synTime" to be able to
convert between synTime and some calendar date. In 'R' the chron
library has a function (also called cron) to determine the number of
days since some origin. If you want to be really clever, you can grab
the synTime:units attribute (if it is POSIX-compliant) and use it as
the origin for the chron command. Be sure to take a look at the R
commands for POSIX manipulation - POSIXlt, POSIXct
library(chron)
snytime = get.var.ncdf(ex.nc,'synTime')
a = chron(syntime/86400, origin=c(month=1,day=1,year=1970))
More complicated access of multidimensional arrays and time variables
Obtain the netCDF file: Daily_b06_45.nc [193kb -- still tiny].
library(fields)
library(ncdf)
library(chron)
# This dataset happens to be the control experiment
control = open.ncdf(con="Daily_b06_45.nc", write=FALSE, readunlim=FALSE)
cat(paste(control$filename,"has",control$nvars,"variables"), fill=TRUE)
lonmat = get.var.ncdf(nc=control,varid="longitude") # reads entire matrix
latmat = get.var.ncdf(nc=control,varid="latitude") # ditto
timearr = get.var.ncdf(nc=control,varid="time") # reads entire time array
plot( lonmat, latmat, main='The (MM5) grid locations')
US( add=T )
# source summary function look at this for the gory details of accessing
# details of ncdf object
source("summary.ncdf.r")
# this prints out sensible information
summary(control)
#----------------------------------------------------------------------
# Find a day of interest and read the other variables for JUST that day.
# netCDF has some "philosophy" associated with this
# We need to make a "start" and "count" array
#----------------------------------------------------------------------
# It is very common for netCDF files to be created with a monotonic time array.
# Enough information should be given in the time attributes to decode the time
# array into a calendar date. R has several ways to convert calendar dates
# into monotonic time arrays --
# the metadata in the netcdf file indicates the first time is 1 July 1995
# and the base time is in units such that noon on 1 Jan 1601 is t = 0.5.
t1 = julian(x=7, d=1, y=1995, origin=c(month = 1, day = 1, year = 1601))
t1
timearr[1]
# given that, there's lots of ways to find a specific date
targettime = julian(x=7,d=4,y=1995,origin=c(month = 1, day = 1, year = 1601))
inds = (1:dim(timearr))
tind = inds[targettime == timearr]
# tind = 4, the 4th time point in this case.
ndims = control$var[['PRCP']]$ndims # also works (3 in this case)
varsize = control$var[['PRCP']]$varsize # 61 49 7
start = c( 1, 1, tind) # knowing time is the last dimension
count = c(varsize[1], varsize[2], 1)
# read in this slice of the data
cntlprcp1 = get.var.ncdf(nc=control,varid="PRCP",start,count)
x = 1:nrow(cntlprcp1) # R plots rows along X axis!
y = 1:ncol(cntlprcp1)
image.plot(x,y,cntlprcp1,col=tim.colors())
# If you want to get 3 timesteps (starting from a date), for example
# This also illustrates a special case if you want ENTIRE dimensions
start = c( 1, 1, tind)
count = c( -1, -1, 3)
cntlprcp2 = get.var.ncdf(nc=control,varid="PRCP",start,count)
# plot 'em up
set.panel(3,1)
for ( k in 1:3){
image.plot( x,y,cntlprcp2[,,k],col=tim.colors())
}
# an example of subsetting on the other space dimensions.
# Lets assume for a moment that you want to ignore the edges
# i=1,2, N-2,N-1,N ...
# (for just one timestep, you get the picture)
start = c( 3, 3, tind)
count = c( varsize[1]-5, varsize[2]-5, 1)
cntlprcp3 = get.var.ncdf(nc=control,varid="PRCP",start,count)
#
# Grabbing a time series for just one location (25,30)
# lon,lat location: lonmat[25,30],latmat[25,30]
start = c( 25, 30, 1)
count = c( 1, 1, -1)
cntltmax = get.var.ncdf(nc=control,varid="TMAX",start,count)
plot( cntltmax, type="h")
# The result of this matches the comments above ...
month.day.year(t1, origin=c(month = 1, day = 1, year = 1601) )
Station Data
Obtain the netCDF file: STN_050258.nc [28kb -- miniscule].
PLEASE dump the header of this file! This is a 'real world' file. For the purpose
of demonstration, I will assume you've done it and know there are the following variables:
station = open.ncdf(con="STN_050258.nc", write=FALSE, readunlim=FALSE)
cat(paste(station$filename,"has",station$nvars,"variables"), fill=TRUE)
lons = get.var.ncdf(nc=station,varid="longitude") # reads entire coordinate variable
lats = get.var.ncdf(nc=station,varid="latitude") # kinda boring, since the station
elev = get.var.ncdf(nc=station,varid="elevation") # didn't move. They usually do.
timearr = get.var.ncdf(nc=station,varid="time") # reads entire time array
prcp = get.var.ncdf(nc=station,varid="PRCP") # reads entire precip array
tmin = get.var.ncdf(nc=station,varid="TMIN")
tmax = get.var.ncdf(nc=station,varid="TMAX")
tobs = get.var.ncdf(nc=station,varid="TOBS")
snow = get.var.ncdf(nc=station,varid="SNOW")
# plot 'em up
set.panel(5,1)
plot( timearr, snow, main='SNOW' )
plot( timearr, prcp, main='PRCP')
plot( timearr, tmin, main='TMIN')
plot( timearr, tmax, main='TMAX')
plot( timearr, tobs, main='TOBS')
# note that there are many more observations of snow and precip than temperature.
# The unobserved temperature data must be coded as 'missing' to use the same time
# coordinate variable.