The use of covariance matrices in dimension reduction for space-time data
Ian Jolliffe
Universities of Exeter, Kent, Aberdeen
ian@sandloch.fsnet.co.uk

Outline of talk
How is dimensionality reduced? Concentrate on principal component analysis (EOF analysis) which uses covariance or correlation matrices
Definitions
Implementation
Interpretation
Choices
Simplification
Extensions (if time permits)
 to two (or more) groups of variables
 to three or more modes
Some concluding remarks

PCA and EOFs
Principal component analysis (Hotelling, 1933)
Empirical orthogonal functions (Lorenz, 1956)
Other names too
Reduces dimensionality by finding linear combinations of a large set of variables that successively maximise variance
Limitations
Can be more difficult to interpret than using a subset of the original variables, but typically not for space-time data
 Linearity. Non-linear versions exist – not discussed here.
Uses only covariances, not higher-order moments – see independent component analysis (ICA)

PCA – some definitions, terminology
If x is a vector of p variables, then the principal components (PCs) are linear combinations aT1x, aT2x, É aTpx
Although we can find p PCs, and sometimes the last few are useful (e.g in finding outliers), for dimension reduction purposes we usually only keep the first few
In the kth PC ak, the vector of coefficients or loadings, is chosen so that the variance of aTkx is maximised, subject to a normalisation constraint aTkak = 1, and subject to successive PCs being uncorrelated

Finding PCs/EOFs
The optimisation problem which defines PCs turns out, like many in multivariate analysis, to be an eigenvalue problem
The variances of the PCs are eigenvalues of the covariance (or correlation) matrix of x, in descending order, and the vectors of coefficients ak are the corresponding eigenvectors
This is the usual way of finding PCs, though other algorithms exist e.g using the singular value decomposition of the column-centred data matrix

PCA in atmospheric science (geoscience)
Most common format is Ôvariables = stations or gridpoints; observations = different timesÕ
Eigenvectors are known as empirical orthogonal functions (EOFs) and the technique as EOF analysis
Elements of the EOFs are often plotted as contours on a map
Note that the EOFs are vectors of loadings in PCA, not the PCs themselves, which are time series in this context

Example – northern hemisphere sea level pressure (NH SLP)
The data are monthly mean SLP for winter, from 1948 to 2000, on a 2.5¡ x 2.5¡ grid for the NH north of 20¡N from the NCEP/NCAR reanalysis
Some preprocessing has taken place – removing annual cycle - and area weighting based on the square root of the cosine of latitude has been used, but these details need not concern us here

Slide 8
Example 2 – NH 850hPa streamfunction
The data are monthly mean streamfunction for extended winter (includes March), from 1979 to 1997, for the whole NH from the ECMWF reanalysis (ERA)
Some preprocessing has taken place – removing annual cycle
EOFs are often interpreted as Ôphysical modesÕ and there is considerable argument over which EOFs correspond to such modes, and whether EOFs can find them, or even whether they exist

Slide 10
Choices in PCA
There are a number of decisions to be made in PCA
Covariances or correlations?
How many PCs/EOFs?
Which normalisation constraint?

Covariance or correlation
PCs and their variances may be found by calculating eigenvectors and eigenvectors of either (a) a covariance matrix or (b) a correlation matrix
Corresponds to successively maximising variances of linear combinations of the raw variables
Corresponds to standardising each variable to have unit variance before the successive maximisation
NOTE: there is no simple relationship between the PCs found from the two types of matrix

Covariance or correlation II
It is important to use correlations, not covariances, if variables are measured in different units (pressure, temperature) to avoid effects of arbitrary scaling
Most geoscience applications only use one type of variable – choice between covariance and correlation depends whether it is desirable for all variables (spatial locations) to have the same weight or allow those with greater variances to have an increased chance of dominating the first few PCs

How many PCs/EOFs?
There are numerous rules (see Jolliffe, 2002a, Chapter 6) based on:
Size of individual variances (eigenvalues, λk )
Cumulative sum of variances
Changes in successive variances
Physical interpretability
More complicated techniques
Types 2 & 4 are probably most often used in geoscience, but rules of type 5 have been suggested and 3 (gaps between eigenvalues) is also sometimes important (e.g if EOFs are to be rotated)

Choice of normalisation constraint
In the kth PC ak, the vector of coefficients or loadings, is chosen so that the variance of aTkx is maximised, subject to a normalisation constraint aTkak = 1, and subject to successive PCs being uncorrelated
The results presented may have aTkak = 1, but alternatives are aTkak = λk or aTkak = 1/ λk, where λk is the eigenvalue (variance) associated with the kth PC
In interpreting what a PC represents in terms of the original variables, the normalisation is unimportant – the maps look exactly the same. It is the relative values of the akj within ak that are important.
However, there are differences in interpretation of  individual loadings which need not concern us here

Simplification
PCs can be difficult to interpret, though often less so for space-time data than other types of data. To aid interpretation, various simplification techniques have been proposed.

Simplification II
Rotation (orthogonal or oblique)
Restriction of loadings to discrete set of values
LASSO-based approach
Others
Combining variance maximisation and simplification criteria
Truncation of loadings
Empirical orthogonal teleconnections
etc.

Rotation
Well-known and widely-used but controversial (Richman, 1986, 1987; Jolliffe, 1987,1995; Mestas-Nu–ez 2000). Among questions to be addressed are
Orthogonal or oblique
Choice of simplicity criterion e.g. varimax
How many EOFs to rotate
Choice of normalisation constraint

Example of rotation– USA summer precipitation
402 stations (variables)
1312 times (observations) = 41 (3-day periods in May-Aug) x 32 (years)

Slide 20
Other Simplification Methods
We give no details here, but show the results of applying one of them (LASSO-based) to the earlier NH SLP example

Slide 22
Slide 23
Relationships between variables in two (or more) groups
We may wish to relate two sets of variables e.g. sea surface temperatures and mean sea level pressure. A variety of techniques is available
Canonical correlation analysis
Maximum covariance analysis (SVD)
Many others

Canonical correlation analysis (CCA)
To find relationships between two groups of variables, find pairs of linear functions of variables, one from each group, that have maximum correlation, subject to being uncorrelated with previously found pairs
Turns out to be another eigenvalue problem, involving covariance matrices between (Sxy) and within (Sxx, Syy) groups of variables
Solve SxySyy-1Syxakx=lkSxxakx   (akx = vector of loadings for x variables – similar equation for y variables)

Maximum covariance analysis
Also
inter-battery factor analysis (Tucker, 1958)
SVD (Bretherton et al., 1992)
Similar to CCA except
It successively maximises covariance rather than correlation
Vectors of loadings are orthogonal, rather than derived variables uncorrelated
Solves SxySyxakx = lkakx

Maximum covariance analysis: Pacific SST vs. Hemispheric 500mb height ( Wallace et al., 1992)
Extensions to 3 (or more) modes
By ÔmodesÕ here I mean ÔtimeÕ, ÔspaceÕ –extras might be different climate variables, different levels in the atmosphere. Some extensions:
O-mode, P-mode, É, T-mode analyses
Extended EOF analysis
Three-mode PCA

O-mode, P-mode, É, T-mode
Not really an extension – given 3 modes, most often space, time, climate variable, choose one as ÔvariableÕ, one as ÔobservationÕ, ignore the third, and do PCA. 6 possibilities.
S-mode most usual: space = variables, time = observations.
T-mode, not uncommon: time = variables, space = observations.
Other 4 used occasionally

Extended EOF analysis
n times, s spatial locations, p climate variables. Combine locations and variables to give (n x sp) data matrix and carry out EOF analysis on it.
Can also incorporate different time lags to give multivariate EEOF (MEEOF) analysis (Mote et al., 2000).
The latter also extends MSSA (Plaut & Vautard, 1994).

MEEOF example – 5 variables averaged over 0-10¡ S for various longitudes
200mb velocity potential
Outgoing radiation
215mb water vapour
100mb temperature
100mb water vapour

Concluding remarks
All the techniques discussed have an underlying objective of dimension reduction and all use covariance or correlation matrices. There is often a desire to physically interpret the new dimensions – this can be controversial
For example, oblique rotation is advocated by some because Ôphysical modesÕ (NB different meaning of ÔmodeÕ) are often correlated. Other techniques are advocated (independent component analysis – ICA; Aires et al., 2000) on the premise that ÔmodesÕ are not just uncorrelated, but independent

More concluding remarks
We have mentioned some EOF-related techniques very briefly and others not at all
A large missing class consists of techniques explicitly designed for time series data e. g. SSA (Golyandina et al., 2001), MSSA (Plaut & Vautard, 1994), POP analysis (von Storch et al., 1988), MTM-SVD (Mann & Park 1999),É
Why not consult Jolliffe (2002a) for further details and references?

Slide 34
Discrete set of values for loadings
Hausmann (1982): -1, 0, +1
Vines (2000), Jolliffe, Uddin & Vines (2002): more integers – gives  so- called simple components
Chipman & Gu (2004): find ordinary EOFs, then truncate to –1, 0, +1 or –c1, 0, c2
Rousson & Gasser (2004): a technique that produces blocks of zeros and blocks of equal non-zero loadings

LASSO-based approach
LASSO (Least Absolute Shrinkage and Selection Operator) developed in multiple regression to deal with multicollinearity.
A compromise between variable selection and biased regression. Shrinks some regression coefficients exactly to zero.
Adaptation to PCA: to the usual optimisation problem add an extra constraint (Jolliffe, Trendafilov & Uddin, 2003).

LASSO II
Constraint is
    where ajk is the jth element in the kth EOF, and t is a Ôtuning parameterÕ. As t ¨ 0, an increasing number of loadings are driven to 0.
The technique is named SCoTLASS (Simplified Component Technique – LASSO).
Zou et al. (2006) have an ÔimprovedÕ version of SCoTLASS, with an implementation in R

Mediterranean sea surface temperature example
16 variables corresponding to average seasonal sea surface temperature in 16 areas of the Mediterranean, 1946-1988
Original source Bartzokas et al. (1994)

SST example
Explain the dark/light red/blue shading
Compared to the PCs
Rotation (using varimax) gives separate regions in the first 2 PCs, rather than overall temperature and a contrast between regions
SCoTLASS gives a simpler version of the rotated PCs
SCoT (a technique that maximises a criterion combining variance and simplicity) gives a simpler version of the unrotated PCs; so do simple components, but simplicity in a different sense

Slide 40
Slide 41
Other techniques for two groups of variables
Redundancy analysis (van den Wollenberg 1977) RxyRyxakx = lkRxxakx (R matrices contain correlations)
Related to PCA of instrumental variables and reduced rank regression
Unlike CCA, MCA, one set of variables is treated as predictors, one as responses
Principal predictors (Thacker 1999)
Sxy[diag(Syy)]-1Syxakx=lkSxxakx
Conditional MCA (An, 2003)
Multivariate regression, combined PCA of x and y variables, separate PCAs of x and y followed by CCA, partial least squares and a number of others

Three mode PCA
xijk, i=1,2,Én; j=1,2,É,p; k=1,2,É,t is approximated by
m < n; q < p; s < t (Kroonenberg, 1983). Other varieties exist in the psychometric literature.