PGAP Home Permian Jurassic Publications Recent Abstracts (full text) About our Data PGAP Activities


CORRESPONDENCE ANALYSIS


Introduction

Correspondence analysis (CA) is a method used commonly in studies of modern ecology and vegetational succession (Gauch 1982; Ter Braak 1992). With CA, two-dimensional plots (one set for taxa and the other for localities) are produced showing variance within data sets on a series of axes. Taxa that frequently co-occur plot closest together, whilst those that rarely co-occur are furthest apart. The greatest variation is shown on the first axis, with other axes accounting for progressively less. The same applies to the localities plot; those which share many taxa plot closest to one another, whilst those with little in common plot furthest apart.

We use CA as a means of arranging all of the elements relative to axes in multidimensional space according to their similarity to each other. Most of the variation occurs on the first axis, with other axes accounting for progressively less. The advantages of CA are that it provides the same scaling of sample (locality) and character (taxa) plots, enabling direct comparison, and can accommodate 'incomplete' data matrices where some information is missing (Hill 1979; Gauch 1982), as always occurs with the fossil record (e.g. Rees and Ziegler 1999; Rees et al. 2000).

The version used is one of the programs in the CANOnical Community Ordination (CANOCO) package compiled by Ter Braak (1992), an extension of the Cornell Ecology Program DECORANA of Hill (1979). The general procedure has been described by Shi (1993): "Geometrically, ordination involves rotation and transformation of the original multidimensional co-ordinate system and reduction of high dimensionality so that major directions of variation within the data set can be found and more readily comprehended than by looking at the original data alone". Thus, we use CA as a means of arranging all of the elements (whether taxa or localities) relative to axes in multidimensional space according to their similarity to each other.


CAsimplematrix3.gif

An initial data matrix (Fig. A), comprising a vertical locality axis and a horizontal taxon (e.g. genus, species) axis, may appear to have no structure, but by re-arranging the locality axis (Fig.B) and then the taxon axis (Fig. C), a pattern emerges. The paleontologist may be faced with a data array like Fig. A (or Fig. B if, for example, the paleolatitude is known for each locality) and the computer effectively rearranges the matrix to produce the Fig. C plot. This could be done by hand of course, but with data matrices containing hundreds of columns and rows, this becomes impractical.

Figure C shows a very simple matrix, reflecting perhaps latitude, whereas in reality of course there may be more than one source of variance in the data. CA serves to identify the degree of variance and can ordinate the various influences on the data array but cannot, of course, specify the sources of the variance (examples of which include temperature, precipitation, geography, and ecological succession). This is the job of the ecologist or paleontologist. In our work, we use the physiognomy implicit in the names of individual fossil leaf genera to ultimately enable the determination of global paleoclimates. CA of fossil leaf genera and localities, combined with distributional patterns of climate-sensitive sediments, enable global climate zones (biomes) to be drawn on paleogeographic maps.



SHORTCUTS FOR CANOCO


This quick guide is designed to walk you through Canoco in the most simple path to getting correspondence analysis axis scores. As such, it ignores many of the options available in Canoco for specific data analysis problems. These issues are probably easier to deal with after grasping the basic layout of Canoco, and the manual may be of use at a later point. This version of Canoco is a Macintosh version, and all other programs referred to are also Macintosh.


I.THE DATA MATRIX

Stay away from Excel: it doesn't deal well with the spacing issues that Canoco needs defined. Microsoft Word is your best choice.


EXAMPLE PRESENCE-ABSENCE MATRIX
(periods are in to show where spaces should be, lines end with [Return]):

Canocohelptable.gif

Step by step, these are the things to keep in mind as you're setting up a matrix:

1. Canoco is a Fortran program that has a very strict format that must be followed or it won't know how to read the files.

2. The horizontal numbers are the species identifiers. They do not need to be in increasing order (but it's not a bad habit). Only the species present at a locality need be included, Canoco does not accept "0.0" occurrence. The species are then listed in the first list after the matrix, which has 10 identifiers of 8 characters each per line, followed by a [Return]. For example, species 20 = Betul.

3. The vertical numbers (1,2,3,4,5,8,9,13) are the locality numbers. They need to be in increasing order but not in immediately consecutive order. Their identifiers are in the second list, and are also 10 per line of 8 characters each.

4. The overall format is the following: 1st line: You define what the data matrix is, any text is fine but keep it under 80 characters.

2nd line: This defines each line of the data sheet. In this example, I2 refers to the 2 spaces allotted to the vertical (locality) numbers (eg:.1). "7" refers to the number of presence- absence couplets per line ( the sequence "6 1.0" is a couplet). "I5" is the number of characters per species number (eg: ..149 is correct, as is ...20 ). "F5.0" defines the number of characters per presence identifier (..1.0 is the correct 5 characters). Remember to close all parentheses.

3rd line: State again the number of couplets per line.

4th line-19th line (or more of course): This is your data matrix. Again, include only the species present at a locality. It is fine to extend into additional lines if you have more than 7 (in this example) species for a locality, but you need to start the line with your "I3" identifier of the locality (see example, localities 8,9,13). Remember to end each line with a [Return]. I recommend using a font such as Courier which lines up perfectly; it is then easier to see if you have made spacing mistakes.

20th line: Always end your data matrix with a 0 (zero) then a [Return].

lines 21-40: This is where you define your species--count across to find out what each number refers to. You can have extra definitions in, but make sure your highest species number agrees with the highest position in the matrix.

lines 41-42: This is where you define your localities; in this example, locality 8 is actually climate station 551. For both definition matrices remember to use 8 characters and only 10 identifiers per line (ending each line with a [Return]).

5.Remember: Canoco is picky! If you are one space off, it will try to run the analysis with completely wrong numbers and sometimes succeed, giving you an ordination of something completely bizarre. Take the time to do this part correctly, because it's not so easy to see the mistakes after they've been made.

6. Save the file as "Text".


II. RUNNING CANOCO

Canoco has a dialogue format, where it asks various questions and you give the answers. For the most part, it's fairly straightforward, but I have provided below the best answers to give for a correspondence analysis, questions in italics and answers in bold.

Type 0 for input from screen
RETURN

Type 1 for long dialogue
1

Type 1 for changing maximum data size
RETURN

Type name of file with species data
CANOCO.SPE
hit RETURN which pulls up a menu, then open your file as you normally do in a Mac program

Type name of file with covariables
RETURN

Type name of file with environmental data
RETURN

Type name of print file
CANOCO.OUT
type in "your choice of names".out

Type name of solution file
CANOCO.SOL
type in "your choice of names".sol

Type of analysis:
4

Scaling of ordination axes
-1

Species and sample diagnostics
0

Enter number of samples to be omitted
RETURN

Transformation of species data
RETURN

Type weight to be given to species
RETURN

Type weight to be given to samples
RETURN

Weighting of species required?
RETURN

it then prints out lists of numbers to your files and then says...

Press RETURN for more, S to skip...
RETURN

Output option for
spec-scor      samp-scor
2                    2
RETURN

Type 0=stop
RETURN

it should say "successful completion of run" so hit RETURN to get out of the program. If you have any problems during the run, "Apple key" "." gets you out.


III. USING THE OUTPUT

The ??.out file gives you a log of the dialogue you've just had and it tells you the variance per axis; it's good to keep on record.

The ??.sol file gives you the first 4 axes' scores for both species and localities; I've mostly worked with the locality plots but the species scores may be very useful in plots of fossil data.

Open both of these files in Microsoft Word and use a small font for them to make sense, stay away from the Canoplot program because it's really unwieldy.

What I tend to do at this point is to take the axis scores into a spreadsheet like Excel, and manipulate it from there. We have various ways of graphing the axes or you can find your own method.



References:

Gauch, Jr., H.G. 1982. Multivariate analysis in community ecology. In: Beck, E., Birks, H.J.B. & Connor, E.F. (eds) Cambridge studies in ecology. Cambridge University Press, New York, 298 p.

Hill, M.O. 1979. Correspondence analysis: a neglected multivariate method. Applied Statistics, 23, 340-354.

Shi, G.R. 1993. Multivariate data analysis in paleoecology and paleobiogeography - a review. Palaeogeography, Palaeoclimatology, Palaeoecology, 105, 199-234.

Ter Braak, C.J.F. 1992. CANOCO - a FORTRAN program for canonical community ordination. Ithaca, N.Y. Microcomputer Power. 95pp (plus software version 3.11, Nov. 1990).



PGAP Home Permian Jurassic Publications Recent Abstracts (full text) About our Data PGAP Activities