Data Management Multivariate • 2A

(1)

19. AUG. P~O Eidg. Anstalt fur das forstliche Versuchswesen

-Bibliothak- Eidgen 0ssische Anstalt 8903 Birmensdorf ZH fur das forstliche Versuchswesen

CH 8903 Birmensdorf lnstitut federal

de recherches forestieres CH 8903 Birmensdorf lstituto federale di ricerche forestali CH 8903 Birmensdorf Swiss Federal Institute of Forestry Research CH 8903 Birmensdorf

Juni 1980 Nr. 215

Berichte Rapports

Otto Wildi

Laszlo

Orl6ci

2 A

215 •

Rapporti Reports

Management and Multivariate Analysis of

Vegetation Data

(2)

Die Eidg. Anstalt fur das forstliche Versuchsw·esen hat den Zweck, durch wissenschaftliche Versuche, Untersuchungen und Beobachtungen der schweizerischen Forstwirtschaft in ihrem vollen Umfange eine sichere Grundlage zu verschaffen (Bundesbeschluss betreffend die Grundung der EAFV).

Die Anstalt stellt die Ergebnisse ihrer Arbeiten vorwiegend in der Form von Publikationen zur Verfugung von Praxis und Wissen- schaft. In den MITTEILUNGEN erscheinen meist umfangreichere Arbeiten von liingerfristigem lnteresse. Die BERICHTE enthalten in der Regel kurzere Texte. die sich an einen engeren Leserkreis wenden.

Die Publikationen der EAFV. die den lnhabern schweizeri- scher Forstbeamtungen kostenlos abgegeben werden. sind als Amts- exemplare zu betrachten.

L'lnstitut federal de recherches forestieres a pour but de fournir. en procedant a ^desessais scientifiques. a ^desrecherches et

a des observations. une base solide a l'economie forestiere suisse dans son ensemble (Arrete federal concernant la creation de l'IFR Fl.

L'tnstitut met les resultats de ses travaux

a

la disposition de la science. principalement sous forrne de publications. La plupart des travaux importants et d'interet durable paraissent dans les MEMOIRES. Les RAPPORTS contiennent en regle generale des textes plus courts. qui s'adressent a un cercle plus restreint de lecteurs.

Les publications de l'IFRF remises gratuitement aux fonc- tionnaires forestiers doivent etre considerees comme des exemplaires de service.

L'lstituto federale di ricerche forestali ha per scopo di fornire mediante esperimenti, ricerche e osservazioni scientifiche, una base sicura per l'economia forestale in tutta la sua estensione (Decreto federale sull'istituzione dell'IFRF).

L'lstituto mette i risultati delle sue ricerche a disposizione della pratica e della scienza, principalmente sotto forma di pubblicazioni. Nelle MEMORIE compaiono per lo piti lavori importanti d'interesse durevole. 1 RAPPORT! contengono di regola testi piu brevi indirizzati ad una cerchia di lettori piu ristretta.

Le pubblicazioni dell lFRF. rimesse gratuitamente ai funzio- nari dei servizi forestali. sono da considerare quali esemplari d'ufficio.

The purpose of the Swiss Federal Institute of Forestry Re- search is to furnish sound principles for all aspects of forestry in Switzerland, through scientific research. investigation and observa- tion. (Governmental decree on the founding of the SFI FR.)

Its findings are. mainly through publishing. made available for application in practice and research. Texts of limited application are generally presented in the "Reports" (Berichtel. while those of

ETHICS WSL

II ^111111111

(3)

Eidgen6ssische Anstalt

tor das forstliche Versuchswesen CH 8903 Birmensdorf

lnstitut federal

de recherches forestieres CH 8903 Birmensdorf lstituto federale di ricerche forestali CH 8903 Birmensdorf Swiss Federal Institute of Forestry Research CH 8903 Birmensdorf

Juni 1980 Nr. 215

Berichte Rapports Rapporti Reports

Oxf.: 182 : DK 519. 237: 519. 61

Otto Wildi Laszlo Orl6ci

Management and Multivariate Analysis of Vegetation Data

Herausgeber:

(4)

Address of the second author:

Laszlo Orl6ci, Department of Plant Sciences, University of Western Ontario London, Ontario, Canada N6A 587

Composition.

Graphics

Kurt Rauber Mirek Sebek Manuscript handed in: January 13. 1980 Reference:

Eidg Anst. forstl. Versuchsv,es., Ber.

(5)

ABSTRACTS

Management and Multivariate Analysis of Vegetation Data

This publication of the SFI FR is a specialized handbook for phytosociologists and plant ecologists. It gives a brief, easily understandable introduction to the mathematical treat- ment of vegetational and ecological data, and offers guidance in the application of a new package of computer programs. The book aims to provide research workers and students with a versatile, easy-to-use intrument which exploits to the full the potentials of a fast computer.

This is attempted in a system of programs giving maximum flexibility in choice of method and automating as far as possible otherwise time-consuming manipulations.

Verwaltung und multivariate Analyse von Vegetationsdaten

Der vorliegende Bericht der EAFV ist ein spezialisiertes Handbuch, das sich an Pflanzen- soziologen und Pflanzenokologen richtet. Es bietet dem Leser eine kurze, leicht verstand- liche Einfuhrung in die Wirkungsweise mathematischer Behandlung von Vegetations- und Standortsdaten und gibt eine Anleitung fur den Einsatz eines Computerprogrammpaketes.

Das Buch verfolgt das Ziel, dem Forscher und dem Studenten ein vielseitiges, leicht zu be- dienendes Instrument in die Hand zu geben, um von der Leistungsfahigkeit grofser Rechen- anlagen profitieren zu konnen. Dies ist in Form eines Systems von Programmen realisiert worden, das dem Benutzer viel Freiheit in der Wahl der Methode gewahrt und gleichzeitig zeitraubende Manipulationen in den Oaten soweit wie moglich automatisiert.

Traitement et analyse

a

plusieurs variables de donnees de vegetation

Le present rapport de I' I FR F est en fait un manuel specialise destine aux phytosociologues ainsi qu'aux ecologistes des plantes. Le lecteur y trouvera une breve introduction sur le traitement mathematique

a

l'ordinateur des donnees sur la vegetation et la station, ainsi que sur !'utilisation d'un paquet de programmes pour ordinateur. L'ouvrage a pour l'objectif de mettre a disposition du chercheur ou de l'etudiant un bon instrument, d'utilisation aisee, de sorte qu'il puisse beneficier des avantages des ordinateurs performants. Les auteurs pro- posent un systeme de programmes qui laisse une grande liberte

a

l'utilisateur dans le choix de la methode, alors que les manipulations fastidieuses dans les donnees sont automatisees dans la mesure du possible.

(6)

TABLE OF CONTENTS

Abstracts ... ... ... . .

Preface ... .... .. ... ... ... ... ... .. • • • • • • • • • • • · · · 3 7

Introduction . . . 9

2 Numerical methods in vegetation surveys . . . 10

2.1 Basic steps ... ... .. .. ... .... ... ... .. ... 10

2.2 The systems structure ... 12

2.3 Short reference to numerical methods. . . . 15

2.3.1 Characterization . . . 15

2.3.2 Transformations . . . 17

2.3.3 Resemblance measures .. ... ... ... 17

2.3.4 Ordination. . . 18

2.3.5 Classification. . . 19

2.3.6 Identification ... ... 20

2.3.7 Ranking .... ... ... ... ... .... ... .. ... ... .. 20

2.3.8 Evaluation of clustering results .. ... .. . ... 21

3 Planning the analysis ... . .. .... . ... ... ... .. .. ... ... 23

3.1 The operational pathway. . . . 23

3.2 Transformation of cover/ abundance values. . . . 23

3.3 Resemblance measures and vector transformations . . . 25

3.4 Ordinating the data by component analysis ... ... 28

3.5 Clustering methods . . . . . 28

3.5.1 Grid analysis ... ... ... .. 28

3.5.2 Other clustering methods ... .. .... ... ... 29

3.6 Managing large data sets . . . 30

3.6.1 General considerations ... ... ... 30

3.6.2 Reducing sample size ... .. ... ... .. 30

4 Operating the computer programs ... .. ... ... 32

4.1 General considerations . . . . . . 32

4.2 The standard input file ... 32

4.3 Implementation of the programs into computing systems ... 35

4.3.1 Interactive versions (DEC System 1090) . . . ... 35

4.3.2 File handling in the batch programs ... ... 35

4.4 Brief review of programs with examples ... 37

(7)

4.4.1 4.4.2 4.4.3 4.4.4 4.4.5 4.4.6 4.4.7 4.4.8 4.4.9 4.4.10 4.4.11

Program /NIT . . . 37

Program EDIT . . . 40

Program TABS . .... .. . . 45

Program RES£ . . . 48

Program /DEN . . . . .. . . 52

Program RANK . . . 53

Program CL TR ... 55

Program PCAB . . . 57

Program GRID . . . .. . . 60

Program ORDB . . . .... 63

Program CIAC . . . ... . . 67

(8)

(9)

PREFACE

This manual is a revised and enlarged version of "Management and Multivariate Analysis of Large Data Sets in Vegetation Research''. a handbook for the package of computer programs produced at the University of Western Ontario in 1978. The package has been used exten- sively by vegetation scientists in the analysis of different data sets. Experience has been gained, leading to revisions of its structure and of the individual programs. Following the requests of several users, we decided to include some introductory information on numerical methods and also suggestions how to chose between the different options. These should facilitate the use of the package by newcomers in this field who have already had some training in multivariate analysis. The latter is still necessary to operate the programs and to properly interpret the results.

The operational structure in the new version of the package is considerably simpler than that in the previous version. The number of programs has been reduced from 17 to 11 and the data files from over 20 to only 6 different types. This simplifies batch processing, even though some new algorithms have been incorporated into the package. The present version is not compatible with the previous one, except for the original input files which can be used without any changes.

Several users gave us helpful hints and suggestions for revisions. Mrs. R. Kuhn-Garrod and Mrs. M. J. Sieber assisted with the manuscript. Dr. W. Bosshard (Publisher), the Director of the Swiss Federal Institute of Forestry Research, accepted the manual for publication and made it possible to produce an edition to fulfil a demand comparable to that of the first version; we would like to thank him for this.

(10)

(11)

1 INTRODUCTION

The processing system was designed to make use of the capabilities of a fast computer to manage large amounts of data and to perform multivariate analyses. Its purpose is to serve the users of multivariate methods, and also those who prefer to make use of automated data handling in subjective classifications. The design of the system has been influenced by special requirements, with a view to its use in regions such as Central Europe with large amounts of scarcely processed vegetation data.

The existing program packages avai I able to the vegetation scientist rarely satisfy his objectives, or include all those aspects which we consider important: a sufficent number of options for the methods of data analysis, flexibility in application under different requirements, possibilities for the user to interact with the system.

The system which we present has several advantages:

1. The majority of the algorithms incorporated in it can handle large data sets with the help of peripheral. storage units.

2. The structure of the processing system is such that it can accept any method and accomodate new algorithms with great ease. To satisfy the requirements of the users, their different data structures and objectives, the system incorporates not one but several different methods.

3. Feedback to the user is achieved by displaying intermediate results. These can make the analysis more flexible, so one can proceed in steps and continue in new direc- tions if necessary.

4. The system incorporates service-oriented manipulations, such as the printing of tables and graphs and the reducing or combining of data sets. (The original identification numbers of the species and releves are of course retained.)

The system consists of different programs not all of which are needed at any one time. The size of the computer core required is therefore variable. An advantage of this is that the analysis may use a large number of different program combinations to serve different objectives. But to take full advantage of the system's operating potentials careful planning is advisable. For this, a good understanding of the system and the methods is required. The manual therefore presents an introduction not only to the operation of the processing system, but also to the different methods offered by the programs.

To achieve these objectives, Chapter 2 summarises data handling, numerical analysis and the processing system. It introduces the structure of the latter and gives brief explanations of the methods.

Chapter 3 is concerned with the planning of pathways in the analysis. It also de- scribes some typical applications as examples, since the selection of the methods depends not only on the objectives of the investigator, but also on the actual data structure.

Chapter 4 contains the descriptions of the algorithms and the instructions needed to operate the processing system. These short reviews are followed by illustrative examples.

A minimum number of references are included, mainly textbooks and reviews on numerical and other methods, which in themselves are rich sources for other references.

Program listings are given in the back folder of the booklet in the form of microphotographs.

Copies of the programs are available on request from 0. Wildi.

(12)

2 NUMERICAL METHODS IN VEGETATION SURVEYS

2.1 Basic steps

In any given survey, data processing involves just a few of several possible methodological steps that are logically related. Fig. 1 illustrates this point. Assuming that the purpose of the survey is given, the survey area has to be delimited. In most cases, only a portion of the area can be sampled and only a few of the possible states of the variables can be measured and included in the analysis. Before the sampling begins, the sampling criteria have to be defined, including sampling design and variables. The former may be random, systemat[c, preferential or a combination of these. Examples of variables include cover, biomass, and frequency of plant species, families, or life form. The sampling yields the data to be subjected to numerical analysis. In the course of this, as will be seen, the investigator's conclusions may be strongly influenced by the resemblance measure actually applied to the raw data. The results of data analysis may directly serve some defined practical goal, or may lead to more field work and analysis.

In data analysis, three basic processes are of interest: (i) Specific subsets of the original data set may be identified and displayed before attempting to extract information or to draw general conclusions. (ii) The data set may be tested for fit to statistical, geo-

Survey area

Sample

Survey space

State of variables

Sampling criteria

Transformation Measuring Model building

Data space

---,

_I

I I I I I I I I I I I I I

J

resemblance

Resemblance space

Fig. 1 Methodological steps in vegetation surveys. Explanations are given in the main text.

(13)

Handling programs

Reading

a data set, storing subsets on separate files for further analysis

' ^-

R e d u c i n g ~ number of species or

releves, or dividing data sets for analysis and storage

Condensing - - - - data sets by forming frequency matrices for storage

Comiining--- data sets for

analysis and storage

Processing programs

Resemblance matrix for input in multivariate analysis

•

_ -Ranking

Response programs

species, releves or / Scatter diagrams or environmental factors /

Multidimensional scaling to produce ordination

coordinates _________.,, Dendrograms Classification - - - - -

to form groups

- - Identification - - - - Tables to evaluate results, - - - - -

reallocate new observations

'

Tests (statistical)

Fig. 2 Flow diagram of data management.

metrical or dynamic models suggested by practical or theoretical considerations. (iii) Com- parisons may be made between the subsets or between the data set and other sets from external sources, such as data from literature. It follows from these that the management of data must involve analyses as well as data handling processes (Fig. 2). There are also processes referred to as response programs, which have to be considered because of economic reasons, as will be seen below.

Initial data handling comprises four processes listed in the left-hand column in Fig. 2.

These involve data manipulations considered important to store or retrieve information and to give fast access to any portion of the information in the system using the processing programs. The entire data set - including measurements, identification labels such as row and column numbers, the names of species or site factors, and remarks about the origin and nature of the data - is exposed to initial manipulations. The reduction or recombination of sets allows the user to process a portion of a set or to perform an analysis of two or more sets simultaneously. Condensation achieves reduction of the species number, the formation of groups, the computation of group centroids, or indeed, condensation of the data in the frequency tables in which the plant community rather than the releve represents the fundamental unit. Frequency tables facilitate quick analyses of very large data sets. They represent a useful device for managing vegetation data banks, since they can be processed with ease by the programs.

(14)

Data analysis often starts with the generation of a resemblance matrix (middle column, Fig. 2; Section 2.3.3). We regard this as an important step in the analyses performed by this package, as it is known that the resemblance matrix defines completely the sample space of the analysis, while constraining the possible results. Many kinds of resemblance measures can be used, but the type of raw data and the objectives of the analysis will determine the choice.

The resemblance may represent input in ranking (Section 2 3.7) to reveal information about the importance of the descriptors, and through this to assist the user to reduce the data set with a minimum loss of information. This is important in vegetation surveys where the number of species is usually very large and the data sets are difficult to analyse.

The resemblance matrix can represent an input in ordinations (Section 2.3.4). This can illuminate the ecological significance of the data structure and supply coordinates to be used as input in further processing such as classification or trend seeking. Classification as well as identification {Sections 2.3.5 and 2 3.6) enables the user to reduce the information content of data by the formation of groups or by assignments to groups. After classification or ordination, in specific cases, statistical tests may be performed on the resu Its in accordance with specific hypotheses {Section 2.3.8).

An efficient analysis should be fast in yielding results. The response algorithms (right column, Fig. 2) can aid this. They are not restricted to a specific kind of analysis, as will be apparent from the structure of the processing system described in the next section.

2.2 The systems structure

Each program in the package (Fig. 3) does one step in the analysis. Several programs may have to be run in order to get the desired results. Information flow from program to program, i.e. the transmission of the intermediate results, is achieved by input and output disk files (Fig. 4). The evaluation of the options available and the display of the results require local input, from cards or teletype, followed by a local output to the printer or the teletype, displaying the results. The interpretation of the output suggests which program should be run next, whenever an alternative pathway exists. Fig. 5 is a flow chart of operations, indicating logically connected steps in the analysis. These are to be interpreted as follows:

The analysis, to get started, requires a standard input. Its structure is explained in Chapter 4.2. Programs EDIT and TABS may be used to derive new input files from those already existing. Programs /NIT, EDIT and TABS can process these files directly. However, the numerical analysis of the input data always assumes preprocessing by /NIT ( Fig. 5).

After this, program RESE is able to compute resemblances. The complete resemblance matrix, containing all possible comparisons of pairs of data rows or columns, is the input in other analysis, such as identification (/DEN}, clustering {CLTR), ranking {RANK) and component analysis (PCAB}. These programs yield descriptions of groups, or coordinates on gradients. These may be processed by EDIT The coordinates from PCAB represent a special case when used in program GRID for finding noda (centers of groups) and in program ORDB for printing scatter diagrams.

Program TABS uses information about the group membership of species or releves for printing vegetation tables. The latter are condensed into contingency tables. A statistical test is performed on them in program CIAC to measure the success of the tabular sorting procedure. Since this also yields coordinates for releves and species groups, scatter diagrams can be printed, again with program ORDB.

(15)

Name

/NIT

2 EDIT

3 TABS

Function Handling

Handling

Purpose

Initializes data files, does transformations and computes frequencies

Stores classifications from other programs or from the user, reduces data sets

Handling and Prints vegetation tables, computes and prints frequency presentation and contingency tables

4 RESE Analysis Computes resemblance vectors or matrices

5 /DEN Analysis Reallocates new releves or species to existing groups 6 RANK Analysis Ranks species on coordinates of orthogonal functions 7 CLTR Analysis and Does clustering, prints dendrograms and forms groups

presentation

8 PCAB Analysis Computes component analysis ( R type) or correspondance analysis (reciprocal ordering)

Analysis Finds groups in ordinations Presentation Prints scatter diagrams 9

10 11

GRID ORDB

CIAC Analysis Measures success in tabular sorting

Fig. 3 List of programs. See the main text for explanations.

Local input by the user

Input from disk files of preceding runs

-

Fig. 4 General structure of the programs.

'

Processing program

Local output

Output to disk files

(16)

It is obvious from what has been said that a typical analysis will not require the use of all the programs listed in Fig. 3. Some useful pathways will therefore be traced and explained in Section 3.1.

3 TABS

11 CIAC

Function of the programs:

□ H•ndl;ng

□ Handling

and

presentation

Standard Input

2 EDIT

5 IDEN 6RANK

□ Analy,i,

□ Analysis

and

presentation

1 INIT

4 RESE

7 CLTR 8 PCAB

D

P",enU.tion

Fig. 5 Flow chart of operations. The arrows indicate the logical steps in proceeding through the analysis.

(17)

2.3 Short reference to numerical methods

2.3.1 Characterization

The objectives and methodes of data analysis, with relevance to vegetation surveys, have been reviewed in great detai I by Orl6ci ( 1978). Data analysis endeavours to reduce the information content of data to a simpler form, that can be interpreted. The process usually involves fitting constructions of different complexity to the data, such as a regression line, an ordination structure, or a group structure. In all cases, the vegetation analyst may benefit from automated multivariate procedures. These operate on the basis of explicitly defined algorithms. There is a great number of worked algorithms available to the users, some of which have been widely used. The examples range from the method of Braun-Blanquet (Mueller-Dombois and Ellenberg, 1974) to automated clustering and ordination which can claim more objectivity However, the choice between subjective or objective procedures is hardly ever affected by broad differences in their purpose, but is more influenced by local tradition and the personal preference of the investigator.

Why use multivariate analysis¹ One advantage is efficiency. This results from the variables being analysed simultaneously and not individually. Another is increased power.

This arises from the fact that the variables are analysed as correlated entities. Yet another is the broad relevance which can be derived from the results, reflecting the collective influence of all variables and being applicable to all variables in a simultaneous analysis.

The data analyst has to keep in mind that aberrant observations can have an unduly strong influence on the results or even dominate the analysis. It is well advised to remove the aberrant observations prior to the analysis and to interpret them separately from the main body of the data. Relevant points are made about this in Chapter 3.6.2.

To illustrate what has been said so far, a set of vegetational data, presented in sym- bolic terms, will now be considered:

n releves

l

x,, , .. X1n

p species

Xp1 ... Xpn

In this, the n releves (interpreted as individuals) are described based on the abundance of p species. The number of species in vegetation surveys is generally large and it may even exceed the number of releves.

Information in the data is carried by species correlations (or, equivalently, by releve similarities). Fig. 6, graph (a), illustrates the problem in indeterminacy when the species correlations are low; groups could be formed with members 1 and 2, 2 and 3 or 1 and 3, none of which is more justified than the others. The set of releves here can be determined in different ways according to the species scores. Graph (b) in Fig. 6 illustrates a normal case where the species are highly correlated. The group structure is strong, and the hierarchy is unique, R ₁

+

^(R₂

+

^R_{3 ).} Simultaneous analysis of all the species would suggest a possible simplification {Graph [c) of Fig. 6) in that the number of the species can be reduced with minimum loss of information. Analysis of this new, reduced data set may be much more economical than the analysis of the original set by many automated methods.

It is conceivable that often the correlations analysed are linear. It is however a well- known fact that species hardly ever respond linearly, but rather, the response graphs are bell-shaped or even more complex (Fig. 7). Because of this, non-linear correlations can be

(18)

'' ^{~ ~} '' '' '' ^~ ^~ ^'' '' ^f ^f

0 0 0

^(c⁾

f ^f ^f f f

1 1 1 1

(a) (b)

Fig. 6 Sets of releves, overdetermined by a great number of species ( Graphs [a] and [ b ]) . Graph (c) demonstrates an efficient solution for reducing the number of species in (b).

Species performance

species 1

pH low

A B C

species 2

pH high

Fig. 7 Non-linear correlation characterises two hypothetical species within the range A to C. The correlation is almost linear within the smaller range B to C.

expected. Linear correlations are indeed assumed in some of the programs of this package. The user, however, can avoid some undesirable effects from non-linearity by not mixing releves from very different sites in the same sample. Fig. 7 illustrates this point, and also the effect of subdividing the data into more homogeneous sets. Further relevant aspects will be discussed in Chapter 3.6, dealing with the analysis of very large data sets.

The aims and properties of numerical methods often lead to very rigid constraints affecting their application. This will be explained briefly in the following sections.

(19)

2.3.2 Transformations

This term is used for any systematic adjustment of elements within a row or column vector of the data. The following cases are frequently used in plant ecology:

1. Replacing a code by numerical values, i.e., transforming the mixed scale of Braun- Blanquet to a metric scale:

empty ➔ 0.0

+

^➔1.0

1 - 2.0

2 ➔ 3.0 etc.

2. Signum transformation (quantitative data changed to presence-absence). This disregards quantity in the species scores. The vector,

[ 1 23

o

5 17 J

after signum transformation becomes [ 1 1 0 1 1 ].

3. Weighting given scores by some specified quantity, such as the total or the length of a vector. The vector

[ 1 2 0 1 1]

after weighting with total, becomes [ 1}5 2}5 0 1}5

1k ].

Some transformations allow one to manipulate the influence of different species on the analysis by weighting them on the basis of performance measurements (frequency, biomass, cover percentage, etc.). Other transformations are needed if the measurements have different scales to make them commensurable.

2.3.3 Resemblance measures

There are quantities, known as resemblance measures, which express the similarity or dissimilarity of objects. Most resemblance measures incorporate data transformations and differ from each other. To describe the resemblance structure of a set of n individuals, n (n - 2)/2 resemblances of different elements have to be computed. The different measures may express differently the degree of resemblance of given vectors. Some will stress the qualtitative aspects, while others are more sensitive to quantitative differences. There are many different resemblance measures. Two examples:

A dissimilarity (distance) may be defined as the sum of all differences between the elements of two vectors:

Releve 1: 1 3 0 3

Releve 2: 2 0 0 ¹

Absolute differences ldl: 3 0 2, ~ !di= 6

I

6 represents a measure for the dissimilarity of releve 1 and 2.

(20)

A similarity may be defined as the sum of all products of the pairs of elements in two vectors:

Releve 1:

Releve 2:

Products s:

1 3 0 3

2 0 0

2 0 0 3,~s=5

I

5 represents a measure for the similarity of releve 1 and 2. Should the vectors be normalized, the upper limit of this similarity measure would be 1.

2.3.4 Ordination

Ordering data points on axes, and explaining the observed trends are the main aspects which characterize ordinations. Scatter diagrams are plots of points in two- or three-dimensional graphs. In some ordinations, the ordering of species or releves is based on measurements of site factors. e.g.,

~

^s ¹ ² ³

pH 4.5 60 5.5

Altitude, m 130 230 270

Scatter diagram:

3

m a.s.l.

•

250 2

•

200

150 1

•

100

4 5 6 pH

Releve 2 and 3 are more similar than 1 and 2 or 1 and 3 with regard to pH and altitude, con- ditional on the linear scales used.

Some ordinations use coordinates {scores) derived from the data by multidimensional scaling procedures, such as that in principal component analysis (PCA), reciprocal ordering, polar ordination, etc., as the ordering criteria for species or releves. e.g.,

(21)

~

¹ ² ³

s

Scores on component ax is 1 0.8 1.9 5.1

Scores on component axis 2 1.4 2.0 -1.0

Scatter diagram:

Axis 2

3.0

2

•

5.0

Axis 1 3

•

The conclusion that releve 1 and 2 are more similar than 1 and 3 or 2 and 3, based on flo- ristic composition, is substantiated.

Ordinations can improve the user's understanding and reveal aspects of complex resemblance structure in a data set which would otherwise go undetected. Ordination coordinates are new descriptors of vegetational variation. When they are correlated with site factors, the ones affecting the vegetation can be identified. Ordinations may reveal groups, but they are more efficient at displaying the gradient structures.

2.3.5 Classification

This is the process which divides a sample into groups. The algorithms which classify objects are often referred to as cluster analyses. The strategies may be agglomerative or divisive. In many algorithms, clustering is based on resemblance measures.

Agglomerative clustering starts with finding the pair with most similar data points (individuals). This is then united with others in subsequent steps to form increasingly larger clusters. The results can be presented in the form of a dendrogram:

Distance level for fusion

3

2

0 -+-_ _ ..,___...._ _ _ ..,__ _ _ _ _ _ _

3 2 Releves

(22)

Releve 1 and 3 are more similar than 1 and 2 or 2 and 3. The dendrogram gives no information about whether releve 1 or 3 is closest to releve 2.

In divisive clustering, the sample of data points is subdivided successively into groups to optimize (or come closer to optimizing) given criteria. In the example above, a divisive clustering algorithm would most likely assign releve 1 and 3 to one group and releve 2 to a second group. Generally, agglomerative und divisive clustering need not lead to the same result. Classifications represent potentially efficient methods for summarization of data sets within a group structure. The resulting groups can be subjected to further analysis by using suitable methods to reveal finer trends or structures.

2.3.6 Identification

This is the process of finding the most likely parent group for an individual. Unlike classification, identification requires the existence of established reference groups. e.g.,

Parent populations: PI P2

Candidates for joining the parent populations: c₁ c₂ Scatter diagram:

Axis 2

Axis 1

An identification process would assign c₂ to p ₁ and c₁ to p_{2 .}Note that a cluster analysis would most likely lead to a first group with members c₁and c_{2 .}

Identification can serve useful purposes in synsystematics, for instance, by reallocat- ing new releves to categories in an already existing system. Identification algorithms also represent a tool for refining already existing classifications.

2.3.7 Ranking

The purpose of ranking is to order a list of species according to their potential in accounting for variation within a given set of releves. Ranking can be based on the properties of single vectors, such as species frequency:

(23)

Releves Frequency Rank order

2 3

Species 1 3 5 3 1

Species 2 0 6 0 1 3

Species 3 0 1 2 2

A more powerful algorithm would rank species not only in terms of individual vectors but of the correlation of vectors, e.g.,

Releves Rank order

2 3

Species 1 1 0 1

Species 2 1 0 3

Species 3 0 2

Since species 1 and 2 are functionally correlated, if species 1 is declared to have rank order 1, species 2 has to be ranked last. Since species 3 accounts for less of the correlations than 1, it is given rank 2.

Ranking allows the reduction of the species number. The algorithms minimize information loss within the constraints of given criteria. Some ranking algorithms can detect differentiating species. These can be used in the keys to vegetation types. Ranking may even be used to determine the sample size necessary to describe a given survey area. To do this, it is repeated on increased numbers of releves until the rank order of the species is stabilized.

2.3.8 Evaluation of clustering results

To measure the success of classifications, tests are performed on hypotheses concerned with individual properties of the structured tables. e.g.,

Solution 1 Species Releves

1 2 3 4

1 1 1 0 0

2 0 0 1 1

3 0 0 1 0

Solution 2 Species Releves

,

2 3 4

1 1 1 0 0

2 0 0 1 1

3 0 0 1 0

If the objective of a clustering process is to form species groups which are as specific to the releve groups as possible, then the group structure in solution 2 is better than the same one in solution 1. Statistical tests are available to determine which clustering algorithm serves the objective best. In some cases, the test procedures are used without the restrictions which have to be observed in statistics, simply in a deterministic context.

(24)

Bibliography to Chapter 2

Green, R.H., 1979: Sampling Design and Statistical Methods for Environmental Biologists. 257 p., New York, Wiley

Mueller-Dombois, 0., and Ellenberg, H., 1974: Aims and Methods of Vegetation Ecology. 547 p., New York, London, Sydney, Toronto, Wiley.

Orl6ci, L., 1978: Multivariate Analysis in Vegetation Research. 2nrl ed., 451 p., The Hague, Junk.

Whittaker, R.H., 1967: Gradient Analysis of Vegetation. Biol. Rev 42 207-264.

Whittaker, R.H., 1973: Ord.nation and Classification of Commun,ties. Handbook of Venetation Science V.

750 p., The Hague, Junk.

(25)

3 PLANNING THE ANALYSIS

3. 1 The operational pathway

It has already been mentioned that vegetation surveys have different objectives which then require different methodological steps. These can of course limit the relevance of the results.

The same holds true for the analysis of data: no standard procedure exists that could fulfil all the needs. The present package attempts to overcome this problem by offering considerable freedom in the choice of combining methods of vegetation analysis (Section 2.2).

Although the flow chart of operations in Fig. 5 indicates all the technically possible pathways in the analysis that can be performed by the package, not all of them may be desirable or even meaningful. The choice of options will be recommended in the following sections but first three typical pathways will be described. These are likely to cover the needs of many analyses. However, they do not represent standard methods.

The first, indicated by a dotted line in Fig. 8, is preferred by users of non-automated classifications. The analysis starts with program EDIT, which stores the arrangement and the group membership records of releves and species. Program TABS then prints a vegetation table, a frequency table (Stetigkeitstabelle), and a contingency table. Program CIAC (canon- ical analysis of the contingency table) measures the success of the classification. The results are presented to program ORDB which prints a scatter diagram of the releve and species groups.

In the second case (dashed line in Fig. 8), program /NIT initializes the computation.

Here, a code, such as Braun-Blanquet, is replaced by numerical values and some transformations may be applied. Program RES£ computes a resemblance matrix and program PCAB computes ordination coordinates for both releves and species in component analysis or reciprocal ordering. GRID analysis finds noda (clusters) in the ordinations, which are picked up and stored by running program EDIT A vegetation table is printed and tested in the same way as described in the first case.

In the third case (solid line in Fig. 8), the analysis begins with program /NIT The resemblance matrix (program RESE) is subjected to cluster analysis (program CLTR). This yields a dendrogram which, after inspection, is subdivided into groups. Program ED/Tstores the classification, TABS prints the vegetation- and summary-tables and C/AC measures the success of the classification.

In any of these three pathways, the appropriate options have to be chosen: the transformations, the resemblance measure, the methods of ordination and clustering, and the method of editing the data files. This is discussed next.

3.2 Transformation of cover/ abundance values

The description of vegetation stands is often based on estimates of a cover abundance scale.

The data are of a mixed type which consists of codes with specific meanings. Since the majority of mathematical methods can only handle metric data, a transformation of the scale is needed that yields the desired numerical values. There should be no doubt about the

(26)

3 TABS

•

11 CIAC

Standard Input

2 EDIT 1 INIT

4 RESE

- - - ~ 10ORDB

... ,...

1. • •• • ••• Pathway for non-automated classification followed by measurement of success in tabular sorting

2. - - - Pathway for detecting groups by grid analysis

3. - - - - Pathway for detecting groups by other clustering methods

Fig. 8 Three typical pathways in a vegetation analysis.

influence of such transformations on the results. Different solutions have been suggested in the past. We prefer vafl der Maarel's (1979) solution. He reviewed the previous attempts.

Then he drew some general conclusions and showed that almost all the transformations previously suggested can easily be derived when the Braun-Blanquet scale is replaced by scores from 1 to 9 and transformed according to the power function,

( 1 ) (Fig. 9). In this, xis the ordinal score (1-9), y is the transformed score and w is a user defined coefficient. The influence of won y is shown in Fig. 9:

(27)

Braun- Cover Ordinal Power transformations (y = xw)

Blanquet scale

scale % (x) w=0 w = 0.25 w= 0.5 w= 1 w=2 w=4

- ·-,___

(blank) 00 0 0.0 0.00 0.00 0 0 0

r () 1 1.0 1.00 1.00 1 1 1

+ 0.1 2 1.0 1.19 1.41 2 4 16

1 5.0 3 1.0 1.32 1.73 3 9 81

2m 4 1.0 1.41 2.00 4 16 256

2 2a 17.5 5 1.0 1.50 2.24 5 25 625

2b 6 1.0 1.57 2.45 6 36 1296

3 37.5 7 1.0 1.63 2.65 7 49 2401

4 62.5 8 1.0 1.68 2.83 8 64 4096

5 87.5 9 1.0 1 .73 3.00 9 81 6561

~

Fig. 9 Transformations of the Braun-Blanquet cover/abundance scale (van der Ma a rel, 1979).

When w

=

0, the power transformation will yield presence/absence data (with the convention 0°

=

0). This disregards species quantity and thus implies considerable loss of information. However, it has been frequently observed that some large samples can be successfully analysed in terms of presence/absence. Frequency tables (Stetigkeitstabellen) and contingency tables are examples. The latter can be analysed by program CIAC.

Transformations oased on w = 0.25 to w = 1.0 represent a compromise in that they account for differences in cover percentages, and at the same time, they give extra weight to low values. Van der Maarel (1979) found that results from numerical analyses were easiest to interpret at w

=

0.5 or w

=

1.0.

When w is set equal to 2, the y values approximate reasonably well to the original cover percentages. It is however doubtful if cover is a good measure of species importance.

Many phyto-sociologists prefer to give extra weight to the dominant species. This is the case when w

=

4 or more.

Transformation of cover/abundance values is part of the original input file, described in Section 4.2, unlike the vector transformations which are part of the resemblance measures. These are discussed next.

3.3 Resemblance measures and vector transformations

The term resemblance means the degree of similarity of objects according to the properties by which they are described. The resemblance of objects is measurable by different functions. Resemblance functions generate the resemblance matrix and in turn impose a structure on the samples. The resem~lance matrix is the direct input for different methods of multivariate analysis. The role of a resemblance function is fundamental, in that the result will depend on it. In some methods the choice between resemblance functions is a matter of personal preference. In other cases, the choice is dictated by the method of analysis.

(28)

-

Resemblance measure Option in Type Value of Value of Intrinsic

program highest lowest transformation

RESE similarity similarity

-

Cross product 1 similarity not fixed not fixed centering

Covariance 2 similarity not fixed not fixed centering

Correlation coefficient 3 similarity +1 -1 standardisation

Euclidean distance 4 distance 0 not fixed none

Chord distance 5 distance 0

.JJ-

normalisation

Ochiai's coefficient 6 similarity 0

Van der Maarel's coefficient 7 similarity 0

Fig. 10 Characteristics of the resemblance measures in program RESE

The resemblance functions, computed in program RESE, differ widely ( Fig 10) They measure similarities or distances. They may or may not have a fixed upper or lower limit. The most characteristic feature, however, is the transformatjon of the species or releve vectors which the resemblance measure can induce. An example of such a transformation widely used in numerical analysis is standardization. The correlation coefficient (option 3 in RESE) incorporates this, allowing comparisons to be made when the measurements are based on different scales. The correlation coefficient, in fact, may often be the only accept- able resemblance measure for site variables.

Resemblance measures without a fixed upper limit usually have the disadvantage that they rely on absolute species quantity. The graphs on the left of Fig.11 illustrate this point. Without transformation, releve 1 appears to be more similar to releve 3, even though all its species are in common with releve 2. Normalizing the releves, i.e., weighting the species scores by the inverse of the total length of the releve vectors - a transformation incorporated by the chord distance - compensates for their differences in diversity (right graphs in Fig. 11 ). This transformation is also available as a separate option in program /NIT Whenever a transformation is 3pplied, further computations rely entirely on the transformed data. These may differ substantially from the untransformed data. The lower graphs in Fig. 11 are examples of this.

Another commonly used transformation which is available as an option in program /NIT {option 3), converts the data elements into deviations from the expectations. Fig. 12 shows this in an example of contingency tables. The lowest graph on the right indicates that such a transformation may bring about fundamental changes in the data: the observed quantities disappear completely and the new values express the degree to which occurrence of the species deviates from what would be expected in a smooth vegetation table exhibiting no associations between releves and species.

The resemblance measured may be influenced by the species correlations. Such an influence is desirable if trends are sought, since the information about trends is carried in the correlations. If the effect of correlations is undesirable, the data must be orthogonalized.

There is no such transformation provided in program /NIT and also no resemblance measure in RESE that would do so. However, orthogonalisation may be achieved based on orthogonal functions in program RANK or in the components of program PCAB. Note that GRID works in a perfectly orthogonalized space (Section 3.5.1 ).

(29)

Releves Releves

2 3 4 2 3 4

f

¹ ²

f

^1/3 ^1/3

c:;>

² ⁴

c:;>

^2/3 ^2/3

w

² ⁴ ³ ¹

w

^2/3 ^2/3 ^3/3 ^3/3

Vector

3 6 3 1

length

Vector

1 1 1 1

h lengt

2 3 4 2 3 4

t t t t t

c:;>c:;>

c:;> c:;> c:;> c:;> c:;> c:;> c:;> c:;>

WW WW WW WW w WW WW WWW WWW

Fig. 11 Transformation of releve vectors to unit length (upper graph): Species scores x are divided by the Euclidean length I of the related rel eve vectors, where I

= (

I:x ²⁾'Ii_ The lower graph illustrates the same example in non-numerical terms.

Raw data

Releves

2 3

f

¹ ¹ ²

4'

¹ ¹ ²

w

¹ ¹

I: 1 3 1 5

Expectations

Releves

2 3

f

^2/25 ^6/25 ^2/25 ^2/5

4'

2/25 6/25 2/25 2/5

w

^1/25 ^3/25 ^1/25 ^1/5

I: 1/5 3/5 1/5 1

Releves

1 2 3

f f

4' 4'

w

Frequencies

f

^1/5

w 4'

I: ^1/5

Deviations

f

^+3/25

4'

^-2/25

w

^{-1 /25}

..

w

Releves

2 3

1/5 1/5 1/5 3/5

Releves 2

-1 /25 -1/25 +2/25

Releves 2

f

c:;>

~

1/5

3

-2/25 +3/25 -1 /25

3

w

2/5 2/5 1/5 1

Fig. 12 Transformation of raw data to deviation scores. The graphs on the bottom illustrate the same example in non-numerical terms.

(30)

3.4 Ordinating the data by component analysis

Component analysis seeks representation of linear continuous variation on axes which are parsimonious, and preferably, indicative of ecological trends. It produces new descriptions of objects, such as releves, by way of a reduced number of uncorrelated variables, known as components. The component scores represent coordinates on straight axes. From these, two- or three-dimensional scatter diagrams (ordinations) are constructed in program ORDB.

These are graphical representations of the resemblance structure. Whereas program PCAB represents a single algorithm, it offers different options: When the input matrix consists of resemblances between releves (species). scores for species (releves) are derived by setting the local input option in PCAB to "NO''. The classical versions of PCA use the correlation coefficient as a resemblance measure. This requires option 3 in RES£. But for vegetation data the covariance may represent a better choice, since it does not incorporate standardization. Other similarity measures may be used, but they are handled as covariances. Euclidean distance (option 4 in RES£) is not compatible with this program.

PCAB includes an option to store the adjusted component coefficients when the local input is set to "YES''. The program yields adjusted component coefficients for species and component scores for releves. The component coefficients measure the effect of the species vectors on the components in ordinations of releves. The component scores are ordination coordinates.

If the data are transformed into deviations from expectation (option 3 for transformations in /NIT), PCAB will yield the species or releve scores, which after adjustment, are the bases for reciprocal ordering. Since the species and releve scores are related through simple averaging, the computational difficulty can be reduced by using the smaller resemblance matrix of either the releves or the species as the input.

The results of component analysis can be most conveniently interpreted from the scatter diagrams and from the inspection of the eigenvalues which (as proportions) measure the relative resolving power of the axes. While there is no rule about the number of axes which should be considered, four dimensions often prove sufficient to interpret an ordination. From the interpretation of higher-dimensional graphs problems can arise as only selected projections can be observed in steps (program ORDB). Grid analysis was designed exactly with this purpose in mind.

3.5 Clustering methods 3.5.1 Grid analysis

Grid analysis finds nodal groups within an ordination space. The resolving power of the algorithm can be adjusted by changing the grid-width, i.e., the number of cells per axis.

However, the number of noda found very much depends on the group structure of the data.

Continuously dispersed data points lead to a low number of noda and discontinuities to a higher number.

The present version of GRID is limited to four dimensions. Fewer dimensions can be used. For this, 01 has to specify the number of cells on the axes to be neglected. Three or two dimensions may suffice in small and homogeneous data sets. In any case, it is advisable to start with a low resolution, 4 cells per axis, say, for about 100 data points and to increase the resolution in steps up to 8 cells or so per axis. In this way, the reoccurring

(31)

groups are identifiable and conclusions can be drawn about discontinuities and the presence of natural groups. The group structure found by GRID can be interpreted in an ordination (ORDB).

3.5.2 Other clustering methods

Clustering methods, such as grid, single linkage, complete linkage, and minimum variance, impose a group structure on the data. If discontinuities exist in the structure, the groups will be natural. Otherwise clustering will only produce dissections. Clustering must be distinguished from identification (program /DEN) which finds a parent group for new data points in an already classified data set. The clustering algorithms differ in the way they process a given resemblance matrix (Section 4.4.7).

Single Linkage Analysis (option 1 in Cl TR) is one of the methods offered. The dendrograms produced often reveal chaining in the fusions ( Fig. 13, upper graph}. While this would clearly be a disadvantage when a distinct group structure is present, it could be ad- vantageous when the sample points are continuously dispersed with pronounced gradients.

u5 E

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Data points

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Data points

Fig. 13 Typical dendrograms from Single Linkage Analysis (upper graph) and Complete Linkage Analysis (lower graph). Dotted lines indicate choices for forming groups.

(32)

Complete Linkage Analysis (option 2 in CLTR) tends to yield very compact groups (lo'M:lr graph in Fig_ 13). This property is considered an advantage if the data structure is discontinuous. The same is true with Minimum Variance Clustering, the third method offered (option 3 in CLTR). This minimizes the variance within the groups. While single and complete linkage analysis are not restricted to any specific resemblance measure, minimum variance clustering assumes that squared Euclidean distances are the input (options 4 and 5 in RES£). This demonstrates explicitly that the method is of limited scope and should be used only if the variance is considered a meaningful measure.

Clustering yields a dendrograrn. To find groups, the dendrogram has to be subdivided. There are many possible ways to do this. The subdivision is usually at a level of similarity where easily recognizable groups exist. Examples are given in Fig. 13. The dendrograms are dismembered at levels indicated by the dotted line. In program CLTR, the user would have to ask for 6 groups to get the subdivision shown in the upper graph, and 4 groups for that of the lower graph.

3.6 Managing large data sets 3.6.1 General considerations

Formal procedures, such as the automated methods presented in this package, are most suitable to classify sets of data which are visually heterogeneous. Many of the programs in the package can handle data sets of relatively large size (500 releves and 500 species). Exces- sive size is in fact typical in vegetation surveys. While data handling (Section 2.1) is not really a great problem when the size of the data set is large, sophisticated numerical methods such as component analysis can be only done economically when the data sets are relatively small. The problem of reducing the size of the sample may thus arise.

3.6.2 Reducing sample size

A solution should be based on practical as well as theoretical considerations. Firstly, the individual releves and species which are needed for the study must of course be retained.

Secondly, loss of information when reducing the size of a sample should be kept to a minimum. A suggested operational pathway could be like this:

Preparation of the data starts by removing the aberrant observations, such as releves that are clearly not a part of the type of vegetation intended to be sampled. For this, use option 12 in EDIT The remaining sample may then be subdivided into subsets based on some internal or external criteria to reduce the computation cost and to improve the efficiency of further manipulations. Alternatively, the releves may be resampled. In this way, the major features of the compositional spectrum of the data are likely to be retained, and only the sample size is reduced. As Jong as the original data do not exhibit periodicity, option 9 can be used in program EDIT to perform the extraction of a subsample, retaining every second, third or fourth releve. The same program will produce the reduced data files {file REOR). These have to be preprocessed in program /NIT before the species list can be reduced.

Data Management Multivariate • 2A

Laszlo

2 A

215 •

Management and Multivariate Analysis of

Vegetation Data

a

II 111111111

Management and Multivariate Analysis of Vegetation Data

a

a

a

---,

' -

•

'

-

'

□ H•ndl;ng

D

l

+

+

'' ~ ~ '' '' '' ~ ~ '' '' f f

0 0 0

f f f f f

1 1 1 1

+

o

1k ].

2 0 0

~

•

•

•

~

s

•

•

•

,

•

... ,...

=

=

=

=

=

-

-

.JJ-

f

f

c:;>

c:;>

w

w

t t t t t

WW WW WW WW w WW WW WWW WWW

= (

f

4'

w

f

4'

w

f f

4' 4'

w

f

w 4'

f

4'

w

..

w

f

~

w

II ^111111111

' ^-

'' ^{~ ~} '' '' '' ^~ ^~ ^'' '' ^f ^f

f ^f ^f f f