• Keine Ergebnisse gefunden

Short reference to numerical methods

Im Dokument Data Management Multivariate • 2A (Seite 17-25)

2 Numerical methods in vegetation surveys

2.3 Short reference to numerical methods

2.3.1 Characterization

The objectives and methodes of data analysis, with relevance to vegetation surveys, have been reviewed in great detai I by Orl6ci ( 1978). Data analysis endeavours to reduce the in-formation content of data to a simpler form, that can be interpreted. The process usually involves fitting constructions of different complexity to the data, such as a regression line, an ordination structure, or a group structure. In all cases, the vegetation analyst may benefit from automated multivariate procedures. These operate on the basis of explicitly defined algorithms. There is a great number of worked algorithms available to the users, some of which have been widely used. The examples range from the method of Braun-Blanquet (Mueller-Dombois and Ellenberg, 1974) to automated clustering and ordination which can claim more objectivity However, the choice between subjective or objective procedures is hardly ever affected by broad differences in their purpose, but is more influenced by local tradition and the personal preference of the investigator.

Why use multivariate analysis1 One advantage is efficiency. This results from the variables being analysed simultaneously and not individually. Another is increased power.

This arises from the fact that the variables are analysed as correlated entities. Yet another is the broad relevance which can be derived from the results, reflecting the collective influence of all variables and being applicable to all variables in a simultaneous analysis.

The data analyst has to keep in mind that aberrant observations can have an unduly strong influence on the results or even dominate the analysis. It is well advised to remove the aberrant observations prior to the analysis and to interpret them separately from the main body of the data. Relevant points are made about this in Chapter 3.6.2.

To illustrate what has been said so far, a set of vegetational data, presented in sym-bolic terms, will now be considered:

n releves

l

x,, , .. X1n

p species

Xp1 ... Xpn

In this, the n releves (interpreted as individuals) are described based on the abun-dance of p species. The number of species in vegetation surveys is generally large and it may even exceed the number of releves.

Information in the data is carried by species correlations (or, equivalently, by releve similarities). Fig. 6, graph (a), illustrates the problem in indeterminacy when the species correlations are low; groups could be formed with members 1 and 2, 2 and 3 or 1 and 3, none of which is more justified than the others. The set of releves here can be determined in different ways according to the species scores. Graph (b) in Fig. 6 illustrates a normal case where the species are highly correlated. The group structure is strong, and the hierarchy is unique, R 1

+

(R2

+

R3 ). Simultaneous analysis of all the species would suggest a possible simplification {Graph [c) of Fig. 6) in that the number of the species can be reduced with minimum loss of information. Analysis of this new, reduced data set may be much more economical than the analysis of the original set by many automated methods.

It is conceivable that often the correlations analysed are linear. It is however a well-known fact that species hardly ever respond linearly, but rather, the response graphs are bell-shaped or even more complex (Fig. 7). Because of this, non-linear correlations can be

'' ~ ~ '' '' '' ~ ~ '' '' f f

0 0 0

(c)

f f f f f

1 1 1 1

(a) (b)

Fig. 6 Sets of releves, overdetermined by a great number of species ( Graphs [a] and [ b ]) . Graph (c) de-monstrates an efficient solution for reducing the number of species in (b).

Species performance

species 1

pH low

A B C

species 2

pH high

Fig. 7 Non-linear correlation characterises two hypothetical species within the range A to C. The correla-tion is almost linear within the smaller range B to C.

expected. Linear correlations are indeed assumed in some of the programs of this package. The user, however, can avoid some undesirable effects from non-linearity by not mixing releves from very different sites in the same sample. Fig. 7 illustrates this point, and also the effect of subdividing the data into more homogeneous sets. Further relevant aspects will be discussed in Chapter 3.6, dealing with the analysis of very large data sets.

The aims and properties of numerical methods often lead to very rigid constraints affecting their application. This will be explained briefly in the following sections.

2.3.2 Transformations

This term is used for any systematic adjustment of elements within a row or column vector of the data. The following cases are frequently used in plant ecology:

1. Replacing a code by numerical values, i.e., transforming the mixed scale of Braun-Blanquet to a metric scale:

empty 0.0

+

1.0

1 - 2.0

2 3.0 etc.

2. Signum transformation (quantitative data changed to presence-absence). This dis-regards quantity in the species scores. The vector,

[ 1 23

o

5 17 J

after signum transformation becomes [ 1 1 0 1 1 ].

3. Weighting given scores by some specified quantity, such as the total or the length of a vector. The vector

[ 1 2 0 1 1]

after weighting with total, becomes [ 1}5 2}5 0 1}5

1k ].

Some transformations allow one to manipulate the influence of different species on the analysis by weighting them on the basis of performance measurements (frequency, biomass, cover percentage, etc.). Other transformations are needed if the measurements have differ-ent scales to make them commensurable.

2.3.3 Resemblance measures

There are quantities, known as resemblance measures, which express the similarity or dis-similarity of objects. Most resemblance measures incorporate data transformations and differ from each other. To describe the resemblance structure of a set of n individuals, n (n - 2)/2 resemblances of different elements have to be computed. The different measures may express differently the degree of resemblance of given vectors. Some will stress the qualtitative aspects, while others are more sensitive to quantitative differences. There are many different resemblance measures. Two examples:

A dissimilarity (distance) may be defined as the sum of all differences between the elements of two vectors:

Releve 1: 1 3 0 3

Releve 2: 2 0 0 1

Absolute differences ldl: 3 0 2, ~ !di= 6

I

6 represents a measure for the dissimilarity of releve 1 and 2.

A similarity may be defined as the sum of all products of the pairs of elements in two vectors:

Releve 1:

Releve 2:

Products s:

1 3 0 3

2 0 0

2 0 0 3,~s=5

I

5 represents a measure for the similarity of releve 1 and 2. Should the vectors be normalized, the upper limit of this similarity measure would be 1.

2.3.4 Ordination

Ordering data points on axes, and explaining the observed trends are the main aspects which characterize ordinations. Scatter diagrams are plots of points in two- or three-dimensional graphs. In some ordinations, the ordering of species or releves is based on measurements of site factors. e.g.,

~

s 1 2 3

pH 4.5 60 5.5

Altitude, m 130 230 270

Scatter diagram:

3

m a.s.l.

250 2

200

150 1

100

4 5 6 pH

Releve 2 and 3 are more similar than 1 and 2 or 1 and 3 with regard to pH and altitude, con-ditional on the linear scales used.

Some ordinations use coordinates {scores) derived from the data by multidimen-sional scaling procedures, such as that in principal component analysis (PCA), reciprocal ordering, polar ordination, etc., as the ordering criteria for species or releves. e.g.,

~

1 2 3

s

Scores on component ax is 1 0.8 1.9 5.1

Scores on component axis 2 1.4 2.0 -1.0

Scatter diagram:

Axis 2

3.0

2

5.0

Axis 1 3

The conclusion that releve 1 and 2 are more similar than 1 and 3 or 2 and 3, based on f lo-ristic composition, is substantiated.

Ordinations can improve the user's understanding and reveal aspects of complex resemblance structure in a data set which would otherwise go undetected. Ordination co-ordinates are new descriptors of vegetational variation. When they are correlated with site factors, the ones affecting the vegetation can be identified. Ordinations may reveal groups, but they are more efficient at displaying the gradient structures.

2.3.5 Classification

This is the process which divides a sample into groups. The algorithms which classify objects are often referred to as cluster analyses. The strategies may be agglomerative or divisive. In many algorithms, clustering is based on resemblance measures.

Agglomerative clustering starts with finding the pair with most similar data points (individuals). This is then united with others in subsequent steps to form increasingly larger clusters. The results can be presented in the form of a dendrogram:

Distance level for fusion

3

2

0 -+-_ _ ..,___...._ _ _ ..,__ _ _ _ _ _ _

3 2 Releves

Releve 1 and 3 are more similar than 1 and 2 or 2 and 3. The dendrogram gives no informa-tion about whether releve 1 or 3 is closest to releve 2.

In divisive clustering, the sample of data points is subdivided successively into groups to optimize (or come closer to optimizing) given criteria. In the example above, a divisive clustering algorithm would most likely assign releve 1 and 3 to one group and releve 2 to a second group. Generally, agglomerative und divisive clustering need not lead to the same result. Classifications represent potentially efficient methods for summarization of data sets within a group structure. The resulting groups can be subjected to further analysis by using suitable methods to reveal finer trends or structures.

2.3.6 Identification

This is the process of finding the most likely parent group for an individual. Unlike classifi -cation, identification requires the existence of established reference groups. e.g.,

Parent populations: PI P2

Candidates for joining the parent populations: c1 c2 Scatter diagram:

Axis 2

Axis 1

An identification process would assign c2 to p 1 and c1 to p2 . Note that a cluster analysis would most likely lead to a first group with members c1 and c2 .

Identification can serve useful purposes in synsystematics, for instance, by reallocat-ing new releves to categories in an already existing system. Identification algorithms also represent a tool for refining already existing classifications.

2.3.7 Ranking

The purpose of ranking is to order a list of species according to their potential in accounting for variation within a given set of releves. Ranking can be based on the properties of single vectors, such as species frequency:

Releves Frequency Rank order

2 3

Species 1 3 5 3 1

Species 2 0 6 0 1 3

Species 3 0 1 2 2

A more powerful algorithm would rank species not only in terms of individual vectors but of the correlation of vectors, e.g.,

Releves Rank order

2 3

Species 1 1 0 1

Species 2 1 0 3

Species 3 0 2

Since species 1 and 2 are functionally correlated, if species 1 is declared to have rank order 1, species 2 has to be ranked last. Since species 3 accounts for less of the correlations than 1, it is given rank 2.

Ranking allows the reduction of the species number. The algorithms minimize in-formation loss within the constraints of given criteria. Some ranking algorithms can detect differentiating species. These can be used in the keys to vegetation types. Ranking may even be used to determine the sample size necessary to describe a given survey area. To do this, it is repeated on increased numbers of releves until the rank order of the species is stabilized.

2.3.8 Evaluation of clustering results

To measure the success of classifications, tests are performed on hypotheses concerned with individual properties of the structured tables. e.g.,

Solution 1 Species Releves

1 2 3 4

1 1 1 0 0

2 0 0 1 1

3 0 0 1 0

Solution 2 Species Releves

,

2 3 4

1 1 1 0 0

2 0 0 1 1

3 0 0 1 0

If the objective of a clustering process is to form species groups which are as specific to the releve groups as possible, then the group structure in solution 2 is better than the same one in solution 1. Statistical tests are available to determine which clustering algorithm serves the objective best. In some cases, the test procedures are used without the restrictions which have to be observed in statistics, simply in a deterministic context.

Bibliography to Chapter 2

Green, R.H., 1979: Sampling Design and Statistical Methods for Environmental Biologists. 257 p., New York, Wiley

Mueller-Dombois, 0., and Ellenberg, H., 1974: Aims and Methods of Vegetation Ecology. 547 p., New York, London, Sydney, Toronto, Wiley.

Orl6ci, L., 1978: Multivariate Analysis in Vegetation Research. 2nrl ed., 451 p., The Hague, Junk.

Whittaker, R.H., 1967: Gradient Analysis of Vegetation. Biol. Rev 42 207-264.

Whittaker, R.H., 1973: Ord.nation and Classification of Commun,ties. Handbook of Venetation Science V.

750 p., The Hague, Junk.

Im Dokument Data Management Multivariate • 2A (Seite 17-25)