• Keine Ergebnisse gefunden

2 Methodical Background

2.7 Considerations for the Assessment of Classification Accuracy

Thematic maps resulting from the classification of remotely sensed data should be accompanied by information about their accuracy, so that the user can estimate their reliability and applicability for the purpose at hand. Quantitative accuracy measures are also necessary to compare the relative accuracies resulting from different data processing methods.

Classification accuracy is assessed by measuring the agreement between the classified image and reference data which is assumed to be correct (Campbell 1996). Site specific reference data (‘truth’) may consist of another map, if there is already a high accuracy map with a compatible classification of the area, but usually the reference data consist of a number of sample points (or areas). The class of these testing points is identified on the ground, or through manual interpretation of high or very high resolution remotely sensed data. Neither method guaranties that the reference data are 100 % correct – even if the classes of the sample points are determined in the field, there may still be errors due to a time lag between the field work and the acquisition of the classified image, or due to a lack of positional accuracy. This means that the reference data usually represents only a ‘relative truth’

and errors in the reference data, or in the registration between reference and classified image, will negatively influence the accuracy values which are calculated for satellite data classifications (Langford & Bell 1997).

The degree of agreement between a classified image and reference data can be measured using several methods. In remote sensing studies, the classification accuracy is usually determined using measures based on an error matrix (also called confusion matrix or contingency table). For n classes, this is an n × n array where the rows represent the classified data while the columns represent the reference data, or vice versa. A number of techniques can be used to derive statistical measures of accuracy from the error matrix. The most simple of these accuracy measures is the overall classification accuracy (percentage correct), which is computed by dividing the sum of the elements in the main diagonal of the matrix (representing the number of samples with complete agreement) by the total number of samples (Congalton 1991). Accuracies for individual classes can

also be computed from the error matrix. For each class, two accuracy values (producer’s accuracy and user’s accuracy) have to be distinguished. The producer’s accuracy indicates the percentage of reference points of a certain class which have been assigned to the same class in the classified image. In other words, it measures the errors of omission. The user’s accuracy indicates the percentage of pixels assigned to a certain class in the classification which agree with the reference data. This is equivalent to errors of commission.

The overall accuracy value (as described above) contains a proportion that is attributable to chance agreement between the two data sets. Cohen (1960) developed the Kappa coefficient as a measure of agreement which is adjusted for chance agreement. The expected chance agreement is calculated by adding the products of the row and column totals for each cell of the main diagonal of the error matrix, thus indirectly incorporating off-diagonal elements in the calculation (Congalton 1991). The Kappa index of agreement (KIA) is not as widespread as the overall accuracy (percentage correct) as a measure of classification accuracy in remote sensing studies, but it is also used by many authors (e.g. Dikshit & Roy 1996, Cingolani et al. 2004). Næsset (1996) describes the use of a weighted Kappa coefficient to evaluate the classification accuracy if all errors are not equally serious, i.e. if there are informational classes which are more closely related to each other than others.

The measured accuracy of a classification depends on many factors besides the classification method. As a classification becomes more detailed (containing more informational classes which are defined more precisely), the potential information content of the resulting map increases, but at the same time the opportunity for classification errors also increases (Campbell 1996). It is usually possible to increase the overall classification accuracy by decreasing the detail or precision of the classification, e.g. by mapping just one general forest class instead of differentiating between several forest types (Laba et al. 2002). However, this may result in highly accurate maps which do not provide the information necessary for a certain purpose (e.g. monitoring the distribution of a specific ecosystem type like cloud forest).

Mixed pixels and other edge effects can be a major source of classification errors, especially in a heterogeneous landscape. Many authors exclude areas near class boundaries from the accuracy assessment, with the argument that errors in these areas could be due to misregistration between the classified image and reference data. Some select testing pixels only in areas that clearly belong to the typical ‘core’ of classes, avoiding mixed pixels in border areas or areas of transition between defined classes (Hill 1999, Schlerf et al. 2003, Langford & Bell 1997). In some cases (where there are no independent testing points because of economic or temporal constraints) the same pixels that were already used to train the classifier are used for testing the accuracy (Strahler 1980). Such samples are obviously biased. All of these tactics lead to exaggerated accuracy values which do not represent the real classification accuracy for the whole image.

Both the sample size (number of samples) and the sampling method have to be considered for the creation of an appropriate testing set. The sample size determines the confidence intervals of the accuracy estimates (Brogaard & Ólafsdóttir 1997, Næsset 1996). If confident estimates of the classification accuracies for individual classes are required, the sample size needs to be larger than for an equally confident estimate of the overall accuracy. Larger samples than for a simple calculation of the overall accuracy are also needed in order to adequately represent the confusion between all pairs of classes in the error matrix (Congalton 1991). The sample size is restricted by the expense of determining the truth values for a large number of testing points (Congalton 1991, Langford & Bell 1997).

Sampling should be random in order to meet the statistical requirements of the Kappa analysis, among other statistical techniques that are based on the error matrix (Congalton 1991). However, pure random sampling may mean that classes of small extent fail to be represented in the sample, so stratified random sampling is frequently used in order to ensure that samples from all classes are included in the accuracy assessment (e.g. Helmer et al. 2002, Qiu & Jensen 2004). Stratified random sampling is also recommended by Congalton (1991). The prerequisite for a stratification of the sample is an existing map depicting the distribution of the class areas, so this is usually done after classification. Random sampling may lead to problems if the resulting samples are located in inaccessible areas where the true class cannot be determined. Because of this, some authors use cluster sampling, involving the selection of samples in previously known areas. Sampling of this type also helps to reduce the field access cost per sample. However, this will lead to a biased testing sample and an overestimation of the classification accuracy (Arora & Mathur 2001). Cluster sampling, and also systematic sampling, can produce spatially autocorrelated data. These sampling methods do not ensure that every individual in the population has an equal chance of being included in the sample, thus violating the requirements for inferential statistics (Næsset 1996, Brogaard &

Ólafsdóttir 1997, Arora & Mathur 2001).

Even if the same testing sample is used, the measured accuracy still depends on whether the interpreters assigning the ‘truth’ values to the sampling locations do this independently (blindly) or whether they know the classification result to be assessed and can be influenced by it, giving the benefit of the doubt in ambiguous cases. The latter method can lead to a strong increase in the overall accuracy value (Langford & Bell 1997).

A per-pixel accuracy assessment with an error matrix is also possible for object-based classification results, and is used by some authors (e.g. Wang et al. 2004c). However, this way a single large misclassified object can have a disproportionate impact on the overall accuracy (de Kok et al.

2000). This is why in some cases reference objects are used instead of reference pixels, but this may again lead to the problem that not enough samples can be found (Lobo 1997). Per-segment accuracy assessment also does not address the question whether object borders match class borders – an

object primitive covering two land cover classes could be assigned to the majority class in both classification and reference, so no error for the misclassified part of the segment would be noted.

There is not yet any generally accepted standard for automatic accuracy assessment of object-oriented classifications, and the accuracy is sometimes evaluated only visually (de Kok et al. 2000).

The accuracy assessment of object-based classifications in the eCognition environment can be conducted on a pixel basis or on a segment basis (Baatz et al. 2002).

The accuracy assessment methods discussed above, which are based on the error matrix, assume a hard classification with mutually exclusive classes, where every classification decision is either completely correct or completely incorrect. As discussed at the end of chapter 2.6, this rigorous definition of accuracy is often not appropriate when land cover (natural vegetation) types are classified with remotely sensed data. Several approaches have been suggested to ‘soften’ accuracy assessments. When interpretations of aerial photographs are used to produce the reference data for a Landsat-based classification, some authors take account of the mixed pixel effect, and of uncertainties in locating each Landsat pixel on the aerial photographs, by recording several classes to be counted as ‘correct’ for each testing sample (Strahler 1980, Helmer et al. 2002). Gopal &

Woodcock (1994, quoted in Woodcock & Gopal 2000) published a fuzzy accuracy assessment method for a hard classification with fuzzy ground truth. They use five linguistic grades to gradually differentiate between ‘absolutely right’ and ‘absolutely wrong’. So for one testing point, the ground truth could for example pronounce one class ‘absolutely right’, and several other classes

‘reasonable or acceptable’. This makes it possible to generate a number of accuracy measures beyond the ‘hard’ accuracy. For instance, in addition to the proportion of ‘best choice’ assignments, the proportion of at least ‘acceptable’ class assignments (the sum of ‘absolutely right’, ‘good’ and

‘reasonable or acceptable’ answers) can be calculated (Woodcock & Gopal 2000). This technique was adopted by Laba et al. (2002) and Ma et al. (2001). Ricotta (2004) proposes a method to evaluate the accuracy of fuzzy maps, where a fuzzy classification output is compared to the true fuzzy land cover. The weighted Kappa described by Næsset (1996) is also a way to ‘soften’ the accuracy assessment by taking account of the fact that not every classification error is equally serious.

On the whole, there are many variations in the methods which are used to measure classification accuracy. Some methods are more rigid, while others can lead to an overestimation of accuracy. So even if widely used accuracy measures, like overall accuracy or Kappa, are reported, they are not directly comparable without information about how they were derived.

3 Forest Resources and Land Cover in the Dominican Republic, with Special Regard to the