Classification Issues - Methodical Background

2 Methodical Background

2.6 Classification Issues

The partitioning of local observations into classes is a prerequisite for the creation of land cover maps. Digital image classification is the process of allocating image primitives (pixels or groups of pixels) to classes, usually according to their values in several spectral bands (Campbell 1996).

Additional (non-spectral) data layers or contextual information can also be used in the classification process. Unsupervised classification methods are used to identify natural homogeneous groups (clusters) within the data. Only afterwards are informational labels assigned to the resulting groups or clusters. These methods have the disadvantage that the interpreter cannot influence the identity and extent of the resulting categories, which may then not correspond to the informational classes of interest. Supervised classification, by contrast, makes use of ground knowledge introduced through training areas (samples of known class identity) and extends it to the entire image with statistical methods (Lobo 1997). Each image primitive is characterised by n observations (the values in n data channels). The training samples are vectors in an n-dimensional space (the feature space). A supervised classifier uses the distribution of the training samples for each class to estimate density functions in the feature space and to divide the space into class regions (Fukunaga 1990).

A basic step in supervised image classification and mapping is the design of a realistic classification scheme, i.e. the definition of discrete informational land cover units which are separable with the available data (Cingolani et al. 2004). Informational classes determined by the interpreter or required by the prospective map user may not correspond to consistent classes in the data. It may be necessary to subdivide an informational class into several subclasses if it belongs to separate clusters in feature space or to merge several informational classes if they are not separable with the data used.

Feature selection and class separability

A second basic step in image classification and mapping is the selection of an appropriate feature combination from the available data set (which may consist of spectral bands, textural features and other data layers) to be used for the separation of the classes. This step is not always necessary if only a small number of channels is available for classification. But having integrated a number of spectral bands and texture features, and perhaps also ancillary data like topographical variables as additional channels, the resulting vector can be too large for the classification algorithm to deal with. The data stack may also contain redundant or unhelpful data layers, making the classification unnecessarily slow. A reduced number of data layers may still contain sufficient information for the classification. Discarding some of the channels can even lead to improved classification results if the classifier used is impaired by high dimensionality data (see discussion of the maximum likelihood classification below) or by the statistical properties of some of the data layers (Peddle 1993) or if the discarded data do not contain information helpful for the separation of the classes.

A simple method to reduce the number of channels before classification is to look at the correlation between pairs of data layers and, in the case of high correlation coefficients, to discard channels which are regarded as redundant. Liu et al. (2002b) use this method to select data layers with small correlation coefficients out of a number of spectral and topographical features. Especially the large number of texture features which can be generated from the spectral data (see chapter 2.3) require the choice of a subset of features. One way to approach this task is correlation analysis. Gong et al.

(1992), having generated ten GLCM texture features with partially very high correlation coefficients, choose three relatively little correlated ones for later combination with other features and classification. Correlation analysis is also employed by Narasimha Rao et al. (2002) as a first step during the selection of texture features.

To simply find a set of channels which are not closely related to each other is not always a sufficient criterion for feature selection. If it is the objective to find an optimal subset of m out of n features (channels) for the discrimination of a given set of classes, one needs to apply feature selection methods based on separability measures, e.g. class divergence. The task is to determine the set of m channels which maximises the divergence between the classes for this reduced number of channels. The pairwise class divergence is a measure of statistical separation between two classes as characterised by the means and covariance matrices of the class training samples (Hill & Foody 1994). The class divergence can be calculated relatively easily for Gaussian distributions (Landgrebe 2003). With increasing statistical separability between two classes, the divergence increases without bound. Classification accuracy, by contrast, cannot exceed 100 %. The transformed divergence was developed as a divergence measure which also reaches an asymptote (Singh 1987). It is based on a transformation of the divergence (using the exponent) and limits the transformed divergence to a range of 0 to 2 with larger increments for differences between small

divergence values. A divergence measure for not only two classes but all the pairs of classes can be calculated using averaging methods. Average interclass divergence is simply the average of the divergence for each pair of classes. This leads to a dominance of class pairs with very high divergence in the average value. Using the average transformed divergence instead reduces the dominance of class pairs with the highest divergence and helps to select the features which are best at differentiating even between relatively similar classes.

Dikshit & Roy (1996) use the average transformed divergence as a measure of the relative capability of feature sets with and without textural data for the discrimination of class pairs. Chen et al. (2004) use transformed divergence for the selection of texture features. Singh (1987) and Hill &

Foody (1994) use matrices of pairwise transformed divergence to study the separability of tropical forest classes with the help of Landsat data.

Several authors have noted that there is no close correspondence between the average transformed divergence for a feature set and the accuracy achieved when they classified their image using this feature set (Chen et al. 2004, Gong et al. 1992). One of the reasons for that is the fact that separability measures are usually calculated from the training samples only. Therefore they cannot be expected to predict the exact classification accuracy for the whole image, if the training samples are not exactly representative of the whole image including areas of potential edge effects. Also, there is generally no one-to-one relationship between separability measures and classification accuracy. Instead, in the best case, a given value of a separability measure can determine a certain range of possible classification accuracies for the samples under examination (Landgrebe 2003).

Separability indices are thus no linear predictors of classification results, but they do give an indication for the potential of a certain set of data layers to discriminate between the classes as represented by their class signatures.

Another criterion for class separability is the Bhattacharyya distance. Like the divergence and transformed divergence, the Bhattacharyya distance in its common form is calculated from the class means and covariance matrices assuming normal distributions (Verbeke et al. 2004). It is possible to derive error bounds (upper and lower bounds on the probability of correct classification) for the Bhattacharyya distance and the related Jeffries-Matusita distance measure (Landgrebe 2003, Fukunaga 1990). The Bhattacharyya distance is recommended by Landgrebe (2003) as having a close relationship with the probability of correct classification. Wang et al. (2004c) use the Bhattacharyya distance to compare the class separability at the pixel level with the separability at the object level after segmentation with different scale parameters (using the mean spectral values of the resulting object in the separability calculation).

Comparing several separability measures, Gong et al. (1992) found that Jeffries-Matusita and average transformed divergence performed similarly in the feature selection; and Treitz and

Howarth (2000b) found that in their case study, the Jeffries-Matusita distance and transformed divergence were better indicators for classification success than Bhattacharyya distance and simple divergence.

Benediktsson et al. (1990) do not apply the above mentioned separability measures for a case of multisource data classification including topographic data, because the estimation of separability with a Gaussian assumption requires the computation of covariance matrices, and the fact that class-specific topographic data sometimes have very little variation can prevent the computation of a covariance matrix. It is difficult to compute separability measures without the Gaussian assumption.

Maximum likelihood classification

Maximum likelihood classification (MLC) is the most widespread supervised classification technique in remote sensing. It has been widely used for the digital classification of satellite imagery since the 1970s (Strahler 1980). The term ‘maximum likelihood classifier’ is used here in the sense of the quadratic (Gaussian) classifier only. This is a statistical classifier which calculates the probability that an observation (pixel vector) belongs to a certain class according to a multivariate normal (Gaussian) density function. This involves a calculation of the distance between the observation and the mean vector of the class in feature space, corrected for the covariance of the class (modified Mahalanobis distance) (Liu et al. 2002b, Strahler 1980). The training samples are used to estimate the mean vector and covariance matrix for each class. The probability functions are calculated for all classes and the pixel is assigned to the class for which it has the maximum likelihood of membership. The class space in feature space can be limited by a so called ‘Gaussian threshold’, which is the radius (in standard deviation units) of a hyperellipsoid around the mean of the class in feature space (PCI 2001). In this case, observations which do not lie within the hyperellipsoid of any class are assigned to a ‘null class’.

MLC assumes that the training data statistics are Gaussian in nature. If it is true that the class probability density functions are Gaussian, MLC is the optimal classifier which minimises the overall probability of error (Liu et al. 2002b). But experience shows that the maximum likelihood classifier is rather robust (Gong et al. 1992). It can tolerate moderate violations of its underlying assumptions and still yield classification results which may be superior to those of classifiers which do not employ the Gaussian model. Benediktsson et al. (1990) found that even for data which were not all normally distributed, the MLC yielded a better classification result than a minimum Euclidian distance classifier.

For a statistical classifier using the multivariate Gaussian model, the computing cost grows as the square of the number of features (channels used) (Benediktsson et al. 1990). The required number of training samples to estimate the class statistics for a Gaussian (quadratic) classifier is also related to the square of the number of features (Fukunaga 1990). This gives rise to the ‘Hughes effect’: For

a finite number of training samples, the classification accuracy rises at first with the number of features (or measurement complexity), but then it reaches a maximum and starts to decrease when more features are added. While the class separability is always higher for higher dimensionality data (more measurements), the accuracy of the statistics estimation with a given number of training samples decreases when the dimensionality becomes too high, and at some point this leads to a lower classification accuracy in spite of better theoretical class separability (Landgrebe 2003).

Hay et al. (1996) used a maximum of seven channels for the MLC because they feared a reduction of classification accuracy for higher dimension data sets. Peddle (1993) found that MCL was not a good classification technique for multi-source data sets (consisting of spectral, textural and DEM extracted data). The accuracies of classifications including the texture and / or the DEM derived data were lower than of those involving only the three spectral channels, except for the case when only the elevation was added as a fourth channel. The usability of MLC is limited for data sets with many channels, with data which are not normally distributed or with multi-source data with differing scales of measurement (Arora & Mathur 2001, Peddle 1993).

The high spatial resolution satellite data which are available today tend to exhibit increased within-class variability compared to the medium resolution satellite data which are traditionally within-classified with MLC. This means that the volume of feature space occupied by each class is expanded, leading to increased possibilities of class overlap in feature space. (Qiu & Jensen 2004). Statistical approaches which work satisfactorily for relatively low spatial resolution imagery with a limited number of channels may not be suitable for the high resolution and / or high dimensionality data sets which are increasingly used in digital land cover classifications. Moreover, ancillary data included in a classification are in most cases unlikely to conform to the assumptions of MLC.

Alternative (non-parametric) classification methods

Non-parametric classifiers are distribution-free, i.e. they do not make any assumptions about the mathematical form of the density functions (Fukunaga 1990).

The k-nearest-neighbour classifier (k-NN) is a simple and robust non-parametric classification method (Debeir et al. 2002). It works by extending a local region around the vector of an unclassified image primitive until its kth nearest neighbour among the training sample vectors is found (according to the Euclidian distance in feature space). The class labels of these k nearest neighbours are determined and the unclassified vector is assigned to the majority class of its k nearest neighbours (Landgrebe 2003). The k-NN classifier is suitable for high dimensionality data.

However, the number of samples which are needed becomes very large for a large number of channels. Fukunaga (1990) shows that for normal distributions and k=5, 150 samples are needed for 8 channels, but for 16 channels the number of samples needed is already 34,000. Larger values of k are recommended for data with high variance (PCI 2001). Debeir et al. (2002) use a 5-NN classifier

for the classification of a large data set containing spectral, textural and ancillary data. Tuominen &

Pekkarinen (2005) use a k-NN classifier with k=5 for the classification of a large data set derived from orthophotos (three spectral channels, channel ratios, texture features), aiming for an estimation of forest attributes. They find that for a combination of more than 10 features, the accuracy is not significantly increased any more. Collins et al. (2004) successfully classify a stacked vector data set consisting of Landsat multispectral channels, topographic data (elevation, slope and aspect) and modelled environmental gradients using a k-NN classifier, also with k=5.

Another non-parametric classifier is the Artificial Neural Networks (ANN) classifier. Artificial neural networks are computational systems which imitate biological systems of information processing (artificial intelligence). They have been increasingly employed in remote sensing research since the 1990s (Franklin 1995, Sugumaran 2001). Neural networks are networks of interconnected artificial neurons (processing elements, also called nodes). The nodes are organised in layers (one input layer, usually one or several hidden layers, one output layer). Each node in one layer is connected to the nodes in the adjoining layers. The nodes in the input layer receive data from outside the network (e.g. one input node receiving the information from one remote sensing data channel) and pass on an output to the nodes in the next layer. The nodes in the other layers receive inputs from the nodes in the preceding layer and produce one output which they pass on.

The output is produced by adding the weighted input from the nodes in the preceding layer and putting the sum through an activation function. In many cases, this is the sigmoid function, which reaches one when the input sum goes to infinity and zero when the input sum goes to minus infinity.

The output of the nodes in the output layer indicates the classification result (Paola & Schowengerdt 1995, PCI 2001).

All the connections between the nodes carry weights which are set to random values before the network is trained. Training consists of finding the appropriate weights through an adaptive training procedure so that the output error is minimised. There are several learning algorithms for neural networks. In remote sensing and image classification, the most widely used learning algorithm is a supervised algorithm called back-propagation (Arora & Mathur 2001), employing the ‘generalised delta rule’. During the first phase of training a back-propagation network, the training sample vectors (with known classes / target outputs) are used as input for the network and propagated forward to compute the output values for each output node. The error between the actual and desired output is calculated. (In the case where each output node represents one class, the desired output is a high value, e.g. 0.9, for the node of the correct class, and a low value, e.g. 0.1, for the other nodes.) The second training phase is a backward pass from the output nodes through the network, during which the weights are changed according to the learning rate and the error signal passed backwards to each node (Benediktsson et al. 1990). This process (entering the training data, calculating the output error, adjusting the weights of the connections) is repeated many times (Foody 2004) until some condition is fulfilled, preferably until the network has stabilised so that the

error and the weight changes per cycle have become very small (iterative training). Once the network is trained, i.e. appropriate weights are found and fixed, all pixel vectors are fed into the network and classified.

ANN classification has the advantage that it makes no assumptions about the data set, so that different kinds of data can be used as input. Typically, all input channels are scaled to a common range (usually values between 0 and 1 like the node output values) before training and classification. Benediktsson et al. (1990) conclude that neural networks need to be trained with carefully selected representative samples. But according to Paola & Schowengerdt (1995) and Qiu

& Jensen (2004), ANN classifiers are robust to noise in the training data and able to generalise.

It is often seen as a disadvantage of neural networks that they operate as a ‘black box’ (Franklin 1995, Qiu & Jensen 2004) without the capability to explain the relationship between input and output. Due to their heuristic nature and the element of random variations in the results (because the weights of the connections are randomised before training), performance prediction and the analysis of results are difficult. Another disadvantage is that iterative training requires much more computation than parametric discriminant functions (Landgrebe 2003, Paola & Schowengerdt 1995). Once the network is trained, however, the classification process as such is fast (Pal & Pal 1993).

Kimes et al. (1999) successfully employed neural networks to differentiate between primary and secondary tropical forest and less successfully for mapping secondary forest age using SPOT HRV spectral and textural data. Linderman et al. (2004) use ANN to map understorey bamboo from Landsat data. Comparing ANN and MLC, Paola & Schowengerdt (1995) found that maps produced from Landsat TM data were similar for both methods, but ANN was better at classifying mixed pixels and for one image produced a higher overall accuracy. Sugumaran (2001), classifying tropical forest and plantation types using IRS-LISS-III channels 1-3 (23.5 m resolution), achieved somewhat better results with ANN than MLC. Arora & Mathur (2001) find ANN superior for the multi-source classification of five broad land cover classes in a Himalayan region (using medium resolution spectral data and slope and aspect maps). Liu et al. (2002b) also achieve better results with ANN than with MLC for a land cover classification with medium resolution multispectral and ancillary GIS data layers. Most studies find ANN classifications result in higher accuracy than MLC, but that it takes a long time to train the neural networks (Sunar Erbek et al. 2004).

To my knowledge, neural networks have not yet been used for tropical forest classification with high spatial resolution satellite data, but it does seem to be a promising approach especially for data sets which combine satellite and ancillary data.

Classification of segmented images (object-based classification)

In a segmented image (see chapter 2.4), each segment (image object primitive) is characterised by an integrated measurement vector, consisting, in the most basic case, of the mean values (in n spectral channels) of the pixels forming the segment. Using the mean spectral values of the image

Im Dokument Exploiting the Spatial Information in High Resolution Satellite Data and Utilising Multi-Source Data for Tropical Mountain Forest and Land Cover Mapping (Seite 46-56)