Clustering Methods - Weather pattern classification to represent the UHI in present and future

4 Weather pattern classification to represent the UHI in present and future climate

4.4 Clustering Methods

4 Weather pattern classification to represent the UHI in present and future climate

_____________________________________________________________________________________________

ensemble of three realizations of the A1B scenario simulations is available. This gives the opportunity to account, to some extent, for the climate variability (Schoetter et al., 2012).

GP, TH and RH are directly available in the model output. VO has to be calculated from the u- and v-components of the wind by using the definition of the vertical component of the vorticity vector:

y u x VO v







  (4.2)

All variables are bilinearly interpolated to the 2.5° x 2.5° regular grid to match the grid used for the WPC.

4.3.4 CLM

CLM is a non-hydrostatic RCM based on the Lokal-Model (LM) from the DWD. The model is described by Böhm et al. (2006) and detailed information about the dynamics and physics are given by Steppeler et al. (2003). The prognostic variables of the model are: temperature, horizontal and vertical wind components, pressure perturbations, specific humidity and cloud water content. A detailed description of the IPCC simulations used in this study is given by Hollweg et al.

(2008). As with the REMO simulations, the same ECHAM5 A1B scenario simulations are used as a forcing at the lateral boundaries. However, instead of nesting the model twice, CLM is directly nested into the ECHAM5 results. The model domain covers nearly the same area as the 50 km results from REMO but on a finer horizontal resolution of 0.165° (~18 km). Only downscaled simulations of the first two realizations of the A1B Scenario simulations are available. Again, the variables GP, TH and RH are directly available, and VO is calculated using Eq. (4.2).

means method, one can apply the same distance measures to both the clustering and the assignment of data from different models to the resulting WPs.

The results of the k-means methods are highly dependent on the choice of the initial conditions. Several studies have sought to overcome this disadvantage, such as through the extended k-means method (e.g. Enke and Spekat, 1997; Philipp et al., 2007). Three different methods, k-means, dkmeans and SANDRA, are tested here to achieve the best results. The computation of the clustering is done with the classification software developed during the COST733 action (Philipp et al., 2010). In the following the three clustering methods are briefly described.

4.4.1 k-means

The k-means method is a non-hierarchical cluster algorithm, which groups objects into mutually exclusive clusters. The number of clusters k has to be prescribed for this method. The similarity measure is the squared Euclidian distance SED between two data objects (vectors including fields of one or more variables) x1 and x2:

2 2

1 x

x 



SED (4.3)

This measure is then used to define the so called within-cluster sum of squares WSS:

 

 



 ^k

i x C

WSS

z 2

x (4.4)

where x denotes all data objects belonging to the cluster Ci; zi is the i^th corresponding cluster centroids (CC) and k is the number of clusters. Since several variables are used in this study, the data objects of each variable are normalized by subtracting their corresponding temporal-spatial mean and dividing it by their standard deviation afterwards.

The k-means algorithm tries to determine the minimum of the WSS in an iterative process. The first step is the so-called starting partition where k CCs are chosen. The SED is computed for each combination of CC and data objects. All data objects are then assigned their nearest CC to form the cluster. The data objects belonging to each cluster are averaged to yield the new CC. Thereafter, the SED for all objects and the new CC are computed and the objects are reassigned to their nearest

4 Weather pattern classification to represent the UHI in present and future climate

_____________________________________________________________________________________________

point needs to be reassigned to another CC. To achieve some independence of the staring partition, the k-means algorithm is repeated 1000 times using different starting partitions. The quality of a classification result is reflected by the explained cluster variance (ECV). The ECV is based on the ratio of the WSS and the total sum of squares (TSS):

ECV 1WSS

TSS (4.5)

The ECV ranges from 0 to the optimal value of 1. Therefore, the result with the highest ECV is chosen as the final classification result.

4.4.2 dkmeans

A crucial point using k-means clustering is the starting partition. The so-called dkmeans method (Enke and Spekat, 1997, Philipp et al., 2010) uses the most dissimilar data objects as a starting partition. These objects are indentified by an iterative algorithm. In contrast to the k-means method, the Euclidian distance ED is used as a similarity measure:

2 2

1 x

x 



ED (4.6)

Again, all objects are normalized before the ED is calculated. After the most dissimilar CCs are found, all the remaining objects are assigned to their most similar CC to form a cluster. As with the k-means method, the new CCs are calculated and all objects are reassigned to their most similar CC. The iterative process of reassigning the data objects and calculating the new CCs stops if no object needs to be reassigned during an iteration step.

4.4.3 SANDRA

The conventional k-means clustering algorithm tends to reach local minima of the sum of squares WSS too often. This problem can be avoided by using the simulated annealing and diversified randomization method (SANDRA) developed by Philipp et al. (2007). Diversified randomization means that the clustering is done several times with randomized starting partitioning of the data, and during the clustering the ordering of the data objects and the cluster numbers are randomized. Simulated

annealing allows data objects to be assigned to a ‘wrong’ cluster during the iteration process, meaning that the object is not necessarily assigned to its closest CC. At first, this causes the WSS to increase. However, this process can prevent misidentification of a local minimum as the global minimum. In practice, each data object can be moved into a ‘wrong’ cluster if the acceptance probability P is larger than a random number between 0 and 1. P is given by:



 



 

 T

P exp ED^old ^new (4.7)

where EDold is the Euclidian distance between the data object to the old cluster, EDnew

is the Euclidian distance to the potentially new cluster, and T is a control parameter which is reduced after each iteration step by a constant factor CO:

i CO T

T_1   (4.8)

CO is the so-called cooling rate, which is set to 0.999 in this study. The initial T is empirically chosen in a way that 99% of the objects are moved during the first iteration step. The clustering process is finished when no reassignment is possible and no

‘wrong’ reassignment has appeared in an iteration step. As it is done for k-means and dkmeans, all data objects are normalized to assure comparability between the different variables used for the WPC.

4.5 Optimal method for weather pattern classification based on

Im Dokument Quantifying the influence of climate change on the urban heat island of Hamburg using different downscaling methods (Seite 40-43)