• Keine Ergebnisse gefunden

Schlussfolgerungen

3.2 Cluster Analysis of Wind Data

8%

6%

4%

2%

WEST EAST

SOUTH NORTH

0 − 2 2 − 4 4 − 6 6 − 8 8 − 10 10 − 12 12 − 14 14 − 16 16 − 18 (m/s) P=1000 mbar

8%

6%

4%

2%

WEST EAST

SOUTH NORTH

0 − 5 5 − 10 10 − 15 15 − 20 20 − 25 (m/s) P=925 mbar

8%

6%

4%

2%

WEST EAST

SOUTH NORTH

0 − 5 5 − 10 10 − 15 15 − 20 20 − 25 25 − 30 30 − 35 (m/s) P=850 mbar

8%

6%

4%

2%

WEST EAST

SOUTH NORTH

0 − 5 5 − 10 10 − 15 15 − 20 20 − 25 25 − 30 30 − 35 35 − 40 (m/s) P=700 mbar

Figure 3.2: Windrose of geostrophic wind from NCEP/NCAR Reanalysis-I data (19602007) parcel of air with temperatureTat pressurePis brought to a standard reference pressureP0 of1000 mbaradiabatically, and can be calculated with the Poisson’s equation (see. Eq.3.3).

Tθ =T (P0 P )

Rgc

cp (3.3)

Here R is the gas constant, and the cp is the specific heat capacity at a constant pressure.

With the retrieved real temperature and geopotential heights, the potential temperature and the corresponding PTG can be obtained. In this study, the temperature input option is using one real temperature plus two levels of PTGs.

to5×10−4seconds per grid per time step for 200 MHz CPU power. The model can be run on daily base for short time period. If long time period or statistical result is desired, the input data have to be classified into representative cases. In this work, a two-step cluster analysis is used to classify the more than 40 years data, withk-mean clustering for the first step and Ward’s clustering for the second step. Sensitivity tests of the METRAS PC model show that humidity is not a sensitive parameter. Therefore humidity is excluded from the cluster analysis, an arbitrary value will be assigned for the simulation, which will not affect the wind simulation results.

Clustering, or grouping is considered the most important unsupervised learning technique, which deals with finding a structure in a collection of unlabeled high-dimensional data.

Clustering is done based on the measure of associations between clusters or original data, which is called similarity or its counterpart dissimilarity. Distance is most popular similarity criterion. Two or more objects belong to the same cluster are “close”in terms of a given distance, e.g. the Minkowski metric,

dp(xi, xj) = ( N d=1

|xi,d−xj,d|p)1p (3.4) where N is the dimensionality of the data. The Euclidean distance is the special case of Minkowski metric withp= 2.

Clustering methods can be distinguished according to operational methodology into hier-archical and partitional methods. The affiliation of an object to the clusters can be exclusive, overlapping or fuzzy weighted. A complete clustering assigns every object to a cluster, whereas a partial clustering does not (Tan et al., 2005). In this work, the complete exclusive clustering method of both hierarchical and partitional method have been used, and will be discussed in the following.

3.2.1 Ward’s Method

Hierarchical clustering creates a hierarchy of clusters in a tree structure called dendrogram.

The root of the tree, i.e. the highest level of cluster is a a single cluster containing all data ob-jects, and the lowest level of clusters are the individual data objects. The hierarchical cluster-ing can go “bottom-up”, subsequently merge two clusters with smallest distance (agglom-erative hierarchical clustering), or “top-down”, recursively split the clusters that are farthest (divisive hierarchical clustering). Agglomerative clustering is easy to implemented, thus the most frequently used hierarchical method. For the agglomerative hierarchical clustering, differences may arise depending on the way of defining distance between clusters. The sin-gle linkage clustering defines the minimum distance between the closest objects within two clusters as the cluster distance, and the complete linkage uses, on the contrary, the longest object distance. The average group linkage uses the mean of all the distances of all the object pairs formed by the two clusters. The Ward’s method specifies the linkage function as the increases in the “error sum of squares”(ESS) after fusing two clusters into a single cluster.

Ward’s method seeks to merge the clusters while minimizing the increase in ESS at each

step.

ESSj =

Nj

i=1

|xji −cj|2 (3.5)

D(j1, j2) =ESS(j1, j2)−(ESS(j1) +ESS(j2)) (3.6) cj =

nj

i=1

xji (3.7)

ESS(j1, j2)is the error sum of squares, once clusterj1andj2are merged into one cluster.

For the agglomerative hierarchical clustering, of course including the Ward’s method, all the possible combinations have to be calculated, and stored for comparison, which leads to a large intermittent storage demand. So the methods are only appropriate for small datasets.

3.2.2 k-mean Clustering

k-mean Clustering is the most representative partitional clustering method, which divide the data into a specified number of clusters fixeda priori. The algorithm is done in following steps:

1. Definekcentroids, one for each clustercj(j = 1, ..., k);

2. Associate each data point to the nearest centroid, being denoted asxji; 3. Recalculate the new centroid of thenj data point of each clusterj:cj =nj

i=1xji; 4. Repeat step 2 and 3, until the location ofkcentroids do not change any more.

Because a matrix of distance does not have to be determined and stored in thek-mean clus-tering, therefore it can be applied to much larger data sets. In this study, to classify the more than 40 years daily data, a two-step clustering with the first step using k-mean, and the second step using Ward’s method with refinement byk-mean has been applied.

3.2.3 Clustering Results

Before clustering, the different dimensions of the data have been standardized at the first step, the daily real temperature at1000 mbarand the lowest two potential temperature gra-dients (1000925 mbar level and 925850 mbar level) have been classified into 10 groups by k-mean clustering. Fig.3.3 shows the example scatter plot of grouped data between 19601970.

In the next step, each temperature class is clustered with Ward’s method, and refined by k-mean method. The refinement withk-mean improves the clustering accuracy, and offer the possibility to automatically fix the number of sub-clusters (Hosking and Wallis, 1997).

Fig.3.4 shows the example result of the different cluster method for one temperature class.

Wind originate from west (horizontal) and south (vertical) are denoted as positive. Fig.3.4a

0.005 0 0.015 0.01

0.02

0 0.01 0.02

250 260 270 280 290 300 310

Second level PTG First level PTG

Real temperature [K]

Figure 3.3: Clustering of temperature data of19601970

shows the clustered wind data of one temperature subset by Ward’s method without re-finement byk-mean clustering, and Fig.3.4b shows the case with refinement. The refined result is more reasonable, for example, some of the data objects in cluster No.8 is more close to cluster No.6 before refinement. One problem can be noticed in both Fig.3.4a and 3.4b is that wind with very diversified directions, even complete opposite directions are classified into one group, which is not proper. Therefore, some constraints on the wind direction for distance calculation are imposed on the clustering algorithm, which shows Fig.3.4c.

The final 200 temperature-wind clusters are shown in Fig.3.5a, and the windrose of the 200 wind clusters are shown in Fig.3.5b. As we can see that by taking the centroid of all data objects, the maximum wind speed of the clusters is smaller than the original data in Fig.3.2.

One may also notice that, the windrose is a flip of the clusters in Fig.3.5a, this is because that windrose denotes the originating direction of the wind, whereas in the cluster plot express the wind in the opposite direction as specified before.