Cluster Analysis of Wind Data - Study of topographic effects on hydrological patterns and the i

Schlussfolgerungen

3.2 Cluster Analysis of Wind Data

WEST EAST

SOUTH NORTH

0 − 2 2 − 4 4 − 6 6 − 8 8 − 10 10 − 12 12 − 14 14 − 16 16 − 18 (m/s) P=1000 mbar

WEST EAST

SOUTH NORTH

0 − 5 5 − 10 10 − 15 15 − 20 20 − 25 (m/s) P=925 mbar

WEST EAST

SOUTH NORTH

0 − 5 5 − 10 10 − 15 15 − 20 20 − 25 25 − 30 30 − 35 (m/s) P=850 mbar

WEST EAST

SOUTH NORTH

0 − 5 5 − 10 10 − 15 15 − 20 20 − 25 25 − 30 30 − 35 35 − 40 (m/s) P=700 mbar

Figure 3.2: Windrose of geostrophic wind from NCEP/NCAR Reanalysis-I data (1960∼2007) parcel of air with temperatureTat pressurePis brought to a standard reference pressureP₀ of1000 mbaradiabatically, and can be calculated with the Poisson’s equation (see. Eq.3.3).

T_θ =T (P₀ P )

Rgc

cp (3.3)

Here R is the gas constant, and the c_p is the speciﬁc heat capacity at a constant pressure.

With the retrieved real temperature and geopotential heights, the potential temperature and the corresponding PTG can be obtained. In this study, the temperature input option is using one real temperature plus two levels of PTGs.

to5×10⁻⁴seconds per grid per time step for 200 MHz CPU power. The model can be run on daily base for short time period. If long time period or statistical result is desired, the input data have to be classiﬁed into representative cases. In this work, a two-step cluster analysis is used to classify the more than 40 years data, withk-mean clustering for the ﬁrst step and Ward’s clustering for the second step. Sensitivity tests of the METRAS PC model show that humidity is not a sensitive parameter. Therefore humidity is excluded from the cluster analysis, an arbitrary value will be assigned for the simulation, which will not affect the wind simulation results.

Clustering, or grouping is considered the most important unsupervised learning technique, which deals with ﬁnding a structure in a collection of unlabeled high-dimensional data.

Clustering is done based on the measure of associations between clusters or original data, which is called similarity or its counterpart dissimilarity. Distance is most popular similarity criterion. Two or more objects belong to the same cluster are “close”in terms of a given distance, e.g. the Minkowski metric,

d_p(x_i, x_j) = ( N d=1

|x_i,d−x_j,d|^p)¹^p (3.4) where N is the dimensionality of the data. The Euclidean distance is the special case of Minkowski metric withp= 2.

Clustering methods can be distinguished according to operational methodology into hier-archical and partitional methods. The afﬁliation of an object to the clusters can be exclusive, overlapping or fuzzy weighted. A complete clustering assigns every object to a cluster, whereas a partial clustering does not (Tan et al., 2005). In this work, the complete exclusive clustering method of both hierarchical and partitional method have been used, and will be discussed in the following.

3.2.1 Ward’s Method

Hierarchical clustering creates a hierarchy of clusters in a tree structure called dendrogram.

The root of the tree, i.e. the highest level of cluster is a a single cluster containing all data ob-jects, and the lowest level of clusters are the individual data objects. The hierarchical cluster-ing can go “bottom-up”, subsequently merge two clusters with smallest distance (agglom-erative hierarchical clustering), or “top-down”, recursively split the clusters that are farthest (divisive hierarchical clustering). Agglomerative clustering is easy to implemented, thus the most frequently used hierarchical method. For the agglomerative hierarchical clustering, differences may arise depending on the way of deﬁning distance between clusters. The sin-gle linkage clustering deﬁnes the minimum distance between the closest objects within two clusters as the cluster distance, and the complete linkage uses, on the contrary, the longest object distance. The average group linkage uses the mean of all the distances of all the object pairs formed by the two clusters. The Ward’s method speciﬁes the linkage function as the increases in the “error sum of squares”(ESS) after fusing two clusters into a single cluster.

Ward’s method seeks to merge the clusters while minimizing the increase in ESS at each

step.

ESSj =

i=1

|x^j_i −cj|² (3.5)

D(j1, j2) =ESS(j1, j2)−(ESS(j1) +ESS(j2)) (3.6) cj =

i=1

x^j_i (3.7)

ESS(j1, j2)is the error sum of squares, once clusterj1andj2are merged into one cluster.

For the agglomerative hierarchical clustering, of course including the Ward’s method, all the possible combinations have to be calculated, and stored for comparison, which leads to a large intermittent storage demand. So the methods are only appropriate for small datasets.

3.2.2 k-mean Clustering

k-mean Clustering is the most representative partitional clustering method, which divide the data into a speciﬁed number of clusters ﬁxeda priori. The algorithm is done in following steps:

1. Deﬁnekcentroids, one for each clusterc_j(j = 1, ..., k);

2. Associate each data point to the nearest centroid, being denoted asx^j_i; 3. Recalculate the new centroid of then_j data point of each clusterj:c_j =nj

i=1x^j_i; 4. Repeat step 2 and 3, until the location ofkcentroids do not change any more.

Because a matrix of distance does not have to be determined and stored in thek-mean clus-tering, therefore it can be applied to much larger data sets. In this study, to classify the more than 40 years daily data, a two-step clustering with the ﬁrst step using k-mean, and the second step using Ward’s method with reﬁnement byk-mean has been applied.

3.2.3 Clustering Results

Before clustering, the different dimensions of the data have been standardized at the ﬁrst step, the daily real temperature at1000 mbarand the lowest two potential temperature gra-dients (1000∼925 mbar level and 925∼850 mbar level) have been classiﬁed into 10 groups by k-mean clustering. Fig.3.3 shows the example scatter plot of grouped data between 1960∼1970.

In the next step, each temperature class is clustered with Ward’s method, and reﬁned by k-mean method. The reﬁnement withk-mean improves the clustering accuracy, and offer the possibility to automatically ﬁx the number of sub-clusters (Hosking and Wallis, 1997).

Fig.3.4 shows the example result of the different cluster method for one temperature class.

Wind originate from west (horizontal) and south (vertical) are denoted as positive. Fig.3.4a

0.005 0 0.015 0.01

0.02

0 0.01 0.02

250 260 270 280 290 300 310

Second level PTG First level PTG

Real temperature [K]

Figure 3.3: Clustering of temperature data of1960∼1970

shows the clustered wind data of one temperature subset by Ward’s method without re-ﬁnement byk-mean clustering, and Fig.3.4b shows the case with reﬁnement. The reﬁned result is more reasonable, for example, some of the data objects in cluster No.8 is more close to cluster No.6 before reﬁnement. One problem can be noticed in both Fig.3.4a and 3.4b is that wind with very diversiﬁed directions, even complete opposite directions are classiﬁed into one group, which is not proper. Therefore, some constraints on the wind direction for distance calculation are imposed on the clustering algorithm, which shows Fig.3.4c.

The ﬁnal 200 temperature-wind clusters are shown in Fig.3.5a, and the windrose of the 200 wind clusters are shown in Fig.3.5b. As we can see that by taking the centroid of all data objects, the maximum wind speed of the clusters is smaller than the original data in Fig.3.2.

One may also notice that, the windrose is a ﬂip of the clusters in Fig.3.5a, this is because that windrose denotes the originating direction of the wind, whereas in the cluster plot express the wind in the opposite direction as speciﬁed before.

Im Dokument Study of topographic effects on hydrological patterns and the implication on hydrological modeling and data interpolation (Seite 74-77)