HARAM - ML-HARAM: ML-ARAM for Large Multi-label Datasets

4.2 ML-HARAM: ML-ARAM for Large Multi-label Datasets

4.2.1 HARAM

HARAM differs from the original algorithm by an additional preparation step that takes place after the training process is completed. In this step, higher layers are organized (i.e. the Fuzzy ART clusters), reducing the number of activated F₂ prototypes to be built.⁹ During the clustering of prototypes, their identifiers are stored for rapid access from the top. To enable this process, the learnedF₂ prototypes are used as input for an unsupervised Fuzzy ART network to learn from. Similarly to Eq. (2.2), the winner is determined by choosing the cluster with the highest activation:

T_k^c(W^a_j) = |W^a_j ∧W^c_k|

α+|W^c_k| (4.2)

where W^c_k is the weight vector of the cluster and W^a_j the prototype of the network.

An important difference here is that two points (r^min and a r^max) coded in W^a_j are used rather than a single input point. Thus, in contrast to Eq. (2.2) both the dis-tance of a prototype from the cluster C and the prototype size influence the activation

9We differentiate here betweenF2 prototypes, which are coded in the category nodes and connected to labels, and clusters of higher layers that are not part of the ARAM network.

value. This activation function therefore rewards the creation of compact clusters. The smaller prototypes among the equidistant ones produce larger activation values for a given cluster.

The learning equation is as follows. The winner is found in a similar manner to Eq.

(2.2) and (2.3) by choosing the cluster with the highest activation:

W^c(new)_K =βc

W^a_j ∧W^c(old)_K

+ (1−βa)W^c(old)_K (4.3) The learning rate βc is set to unity in the fast-learning mode. The growth process of a cluster C after learning a prototype P is depicted in Figure 4.2. The corresponding hyperbox P with the corner points (p^min,p^max) becomes incorporated into the cluster C = (c^min,c^max), changing it toCnew=(c^min_new,c^max_new). This differs from the standard Fuzzy ART training algorithm in which only a point is learned by the network each time. In HARAM two points describing a prototype are learned simultaneously in the higher layer. This allows the more rapid clustering of prototypes. If Cnew is the winner of the higher layer, the prototypeP is activated atF₂ along with all other prototypes that have been involved in the creation of C.

The Fuzzy ART network usually has a lower vigilance than the level chosen during ARAM training. Clustering can be performed recursively for several layers, using proto-types/clusters from a lower layer as input for the next layer and decreasing the vigilance for each new higher layer. Although this builds a hierarchy of prototypes, the gain in speed decreases significantly with each additional layer and accuracy scarcely changes at all. The experiments show that apart from the prototype layer, it is the first cluster layer that influences the accuracy of the classifier most. For this reason, only one cluster layer is used in this study.

This clustering process creates larger (and more importantly, fewer) clusters than prototypes and is therefore able to accelerate access to them. In order to maximize the gain in speed, one can use a simple rule of thumb as a guiding value for the optimal number of clusters: assuming the equal distribution of prototypes in clusters, this value can be obtained by minimizing the sum of the number of activated clusters and the number of activated prototypes, which gives the square root of the number of prototypes.

The hierarchical activation process starts by activating the clusters of higher layers and then proceeds downward, only activating the prototypes of the lower layer that belong to the winner at the higher layer. First, the clusters of the highest layer are activated by a test sample. The winner propagates the pointers to the prototypes to be activated in the layer below. After activating only the selected prototypes in this layer, new pointers are restored from the winning cluster, and the process continues until the lowest layer F₂ is reached, from which the corresponding classification labels can be obtained.

A nice byproduct of HARAM is that hierarchical IF-THEN rules can be easily ex-tracted from the network, simplifying the knowledge representation of the learned data.

Through the use of prototypes and not training samples in creating the clusters, HARAM has several advantages compared to other approaches to hierarchical ART:

Figure 4.3: Potential Problems with HARAM

a) HARAM can be much more efficient, since prototypes are a compressed representa-tion of the training samples and thus the calcularepresenta-tions for cluster-building are rapid; b) the training set must not be presented multiple times (in other hierarchical approaches a training sample should be presented into each layer); c) because our extension only needs prototypes, it can also be used without any training data as an add-on to an already trained ARAM network; and d) clusters are direct generalizations for the prototypes, and thus hierarchical fuzzy rules can be extracted from the network. The evaluation of the hierarchical rules and their use in the understanding of the classification process are left for future research.

Although the process described above is performed offline, it can be modified in order to retain the valuable online learning property of ART. It is also possible to train the clusters directly on the training samples at the same time as the prototypes; however this would result in a much longer training time.

Two potential (related) problems with HARAM involve the neighborhood and over-lapping clusters. In the first case, if a data point is not covered by any prototype, it can still be covered by a cluster. A simple way to visualize this is, if we start from a rectangle (ClusterC) and divide it equally into four subrectangles, as depicted in Figure 4.3. Two of these subrectangles on opposite corners,Pi andPj, represent prototypes; the other two represent empty spaces that were aggregated through the training process. If a pointAlies in these empty spaces farther away, near the border, there may be another prototypePk just outside of C that could be closer to this point than wither of the two prototypes in C. Thus, the cluster may create borders between the prototypes that may be suboptimal. The other issue, which can also be seen in Figure 4.5, concerns cases in which a point lies in a region of overlap – that is, where multiple clusters have the max-imum activation value at the same time. To overcome both problems, we propose not using the WTA rule of Eq. (2.3), but rather its weaker form with more than one winning cluster to select the prototypes for activation.¹⁰ Several experiments were performed in order to determine the appropriate number of clusters for participation in the selection of prototypes. If not explicitly specified, the Number of Selected Clusters (NSC) that

10One important issue with high-dimensional spaces is the consideration of the neighborhood. When only one winning cluster is activated, some prototypes might be close to it despite their membership in a neighboring cluster. The use of multiple winners increases the probability of activating all prototypes that are close to the presented test sample.

Cluster

Figure 4.4: Schema for training and testing with HARAM

0.0 0.2 0.4 0.6 0.8 1.0

Figure 4.5: HARAM Schema: Circle-in-Square Problem

contribute to the prototype activation will be three.

The schematic training and test procedures for HARAM are depicted in Figure 4.4.

The dashed arrows point in the direction of the training. The training samples are used for learning the prototypes; afterward, clusters are created using the learned prototypes.

The test procedure is depicted by the solid arrows: each test sample is passed to the cluster layer which selects the prototypes (IDs) to be activated and for participation in the prediction.

Figure 4.5 shows HARAM handling the well-known “the circle in the square” problem [CGM⁺92]. The ARAM network trained with a vigilance of 0.95 had 125 prototypes. In the upper layer, only five clusters were created with the clustering vigilance set to 0.8.

In the classification phase with the sample (0.5,0.5), the middle cluster (number 3) in the higher layer won. It contained 30 prototypes to be activated, plus 42 from both its

neighbors. In total, only 77 prototypes from two levels were activated while classifying the sample, requiring only about 60% the effort of the standard ARAM network.

ML-HARAM

The second difference between HARAM and ARAM concerns their multi-label versions.

ML-ARAM [Sap09b] increases its precision by relaxing the WTA rule, allowing multiple prototypes to be involved in the label-ranking calculation. To that end, it first calculates the difference in activation between the most activated and the least activated prototype.

This difference is subsequently multiplied by a user-defined ML-threshold to select a fraction of highly activated prototypes for the creation of label-rankings. For more details on this process and the subsequent transformation from label-rankings to multi-labels, see [Sap09b].

In HARAM, instead of all prototypes being activated, an estimation of the lowest activation value is made: the least activated prototype from the selection made by the clustering process is used as lower bound. The main difference is that the user-specified threshold must be changed accordingly. The variation resulting from this method is noticeable in the experiments only in the third or fourth decimal digit place.

A strategy more faithful to the original approach, which is associated with higher computational costs and will therefore be disregarded, involves adding the prototypes belonging to the least activated cluster to the selected prototypes, and then using the least activation. The threshold should be by this method in the same order of magnitude as in the original method.

One major advantage of the approach as opposed to RAkEL and HOMER, which are built by dividing the label space and not the feature space, is that we assume not that the labelset or sub-labelsets are in a neighborhood, i.e. similar label occupies a limited compact space in the feature space. ML-ARAM divides the feature space based on the prototypes, which may or may not have similar labels. This is a more successful strategy from a linear and generalization of classification perspective, since the label correlation may be positive or may interfere with the classification, and therefore the feature space and not the label space should be used as divisor.

Im Dokument Multi-label Classification with Multiple Class Ontologies (Seite 100-104)