• Keine Ergebnisse gefunden

Requirements for the Classifier

1.6 Overview of the Thesis

2.1.2 Requirements for the Classifier

As described above, for today’s challenges, a system must include certain key proper-ties. We will discuss which requirements for the base classifier are most important, for integration into our Data Mining System (DMS). The classifier should be able to handle MLC tasks and make high-quality predictions regarding several performance measures;

it should also possess a number of key properties for integration into the proposed DMS.

We have described the system as to be capable of handling large datasets and of ex-tracting rules from the classification model. This goal can be achieved in multiple ways, but we also regard online learning as a key requirement for a classifier of large datasets, as this facilitates knowledge extraction as well as many applications such as knowledge translation, process understanding and error search, as discussed above.

Online Learning

An MLC algorithm can learn online if an additional sample is presented after the initial training and the algorithm does not need to access any previously presented training samples [Opp]. In other words, in order to integrate an example into the model, the learning model only requires the example in question and the already learned model.

The obvious advantage is that because algorithms using online learning do not need all of the data at once, the memory consumption of the implementations can be maintained at acceptable ranges. This property is especially important for large datasets [ZGH10]. In such cases, the memory and time consumption of an MLC algorithm must be minimized.

Comparing the sample to a compressed model is a step in the right direction. In the

special case of large data, it is also likely that the data will change or be extended, i.e.

the process of data collection may not yet be finished or refined. Furthermore, labels may be added to the labelset, as in the case of gene function prediction with the Gene Ontology, where this ontology (labelset) has been steadily increasing for years. For the most classifiers, this would mean a complete retraining of the model; however, some models using online learning can address this issue (data changing) more elegantly.

Although MLPs allow online learning through recursive gradient descent, stability is only achieved with multiple iteration over the training samples (batch modes). This theoretically limits the convergence to the lowest error rate, since, as [BA98] states, convergence can only be achieved through simulated annealing (decreasing the learning rate), increasing the probability of being trapped in local minima. 2 A fast-learning mode (high learning rate) would result in the loss of presented patterns causing oscillations in the learning. However, several authors argue that the stochastic nature of online learning makes it possible to occasionally escape from local minima [WM03].3 In [WM03], empirical evidence was found indicating that convergence can be reached significantly faster using online learning than batch learning, with no apparent differences in accuracy.

As discussed in [WM03], online learning for neural networks, which has several other names, is performed sample by sample, but generally over many epochs in order to the pattern stored in the network to converge. Here, we require a fast and stable online learning property such that the algorithm can perfectly learn the presented pattern after only one presentation (one epoch).

Rule Interpretability

Training a classification model to solve a complex task such as the one described in Section 1 is a costly enterprise. The data preparation, choice of classification algorithm and exploration of parameter space will all take time, as will the study of the classification model should it not perform well. Classification rules are therefore an indispensable tool when dealing with large datasets. They enable an insight not only into the classification process but also into the patterns the model can extract/learn from the training data, and consequently extract knowledge in the form of rules [CT95]. The understanding of the problem can be enhanced by the classification rules, as many correlations between features and classes become clear through such rules, pointing to underlying processes.

Furthermore, this knowledge can be ported to other tasks and classification models, i.e.

adapting using the learned model instead of retraining can save resources. Especially in text classification, the extraction of classification rules allows the application of semantic methods, extending the features and creating relations in meta levels between features and classes and thereby enhancing both the interpretability of rules extracted from the classifier [BS14b] and possibly the prediction quality.

2Also stochastic gradient descent, an advanced and highly recommended method for MLPs, requires all the samples to be known in advance.

3Online learning is also referred as online training in the literature.

Under the aspect of human-understandable rules, a general division of trainable clas-sifiers into three groups can be made. One group includes the models for which classifi-cation rules are not easily extracted, called black-box approaches; prominent examples are the Multi-Layer Perceptrons (MLPs) with backpropagation [Wer74] and Support Vector Machines (SVMs)[CV95]. In a second group, there are the rule classifiers, such as decision trees or Fuzzy Rule Learners but also Fuzzy Adaptive Resonance Theory (Fuzzy-ART) networks. The last group encompasses the lazy learners, such k-Nearest Neighbors (kNN), where no model is extracted from the data.

Although there are several methods to extract rules from popular black-box classifiers, the classification rules are generally difficult to describe; however, some of these classifiers (e.g. Multi-Layer Perceptron (MLP) and Support Vector Machines (SVMs)) are based on the simple idea of hyperplanes dividing the space and assigning a class for each space slice. The multi-layers of an MLP and the kernel trick (using a kernel to map a nonlinear problem to a linear one, often seen in SVM classifiers) allow the two methods to handle non-linear problems with the complexity of a more elaborated model. By using this non-linearity, the simple linear rule cannot be extracted easily; it may even be the case that the rule cannot be formulated in a simple form. However, for text mining, for large data and in MLC, linear kernels are normally used, facilitating the extraction of rules. But there are many factors that still make the rules difficult to process. BR is still usually applied as a MLC method in these cases, i.e. the dependency of labels is ignored, causing an analysis relation between the rules and labels tedious. Furthermore, in MLC, the hyperplanes may be used in several ways, dividing not only classes but groups of classes (LP). From the hyperplanes, the topology of the sample structure is abstracted;

this usually aids in generalization, but it makes rule analysis more difficult.

Regarding the third approach, despite the fact that it is easy to explain the classifi-cation of a single sample with lazy learners, the extraction of easy-to-read rules is much more complicated, particularly in the MLC case. This will further be discussed in the context of ML-kNN.

The classification algorithms that are not lazy learners, especially the ANNs, can also be divided into two major groups: Margin-based Learners (MbLs) and Prototype-based Learners (PbLs).4 The first seeks to separate samples belonging to different classes using a margin, based on the hyperplane separation theorem or on the concept of maximum-margin hyperplanes [BV04]; the second seeks to cluster similar patterns of equal classes (pattern clustering, e.g. Fuzzy ART [KSPK15, CT95]).5

The MbL paradigm has been used to create a large number of successful neural net-works and general classifiers. The main concept is that in higher dimensional spaces or after a kernel modification (or a non-linear transformation), samples of different classes

4We are interested here in a general categorization between the division and aggregation of features, represented by MbL and PbL respectively.

5However, some of the algorithms mentioned here are not so easy to classify: Bayes learners do not fall into either of these groups, yet they try to group data together, seeking similarities. Decision trees are also difficult to categorize into these two groups, but one main aspect is that they search for the differences between the samples.

can be better separated (i.e. with less error) by a hyperplane than in the initial space.

The greater the distance between the samples and the hyperplane, the greater the so-called “generalization” and the more accurate the estimated performance of the resultant learned pattern.

PbL, on the other hand, is a class of classifiers based on the concept that several representatives or prototypes, each covering a cluster of patterns, are responsible for the classification. The most prominent representative of this class of classifiers is the Self-Organizing Map (SOM). The main aspect of the approaches are that similar input patterns are grouped together, resulting in similar outputs. This similarity is calculated between the input sample from the prototypes generally by using a distance measure which is minimized during the learning phase.

The PbL approach has another important property: the knowledge gathered by the classifier is condensed to and represented by prototypes. Since each prototype is respon-sible for a defined region, an easy analytic rule extraction method is then posrespon-sible. This is particularly true for Fuzzy ARTMAP [CT95]. These rules can be in IF-THEN form and are thus easy to read for humans. Rules can also be verified and checked for outliers and noise. With that in mind, users can debug their created rule system.

Black-boxes approaches can learn and prioritize properties which are not key ones for what humans normally would assign samples to the class given. Thus, if a black-box learner gets stuck in a local minimum and learns an odd property from the problem, it can be difficult to remove it from that system. With prototypes, one can identify which training sample is problematic and remove it from the training collection. Rules may be extracted from MbL and converted into the rules used in PbL and rule classi-fiers; nonetheless, PbL approaches are more adapt for online learning in one epoch, rule translation and error search.

These issues will serve as a basis for examination of the selected classifiers presented below.