• Keine Ergebnisse gefunden

To achieve the research goals and meet the requirements discussed above, the proposed approach focuses equally on two different research fields related to data mining: MLC and Knowledge Extraction (KE) by association analysis. In Figure 1.4, a schematic representation of the approach is shown. The research directions taken in this study were chosen in order to design a system which is capable of achieving the goals better than the state of the art.

The new MLC algorithm Multi-Label HARAM (ML-HARAM [BS15a]) was proposed to rapidly and accurately handle large datasets: it consists of a hierarchically modified Fuzzy ML-ARAM [Sap09a] network optimized for different scenarios, as GPU

paral-lelism for large sample sets or sparsity for large feature sets. On the other side, the use of IMRARs between multiple ontologies was studied in depth. Rare ARs [ASR10] refer to connections between labels in which at least one of the labels does not often occur4 and thus can be surprising. The improvement in classification performance was achieved through post-processing with cross-ontology ARs in the multi-label setup. The schema, focusing on the improvement, is presented in Figure 1.3 – namely, the workflow whereby the classification by the rule classifier (ML-HARAM) produces predictions of two on-tologies A and B (as a special case) for the incoming objects. The relations extracted from ontologies5 A and B (with IMRARs) are then used in the improvement module to output improved predicted multi-labels. As can be seen, the MLC and KE aspects are first separated and then merged in the improvement section.

A more detailed workflow of the entire system, with a division of training and test examples, is depicted in Figure 1.5. Again, the objects are the input for the rule classifier which outputs predicted labels belonging to separate ontologies or the output is reviewed in the Rule Explorer. With the help of the ARs extracted from the training examples, the labels meeting certain criteria are set in a post-processing step, increasing the prediction quality. Multi-label classifiers that attempt to learn and predict such deep connections require much more time and memory; thus, they are unable to handle large datasets and further exacerbate the problem of overfitting. At a certain depth, it will be better to break up the search and rely on label correlation instead.

The key contributions of this study are the extended rule classifier, the post-processing strategies for the improvement of predicted multi-labels (of multiple ontologies) and the rule-exploration system which allows rules to be searched in the feature and label spaces.

The individual components of the system are described below in detail.

ARAM was chosen as the base classification algorithm because of the advantages listed below. In the specific part of improvement, it could be easily replaced by other MLC methods. This algorithm can learn online (instance-incremental learning), and rules can be easily extracted6; moreover, the learning process is very intuitive. ML-HARAM is an extension of ML-ARAM that allows large datasets to be classified in a manner that requires only a small fraction of the time and memory needed by the original algorithm.

It also features classification accuracy comparable in the MLC setup. This modifica-tion also enables the extracmodifica-tion of hierarchical classificamodifica-tion rules that group multiple neighbour classification rules into clusters (hierarchical classification rules). Through the hierarchical approach using the neighbourhood we can identify commonalities between the rules and creating subspaces, which is an important aspect for high-dimensional large data. Analyzing such structures can extend further the KE.

Another important part of the developed system is the extraction of ARs by IMRARs used in a post-processing method for MLC improvement. In the proposed method, the

4Specifically, at least one label does not have high support (i.e. the number of occurrences divided by the total number of transactions).

5Although the use of two ontologies is consistent throughout this study, the use of multiple is easily implemented.

6Such rules can be ported with little effort to the well-known IF-THEN rule patterns.

Predicted

Figure 1.5: Implemented System: The input and data components are green and yellow, processing steps are blue and the output components are orange. Study accomplishments are bold marked.

predictions of the classifier are compared to the selected rare ARs. The main idea is to rely on good-quality predictions (emerging from one ontology), utilizing them as an-tecedents of the selected ARs, and to enforce weak predictions if they are the consequents.

The system allows the extraction of ARs between the labels from the training set (and in case of debugging from the test set). It also allows the extraction of hierarchical classification rules from the ML-HARAM. These are combined in a rule-exploration system that enables the analysis of the classification process in a hierarchical fashion, grouping rules and searching different levels of the hierarchies. The Rule Explorer assists users in the analysis and understanding of the process.

The improvement of predictions with ARs has produced better results than the state of the art [PF08, BS15c]. The Rule Explorer goes a step further to create an explorative analysis system. To the best of the author’s knowledge, there is still no comparable data mining system that can extract hierarchical classification rules from multi-ontology problems, scale well with the number of dimensions in the feature/label space in terms of accuracy and classification time and enable KE with classification rules and association analysis of the multiple ontologies. The system integrates well all aspects together, each aspect enriching the others. The system is so designed that performing big data MLC becomes more precise, more efficient and understandable. The implemented workflow in KNIME is depicted in Figure 1.6, with the components of Figure 1.4 outlined in the respective colors.

Rcv1-v2

Meka Predictor

ARFF Reader Row Splitter ARFF Reader Row Splitter

GAR+AR Extraction ML-HARAM

Learner Prediction

Improver

GAR+AR Extraction

Fuzzy Rule Extractor

Rule Explorer

Figure 1.6: KNIME Workflow Example

<text><p>Germany’s cocoa grind figure for the number quarter of XXXX will probably be ready for release on XXXX, the confectionery industry association XXX.</p><p>The data, a key pointer to chocolate demand, showed an XX.XX percent year-on-year XXXX in the number quarter to XX,XXX.X tonnes.</p><p>–German City newsroom </p></text>

true AGRICULT. AND HORT. + AGRICULT., FORESTRY... + AGRICULT. + COCOA GROWING + CONFECTIONERY + FOOD, DRINK... + PROCESSING INDUSTRIES

predicted

FOOD, DRINK.. + PROCESSING INDUSTRIES ranks

PROCESSING INDUSTRIES + FOOD, DRINK ... + CONFEC-TIONERY + COCOA GROWING + FINANCIAL AND BUSINESS SERVICES + AGRICULTURE + AGRICULTURE, FORESTRY AND FISHING + AGRICULTURE AND HORTICULTURE + METAL MANUFACTURING + METALS AND MINERALS + ...

improved

PROCESSING INDUSTRIES + FOOD, DRINK ... + CONFEC-TIONERY + COCOA GROWING + FINANCIAL AND BUSINESS SERVICES + AGRICULTURE + AGRICULTURE, FORESTRY AND FISHING + AGRICULTURE AND HORTICULTURE + METAL MANUFACTURING + METALS AND MINERALS +...

Table 1.1: Example: green=true positive, brown=false negative, red=false positive, or-ange=true negative, text was anonymized for copyright