Use Case 1: Misprediction Analysis

6.2 Use Cases

6.2.1 Use Case 1: Misprediction Analysis

The first use case will show how classification predictions can be examine in the rule ex-plorer. The first sample of the first slice of the cross-validation will be examined and was

Figure 6.2: Rules Activation for first sample and first cross-validation slice

a news article about the yearly report of a company (Glenealy Plantations). The Top-ics assigned (true labels) to it were “PERFORMANCE”, “ACCOUNTS/EARNINGS”,

“ANNUAL RESULTS” and “CORPORATE/INDUSTRIAL”.

The rule with the highest activation did have all the labels but “ANNUAL RESULTS”

and actually had very similar news associated with, yet none of them were yearly reports but they were quarterly or half-year results. This first rule was selected mainly because one of the articles involved was about a report from a company engaged in gaming, hotel, property investment and development, plantation and stockbroking from Malaysia.

Covering most of the words using in the presentation samples with very similar TF-IDF values. The other 4 selected rules did have this missing label so that at the end the full prediction was correct. But in three from these prototypes/prediction rules the word plant was missing. In the one prediction rule that had plant, the words kuala and lumpur had not a high enough TF-IDF value in order to achieve the highest activation. This was because of the normalization since the TF and the IDF values were also the same in both news. Interesting is that the highest activation had not the highest |A∧W| (the highest had the second prototype with 4874.970) since it was only 4874.575. This was because the value of |W| was only of 4875.037 while the second had 4875.584. This means that the second was either composed of less words or the words had a lower TF-IDF value¹. As we can see also the first one had more news articles involved in its creation. Therefore, the smaller hyperbox (prototype) was selected, which is incorporated in the design of ML-ARAM. Yet, for text classification this feature should be taken into consideration, and although the normalization column-wise chosen in the preprocessing gives higher prediction quality it indirectly includes the length. A cosine normalization would be a solution, yet sometimes the length of the document points to its format and this can also be an important feature.

The predictions regarding the industries deviates greatly from the true labels. First of all, the keywords malay, kual and lumpur which are the stemmed words for Kuala Lumpur in Malaysia are normally connected to the words compan, million, net and profit when the results of financial holding companies are published. One of them is situated in Malaysia, named Mycom Bhd (23522, rule 3818 and 17079, rule 2769). In rule 1253 there is a news article about the results of Malaysia’s Sime Darby Bhd diversified conglomerate with interests including plantations, banking, among others. There we have also the same style, an article about financial results of a company in malaysia about plantations. Still, it is here labelled with “WHOLESALE DISTRIBUTION” which is part of “DISTRIBUTION, HOTELS AND CATERING”, so we find also a partly mislabelling here. The fourth activated rule has Timber Processing as the sample and most of the news associated with it are about timber and about financial results. But the word plantat does not appear. The words are garden, timber, wood, but no plantations.

Thus, the activation is less than the others (this is responsible for a gap of 0.569 of a total of about 1.00).

1A point (collapsed hyperbox/hyperbox of one sample point) would have a value equivalent to the number of features in this case 5000.

(a) (b)

Figure 6.3: Different activation for the sample 0 for Neurons 2733 and 2753 Although the label “TIMBER PROCESSING” is assigned many others, they are not linked to it. In special, the news article was too short and only about the financial results. A background information about the companies might have helped the classifier to solve the task better. Also “TIMBER PROCESSING” has no connections using JacDif which could be helpful in the improvement prediction. The rule with highest value is from “EQUITY MARKETS” with 0.00021. Still a semantic use of the text might have improved the results here, linking plantation and timber and so helping prototypes with closer semantic meaning achiever higher activations.

6.2.2 Use Case 2: Improvement Analysis

We will examine now a sample (number 307 of slice 1) with a successful improvement.

Interesting is that not all Topics labels were correctly predicted. We have three wrong predictions, two correct predictions which were not set because of the low ranking and one false negative. The problem is that the sample is about the El nino impact on Phillipine farm growth. Thus, it is indirect about “SOFT COMMODITIES MARKETS”, yet this label is not used. The “WEATHER” is normally an important issue in plantations and seeds related news, so many words are typical in news relevant to harvest results.

In the Industries labelset the prediction would be easy, only three labels all relevant to agriculture should be set (see activation results for Industries in Figure 6.4). Yet, since farms are mainly related to the label “GRAIN, CEREALS FARMING”, many high activated rules have this label alongside with “AGRICULTURE”. Further, some rules have it not causing the “AGRICULTURE” label to have a ranking not high enough to be set. Specially the rule 263 referring to soya growing deviated the prediction from setting the label “AGRICULTURE”. The activation of the rule is lower than the next rule (3143) (Act(W) = |A∧W|: Act(W₂₆₃) = 4873.721 and Act(W₃₁₄₃) = 4873.094) but the norm of the prototype is lower (|W₂₆₃| = 4875.834 and |W₃₁₆₃| = 4876.425). So

Figure 6.4: Rule Explorer: Sample 306 Activation

Figure 6.5: Rule Explorer AR Connection Examination: Example for “PRODUC-TION/SERVICES” to other nodes, in special to “AGRICULTURE”. Line thickness points to connection strength.

mainly the style of the news diverged enough to get a wrong classification (causing same words being in different labels with similar TF-IDF), also the number of news assigned to the rule (ART_b weights) with 24 for 263 and 19 for 3163 might have played a role in the ranking value of the rule.

Still the rule “PRODUCTION/SERVICES → AGRICULTURE” with a Kulc value of 0.33 can be applied to this sample. Also, the relation between the score of the consequent and antecedent is 0.83, relatively high, allowing strategyHC_b to be applied, increasing the recall of the sample. We can see from Figure 6.5 that many strong relations connected “PRODUCTION/SERVICES” to the tree of “AGRICULTURE, FORESTRY AND FISHING”.

6.3 Conclusion

The Rule Explorer allows a precise and fast examination of a prediction in a cross-ontology setup. Further, the text mining adaptation help to find important words, indicating actionable options like increasing the weight of some prediction rule for a certain word.

It also allows to examine why an improvement rule was applied and why it resulted in a better prediction quality. An examination of the rules involved and the interestingness measures values is easy and fast.

The Rule Explorer can be a valuable asset to the data mining of text mining tools. The examination of the cross-ontologies is efficient and can rapidly lead to prediction model improvement. In combination of ML-ARAM it allows also to reconstruct the creation of rules visualizing the tracibility of the model.

One of the greatest challenges in multi-label classification is that by considering the combinations of the labels (label co-occurrences) in the algorithms the complexity of the whole classification task can significantly increase. In order to postpone the com-binatorial challenge, the direct application of standard data mining methods on the predicted data, improving the predictions, can diminish the problem. However, with greater amount of data, it becomes increasingly more difficult to perform successfully such improvements as well as to perform an analysis of the built model, to ensure quality of the solution and coverage of the problem by the model. The overcoming of these both challenges are paramount goals for the following reasons:

The former issue, improving the predictions, is related to the event that prediction algorithms cannot analyze in large datasets every co-occurrence for label and attribute.

A generalization is imperative and labels with small support will suffer by that being treated often as noise. Post-processing the predictions becomes a logical step, when the special cases can be considered, lowering the complexity of the main classification task.

The latter issue, analyzing the built model, relates to the general question of knowledge extraction and optimizing the model to the given tasks, either by adapting the model or examining attributes and optimizing the input space. The solutions for both issues can be combined and reinforce each other.

Also, the data accumulated requires methods which are adapted to handle large amounts, often called Big Data methods. Such mountains of data can hide important information, connections that are not obvious and even surprising. A key objective of data mining is unveiling these connections between the data.

We did engage these questions and developed methods in order to combine them into a single data mining system which can successfully manage these challenges. We distilled two main issues we tackled, in this study and in this context: Multi-Label Classification (MLC) and MLC improvement.

7.1 Contributions

One important issue when dealing with large data is the task of organizing and discov-ering relations within the data. The organization of the data objects is usually realized with an ontology of classes, each data object receiving at least one label. When using dif-ferent perspectives to label the data, difdif-ferent ontologies are needed, and the connection between these ontologies might help the understanding of the data.

An important challenge of large data classification is the traceability of the model.

The ML-ARAM has this property but it is too slow for large data. I proposed in this study the method ML-HARAM which proved to work in the MLC setup. In some setups, it even outperformed ML-ARAM in terms of many MLC performance measures, indicating better prediction quality. The parameters of ML-HARAM can be changed to converge to the results of ML-ARAM, so a trade off between speed and prediction quality can be made if necessary. Experiments on different implementation paradigms based on sparsity, parallelization and programming language were also performed, showing that the choice for an individual implementation heavily depends on the density of the data. Also the algorithm performed well in different tasks and datasets, ranging from multi-class classification to MLC. The key advantage is that ML-HARAM divides the input space in subspaces, diminishing the number of rules to be tested on the data significantly. The learning of the subspace division is performed much more efficiently than with previous methods, also integrating the clusters and prototypes in one network allowing neighborhood activation.

The predictions given in large MLC will have many labels with low support. Many ap-proaches do ignore and handle them as noise. We developed and examined improvement methods based on the assumption that the highest improvement would come from such labels. This is the key achievement of the study. Another important distilled assumption was that the cross-ontology rules would be an important aspect. We focused on Inter-estingness Measures (IMs) and in special on IMs for Rare Association Rules (IMRARs) in order to extract suitable rules for applying them to the predictions. We applied a vast range of methods and IMs, as no other study did, to the best of our knowledge, achieving good results. This also merges the both different research fields, MLC and knowledge extraction into this application. Furthermore, one of the most successful improvement methods considers the score assigned by the classifiers to each label when applying the rule to the predictions. We found evidence that this can have a significant impact on the improvements and depends heavily on the classifiers used. We also examined the case when there are no predictions available for the second labelset. The predictions obtained by applying the IMRARs were better than the ones of a naive-Bayes-based classifier, and comparing these results with the obtained by using the classifier with the highest score, the IMRARs predictions achieved still 70% of the classifier’s performance measures results without any expensive training.

Furthermore, we developed a visualization for rule analysis, Rule Explorer, in order to be able to examine the cross-ontology rules and use easily the traceability of ML-ARAM.

We examined the RCV1-V2 dataset in detail, discovered several problems and pointed some solutions. The use cases examined with the Rule Explorer are also a demonstration of how the data mining system performs and helps find new insights about the problem as well as improve the predictions.

7.2 Outlook

The rule hierarchy created by ML-HARAM can be further used to cluster the input space and find commonalities between the labels belonging to such a cluster. We can connect the cross-ontology rules between the output space with the relations between the input and output spaces. This can be used to show how the cross-ontologies related in presence of certain patterns in the input space, that is, if a certain pattern in the attribute space can be a trigger for a certain rule between the label of different ontologies. Thus, co-occurrences (which may reveal causalities) can unveil new knowledge about the task at hand.

Also, the Rule Explorer can help find semantic links in text datasets. The use of semantic features and ontologies will be important for neural networks in the years to come. The feature extraction methods are based on statistics and therefore how the language is used and not how to connect the words creating meaning. Many approaches use N-Grams grasping for a more stronger correlation between words and a more solid statistic, but the research on how to extract important features from text (e.g. how words are linked, which are synonyms and how the ontology graph connect them, pretraining datasets that allow a better feature extraction) has to improve greatly in order that automatic classifiers can perform better than humans. When it comes to that point, such powerful feature extraction will require a suitable data mining visualization. This will be important in order to find the causes and relevance of a certain feature for the prediction, pointing to causalities in the data and rules.

Another issue to be investigated would be to apply cross-ontology ARs methods to the ML-HARAM rules between the ontologies. Since each sample is classified with a hierarchical and a set of base prototypes for each ontology the connection between prototypes across the ontologies can be established and examined by the methodology developed in this study. This could enable the linking of similar training samples through the use of test data enabling a further analysis of the feature space across ontologies.

Multi-label methods will also have to incorporate deep neural networks in a fashion that the combinatorial explosion is bounded. Input space subdivision is a key feature which new classification methods should have, especially in the MLC setup. Examina-tion of convoluExamina-tion and N-Grams for feature extracExamina-tion for ML-ARAM will be a logical development of this work.

[ACGM15] Giuseppe Agapito, Mario Cannataro, Pietro Hiram Guzzi, and Marianna Milano. Using go-war for mining cross-ontology weighted association rules.

Computer Methods and Programs in Biomedicine, 120(2):113 – 122, 2015.

[AHK01] Charu C. Aggarwal, Alexander Hinneburg, and Daniel A. Keim. On the surprising behavior of distance metrics in high dimensional space. In Lecture Notes in Computer Science, pages 420–434. Springer, 2001.

[AIS93] R. Agrawal, T. Imieli´nski, and A. Swami. Mining Association Rules between Sets of Items in Large Databases. InProceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207–216. ACM New York, NY, USA, 1993.

[AMS⁺97] S F Altschul, T L Madden, A A Sch¨affer, J Zhang, Z Zhang, W Miller, and D J Lipman. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res, 25(17):3389–3402, Sep 1997.

[ASR10] Uday Kiran Akshat Surana and Polepally Krishna Reddy. Selecting a right interestingness measure for rare association rules. In 16th International Conference on Management of Data (COMAD), 2010.

[ATS05] MafruzZaman Ashrafi, David Taniar, and Kate Smith. Redundant association rules reduction techniques. In Shichao Zhang and Ray Jarvis, editors,AI 2005:

Advances in Artificial Intelligence, volume 3809 of Lecture Notes in Computer Science, pages 254–263. Springer Berlin Heidelberg, 2005.

[BA98] Siegfried B¨os and Shun-ichi Amari. On-line learning in neural networks.

chapter Annealed Online Learning in Multilayer Neural Networks, pages 209–229. Cambridge University Press, New York, NY, USA, 1998.

[Bal10] Jos´e L. Balc´azar. Closure-based confidence boost in association rules. In WAPA, pages 74–80, 2010.

[Bar96] Guszti Bartfai. An art-based modular architecture for learning hierarchical clusterings. Neurocomputing, 13(1):31–45, 1996.

[BB10] Nahla Barakat and Andrew P. Bradley. Rule extraction from support vector machines: A review. Neurocomputing, 74(1–3):178 – 190, 2010. Artificial Brains.

[BBAM04] Anita Burgun, Olivier Bodenreider, Marc Aubry, and Jean Mosser. Dependence relations in gene ontology: A preliminary study. In Workshop on The Formal

Architecture of the Gene Ontology-Leipzig, Germany, May 28-29. Citeseer, 2004.

[BBHK10] Michael R. Berthold, Christian Borgelt, Frank Hoeppner, and Frank Klawonn.

Guide to Intelligent Data Analysis: How to Intelligently Make Sense of Real Data, volume 42 of Texts in Computer Science. Springer-Verlag, 2010.

[BBS10] Fernando Benites, Florian Brucker, and Elena Sapozhnikova. Multi-label classification by art-based neural networks and hierarchy extraction. In Proceedings of the International Joint Conference on Neural Network 2010, Barcelona, Spain, 2010. International Joint Conference on Neural Networks of the IEEE World Congress on Computational Intelligence (IJCNN - WCCI).

[BBS11a] Florian Brucker, Fernando Benites, and Elena Sapozhnikova. Multi-label classification and extracting predicted class hierarchies. Pattern Recogn., 44(3):724–738, March 2011.

[BBS11b] Florian Brucker, Fernando Benites, and Elena P. Sapozhnikova. An empirical comparison of flat and hierarchical performance measures for multi-label classification with hierarchy extraction. In KES (1), pages 579–589, 2011.

[BCCG12] Elena Baralis, Luca Cagliero, Tania Cerquitelli, and Paolo Garza. Generalized association rule mining with constraints. Inf. Sci., 194, July 2012.

[BCM05] Paul Buitelaar, Philipp Cimiano, and Bernardo Magnini. Ontology Learning from Text: An Overview, volume 123. IOS Press, 2005.

[BDE⁺05] David Bremner, Erik Demaine, Jeff Erickson, John Iacono, Stefan Langerman, Pat Morin, and Godfried Toussaint. Output-sensitive algorithms for computing nearest-neighbour decision boundaries. Discrete & Computational Geometry, 33(4):593–604, 2005.

[Ben75] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM, 18(9):509–517, September 1975.

[BG09] Albert Bifet and Ricard Gavald`a. Adaptive learning from evolving data streams. InAdvances in Intelligent Data Analysis VIII, pages 249–260.

Springer, 2009.

[BKK96] Stefan Berchtold, Daniel A. Keim, and Hans-Peter Kriegel. The x-tree: An index structure for high-dimensional data. In Proceedings of the 22th

International Conference on Very Large Data Bases, VLDB ’96, pages 28–39, San Francisco, CA, USA, 1996. Morgan Kaufmann Publishers Inc.

[BMS97] Sergey Brin, Rajeev Motwani, and Craig Silverstein. Beyond market baskets:

generalizing association rules to correlations. In Proc. of the 1997 ACM SIGMOD Int. Conf. on Management of data, pages 265–276, New York, NY, USA, 1997. ACM.

[BMUT97] Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Tsur. Dynamic

itemset counting and implication rules for market basket data. In Proc. of the 1997 ACM SIGMOD Int. Conf. on Management of data, pages 255–264, New York, NY, USA, 1997. ACM.

[BPdMLS13] Ana M. Blanco, Angel C. Pe˜nuela, Luis F. de M. L´opez, and Arcadio Sotto.

Data mining with enhanced neural networks-cmmse. Journal of Mathematical Modelling and Algorithms in Operations Research, 12(3):277–290, 2013.

[Bre01] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.

[BS12] Fernando Benites and Elena Sapozhnikova. Learning different concept

Im Dokument Multi-label Classification with Multiple Class Ontologies (Seite 185-0)