• Keine Ergebnisse gefunden

A current major challenge in the field of computer science involves how to deal with the increasing amount of information available; the specific term used for this phenomenon is “big data”. A significant part of this data is ordered and categorized into classes embedded into ontologies1, i.e. each object has a label (belonging to a class) assigned to it in order to organize the data and facilitate searches. This system results from the need to develop more finely granulated and precise structures to classify objects, thereby enabling users to locate objects more quickly and to more accurately organize these ob-jects. Ontologies are necessary when the number of labels becomes excessive (i.e. in the thousands). At that point, labels are generally structured in an ontology, which is defined as to comprehend many classes/labels interconnected by rules, for example, a genealogy tree would be part of an ontology in which the labels are people and the rules are of the type is-a-child-of. We are interested primarily in the hierarchical structure of the ontologies in which multiple labels are assigned to objects. A divide-and-conquer approach for the embedding objects into the structure results in a hierarchical form,

1As understood in [BCM05] in terms of concepts, taxonomy and non-hierarchical relations.

Alan Turing

Turing aged 16 Born Alan Mathison Turing

23 June 1912 Maida Vale, London, England Died 7 June 1954 (aged 41)

Wilmslow, Cheshire, England ResidenceWilmslow, Cheshire, England NationalityBritish Fields Mathematics, cryptanalysis, computer science, biology InstitutionsUniversity of Manchester distance runner. He was highly influential in the development of computer science, providing a formalisation of the concepts of algorithm and computation with the Turing machine, which can be considered a model of a general purpose computer.[2][3][4] Turing is widely considered to be the father of theoretical computer science and artificial intelligence.[5]

During the Second World War, Turing worked for the Government Code and Cypher School (GC&CS) at Bletchley Park, Britain's codebreaking centre. For a time he led Hut 8, the section responsible for German naval cryptanalysis. He devised a number of techniques for breaking German ciphers, including improvements to the pre war Polish bombe method, an electromechanical machine that could find settings for the Enigma machine.

Turing played a pivotal role in cracking intercepted coded messages that enabled the Allies to defeat the Nazis in many crucial engagements, including the Battle of the Atlantic; it has been estimated that

A.I. researchers

Figure 1.1: Wikipedia page of Alan Turing linked to multiple ontologies

giving rise to an is-a relationship among labels (i.e. grouping objects and assigning them to labels of various structure levels, producing a different granulation). Furthermore, many tasks require multiple perspectives, with each perspective featuring a unique on-tology. As Figure 1.1 demonstrates, a Wikipedia article may be connected to multiple ontologies, having assigned multiple labels from them. The article about Alan Turing can be connected to individual entities, such as Winston Churchill, and therefore with broader but still specific terms, such as World War II, but also with completely different broad terms such as scientific fields (e.g. mathematics and computer science) or even the hierarchy node responsible for movie pictures. The movie pictures connected to the ar-ticle are embedded into other hierarchies such as that of Netflix, opening another source of categorization that is likely to be independent from the Wikipedia hierarchy in that the general-specific relationship and granularity differ. Because the number of objects and perspectives is growing, the ontologies themselves are also becoming more and more complex. Another problem that arises when many labels can be assigned, depending on the perspective and context, is that inconsistencies may be created. This is especially the case when the labelling is manually performed, since different people may assign different labels to the same object based on their individual decisions.

For the labelling of an ever-growing number of objects in situations in which manual methods would be costly and slow (and occasionally inconsistent), scalable automatic methods are required. Useful algorithms in this regard come from the field of Multi-Label Classification (MLC). In MLC, samples are classified in different classes, whereas multiple label assignments are allowed. The growth of ontologies has forced MLC into diversified label spaces of increasing complexity. The enormous number of objects and la-bels makes the MLC task even more difficult and many MLC algorithms cannot complete categorization tasks in an adequate amount of time and memory, creating a high barrier to engage this issue. As a result, the problem of how to connect multiple ontologies within the MLC framework is an under-research question [LM10]. Recently, initial work on the improvement of MLC predictions through reductions in label space [CRdJH14]

produced poor results. Thus, although the question of how to handle heterogeneous attributes in classification has been thoroughly discussed in the literature [Wis09], the next step – how to connect a diversificated output space in a useful way – is to the best of the author’s knowledge, yet to be investigated. We will focus on this research field, concentrating on the question how to improve MLC quality and performance in the presence of multiple ontologies.

In brief, the central hypothesis of the proposed study is that the discovery of cross-ontologies rules between the multi-labels of different cross-ontologies will allow these connec-tions to be used to improve the predicconnec-tions of the classifier. These connecconnec-tions will especially have significant impact on the improvement in large data which is usually organized in large ontologies. Furthermore, the hierarchical nature of the ontologies will be exploited to elicit these rules and relate them to the classification rules, creating an easy-to-navigate rule base. Consequently, methods to handle these tasks, the improve-ment of predictions as well as the rule analysis, need to be developed. An important requirement for these methods is the ability to deal with large data efficiently. For that an MLC algorithm which can handle not only rapidly but also accurate such data must be the basis of the approach.

We are confident that these cross-ontology rules can improve classification results. It is generally easier to predict the labels of one ontology than another; therefore, find-ing relations between the ontologies and usfind-ing the predictions of the ontology that has greater accuracy will help predict labels of the less accurate ontology. In the example depicted in Figure 1.1, through the various movies about Alan Turing, it is widely known that he was a computer scientist. However, a link between Turing and the field of phi-losophy might be more seldom known, although many theoretical researchers engage in Gedankenexperiments, joining their disciplines with philosophy. Thus, if Alan Turing can be identified as theoretical researcher in the field of computer science and the infor-mation that theoretical researchers are linked to philosophy is used, linking Alan Turing to philosophy is within reach – combining a fact with a known relationship to discover a new fact.

A more concrete example is depicted in Figure 1.22. On the left side of the picture is the training data with the selected objects and their respective features and labels, as well as the sample to be classified. On the right side are the ontologies and the classification improvement. The arrows represent classification steps. From the training data, a classification rule for sport cars can be inferred, but for a new sample (dump truck), the rule cannot be successfully applied. By evaluating the features of the extracted sport car rule the dump truck sample would fit. However, the company Hitachi does not produce any sport cars, therefore it would be a misprediction. The reason being that the sport car rule is incomplete because of the restricted training set. But since the new sample is also classified as belonging to the Hitachi company, it is possible to infer that the closest thing to a sport car that this company makes, is a dump truck. Thus, using the prediction

2This extends the idea of [Wis09].

Appearance

Compact class Sport car Truck Tractor D-truck

Color = AND #Seats≥2No

Figure 1.2: Car example: classification rules with only feature space and with additional label space

of another ontology3 and a cross-ontology rule the prediction is improved, by labelling the sample with the correct class (Type). A schema of the classification workflow for this example is depicted in Figure 1.3. On the left is the standard classification with multiple ontologies. Rules are extracted from these ontologies in the top module and in the final module both outputs are combined improving the predictions. The approach can be distilled to the simple idea that scavenging the training data will not always provide the best feature. Also, such an exhaustive search is very costly and not always possible. Identifying relations just in the label space between labelsets of different nature will probably be more fruitful, improving prediction quality with relatively low training cost increase. This will be in large multiple-ontology datasets even more important since the number of labels will be high and the number of cross-ontology relations which have a significant impact will be more likely to happen. Here specific and rare labels will be more difficult to train and predict, so this is where our focus lies. By using an automatic classification system, also new relations between the multiple ontologies might emerge and provide new knowledge.

3It is assumed that the classification into companies is easier to perform.

Rule

Figure 1.3: Approach: Workflow Schema Improvement