• Keine Ergebnisse gefunden

2.5 Community-Wide Evaluation Efforts

2.5.3 Community Evaluation Efforts

Another well accepted approach for benchmarking systems are shared tasks. In advance to a specific conference, problems are proposed and training data is provided by the organizers. Participants are invited to develop a system to solve the given task. Test data (without gold standard annotations) is usually published several month later and participants must then submit solutions within a given time frame. This time slot is usually a few days long, to avoid manual intervention by individual participants. Shared tasks have a long tradition in information extraction, as for instance for the “message understanding conferences” held between 1987 and 1997. The advantage of shared tasks is that they are evaluated on previously unseen test data and therefore provide a more robust evaluation than conventional cross-validation. After the end of shared tasks, the annotations are ideally not publicly released. Researchers are allowed to provide a certain amount of submissions per day (e.g.,one run per day) to evaluate later developed systems. This strategy increases comparability of later approaches, as it avoids common pitfalls in evaluation. However, Kaufmanet al.(2011) showed that shared tasks may still lead to overoptimistic estimates when subject to data leakage or cheating by participating teams. For instance, the authors mention the INFORMS 2010 data mining challenge where more than 30 teams were able to map “blinded” test instances to the corresponding stock-data leading to almost perfect predictions of “future” stock price movements.

The probably most important community efforts in the biological domain are the BioCreative conferences and the BioNLP shared tasks. Over the last years, BioCreative covered several different topics ranging from gene name recognition (Hirschman et al., 2005; Krallingeret al., 2008), over identification of PPI relevant articles and the extrac-tion of protein-protein interacextrac-tions (Leitneret al., 2010), to interactive demonstrations of text mining systems (Arighi et al., 2011).

The BioNLP shared task was introduced by Kimet al.(2009). A characteristic feature of the BioNLP shared task, compared to binary relationship extraction, is the definition of fine-grained event types. Events generally consist of a trigger word (expressing the relation) and an arbitrary number of arguments. A particularity is the definition of nested events, which take another event as an argument.

Due to its success the challenges were repeated twice with different data (Tsujiiet al., 2011; Nédellec et al., 2013). In 2011 the organizers prepared five main event extraction tasks, differing in text types, event types, and domain. The ambitious goal was to provide several different corpora to evaluate domain adaptation capabilities of different systems. However, only one team successfully participated in all maintasks and subtasks (Björneet al., 2012). The theme of 2013 was to support the construction of knowledge bases. To this end, the organizers defined six different tasks relevant for knowledge base construction. The organizers also provided a larger body of supporting resources, encompassing syntactic parses and results from multi-purpose named entity recognition tools.

Another shared task worth mentioning are the two drug-drug interaction extraction challenges (Segura-Bedmaret al., 2011a, 2013). This task shares several similarities with the protein-protein interaction extraction task, as drug-drug interactions are defined as

40

undirected binary relations between two entities in the same sentence. We participated in both shared tasks using an ensemble of different classifiers. The drug-drug interaction task and our contribution will be described in more detail in Chapter 3.

Extraction

This chapter discusses the usability of ensemble learning techniques for relationship extraction. Ensembles aggregate the output of several heterogeneous classifiers in order to reduce the risk of accidentally choosing an inappropriate single classifier. For this reason ensemble methods are generally considered to increase robustness. We choose the domain of drug-drug interactions (DDIs) in order to compare ensemble methods over individual classifier performance. The work presented in this chapter has been originally developed in the context of the SemEval 2013 shared task1 and ranked second among eight participants on an unseen test corpus (Segura-Bedmar et al., 2013).

3.1 Ensemble Learning

Ensemble learning refers to the process of combining several individual classifiers in order to build a stronger classifier. The methodology is inspired by the process of human decision making, where individuals ask the opinion of several people in order to come to a final decision. Previous community competitions showed that ensemble learning helps to achieve better performances than relying on one single method (Kim et al., 2009;

Leitner et al., 2010).

An important property of ensembles is that they increase robustness by decreasing the risk of selecting a bad (or miscalibrated) classifier (Polikar, 2006). For instance, it is straight forward to obtain several different classifiers on a data set. This can be accom-plished by learning different classification algorithms (e.g.,SVM, Naïve Bayes, or logistic regression) or by different parameter settings (e.g.,different soft-margin misclassification costs) on the same dataset. Assume that some of these classifiers exhibit, according to 10-fold cross-validation, similar F1. However, performance on the unseen test set may considerably vary among the different classifiers. In such cases, it can be advantageous to combine learned classifiers in order to reduce the risk of randomly choosing a particu-larly bad classifier. The aggregated result does not necessarily outperform all individual classifiers on test-data but is more robust in terms of performance.

Ensemble learning theory typically distinguishes two combination types:

1. Inclassifier selection, the goal is to train several classifiers (or experts) for different areas. During prediction, the algorithm selects the most suitable classifier for every provided test instance according to some formal criteria (e.g.,depending on feature

1Joint work with M. Neves, T. Rocktäschel, I. Solt, and U. Leser

3 Ensemble Methods for Relationship Extraction

allocation). This approach can be compared with a general practitioner sending patients to the respective specialist according to the disease pattern.

2. Inclassifier fusion, one is interested in combining the prediction of several “weak”

classifiers to form a “strong” forecast. This approach is also known as the “ wisdom of the crowd”.

In this thesis, we will examine two well established classifier fusion techniques (ma-jority voting and stacking) in order to achieve higher robustness on unlabeled test data.

We focused on classifier fusion techniques, as they received much more attention than classifier selection (Kuncheva, 2004, Chapter 3.2.1).