• Keine Ergebnisse gefunden

3.6 Related Work

4.1.1 Self-training

In this chapter, we explore the usage of self-training for PPI extraction to increase performance on test corpora which have potentially different properties than the training corpus. Note that this essentially includes all current application scenarios in biomedical text mining. Self-training is a special case of semi-supervised learning which, in principle, tries to exploit the large amount of available unlabeled data (Zhu, 2008). It is divided in a number of consecutive steps:

4 Domain Adaptation using Self-Training

1. Learn a model from manually annotated data.

2. Use this model to label a large pool of unannotated data.

3. Combine the manually and the automatically annotated datasets to create the training data for the final model.

Self-training has been successfully used to improve performance in a number of tasks, including parsing (McClosky et al., 2006b; Reichart and Rappoport, 2007; McClosky and Charniak, 2008), word sense disambiguation (Jimeno-Yepes and Aronson, 2011), and subjectivity classification (Wang et al., 2008). Here we apply self-training to PPI extraction in the following manner. First, we train a model using some gold standard PPI corpora. Second, we apply the model to all sentences fromMEDLINE containing at least two protein mentions (excluding those sentences being present in the evaluation corpora). In a second training phase, we augment our original training set by a subset of these classified instances (termed self-trained). Finally, predictive performance of the augmented model is evaluated on previously unseen gold standard corpora.

We compare two strategies for adding self-trained instances and show that we achieve consistent performance improvement across different corpora. The gap in F1 between CV and CL evaluation is almost halved. We show that self-training is more robust than a fully supervised approach (in terms of gap size standard deviation) making it better suited to assess performance on unlabeled text.

4.2 Methods

For evaluation we use the five benchmark corpora introduced in Subsection 2.5. Since the ultimate goal of PPI extraction is the identification of PPIs in biomedical texts with unknown characteristics, we focus on corpus-wise extrinsic experiments by learning from one or more training corpora. The baseline for our experiments is the CL evaluation scenario, where a classifier is trained on the ensemble of four corpora and evaluated on the fifth. This evaluation is performed exhaustively for all different combinations of training and test sets. We also perform CC evaluations on the two largest corpora, by training a classifier on AIMed and testing on BioInfer and vice versa.

For all experiments we used the shallow linguistic kernel (SL), as it is one of the best performing kernels and produces fairly robust results in extrinsic evaluation (see Figure 4.2 and Table 4.1). In contrast to other methods, SL does not rely on dependency parse trees, whose generation is costly in terms of computation. SVM parameters (i.e., class dependent soft-margin costsC+1 andC−1 ) were set to default values.

4.2.1 Self-training

To augment the training set with automatically created instances we implemented the following workflow (see Figure 4.3): First, we extracted sentence boundaries from MEDLINEcitations using the sentence segmentation model of Buykoet al.(2006) and scanned these for gene mentions using GNAT (Hakenberget al., 2011). We found 879,928

68

Figure 4.3: Data flow in our self-training setting. The path represented by a dashed line is used in the “self-enriched” strategy but omitted in “self-only.”

sentences containing more than one gene mention, totaling in 3,415,624 co-occurrences.

For cross-learning, models are trained on the union of four corpora and applied on the unlabeled PPI pairs from MEDLINE. Classified instances are then used to retrain a refined model (termed self-trained model). We evaluated two self-training strategies:

• In the “self-enriched” strategy, we added self-trained instances to the manually annotated training corpora and learned a new model. This setting reflects the common self-training strategy.

• In the “self-only” strategy, we trained solely on the self-trained instances derived from MEDLINE. This setting allows us to investigate the contribution of self-trained instances separate from manually annotated gold standard data.

Self-trained instances are selected by keeping the class ratio equal to the respective training corpus by stratified sampling. This strategy reduces the influence of other pa-rameters and allows us to assess the core contribution of self-training. Note that the class ratio of the evaluation corpus may well be different from that of the (augmented) training set, however, it has been considered unknown at training time to avoid infor-mation leakage. The particular instances were selected with respect to their distance to the SVM decision hyperplane, such that the most confidently classified data points were added first. We iteratively increased the amount of self-training examples to a limit of 700,000 training instances. Training on 700,000 instances required 32 GB of main mem-ory and about 32 hours of wall-clock time on a Intel Xeon CPU (X5560 @ 2.80 GHz), while applying the trained model took only 7 msec/instance.

We assessed statistical significance of the results as follows. As advised by Dietterich (1998), we evaluate if one classifier outperforms another classifier by using McNemar’s test with continuity correction (McNemar, 1947). The null hypothesis is that both classifiers have the same error rate. Significance of Kendall’s correlation coefficients were determined using the Best and Gipps algorithm (Best and Gipps, 1974). The null hypothesis is that the correlation coefficients equal zero. For all tests, we used a

4 Domain Adaptation using Self-Training

significance level of α= 0.05 to determine significance. Neither of these tests makes an assumption on the underlying distribution.

4.3 Results

We show results for the standard (self-enriched) and our custom (self-only) self-training strategies.