• Keine Ergebnisse gefunden

5.2 Using Support-Vector Machines

5.2.2 Classification

As classification algorithm we use a Support-Vector machine employing the shallow linguistic (SL) kernel, which achieves state-of-the-art performance and requires no parse-tree information. This allows us to collect large training data without the need of time consuming parse-tree generation.

Using all distantly labeled instances during training proved as too time expensive.

Classifiers are therefore trained with a small subset from all 8,3 million pairs, using

3As of March 24, 2010.

4http://www2.informatik.hu-berlin.de/~thomas/pub/iwords.txt

82

Setting

Feature: Interaction word count Pairs in sentence

Condition: ≥1 = 0 = 1 = 1

Applied to: positive negative positive negative baseline

pos-iword •

neg-iword •

pos/neg-iword • •

pos-pair •

neg-pair •

pos/neg-pair • •

Table 5.3: All seven experimental settings. Based on the number of interaction words and protein mention pairs in the containing sentence, we filter out automati-cally generated positive or negative example pairs not meeting the indicated heuristic condition. The dots indicate which filter is applied for which setting.

For instance no filtering takes place for the baseline setting.

50,000 instances in all experiments except when stated differently. We also evaluate how much training data is required to successfully train a classifier and if the classifier reaches a steady performance after a certain number of training instances.

Another well acknowledged problem is that classifiers often tend to keep the same positive to negative ratio seen in the training phase (Chawla et al., 2004). This raises the question on how training class distribution should be set. In our first experiments, we set the class ratio according to the class distribution averaged over all corpora excluding the evaluation corpus. This allows us to directly compare classifier performance with cross-learning results using the same classifier (see also Chapter 4). Influence of class imbalance is evaluated separately by varying positive to negative ratios from 0.001 to 1,000 using the best filtering strategy from the previous experiment.

Negative instances are generated using the closed world assumption, stating that two co-occurring protein mentions are annotated as negative when not contained in the knowledge base. To estimate the impact of the closed world assumption, we experi-mented with another technique by using the Negatome database5 (Smialowski et al., 2010) to infer negative examples. Negatome provides a reference set of non-interacting protein pairs and is thus better suited to infer negative examples than the closed world assumption. Unfortunately, reliable information about non-interaction is difficult to ob-tain and therefore the database conob-tains far less entries than IntAct. From our 8 million co-occurring protein pairs only 6,005 could be labeled as certainly negative by Negatome.

This is insufficient to build a reasonable set of negative training instances for our experi-ments. Additional negative training instances required for training are therefore inferred using the closed world assumption.

5As of April 30, 2011.

5 Distant Supervision

Finally, we evaluate if a majority voting ensemble of 11 classifiers trained on randomly drawn training instances can further improve extraction quality. This approach loosely follows a bagging strategy (Breiman, 1996, see also Section 3.1). However, training instances are less overlapping than using the regular bagging strategy. Bagging generates new training sets by sampling instances from the original dataset with replacement. In difference to this, we sample instances from the original dataset without replacement.

The latter strategy can be applied due to the huge number of available training instances.

5.2.3 Evaluation

For evaluation, we use the five PPI corpora AIMed, BioInfer, HPRD50, IEPA, and LLL introduced in Subsection 2.5. Each experiment is repeated 10 times with randomly sam-pled training instances. This strategy results in 10 independent estimates for precision, recall, F1, and AUC and allows to robustly estimate individual evaluation metrics. p-values between different experiments are derived using single sided Mann–Whitney U test (Mann and Whitney, 1947), with the null hypothesis that the median of two samples is equal. Significance of Kendall correlation (Kendall, 1938) is determined using Best and Gipps (1974) with the null hypothesis that correlation equals zero. For all tests we use a significance level ofα= 0.01.

5.2.4 Results

Mean values for the seven different instance selection strategies (see Table 5.3) are dis-played in Table 5.4. All strategies, exceptneg-pair filtering, obtain an AUC higher than 0.5. The difference in AUC is generally significantly better than 0.5, except for three experiments using the smallest corpus (LLL). AUC is identical to the probability that a classifier ranks a randomly chosen negative instance lower than a randomly chosen positive instance. Therefore, AUC scores above 0.5 show that the distant supervision assumption holds, to at least some extend, for PPI extraction.

The various settings introduced to filter out likely noisy training instances either im-proved precision or recall or both over the baseline of using automatically labeled in-stances without applying any filters. Many instance selection strategies for AIMed, BioInfer and HPRD50 significantly outperform co-occurrence in terms of F1. However, co-occurrence significantly outperforms all seven settings for the two remaining corpora IEPA and LLL in F1. This might have several reasons: First, these two corpora have the highest fraction of positive instances, therefore co-occurrence is a stronger baseline.

Second, IEPA describes chemical relations instead of PPIs, thus our training corpus might not properly reflect the syntactic property of such relations.

It is encouraging that on two corpora (BioInfer and HPRD50) the best setting performs about on par with the best cross-learning results from Tikk et al. (2010), which have been generated using manually annotated data and are therefore suspected to produce superior results. Distant supervision on the other hand labels text corpora without human intervention, thus reducing the cost of generating training corpora.

For each corpus we calculate the average rank in F1 for the seven different instance

84

7 6 5 4 3 2 1

AIMed BioInfer HPRD50 IEPA LLL

Average Rank

baseline pos-iword neg-iword pos/neg-iword

pos-pairs neg-pairs pos/neg-pairs

Figure 5.2: Average rank in F1 for each experimental setting on the five evaluation corpora.

filtering strategies (see Figure 5.2). Figure 5.3 shows how often a selection strat-egy significantly supersedes the remaining six strategies in terms of F1 (according to Mann–Whitney U test). For instance, pos/neg-iword significantly outperforms all other six strategies on the IPEA corpus. The same strategy outperforms only four other strate-gies on AIMed. Figure 5.2 and Figure 5.3 indicate that the filters pos/neg-iword and neg-iword perform well across all five corpora, suggesting superior robustness for these two settings. These strategies are only once outperformed across all five corpora: On AIMed the filtering strategy pos/neg-pairs significantly outperforms all other strategies.

However, it achieves mediocre results on the remaining four corpora which indicates a comparably lower robustness. This filtering strategy would therefore not be advised as it provides superior results on only one corpus. In the following, we analyze and compare different instance selection strategies in more detail.

Interaction word based settings

All experiments using our interaction words for instance selection lead to an increase of F1and AUC. In comparison to distant supervision without filtering (baseline) we observe the highest increase in AUC (3.8 pp) as well as F1 (11.8 pp) using filtering of positive and negative instances together (pos/neg-iword). This strategy is closely followed by filtering only negative instances (neg-iword) with an average F1 improvement of 11.3 pp.

Finally we observe only a marginal improvement of 1.3 pp in F1 when exclusively filtering positive instances (pos-iword).

5 Distant Supervision

Method AIMedBioInferHPRD50IEPALLL

AUCPRF1AUCPRF1AUCPRF1AUCPRF1AUCPRF1

co-occurrence17.8(100)30.126.6(100)41.738.9(100)55.440.8(100)57.655.9(100)70.3

cross-learning(Tikketal.,)77.528.386.642.674.962.836.546.278.056.968.762.275.671.052.560.479.579.057.366.4

Setting baseline65.121.082.833.563.233.364.243.864.442.875.454.652.240.911.618.051.851.339.244.4pos-iword66.621.882.634.567.538.460.847.167.545.576.557.153.848.612.319.651.650.037.042.2neg-iword65.321.191.134.268.137.370.948.973.443.993.659.854.743.949.946.753.949.977.460.7pos/neg-iword65.121.489.834.668.638.667.049.073.344.893.260.554.643.853.248.053.550.775.860.8pos-pairs64.229.333.431.269.857.818.027.562.747.935.640.866.654.926.335.563.268.227.839.5neg-pairs46.917.285.528.637.324.485.637.950.839.080.952.636.522.418.620.338.244.766.253.3pos/neg-pairs69.723.682.336.662.032.860.642.569.246.575.257.556.043.413.320.354.354.537.944.6 Setting(+Negatome) baseline65.922.279.634.765.736.858.645.267.646.774.057.354.947.512.720.054.853.636.343.2pos-iword67.422.981.435.869.141.156.347.569.247.975.458.557.452.612.920.652.351.237.543.1neg-iword65.321.190.734.368.838.169.649.273.644.692.160.155.644.451.747.855.251.378.962.2pos/neg-iword65.121.489.434.668.838.866.949.173.244.892.260.355.344.253.848.554.952.277.962.5pos-pairs64.629.633.731.569.758.218.327.862.248.535.541.066.956.630.739.763.468.828.139.9neg-pairs47.017.284.928.637.024.385.037.850.938.479.851.936.022.418.520.338.545.166.053.5pos/neg-pairs69.823.881.136.863.934.658.643.569.547.574.257.957.044.313.921.154.753.234.541.7

Trainpos/negratio 1,00060.619.089.831.364.231.384.645.762.541.192.957.057.942.688.357.361.253.793.368.110063.920.088.732.769.035.577.848.771.544.291.959.658.945.665.653.761.553.185.865.61065.520.991.033.971.238.776.051.274.144.295.860.557.945.755.550.157.951.880.763.1165.621.491.134.770.038.671.350.174.544.395.560.656.145.055.549.755.751.679.362.50.165.422.381.335.067.940.957.948.072.146.984.760.453.543.137.940.351.050.058.753.90.0166.026.946.734.166.546.924.732.470.459.748.553.452.848.28.314.252.254.312.319.70.00161.541.40.91.863.263.00.30.667.872.51.32.653.030.00.10.254.110.00.10.1 Trainsetsize(total) 50063.421.871.533.465.939.844.641.967.648.467.456.255.545.431.536.754.052.653.452.75,00065.321.484.334.269.039.963.548.972.645.789.060.456.846.141.943.854.551.366.357.815,00065.521.687.934.669.139.765.149.374.245.692.961.255.844.547.445.955.751.975.161.330,00065.321.589.434.668.839.266.549.373.044.693.160.355.044.050.747.153.850.975.260.770,00065.121.390.734.668.638.167.448.773.244.292.159.854.243.755.048.753.950.978.661.8150,00064.721.391.334.568.237.568.148.473.144.192.859.853.043.057.149.152.751.181.362.7

Table5.4:Resultsofdifferentinstanceselectionstrategies,employingNegatomeasnegativeknowledgebase,differentpositivetonegativeratiosinthetrainingset,andtotalsamplesize.

86

LLLIEPA HPRD50 BioInfer AIMed LLLIEPA HPRD50 BioInfer AIMed LLLIEPA HPRD50 BioInfer AIMed LLLIEPA HPRD50 BioInfer AIMed LLLIEPA HPRD50 BioInfer AIMed LLLIEPA HPRD50 BioInfer AIMed LLLIEPA HPRD50 BioInfer AIMed pos-pairs

neg-pairs

baseline

pos-iword

pos/neg-pairs

neg-iword

pos/neg-iword

0 1 2 3 4 5 6

Figure 5.3: Comparison of all seven instance selection strategies. Individual points repre-sent how often a specific instance selection strategy significantly outperforms the remaining six strategies for a given corpus. For instance, pos/neg-iword significantly outperforms all other six strategies on the IPEA corpus, but only four strategies on AIMed. Strategies (i.e., pos-pair, neg-pair, baseline, . . . ) are ranked by the total number of times the strategy significantly superseded others across all corpora.

5 Distant Supervision

Negatome

We repeated the previously introduced instance filtering techniques, but inferred neg-ative training instances by using the Negatome knowledge base. From all 8,324,763 co-occurring protein pairs found in MEDLINE, Negatome allows us to label 6,005 as negative. To account for the relatively small number of negative instances, additional instances are drawn from the set of instances derived by the closed world assumption.

Using Negatome leads to a small increase of 0.5 percentage points in F1, due to an average increase of 1.1 pp in precision over all five corpora and seven settings. We also observe a tendency for increased AUC (0.9 pp). The largest gain in precision (3.5 pp) is observed between the two baseline results when no instance filtering is applied. Alto-gether the experiments with Negatome indicate that knowledge bases for non-interacting protein pairs provide a better source to infer negative instances than the closed world assumption.

A clear drawback of Negatome is the comparable small size. On our dataset we could only generate 6,005 negative instances using Negatome. The number of negative training instances could be increased by generalizing proteins across species using information about homologous genes (e.g.,using the Homologene database). Using this approach on our data set we could infer approximately 4,200 additional training instances. However, it is unclear if these derived instances are of the same quality than the original 6,005 negative instances.

Effect of the positive to negative ratio

Results for varied positive to negative training ratios are shown in Figure 5.4(a) and Table 5.4 (see Page 86). As expected, the results show that positive to negative ratio on training data affects performance of a classifier. Precision and recall strongly cor-relate with the pos/neg ratio seen in the training set. The strong correlation between recall and pos/neg ratio (Kendall’s tau ranging from 0.524 to 1 for all five corpora) is expected, as the classifier tends to assign more test instances to the majority class.

Oversampling of positive training instances works best for corpora with high fractions of positive examples. A strong correlation (Kendall’s tau ranging from −0.9 to −1.0) between precision and class ratio can be observed for AIMed, BioInfer, and HPRD50.

Correlation for IEPA is close to zero and for LLL the correlation is even positive but not significant (p-value of 0.13). Overall, the observed influence of class imbalance is less pronounced than expected. For instance F1 remains comparably robust with an average standard deviation of 2.6 pp for ratios between 0.1 and 10, whereas in a range between 1 and 100 the average standard deviation increases to about 11 pp. With more pronounced differences in the training ratio, a strong impact on F1 can be observed.

Effect of training set size

The impact of training set size on different corpora is shown in Table 5.4 (see Page 86).

Results aggregated over all corpora are shown in Figure 5.4(b). With increasing training set size a monotonic increase in recall (Kendall’s Tau of 1; p-value < 0.01) can be

88

observed for all corpora, except HPRD50. The negative correlation between precision and sample size is less pronounced but still observable for all five corpora as Kendall’s Tau ranges between−0.552 and−1. For this reason, F1increases for corpora with many positive instances.

pos/neg-ratio

1000 100 10 1 0.1 0.01 0.001

0.0 0.2 0.4 0.6 0.8

1.0 AUC

F1 Precision Recall

(a) Positive to negative ratio sample size

500 30000 70000 150000

0.0 0.2 0.4 0.6 0.8 1.0

AUC F1Precision Recall

(b) Sample size

Figure 5.4: Distribution of mean precision, recall, F1, and AUC aggregated over all five corpora for different class ratios and training set sizes.

Bagging

Based on the previous experiments we determined the best individual strategies: Filter-ing of positive and negative instances for interaction words, a positive to negative class ratio of 1, and a training size of 15,000 instances. We sampled 11 individual training sets exhibiting these properties from the population of all distantly labeled MEDLINE in-stances. For each training set we learned a SVM employing the SL kernel. The minimum, average, and maximal performance of all individual classifiers are shown in Table 5.5, together with the results for majority voting (bagging).

Bagging performs about on par with the mean of the individual classifiers and we ob-serve, according to Mann-Whitney U-test, no significant difference between bagging and the 11 classifiers. However, single classifier sometimes performs better or worse, whereas bagging always performs close to the average. Individual classifier performance can de-viate between 0.4 pp on AIMed to 4.0 pp on IEPA. Thus, bagging can be successfully applied for improving robustness of a classifier.

5.2.5 Conclusion

We investigated the use of distant supervision combined with a machine learning ap-proach to detect protein-protein interactions. We demonstrated that distant supervision

5 Distant Supervision

Method

AIMed BioInfer HPRD50 IEPA LLL

AUC P R F1 AUC P R F1 AUC P R F1 AUC P R F1 AUC P R F1

co-occurrence 17.8 (100) 30.1 26.6 (100) 41.7 38.9 (100) 55.4 40.8 (100) 57.6 55.9 (100) 70.3 cross-learning (Tikket al.,) 77.5 28.3 86.6 42.6 74.9 62.8 36.5 46.2 78.0 56.9 68.7 62.2 75.6 71.0 52.5 60.4 79.5 79.0 57.3 66.4 min of 11 runs 64.7 21.1 90.3 34.3 69.2 68.7 38.2 49.8 73.0 43.4 93.2 59.4 54.3 43.3 51.0 46.9 53.8 49.2 75.6 59.8 mean of 11 runs 65.5 21.4 90.9 34.6 69.9 70.7 38.9 50.2 74.0 44.4 94.7 60.4 55.5 44.7 54.6 49.1 55.2 50.6 78.0 61.4 bagging over 11 runs 21.4 91.3 34.7 70.9 39.3 50.6 44.3 95.1 60.4 44.4 53.1 48.3 49.8 77.4 60.6 max of 11 runs 66.0 21.6 91.8 34.9 71.3 72.2 39.6 51.0 75.3 45.8 96.3 62.1 57.1 46.0 57.3 50.9 58.1 52.2 80.5 63.3

Table 5.5: Result of bagging over 11 classifier trained on different distantly labeled sets.

For comparison we show the minimum, average, and maximal results for these 11 runs.

can be successfully adopted for domains where named entity recognition and normaliza-tion is still an unsolved issue and the closed world assumpnormaliza-tion might be an unsupported stretch. This is important, as named entity recognition and normalization is a key re-quirement for distant supervision. Distant supervision is therefore an extremely valuable method and allows training classifiers for virtually all kinds of relationships for which a database exists. We have shown that results obtained without a manually annotated corpus are competitive with purely supervised methods. Thus the tedious task of anno-tating a training corpus can be avoided.

Five benchmark evaluation corpora – having diverse properties, annotated by differ-ent researchers adhering to differing annotation guidelines – provide a perfect oppor-tunity to evaluate the robustness and usability of distant supervision. Our analysis reveals that domain knowledge such as interaction words or “negative” knowledge bases consistently improves results across all five corpora. Two instance filtering techniques (pos/neg-iword and neg-iword) perform comparably well on all five corpora and are therefore recommended for robust relationship extraction. Filtering of pos and negative instances (pos/neg-pairs) is not recommended, as it achieves only on one corpus supe-rior results. Ensemble strategies such as bagging do not improve overall performance but have a positive impact on classifier robustness by decreasing the risk of selecting an under-performing single classifier.

Surprisingly, class imbalance seams to be a less pronounced problem in distant super-vision as often observed for supervised settings. One possible explanation might be that due to the noisy data, a classifier is less prone to over-fitting.