Cross-Validation - Robust relationship extraction in the biomedical domain

3.4 Results

3.4.1 Cross-Validation

In order to compare the different approaches, we performed document-wise 10-fold cross-validation on the training corpora. All approaches use identical cross-cross-validation splits to ensure comparability of the different classifiers. For APG, ST, SST, and SpT we followed the parameter optimization strategy defined by Tikket al. (2010). For TEES and Moara, we selected parameters using a coarse parameter selection strategy. For SL and SLW, we used the default SVM parameters.

The organizers provide two source corpora annotated using the same annotation guide-lines. The MEDLINE corpus consists of scientific abstracts, whereas the DrugBank corpus consists of specific paragraphs extracted from the DrugBank website. By read-ing some of the annotations of both corpora we observed for DrugBank repeatread-ing phrases such as “Co-administration of DrugX with DrugY leads to . . . ”, indicating that DrugBank is more homogeneous than MEDLINE. Similarly, Chowdhury and Lavelli (2013b) reported a much stronger use of the cue words “increase” and “decrease” in DrugBank, indicating that DrugBank has a more typed vocabulary.

We therefore compared some of the corpus specific aspects such as average sentence length or number of entities per sentence. We also calculated the normalized Shannon entropy (Shannon, 1948) which is defined for observing an arbitrary probability distribu-tion ofN tokens. Shannon entropy quantifies the asymmetry in the observed probability distribution, whereH(X) = 1 represents uniform probability distribution (i.e.,all token are identically distributed) and H(X) = 0 shows a fully localized probability distribu-tion; (i.e.,observing only the identical token).

Results of this analysis are shown in Table 3.3. Significance between all sentence wise characteristics is derived using a two sided Mann-Whitney U-test (Mann and Whit-ney, 1947). The null hypothesis is that the median between the two characteristics is zero. Differences for all characteristics, except length of shortest dependency path, are significant (significance level α = 0.05). This change is mostly pronounced for the amount of entities per sentence, resulting in a higher number of co-occurring drug pairs in DrugBank compared to MEDLINE(7.4 pairs vs. 3.5 pairs).

This analysis leads to the question if the two corpora are similar enough to be con-sidered as the same domain or not. We investigated this question by following two different cross-validation strategies: First, performance of relationship extraction is es-timated for each corpus individually (DrugBank andMEDLINE). This is implemented by following a regular document-wise 10-fold cross-validation for each corpus. In the second experiment, cross-validation data is complemented by data from the other cor-pus. For instance, we perform regular cross-validation on DrugBank, but add the whole MEDLINE corpus to the training instances. This strategy allows us to estimate the impact of additional, but potentially different text sources for both corpora. Perfor-mance of individual methods and different majority voting ensembles for DrugBank and MEDLINEare shown in Table 3.4 and 3.5 respectively.

Characteristic DrugBank MEDLINE p-value Avg. no. of tokens per sentence 25.5 26.9 5.0·10⁻³ Avg. no. of entities per sentence 3.5 2.8 5.9·10⁻¹⁶ Avg. no. of tokens between two entities 9.5 8.8 4.6·10⁻⁶

Avg. no. of tokens on shortest path 3.8 4.0 0.2671

Normalized entropy H(X) 0.56 0.66 —

Table 3.3: Statistics of different characteristics for both DDI training corpora. Only sentences with at least one entity pair are considered. p-values are derived using Mann-Whitney U-test.

Regular CV Combined CV

Method P R F₁ AUC P R F₁ AUC

IndividualClassifier SL 61.5 79.0 69.1 92.8 62.1 78.4 69.2 93.0

APG 77.2 62.6 69.0 91.5 75.9 59.8 66.7 91.6

TEES 77.2 62.0 68.6 87.3 75.5 60.9 67.3 86.9

SLW 73.7 60.0 65.9 91.3 73.4 61.2 66.6 91.3

Moara 72.1 55.2 62.5 — 72.0 54.7 62.1 —

SpT 51.4 73.4 60.3 87.3 52.7 71.4 60.6 87.7

SST 51.9 61.2 56.0 85.4 55.1 57.1 56.0 86.1

ST 47.3 64.2 54.2 82.3 48.3 64.3 54.9 82.7

MajorityVoting

SL+SLW+TEES 76.1 69.9 72.7 — 75.9 65.3 70.1 —

APG+SL+TEES 79.3 69.9 74.2 — 79.2 65.4 71.5 —

Moara+SL+TEES 79.9 69.6 74.2 — 79.6 65.1 71.6 —

Moara+SL+APG 81.4 70.6 75.5 — 81.3 70.3 75.3 —

APG+Moara+SL+SLW+TEES 84.0 68.1 75.1 — 83.7 64.2 72.6 —

APG+SpT+TEES 76.8 68.0 72.1 — 77.1 63.4 69.6 —

APG+SpT+SL 68.7 74.8 71.5 — 69.7 73.8 71.6 —

Table 3.4: Cross-validation results for DrugBank. Regular CV is training and evaluation on DrugBank only. Combined CV referes to suplementing DrugBank with in-stances fromMEDLINE. Higher F₁ between these two settings are indicated in boldface for each method. Single methods are ranked by F₁.

3 Ensemble Methods for Relationship Extraction

Regular CV Combined CV

Method P R F₁ AUC P R F₁ AUC

IndividualClassifier

TEES 70.7 36.0 44.5 82.2 59.6 46.5 51.4 84.9

SpT 37.8 38.6 34.6 78.6 42.3 55.3 47.1 80.4

APG 46.5 44.3 42.4 82.3 38.1 62.2 46.4 82.8

SST 31.3 37.7 31.8 74.1 36.7 61.7 44.9 79.5

SL 43.7 40.1 38.7 78.9 34.7 67.1 44.7 81.1

SLW 58.0 14.3 20.4 73.4 50.1 38.0 42.0 82.4

Moara 49.8 31.9 37.6 — 45.6 43.2 41.9 —

ST 25.2 43.8 30.1 70.5 36.1 48.3 39.8 74.2

MajorityVoting

SL+SLW+TEES 73.6 29.0 37.6 — 55.2 52.7 53.1 —

APG+SL+TEES 60.7 37.9 43.4 — 49.9 62.4 54.3 —

Moara+SL+TEES 68.0 33.0 42.2 — 62.1 55.5 57.4 —

Moara+SL+APG 57.7 36.7 42.4 — 48.3 60.9 52.8 —

APG+Moara+SL+SLW+TEES 73.3 28.3 36.8 — 60.6 54.4 56.5 —

APG+SpT+TEES 58.5 37.4 41.7 — 57.5 59.2 57.1 —

APG+SpT+SL 48.3 39.9 40.0 — 43.6 64.3 51.0 —

Table 3.5: Cross-validation results for MEDLINE. Regular CV is training and evalua-tion onMEDLINEonly. Combined CV referes to suplementingMEDLINE with instances from DrugBank. Higher F₁ between these two settings are indicated in boldface for each method. Single methods are ranked by F₁.

CV results for the DrugBank corpus (Table 3.4) show no clear effect when using MEDLINE as additional training data. By adding MEDLINE instances during the training phase we observe an average decrease of 0.3 percentage points (pp) in F₁ and an average increase of 0.7 pp in AUC. The small increase in AUC indicates that addi-tional data helps to learn a slightly better discrimination between the two classes, but most classifiers are unable to select the optimal threshold value. This is reflected by the minor decrease in F₁. The strongest impact of additionalMEDLINEtraining data on DrugBank can be observed for APG with a decrease of 2.3 pp in F₁. For almost all ensembles (with the exception of APG+SpT+SL) we observe superior results when using only DrugBank as training data. Interestingly, this effect can mostly be attributed to an average increase of 3.3 pp in recall, whereas precision remains fairly stable be-tween ensembles using DrugBank solely and those with additional training data from MEDLINE.

In contrast, for MEDLINE all algorithms clearly benefit from additional training data with an average increase of 9.8 pp and 3.6 pp for F₁ and AUC respectively. For the ensemble based approaches, we observe an average increase of 13.8 pp for F₁ using the additional annotations from DrugBank. These results indicate that MEDLINE gains from additional out-domain data, whereas the effect on DrugBank is unclear.

One possible explanation is the difference in corpus size, where MEDLINEconstitutes almost 15 times less training instances than DrugBank. It is possible that corpora with

sufficient training instances are more likely to be distracted by out-domain information than small corpora with few annotations.

Cross-validation results for both corpora show significantly better F₁-estimates for DrugBank in comparison to MEDLINE (Wilcoxon signed-rank test; p = 0.003906).

Also differences in efficiency of relationship extraction algorithms can be observed. When ranking the different methods by F₁ and calculating rank-correlation between the two different corpora, we observe a very weak correlation (Kendall’s τ = 0.286, p= 0.4). In other words, machine learning methods show varying performance-ranks between the two corpora. This difference is most pronounced for SL and SpT, with four ranks differ-ence between DrugBank and MEDLINE. Additionally, documents come from different sources and it is tempting to speculate that there might be a certain amount of domain specificity between DrugBank and MEDLINEsentences. Without further experiments it remains unclear if differences in overall performance and performance rank are due to domain specific effects or due to different amounts of training instances.

3.4.2 Relabeling

Performance of relabeling is evaluated by performing 10-fold CV on the training set using the same splits as in previous experiments. Note that this experiment is solely performed on positive instances in order to estimate separability of the four different DDI-subtypes. Results are shown in Table 3.6.

Type Pairs Precision Recall F₁

total 3,119 78.6 78.6 78.6

effect 1,633 79.8 79.1 79.4

mechanism 1,319 79.8 79.2 79.4

advise 826 77.3 76.4 76.9

int 188 68.5 80.9 74.1

Table 3.6: Performance estimation for relabeling DDIs. Pairs denotes the number of instances of this type in the training corpus.

The DDI-relabeling capability of TEES is very balanced with F₁ measures ranging from 74.1 % to 79.4 % for all four DDI subclasses. This is unexpected since classes like

“effect” occur almost ten times more often than other classes like “int” and classifiers often have problems with predicting minority classes.

Im Dokument Robust relationship extraction in the biomedical domain (Seite 62-65)