Evaluation of COV4SWS.KOM - Evaluation and Discussion

3.4 Evaluation and Discussion

3.4.2 Evaluation of COV4SWS.KOM

were not available on interface and operation levels; as a result, a relatively low weighting of these levels improves matchmaking results a lot (Version 2), but if these levels are weighted too much, evaluation numbers start to decrease (Version 3).

Apart from the evaluation regarding IR performance measures, the runtime performance in terms of AQRT has also been evaluated. The AQRT significantly depends on optional preprocessing steps which are presented in Section B.5. Important to note, the derivation of weightings based on OLS estimation is conducted at runtime, because cross-validation requires knowledge of the respective current query. An overview of the macro-averaged AQRT (median of the test runs conducted) for the different versions and variants of LOG4SWS.KOM can be found in Table 3.4. As it can be seen, Variants A-D feature relatively similar medianAQRTs. Hence, it can be assumed that the identification of numerical equivalents based on OLS, which is conducted during runtime, does not extend the runtime performance to a large degree.

However, Table C.8 shows that the meanAQRT of Variant D is indeed a little bit higher than those of Variants A–C. Regarding the different versions, Version 1 performs slightly better as it does not require computations on interface and operation level, Versions 2 and 3 perform a little bit worse. However, the differences are too small to be of practical relevance. As the processing of SAWSDL service descriptions to ATSM models is the most time-consuming part of the query processing (cp. Section B.5), taking about 80% of the overall time, the runtime performance could be further improved if this processing was accelerated. A more detailed presentation of the AQRTs can be found in Appendix C.4. A comparison of these results with the results from COV4SWS.KOM and other matchmakers can be found in Section 3.4.3.

0 0.2 0.4 0.6 0.8 1

precision

recall VARIANT A VARIANT C VARIANT E VARIANT F

(a)Version 1

0 0.2 0.4 0.6 0.8 1

precision

recall VARIANT A VARIANT C VARIANT E VARIANT F

(b)Version 2

0 0.2 0.4 0.6 0.8 1

precision

recall VARIANT A VARIANT C VARIANT E VARIANT F

(c)Version 3

0 0.2 0.4 0.6 0.8 1

precision

recall VARIANT A VARIANT C VARIANT E VARIANT F

(d)Version 4

Figure 3.8:Performance of COV4SWS.KOM (Recall-Precision Curves) – Versions

As it can be seen from Figure 3.8, the differences in evaluation results are obvious, especially for Ver-sions 1, 2, and 4. On the contrary, Version 3 shows relatively similar recall-precision curves. Again, Friedman tests have been conducted to detect statistical differences between the results of different eval-uation runs (cp. Appendix C.3). These tests confirm the curves in terms of heterogeneity: While for Version 3, there is no statistical difference between evaluation results of all six variants, there is such a difference for all other versions. As it has been the case for LOG4SWS.KOM, not every pair of variants from a certain version features statistically significant differences. This makes it generally difficult to draw definite conclusions from the data. The results from the Friedman tests indicate that for Versions 1 and 2, differences for sim_Resnik andsim_Lin are rather small, while for Version 3, there is a significant difference between the MC-based Variants B and D. For Version 4, the results differ statistically signifi-cant as it is already indicated by the large gap between evaluation results between Variants 4a/4c and Variants 4b/4d.

The pure ontology-based Version 1 possesses the worst evaluation results, as for every single perfor-mance metric, the other three versions feature better results. There is only one negligible exception as RP of Variant 1f (0.634) is 0.001 higher than its equivalent for Variant 4f. As can be seen from Figure 3.8a as well as from the relatively low P(5) and P(10) values, this can be primarily traced back to the low precision results for low recall levels, while for Variants D, E, and F, the precision values are acceptable

0 0.2 0.4 0.6 0.8 1

precision