• Keine Ergebnisse gefunden

3.4 Evaluation and Discussion

3.4.2 Evaluation of COV4SWS.KOM

were not available on interface and operation levels; as a result, a relatively low weighting of these levels improves matchmaking results a lot (Version 2), but if these levels are weighted too much, evaluation numbers start to decrease (Version 3).

Apart from the evaluation regarding IR performance measures, the runtime performance in terms of AQRT has also been evaluated. The AQRT significantly depends on optional preprocessing steps which are presented in Section B.5. Important to note, the derivation of weightings based on OLS estimation is conducted at runtime, because cross-validation requires knowledge of the respective current query. An overview of the macro-averaged AQRT (median of the test runs conducted) for the different versions and variants of LOG4SWS.KOM can be found in Table 3.4. As it can be seen, Variants A-D feature relatively similar medianAQRTs. Hence, it can be assumed that the identification of numerical equivalents based on OLS, which is conducted during runtime, does not extend the runtime performance to a large degree.

However, Table C.8 shows that the meanAQRT of Variant D is indeed a little bit higher than those of Variants A–C. Regarding the different versions, Version 1 performs slightly better as it does not require computations on interface and operation level, Versions 2 and 3 perform a little bit worse. However, the differences are too small to be of practical relevance. As the processing of SAWSDL service descriptions to ATSM models is the most time-consuming part of the query processing (cp. Section B.5), taking about 80% of the overall time, the runtime performance could be further improved if this processing was accelerated. A more detailed presentation of the AQRTs can be found in Appendix C.4. A comparison of these results with the results from COV4SWS.KOM and other matchmakers can be found in Section 3.4.3.

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

precision

recall VARIANT A VARIANT C VARIANT E VARIANT F

(a)Version 1

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

precision

recall VARIANT A VARIANT C VARIANT E VARIANT F

(b)Version 2

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

precision

recall VARIANT A VARIANT C VARIANT E VARIANT F

(c)Version 3

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

precision

recall VARIANT A VARIANT C VARIANT E VARIANT F

(d)Version 4

Figure 3.8:Performance of COV4SWS.KOM (Recall-Precision Curves) – Versions

As it can be seen from Figure 3.8, the differences in evaluation results are obvious, especially for Ver-sions 1, 2, and 4. On the contrary, Version 3 shows relatively similar recall-precision curves. Again, Friedman tests have been conducted to detect statistical differences between the results of different eval-uation runs (cp. Appendix C.3). These tests confirm the curves in terms of heterogeneity: While for Version 3, there is no statistical difference between evaluation results of all six variants, there is such a difference for all other versions. As it has been the case for LOG4SWS.KOM, not every pair of variants from a certain version features statistically significant differences. This makes it generally difficult to draw definite conclusions from the data. The results from the Friedman tests indicate that for Versions 1 and 2, differences for simResnik andsimLin are rather small, while for Version 3, there is a significant difference between the MC-based Variants B and D. For Version 4, the results differ statistically signifi-cant as it is already indicated by the large gap between evaluation results between Variants 4a/4c and Variants 4b/4d.

The pure ontology-based Version 1 possesses the worst evaluation results, as for every single perfor-mance metric, the other three versions feature better results. There is only one negligible exception as RP of Variant 1f (0.634) is 0.001 higher than its equivalent for Variant 4f. As can be seen from Figure 3.8a as well as from the relatively low P(5) and P(10) values, this can be primarily traced back to the low precision results for low recall levels, while for Variants D, E, and F, the precision values are acceptable

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

precision

recall VERSION 1 VERSION 2 VERSION 3 VERSION 4

(a)Variant A

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

precision

recall VERSION 1 VERSION 2 VERSION 3 VERSION 4

(b)Variant B

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

precision

recall VERSION 1 VERSION 2 VERSION 3 VERSION 4

(c)Variant C

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

precision

recall VERSION 1 VERSION 2 VERSION 3 VERSION 4

(d)Variant D

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

precision

recall VERSION 1 VERSION 2 VERSION 3 VERSION 4

(e)Variant E

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

precision

recall VERSION 1 VERSION 2 VERSION 3 VERSION 4

(f)Variant F

Figure 3.9:Performance of COV4SWS.KOM (Recall-Precision Curves) – Variants

Table 3.5:Evaluation Results for Different Versions/Variants of COV4SWS.KOM

#

Similarity Weights for

AP RP P(5) P(10) AQRT

Metric Interface/Operation/

(in ms) Applied Parameter Levels

1a simResnik(C or p) 0.0/0.0/0.5 0.616 0.552 0.715 0.673 2322.19

1b simResnik(M C) 0.0/0.0/0.5 0.586 0.558 0.715 0.650 2303.15

1c simLin(C or p) 0.0/0.0/0.5 0.605 0.571 0.692 0.685 2366.33 1d simLin(M C) 0.0/0.0/0.5 0.596 0.558 0.723 0.662 2389.38

1e simP L 0.0/0.0/0.5 0.682 0.608 0.823 0.754 2211.67

1f simM C 0.0/0.0/0.5 0.687 0.634 0.815 0.765 2099.77

2a simResnik(C or p) 0.1/0.1/0.4 0.672 0.599 0.785 0.719 2496.46

2b simResnik(M C) 0.1/0.1/0.4 0.642 0.593 0.769 0.700 2642.29

2c simLin(C or p) 0.1/0.1/0.4 0.687 0.626 0.885 0.785 2562.00 2d simLin(M C) 0.1/0.1/0.4 0.683 0.618 0.885 0.796 2729.56

2e simP L 0.1/0.1/0.4 0.728 0.655 0.946 0.804 2445.50

2f simM C 0.1/0.1/0.4 0.730 0.667 0.931 0.823 2376.75

3a simResnik(C or p) 0.25/0.25/0.25 0.679 0.611 0.785 0.719 2592.50

3b simResnik(M C) 0.25/0.25/0.25 0.648 0.589 0.777 0.696 2528.52

3c simLin(C or p) 0.25/0.25/0.25 0.711 0.673 0.900 0.823 2714.12 3d simLin(M C) 0.25/0.25/0.25 0.714 0.671 0.885 0.812 2696.08

3e simP L 0.25/0.25/0.25 0.726 0.658 0.915 0.842 2325.12

3f simM C 0.25/0.25/0.25 0.713 0.657 0.915 0.823 2352.90

4a simResnik(C or p) OLS applied 0.746 0.686 0.923 0.835 2726.54

4b simResnik(M C) OLS applied 0.733 0.678 0.931 0.815 2662.79

4c simLin(C or p) OLS applied 0.704 0.655 0.915 0.819 2810.60 4d simLin(M C) OLS applied 0.707 0.651 0.931 0.819 2786.58

4e simP L OLS applied 0.706 0.628 0.915 0.796 2185.35

4f simM C OLS applied 0.702 0.633 0.915 0.831 2446.00

for higher recall levels (cp. Figure 3.9). For Variants A to C, Version 1 features the worst precision values for nearly all recall levels. The precision values for low recall levels for Version 1 approve the finding for LOG4SWS.KOM that for these recall levels, a pure semantic description of service components is not sufficient – even though COV4SWS.KOM applies different similarity measures for semantics.

As it has been already observed in the evaluation of LOG4SWS.KOM, the consideration of further service abstraction levels in Versions 2 and 3 leads to an improvement of evaluation results. However, in contrast to LOG4SWS.KOM, the higher weights for interface and operation levels in Version 3 (in comparison to Version 2) do not generally lead to a decline in evaluation results: Regarding the AP, RP, P(5), and P(10), Variants 3a-d generally outperform Variants 2a-d, with RP and P(10) for Variants 2b/3b being the only (insignificant) exception. For Variant E, the values vary without very large differences, for Variant F, Version 2 clearly provides better results than Version 3.

Regarding OLS-based weighting of service abstraction levels in Version 4, the evaluation has led to mixed results. While for Variant 4a very good results have been achieved, the integration of automatic level weighting does lead to a degradation in evaluation results for Variants E and F. For Variants C and D, OLS leads to mediocre results, which are not as good as results for Variants 3c and 3d.

For Version 1, simResnik andsimLin feature very similar evaluation results, if comparing the corpus-based and MC-corpus-based values pairwise. There is no clear pattern which variants perform better, while for Versions 2 and 3,simLin outperformssimResnik for all numbers observed. While there is no observable

improvement of evaluation results from Version 2 to 3 for simResnik, the values of simLin are clearly better for Version 3 if compared to Version 2 and also outperform results of OLS-based level weighting in Version 4 (cp. Table 3.5). The amendable performance of simResnik can be traced back to the fact that these versions incorporate different service abstraction levels; assimResnik is not normalized to the range[0..1], this metric is given a disproportionately high weight, leading to overall worse evaluation results. However, this issue is solved by the OLS-based weighting of abstraction levels in Version 4; here, the non-normalization is confronted by the automatic weighting. Hence, simResnik is adjusted to the similarity measures from other service abstraction levels. This leads to the overall best AP and RP values for Variant 4a – even if compared to LOG4SWS.KOM. As can be seen from Figure 3.8d, these values are the result of an extraordinary good performance for low and middle recall levels.

The variety and number of evaluation results makes it difficult to derive generally valid conclusions.

However, there are certain notable outcomes: First of all, the heterogeneity of results shows that there is no generally best matchmaking method for Web services – if reducing the evaluation results to the similarity measures applied, path length, mutual coverage, and the similarity measure of Resnik all lead to the best values under certain restrictions. As a second conclusion, we deduct that OLS-based optimization of service abstraction level weights does not necessarily lead to an improvement of matchmaking results.

In fact, only Variants A and B generally benefit in Version 4. Interestingly, OLS-based weights lead to an improvement of P(5) for every case, i.e., for low recall levels. Third, purely service signature matching as applied in Version 1 leads to the worst evaluation results, confirming the results from LOG4SWS.KOM.

Finally, we would like to dwell on simM C which is – in contrast to the other applied similarity metrics – not a general similarity metric, but depending on certain assumptions typical to service matchmaking.

Variant F respectivelysimM C leads to generally good results and features the best AP and RP values for Version 2. Hence, we deduct that this similarity metric is worth pursuing.

All things considered, COV4SWS.KOM performs very well if the fallback strategy is applied. In our opinion, the results show that the usage of OLS and metrics, which are usually used in order to deter-mine semantic relatedness, might be a strategy in order to improve ontology-based matchmaking results and overcome the issue that current ontologies (like those from SAWSDL-TC) are often only simple tax-onomies that do not rely on advanced features provided by, e.g., OWL DL and that only coarse-grained concept descriptions are available [132].

Again, the runtime performance of the matchmaker at hand in terms of AQRT has been evaluated.

An overview of the macro-averaged AQRT (median of the tests conducted) for the different versions and variants of COV4SWS.KOM can be found in Table 3.5. As it can be seen, Version 1 generally features the lowest AQRT values, Versions 2 and 3 feature similar values, and Version 4 features the highest AQRTs.

Similar to the values observed for LOG4SWS.KOM, the differences can be attributed to the weightings of service abstraction levels. While Version 1 incorporates the service signature, Version 2 and 3 also compute similarity values for the interface and operation levels, causing higher matchmaking runtime.

In Version 4, it is first necessary to determine the weights using OLS, consequently, the actual weighting is conducted, leading to the highest AQRT values. Regarding the different variants, Variants A and B perform quite similar, as do Variants C and D. This is not surprising, as these pairs of variants make use of exactly the same methods but operate on different probability values. The last-mentioned variants perform slightly worse, as the similarity computation is a little bit more complex (cp. Equations 3.18 and 3.20). Variants E and F possess the best AQRT values, as they make use of caches. Again, the differences are quite small. The same conclusions can be drawn as for LOG4SWS.KOM – as the largest part of the matchmaking process duration is attributed to the transformation of the query from SAWSDL to ATSM, the runtime could be primarily accelerated by improving this transformation. Further runtime evaluation results can be found in Appendix C.4.