• Keine Ergebnisse gefunden

3.4 Evaluation and Discussion

3.4.3 Discussion

improvement of evaluation results from Version 2 to 3 for simResnik, the values of simLin are clearly better for Version 3 if compared to Version 2 and also outperform results of OLS-based level weighting in Version 4 (cp. Table 3.5). The amendable performance of simResnik can be traced back to the fact that these versions incorporate different service abstraction levels; assimResnik is not normalized to the range[0..1], this metric is given a disproportionately high weight, leading to overall worse evaluation results. However, this issue is solved by the OLS-based weighting of abstraction levels in Version 4; here, the non-normalization is confronted by the automatic weighting. Hence, simResnik is adjusted to the similarity measures from other service abstraction levels. This leads to the overall best AP and RP values for Variant 4a – even if compared to LOG4SWS.KOM. As can be seen from Figure 3.8d, these values are the result of an extraordinary good performance for low and middle recall levels.

The variety and number of evaluation results makes it difficult to derive generally valid conclusions.

However, there are certain notable outcomes: First of all, the heterogeneity of results shows that there is no generally best matchmaking method for Web services – if reducing the evaluation results to the similarity measures applied, path length, mutual coverage, and the similarity measure of Resnik all lead to the best values under certain restrictions. As a second conclusion, we deduct that OLS-based optimization of service abstraction level weights does not necessarily lead to an improvement of matchmaking results.

In fact, only Variants A and B generally benefit in Version 4. Interestingly, OLS-based weights lead to an improvement of P(5) for every case, i.e., for low recall levels. Third, purely service signature matching as applied in Version 1 leads to the worst evaluation results, confirming the results from LOG4SWS.KOM.

Finally, we would like to dwell on simM C which is – in contrast to the other applied similarity metrics – not a general similarity metric, but depending on certain assumptions typical to service matchmaking.

Variant F respectivelysimM C leads to generally good results and features the best AP and RP values for Version 2. Hence, we deduct that this similarity metric is worth pursuing.

All things considered, COV4SWS.KOM performs very well if the fallback strategy is applied. In our opinion, the results show that the usage of OLS and metrics, which are usually used in order to deter-mine semantic relatedness, might be a strategy in order to improve ontology-based matchmaking results and overcome the issue that current ontologies (like those from SAWSDL-TC) are often only simple tax-onomies that do not rely on advanced features provided by, e.g., OWL DL and that only coarse-grained concept descriptions are available [132].

Again, the runtime performance of the matchmaker at hand in terms of AQRT has been evaluated.

An overview of the macro-averaged AQRT (median of the tests conducted) for the different versions and variants of COV4SWS.KOM can be found in Table 3.5. As it can be seen, Version 1 generally features the lowest AQRT values, Versions 2 and 3 feature similar values, and Version 4 features the highest AQRTs.

Similar to the values observed for LOG4SWS.KOM, the differences can be attributed to the weightings of service abstraction levels. While Version 1 incorporates the service signature, Version 2 and 3 also compute similarity values for the interface and operation levels, causing higher matchmaking runtime.

In Version 4, it is first necessary to determine the weights using OLS, consequently, the actual weighting is conducted, leading to the highest AQRT values. Regarding the different variants, Variants A and B perform quite similar, as do Variants C and D. This is not surprising, as these pairs of variants make use of exactly the same methods but operate on different probability values. The last-mentioned variants perform slightly worse, as the similarity computation is a little bit more complex (cp. Equations 3.18 and 3.20). Variants E and F possess the best AQRT values, as they make use of caches. Again, the differences are quite small. The same conclusions can be drawn as for LOG4SWS.KOM – as the largest part of the matchmaking process duration is attributed to the transformation of the query from SAWSDL to ATSM, the runtime could be primarily accelerated by improving this transformation. Further runtime evaluation results can be found in Appendix C.4.

Table 3.6:Comparison of AP of Matchmakers for SAWSDL

Adaptive AP RP P(5) P(10)

SAWSDL-M0 [132, 140] NO 0.400 N/A N/A N/A

SAWSDL-MX2 [140, 142] automatically 0.679 N/A N/A N/A COV4SWS.KOM(Variant 1f) manually 0.687 0.634 0.815 0.765 LOG4SWS.KOM(Variant 1d) automatically 0.724 0.660 0.846 0.765

URBE [142, 212] manually 0.727 0.651 0.867 0.796

COV4SWS.KOM(Variant 2f) manually 0.730 0.667 0.931 0.823 LOG4SWS.KOM(Variant 2d) automatically 0.739 0.681 0.962 0.835 COV4SWS.KOM(Variant 4a) automatically 0.746 0.686 0.923 0.835

Regarding Versions 1–3, LOG4SWS.KOM generally outperforms COV4SWS.KOM. However,

COV4SWS.KOM Variant 4a is overall the best matchmaker regarding AP and RP – at least regarding the applied test data set. While in COV4SWS.KOM, “only” Variants 4a and 4b benefit from the application of OLS, for LOG4SWS.KOM, OLS generally leads to improved matchmaking results, even though the AP of Variant 3c is slightly worse than those of Variant 3d.

On the one hand, the best values for COV4SWS.KOM Variants C–F have been achieved when the service abstraction level weights have been manually tuned and set. On the other hand, such a manual task could lead to an extensive amount of work, especially if the regarded service landscape changes quickly. Even though COV4SWS.KOM provides sometimes worse matchmaking results, it comes with the benefit of direct adaptation to differing degrees of semantic and syntactic description “richness” for certain service abstraction levels. Such an adaptation is also conducted in LOG4SWS.KOM – however, not directly.

All things considered, both matchmakers perform very well whether or not the fallback strategy or adaptation mechanisms are applied. To the best of our knowledge and compared with the results of the S3 Contest 2009 [142], LOG4SWS.KOM and COV4SWS.KOM provide the best matchmaking results of any SAWSDL matchmaker so far. In contrast to URBE and SAWSDL-MX2, our matchmakers make use of text similarities only as a substitute if an ontology-based matchmaking cannot be carried out. Especially the logic-based version of LOG4SWS.KOM features a higher certainty than the IR methods used in URBE and SAWSDL-MX2. Furthermore, the purely ontology-based matchmakers, namely LOG4SWS.KOM Vari-ant 1d and COV4SWS.KOM VariVari-ant 1f, provide much better results than SAWSDL-M0, which is, to the best of our knowledge, the only other purely ontology-based SAWSDL matchmaker with comparable evaluation results [132, 140].

When comparing the recall-precision curves of LOG4SWS.KOM and COV4SWS.KOM with those of URBE and SAWSDL-MX2 [142], it is eye-catching that our matchmakers provide much better precision results for high recall levels. For example, SAWSDL-MX and URBE feature an AP of about 0.20 for recall = 1.0 while the best variant of LOG4SWS.KOM (2d) features a value of 0.430 and Variant 4a of COV4SWS.KOM still features a value of 0.404. Nevertheless, the RP, P(5), and P(10) values show that the matchmakers presented in this thesis are also able to improve matchmaking results for low and middle recall levels (compared to URBE).

In addition to the evaluation results, we want to highlight and discuss the impact and consequences of applying numerical subsumption DoMs respectively numerical values based on semantic relatedness instead of the usually applied subsumption matching similarity values as presented by Paolucci et al. and adapted by, e.g., [45, 140, 164, 238]. Usually, subsumption matching applies DoMs for discrete elements in a service description and defines theminimumDoM found as the overall service (or operation) DoM.

This leads to a quite coarse-grained, discrete scale of possible service DoMs. As Fernández et al. state, these DoMs are not sufficiently fine-grained [82]. The discrete scale of possible service DoMs further implies that in order to further rank service offers based on a service request, additional techniques like, e.g., text similarity needs to be applied. A continuous, numerical measure like those applied in LOG4SWS.KOM and COV4SWS.KOM allows for a more precise ranking of services. In COV4SWS.KOM, the numerical similarity is directly computed which permits easy combination with other measures.

LOG4SWS.KOM and COV4SWS.KOM implicitly account for differing path lengths between concepts and thereby efficiently punish overly generic semantic annotations, which may otherwise lead to wrong search results [17]. Furthermore, we do not explicitly have to treat inputs and outputs differently. Re-lated work often makes such a distinction, which also requires somewhat arbitrary assumptions regarding a ranking, depending on the type of subsumption relation (cp. Section 3.1.2). In COV4SWS.KOM, se-mantic relatedness is applied which supersedes the distinction between inputs and outputs and generally assigns the same similarity value to a pair of concepts, regardless of the matching level they stem from. In LOG4SWS.KOM, we also use a generic definition of subsumption matching types. The decision whether more generic or more specific concepts should be ranked higher on certain levels is automatically derived through the OLS estimator and may well differ between, e.g., inputs and outputs. Additionally, to ac-count for user preferences, such rankings may be manually specified in LOG4SWS.KOM by assigning the appropriate weights.

We also would like to discuss a potential shortcoming of our matchmakers: If following the approach presented by Paolucci et al. [201], the computed DoM for any operation can be assumed as a guaranteed lower bound of similarity for the request [140]. With the average-based DoM computed by our match-makers, such a lower bound is not guaranteed. However, pros and cons of having such a lower bound have to be weighted – on the one hand, this guarantees a certain degree of similarity for the request.

But on the other hand, this approach is quite prone to outliers, i.e., one very low DoM has a very large impact on the overall DoM. Here, a non-discrete scale which makes it possible to derive an average DoM is certainly helpful. Hence, it is worth discussing if a certain degree of similarity needs to be guaranteed – notably, URBE, which is currently providing the best SAWSDL matchmaking results in terms of AP (apart from LOG4SWS.KOM and COV4SWS.KOM) does also not make use of such a lower bound [212]. An-other approach, which makes use of implicit semantics and does not guarantee a global DoM isiMatcher by Kiefer and Bernstein [123]. Sheth et al. argue that implicit semantics should be generally regarded and a restriction to formal semantics and DL will limit the potential of the Semantic Web [227].

In our opinion, in a context where no minimal DoM is given in a service request – what is always the case if formulating queries using a “query by example” approach as it is currently done in the service matchmaking research community (cp. Section 4.1) – it is legitimate to abstain from guaranteeing such a lower bound of DoM. If a minimal DoM is defined in a query, it is necessary to consider this during matchmaking: In all approaches presented in this thesis, this can be done by adding certain constraints to the matchmaker. However, this is not required in common evaluation approaches for (SAWSDL-based) service matchmakers, where the service request is given as a query by example.

Furthermore, we would like to dwell on the runtime performance of our matchmakers. Even though runtime performance has only been paid little attention when designing the matchmakers, the evaluation results are promising. A comparison of the runtime performance with that of other matchmakers like URBE is difficult, as we were not able to conduct a runtime evaluation of the other matchmakers on the same machine; here, the 2010 edition of the S3 Contest will deliver resilient values. However, COM4SWS, which was a preliminary version of LOG4SWS.KOM and COV4SWS.KOM without, e.g., the caching mechanisms, had an AQRT of 6.14 seconds in the S3 Contest 2009. In comparison, SAWSDL-MX2 needed an AQRT of 7.9 seconds and URBE of 19.96 seconds [142]. Through some changes in the design – especially the introduction of caches and the representation of SAWSDL files as ATSM models – we were able to speed up the AQRT by a huge factor. Thus, COV4SWS.KOM and LOG4SWS.KOM also provide competitive evaluation results regarding the runtime performance. However, these results have to be traced back to the serialization of services into ATSM and the caching mechanisms, not to the matchmakers themselves. In an application scenario where services are registered or stored in a service catalogue, this is a valid assumption.

As a final consideration, it should be discussed for which domains COV4SWS.KOM or LOG4SWS.KOM might be better suitable. The biggest advantage of COV4SWS.KOM is the direct adaptation to differ-ent usefulness of service compondiffer-ent descriptions on differdiffer-ent service abstraction levels. Regarding LOG4SWS.KOM, such an adaptation is indirectly supposable as the OLS adaptation is independently conducted for each abstraction level. On the one hand, the weighting of service abstraction levels has still to be adapted manually; on the other hand, LOG4SWS.KOM will nevertheless be adapted to the regarded service domain. Furthermore, Variant 4a of COV4SWS.KOM features very good results for low recall levels, which are most likely more important to the user than a high AP.

As a conclusion, we recommend to make use of COV4SWS.KOM if it is possible to clearly distinguish between different service domains or which service ontology is used in (part of) a repository. As an example, in the energy domain, the Common Information Model (CIM) presents an extensive ontology available in OWL which could be used to semantically describe corresponding services [248]. In the e-commerce domain,GoodRelationsby Hepp oreClassOWLcould be used [102, 104]. In other domains, such an ontology is missing. Differences between services do not need to come from the consideration of a certain business domain, but might also arise from the usage of different service ontologies in different domains (with one popular example being WSMO-Lite [251]). If services from clearly distinguishable domains would be available in one (potentially distributed) service registry, it seems reasonable to make use of differently adapted/configured matchmakers based on COV4SWS.KOM. Still, it is a requirement that services in such a domain possess comparably useful component descriptions on all levels.

Regarding the evaluation results, it should be mentioned that the results will most certainly vary if applying a different test data set. However, SAWSDL-TC is to the best of our knowledge the only compre-hensive and publicly available test data set for SAWSDL and therefore also used in the annual S3 Contest [142] and by most researchers in this field (e.g., [123, 140, 212]).