Evaluation Environment - Web Service Discovery Based on Semantic Information

The evaluation environment for the matchmaking approaches presented in Chapter 3 is made up from two parts: The actual evaluation framework SME2 (Section B.2.1) and the SAWSDL-TC test data set (Sections B.2.2 and B.2.3).

B.2.1 SME2 Framework

For the evaluation of our matchmaking approaches, we have used the SME2¹framework (Version 2.1).

SME2 is also applied in the S3 Contest², which aims at the retrieval performance evaluation of matchmak-ers for SWS [142]. Thus, it was possible to compare the results of LOG4SWS.KOM and COV4SWS.KOM with the evaluation results of state-of-the-art matchmakers for SAWSDL.

The SME2 framework consists of a GUI-based tool for the evaluation of Web service matchmakers and libraries that facilitate the integration of matchmakers into the framework. The GUI can be utilized in conjunction with OWL-S and SAWSDL test collections (see below). It automatically calculates a number of numerical result quality measures. Most importantly, this includes the “classic” IR metricsrecalland precision (cp. Section B.3). Additionally, the tool computes the AP, which corresponds to the mean precision rate over all recall levels. The statistical figures are completed by performance measures, such as the Overall and Average Query Response Time. Furthermore, a Friedman test is provided and can be used to evaluate if the differences between evaluation results are statistically significant. The threshold value forp, which indicates the level of significance (or probability of error) is 0.05.

1 http://projects.semwebcentral.org/projects/sme2/

2 http://www-ags.dfki.uni-sb.de/~klusch/s3/

SME2’s evaluation for SAWSDL is based on the notion of binary relevance. I.e., the framework assumes that for a given service request, a set of relevant (and accordingly, irrelevant) service offers can be iden-tified. Relevant, in that sense, means that the service offer does not completely fail to satisfy the service request. SME2 expects a ranked result set. Ideally, it should solely consist of relevant services, or, at least, all relevant services should be ranked very high. The specific similarity of each service offer with the service request is not taken into account. However, in the most recent versions of SME2,graded relevance is reckoned, too. As the name implies, in contrast to binary relevance, graded relevance incorporates different degrees of relevance. So, instead of expecting that service offers either completely fail or are a perfect match to a service request, different, finer-grained grades like “possible match” or “partial match”

are defined in the relevance sets [152, 154, 246]. Unfortunately, there is currently no corresponding, generally accepted test data collection for SAWSDL that could have been applied in order to evaluate the matchmakers at hand with respect to graded relevance (cp. the description of SAWSDL-TC in the following section).

B.2.2 SAWSDL-TC Test Collection

To the best of our knowledge, the test data collections applied in the S3 Contest are today the largest and most-accepted sets of SWS applied for testing and evaluating algorithms for SWS discovery and matchmaking. In the S3 Contest, two test collections from www.semwebcentral.org are used – with OWLS-TC version 3.0 Revision 1andSAWSDL-TC1as the current versions of service retrieval test collec-tions for OWL-S and SAWSDL [122]. As SAWSDL is the SWS standard applied in this thesis, we apply SAWSDL-TC as test collection.

SAWSDL-TC was released in 2008 and consists of 894 semantically annotated WSDL 1.1-based Web services, which cover differing domains from education and medical care to food and travel, etc. The set contains 26 queries and a relevance set for each query is provided which can be used in order to compute IR measures. The queries are defined as “queries by example” (cp. Section 4.1). For the evaluation of matchmaking approaches, we apply these queries directly. For the evaluation of the query languages, we have mapped the original queries as described in Chapter 4. The test suite is completed by 24 ontologies, which are referenced using semantic annotations of services.

As SAWSDL-TC is WSDL 1.1-based, it was necessary to convert the test collection to WSDL 2.0, which is the designated service format for the matchmakers presented in this thesis. This is done by the XSLT-stylesheet for converting WSDL 1.1 to 2.0 introduced in Appendix A.1.2. As no new semantic annotations have been added to or eliminated from the service descriptions, and the structure of service descriptions is eventually the same, it is possible to compare the performance of the matchmakers at hand with those of other matchmakers which make use of SAWSDL-TC.

While SAWSDL-TC is a de facto standard in the evaluation of Web service matchmakers, it suffers from a number of limitations. First of all, the scope of SAWSDL-TC is clearly on the evaluation of IR measures;

other desirable dimensions of evaluation as, e.g., the incorporation of different viewpoints what a valid result to a service request might be [153], are not regarded. Second, from a more technical point of view, Web services in SAWSDL-TC are solely annotated at the parameter level. Interface and operation components are not annotated at all. Generally, each service solely contains one interface with exactly one operation. That way, more sophisticated matchmaking approaches on levels beyond the parameters cannot be fully evaluated. Furthermore, some obvious inconsistencies can be found in the compilation of relevance sets. E.g., at least one pair of services that has an identical set of parameters and semantic annotations is not contained in the same relevance set. Furthermore, six services that are contained in the relevance sets are not included in the directory of service offers, which leads to a slight reduction in recall for the corresponding requests. Yet, due to the lack of a more suitable test collection and for the ease of comparison, we have opted to utilize the WSDL 2.0-based version of SAWSDL-TC in the course of our evaluation. As the SAWSDL-TC service collection is completely independent from the work at hand, we meet the requirement of “fair testing” as we do not make use of an artificially created test data set, which meets the requirements of the matchmakers presented in this thesis [156, 261].

An overview of the queries and relevance sets of SAWSDL-TC is provided in the next section.

B.2.3 SAWSDL-TC: Queries and Relevance Sets (Overview)

Table B.1 shows the queries from SAWSDL-TC and the corresponding number of services from the respec-tive result set. Size is given in Kbyte.

Table B.1:SAWSDL-TC: Queries and Relevance Sets (Overview)

No. Size Query Number of Services

in Relevance Set

1. 5 book_price_service 37

2. 5 bookpersoncreditcardaccount__service 16

3. 6 bookpersoncreditcardaccount_price_service 21

4. 4 car_price_service 40

5. 6 citycountry_hotel_service 23

6. 5 country_skilledoccupation_service 74

7. 5 dvdplayermp3player_price_service 14

8. 5 geographical-regiongeographical-region_map_service 15

9. 5 geopolitical-entity_weatherprocess_service 23

10. 5 governmentdegree_scholarship_service 40

11. 5 governmentmissile_funding_service 37

12. 4 grocerystore_food_service 27

13. 4 hospital_investigating_service 19

14. 4 maxprice_cola_service 13

15. 4 novel_author_service 22

16. 4 preparedfood_price_service 25

17. 5 recommendedprice_coffeewhiskey_service 18

18. 7 researcher-in-academia_address_service 16

19. 7 shoppingmall_cameraprice_service 18

20. 4 surfing_destination_service 32

21. 4 surfinghiking_destination_service 39

22. 7 surfingorganization_destination_service 13

23. 4 title_comedyfilm_service 14

24. 4 title_videomedia_service 12

25. 5 university_lecturer-in-academia_service 20

26. 5 userscience-fiction-novel_price_service 28

The queries and relevance sets from the SAWSDL-TC test data set shown in Table B.2 were utilized in order to manually determine optimal numerical equivalents for the DoM levels presented in Variants B and C of LOG4SWS.KOM (cp. Section 3.4.1).

The optimal numerical equivalents have been determined by testing different combinations of weight-ings. As it is the case in Section 3.4, the numerical equivalents for exact and fail matches have been set to 1 and 0 respectively. The numerical equivalents for super and sub matches have been given weightings from 0 to 1 in 0.1-steps. The best evaluations results have been observed for{1, 0.8, 0.6, 0}.

Table B.2:Service Collection for the Determination of Numerical DoM Equivalents in LOG4SWS.KOM

No. Query Number of Services

in Relevance Set

4. car_price_service.wsdl 40

5. citycountry_hotel_service.wsdl 23

11. governmentmissile_funding_service.wsdl 37

15. novel_author_service.wsdl 22

16. preparedfood_price_service.wsdl 25

19. shoppingmall_cameraprice_service.wsdl 18

20. surfing_destination_service.wsdl 32

23. title_comedyfilm_service.wsdl 14

24. title_videomedia_service.wsdl 12

25. university_lecturer-in-academia_service.wsdl 20

Total 10 243

Im Dokument Web Service Discovery Based on Semantic Information - Query Formulation and Adaptive Matchmaking (Seite 172-175)