Data Source Description Model
4.3 Experimental Study
4.3.1 RDF-MT based Characterization of Benchmarks
Num of nodes 8
Num of edges 8
Graph density 0.285
Avg. num of neighbors 2 Connected components 1 Avg. node connectivity 1.0
Transitivity 0.0
Clustering coefficient 0.0
Table 4.1:FedBench RDF-MT Graph Metrics. Clustering coefficient (0.0) suggests that there is no connectivity in the neighborhood of the network.
Review
Producer
ProductType
ProductFeature Person
Offer
Vendor
Product
Figure 4.6:Analysis of RDF-MTs of BSBM. The graph comprises 8 RDF-MTs and 8 inter-dataset links. Each dot represents an RDF-MT stored in each endpoint. A line between dots corresponds to inter-dataset links. There is only one RDF-MT in each endpoint, hence no intra-dataset links.
Figure 4.7:Frequency of BSBM RDF-MTs Per Number of Properties. Majority of Molecule Templates contain from five to seven properties.
and 1.0 average node connectivity. In particular, the connections concentrated on a single RDF-MT (hence, a single dataset),Product, with 6 out of 8 links to or from this RDF-MT (hence, dataset). A histogram of frequencies of RDF-MTs per numeber of properties distributed from six (two RDF-MTs) to 18 (one RDF-MT) is shown in Fig. 4.7.
4.3 Experimental Study
Sider
Figure 4.8:Analysis of RDF-MTs of LSLOD. The graph comprises 56 RDF-MTs and 197 intra- and inter-dataset links; dots in each circle represent RDF-MTs. A line between dots in the same circle shows intra-dataset links, while a line between dots in different circles corresponds to an inter-dataset link. There are nine datasets: Drugbank, Dailymed, Sider, Affymetrix, KEGG, LinkedCT, TCGA-A, ChEBI, and Medicare; they have six, three, two, three, four, 13, 23, one, and one RDF-MTs, respectively.
Num of nodes 57
Num of edges 197
Graph density 0.205
Avg. num of neighbors 11.474 Connected components 3 Avg. node connectivity 1.648
Transitivity 0.634
Clustering coefficient 0.375
Table 4.2:LSLOD RDF-MT Graph Metrics. Clustering coefficient (0.375) suggests high number of intra- &
inter-dataset links.
LSLOD: Life Science Linked Open Data
LSLOD [93] is a benchmark composed of 10 real-world datasets of the Linked Open Data (LOD) cloud from life sciences domain. The federation includes: ChEBI (the Chemical Entities of Biological Interest), KEGG (Kyoto Encyclopedia of Genes and Genomes), DrugBank, TCGA-A (subset of The Cancer Genome Atlas), LinkedCT (Linked Clinical Trials), Sider (Side Effects Resource), Affymetrix, Diseasome, DailyMed, and Medicare. Compared to FedBench, LSLOD datasets contain rather small number of RDF-MTs. Figure 4.8 shows the connectivity of all RDF-MTs associated with LSLOD datasets. In total, there are 57 RDF-MTs with 197 links between them. TCGA-A dataset contains the majority of RDF-MTs (23). There are no shared RDF-MTs between the LSLOD datasets. Fig-ure 4.9 shows that most of the RDF-MTs have between three and 55 properties. Some RDF-MTs from TCGA-A have a large number of properties, e.g.,tcga:clinical_omfhas 197 properties;
tcga:normal_control,tcga:tumor_sample, andtcga:clinical_ntehave 246 proper-ties; andtcga:clinical_cqcf,tcga:biospecimen_cqcf, and tcga:patienthave 247 properties. Graph analysis in Table 4.2 shows that there is medium connectivity (stronger than FedBench) of RDF-MTs, with 0.123 density, 6.912 average number of neighbors, and 3 connected components.
Figure 4.9:Frequency of LSLOD RDF-MTs Per Number of Properties. Majority of Molecule Templates contain from three to 30 properties.
FedBench
FedBench [94] is a benchmark suite for analyzing both the efficiency and effectiveness of federated query processing techniques for different use cases on semantic data. It includes three collections of datasets:
cross-domain,life-science, andSP2Benchcollections. Thecross-domaincollection is composed of datasets from different domains: DBpedia has linked structured data extracted from Wikipedia; Geonames is composed of geo-spacial entities such as countries and cities; Jamendo includes music data such as artists, records; LinkedMDB maintains linked structured data about movies, actors; the New York Times dataset contains about 10,000 subject headings about people, organizations, and locations; finally, the Semantic Web Dog Food (SWDF) dataset includes data about Semantic Web conferences, papers, and authors. Furthermore,Life-sciencecollection contains datasets from the life-sciences domain: Kyoto Encyclopedia of Genes and Genomes (KEGG) has chemical compounds and reactions data in Drug, Enzyme Reaction and Compound modules; the Chemical Entities of Biological Interest (ChEBI) contains information about molecular entities on “small” chemical compounds, such as atoms, molecules, ions;
and DrugBank maintains drug data with drug target information. In addition to these three datasets in the life-sciences collection, a subset of DBpedia dataset that includes data about drugs is added in this collection. Finally,SP2Benchcollection contains a synthetic dataset generated by the SP2Bench data generator [95], that mirrors characteristics observed in the DBLP database. For our experiments, we have used datasets from the first two collections from this benchmark, i.e.,cross-domainandlife-science collections, which contain real-world datasets.
In FedBench, RDF-MTs that have more than 100 properties correspond to classes with multiple pre-dicates and subclasses, such asdbo:Person,dbo:Organisation, anddbo:Place. In addition, in order to study the characteristics of the generated FedBench RDF-MT graph, we report on a graph analysis which is documented in Table 4.3. In particular, we observe a rather medium connectivity of the graph nodes (i.e., RDF-MTs) – 0.081 – with 31.9 average number of neighbors and 9 connected components5. Finally, the clustering coefficient (0.602) indicates that we do not have only links between the RDF-MTs that come from the same dataset, but also many inter-dataset connections. Figure 4.10 illustrates all RDF-MTs in FedBench associated with the dataset they come from with all intra-dataset and inter-dataset connections. In total, 387 RDF-MTs (396 including shared RDF-MTs) with 6,317 links
5A lower number of connected components indicates a stronger connectivity.
4.3 Experimental Study
DBpedia
Drugbank
KEGG
Chebi Jamendo
Geonames
NYTimes
SWDF
LinkedMDB Shared
Figure 4.10:Analysis of RDF-MTs of FedBench. The graph comprises 387 RDF-MTs and 6,317 intra- and inter-dataset links. The dots in each circle represent RDF-MTs. A line between dots in the same circle shows intra-dataset links, while a line between dots in different circles corresponds to inter-dataset links. In numbers, there is only one RDF-MT in ChEBI, 234 in DBpedia, six in Drugbank, one in Geonames, 11 in Jamendo, four in KEGG, 53 in LinkedMDB, two in NYTimes, and 80 in SWDF dataset. Four of these RDF-MTs belong to at least two FedBench datasets, modeled as separate circular dots.
Figure 4.11:Frequency of FedBench RDF-MTs Per Number of Properties. Majority of Molecule Templates contain from one to 20 properties.
are generated. While the majority of the RDF-MTs (230) are related to a single dataset, quite a few (4) are shared between two or more datasets. Most of the RDF-MTs have between three and 20 properties, as can be seen in the histogram of Figure 4.11.
From the reported analysis, it can be observed that RDF-MTs can be used to describe characteristics of datasets in terms of connectivity between RDF types represented in each dataset with other datasets
Num of nodes 396
Num of edges 6,317
Graph density 0.081
Avg. num of neighbors 31.904 Connected components 9 Avg. node connectivity 10.624
Transitivity 0.395
Clustering coefficient 0.602
Table 4.3:FedBench RDF-MT Graph Metrics. Clustering coefficient (0.602) suggests high number of intra- &
inter-dataset links.
in the federation. This answersQ1positively in a sense that datasets can be characterized not only by ontology types (RDF types) and predicates, but also using the characteristics of the network between ontology types within the same dataset and with other datasets in a federation.