Managing Co-evolution - Privacy-aware Query Processing

Privacy-aware Query Processing

8.4 Managing Co-evolution

1. A dataset slice ( subgraph of S) Source dataset S

(e.g., DBpedia) Target dataset

(extracted from DBpedia)

evolve evolve

2. Pull changes

Synchronized datasets to

tj time

Co-evolution manager (S_ti-tj)

S_to T_to

S_tj T_tj

3. Apply strategy 4. Apply changes

T_tj S_tj

5. Propagate changes ti

Dataset Provider Client

target changes

Figure 8.4: Co-evolution of linked datasets

at time pointt_j, the co-evolution manager identifies the conflicts and resolves them. The conflicts are resolved and final changes are merged in both datasets.

8.4.1 Conflict

Our co-evolution strategy aims at dealing with changesets from either the source or target dataset and provide a suitable reconciliation strategy. Various strategies can be employed for synchronizing datasets.

When we synchronize the targetTt_i with sourceSt_i, there may exist triples which have been changed in both datasets. These changed triples may be conflicting.

Definition 45 (Potential Conflict) Let us assume that a synchronization is required for a given time slot t_i´t_j.∆pS_t_j_´t_iqis the changeset of the source dataset and∆pT_t_j_´t_iqis the changeset of the target dataset. A potential conflict is observed when there are triples x₁ “ ps, p,o₁q PSt_j ^x₂ “ ps, p,o₂q P

∆pT_t_j_´t_iq ^x₂RS_t_j “S_t_i Y∆pS_t_j_´t_iqwith o₁ıo₂.

Takingo₁ ıo₂as an indication for a conflict is subjective; in the sense that the characteristics of the involved propertypinfluences the decision. Consider two triplesps, p,o₁qandps, p,o₂q. If pis a functional data type property, two triples are conflicting iffthe object valueso₁ando₂are not equal.

However, if the propertypis a functional object property, these two triples are conflicting if the objects are or can be inferred to be different (e.g. viaowl:differentFrom). Another property which needs special consideration isrdf:type. For this property it is necessary to check whethero₁ando₂belong to disjoint classes. Only then these triples would be conflicting. For example,s1 rdf:type Person ands1 rdf:type Athlete are not conflicting if Athleteis a subclass of Person(i.e. not disjoint). Thus, the process of detecting conflicts is considering the inherent characteristics of the involved property.

8.4.2 Synchronization Strategies

In the following, we list possible strategies for synchronization. We consider the time framet_i´t_j, where in the timet_i, the source and target datasets are synchronised and until timet_j, both source and target

8.4 Managing Co-evolution

datasets have been evolving independently. Before applying synchronization, the state of the source dataset isSt_j “St_i Y∆pSt_j´t_iqand the target dataset isTt_j “Tt_iY∆pTt_j´t_iq.

Strategy I:This synchronization strategy prefers the source dataset and ignores all local changes on the target dataset; thus, the following requirement is necessary. Therefore, the target dataset ignores all triples tx|xR∆pSt_j´t_iq ^ xP∆pTt_j´t_iquand adds only the triplesty|yP∆pSt_j´t_iqu. After synchronization, the state of source dataset isS_t_j “S_t_iY∆pS_t_j_´t_iqand the state of the target dataset isT_t_j “T_t_iY∆pS_t_j_´t_iq.

Thus, the inclusion requirement is met andTt_j ĎSt_j. A special case of this strategy is when the target is not evolving.

Strategy II:With this strategy, the target dataset is not synchronized with the source dataset and keeps all its local changes. Thus, the target dataset is not influenced by any change from the source dataset and evolves locally. After synchronization, at timetj, the state of the target dataset isTt_j “Tt_i Y∆pTt_j´t_iq, and the state of the source dataset isS_t_j “S_t_i Y∆pS_t_j_´t_iq. It allows for synchronized replicas only if data is deleted. There is no synchronization if triples in the target dataset are updated or new triples are included.

Strategy III:This synchronization strategy respects the changesets of both source and target datasets except that it ignores conflicting triples. Here, the set of triples in which conflicts occur isX“ tx₁“ ps, p,o₁q P St_j ^x₂ “ ps, p, o₂q P∆pTt_j´t_iq ^x₂ RSt_j witho₁ ıo₂u⁹. With Strategy III, the set of conflicting triples Xis removed from the target dataset while the source changeset ∆pS_t_j_´t_iq and the target changeset∆pT_t_j_´t_iq are added. After synchronization, the state of the source dataset isS_t_j “ pSt_iY∆pSt_j´t_iq Y∆pTt_j´t_iqqzXand the state of the target dataset isTt_j “ pTt_iY∆pTt_j´t_iq Y∆pSt_j´t_iqqzX.

Thus, the inclusion requirement is met.

Strategy IV:This synchronization strategy also respects the changesets of both source and target datasets.

In addition, it includes conflicting triples after resolving the conflicts. Here, we consider the set of triples in which conflict occurs asX “ tx₁ “ ps, p, o₁q PSt_j ^x₂ “ ps, p, o₂q P∆pTt_j´t_iq ^x₂ RSt_j with o₁ıo₂u. The conflicts over these triples should be resolved. It can be resolved using some resolution policy as described in [87]. Table 8.1 shows a list of various policies for resolving the conflicts. Conflict resolution results in a new set of triples calledYwhose triples are originated fromXbut their conflicts have been resolved. Then, this new set (i.e. Y) is added to the both source and target datasets. After synchronization, the state of the source dataset isSt_j “ ppSt_i Y∆pSt_j´t_iq Y∆pTt_j´t_iqqzXq YY and the state of target dataset isTt_j “ ppTt_iY∆pTt_j´t_iq Y∆pSt_j´t_iqqzXq YY. Thus, the inclusion requirement is met.

8.4.3 Co-evolution Approach

Our approach allows a user to choose a synchronization strategy defined in Section 8.4.2. Below, we describe the status of the source and target datasets after applying each synchronization strategy.

We define a functionCDR(Conflict Detection and Resolution), which (i) identify conflicts for the case of strategy III and strategy IV, and then (ii) resolve conflicts only in case of strategy IV. Our approach considers triple-based operations, explained below using seven cases, to identify conflicts. Consider three triplesx₁“ ps, p,o₁q,x₂“ ps, p,o₂q, andx₃ “ ps, p, o₃qwhich are in conflict with each other x₁ P∆pS_t_j_´t_iq ^x₂ P∆pT_t_j_´t_iq ^x₃ P t∆pS_t_j_´t_iq ^∆pT_t_j_´t_iqu ^o₁ ı o₂ ı o₃. In the following we present seven cases of evolution causing conflicts. For the first three cases (I-III), the conflict resolution is straightforward. But for the cases IV-VII, we have to employ a conflict resolution policy to decide about triplesx₁and x₂(D^S andA^S refers to the deleted and added triples from source dataset, respectively.

Similarly,D^T andA^T refers to the deleted and added triples from target dataset):

9Set of conflicting triples selected after considering the inherent characteristics of the involved property. In rest of the chapter, we say potential conflict a conflict, unless otherwise specified.

Category Policy Function Type Description

Deciding

Roll the dice Any A Pick random value.

Reputation Best Source A Select the value from the preferred dataset.

Cry with the wolves Global vote A Select the frequently occurring value for the respective attribute among all entities.

Keep up-to-date First* A Select the first value in order.

Latest* A Select the most recent value.

Filter

Threshold* A Select the value with a quality score higher than a given threshold.

Best* A Select the value with highest quality score.

TopN* A Select the N best values.

Mediating Meet in the Middle

Standard deviation,

variance N Apply the corresponding function to get value.

Average, median N Apply the corresponding function to get value.

Sum N Select the sum of all values as the resultant.

Conflict

Ignorance Pass It On Concatenation A Concatenate all the values to get the resultant.

Conflict Avoidance

Take the Information

Longest S, C, T Select the longest (non-NULL) value.

Shortest S, C, T Select the shortest (non-NULL) value.

Max N Select the maximum value from all.

Min N Select the minimum value from all.

Trust Your Friends

Choose Depending* A Select the value that belongs to a triple having a specific given value for another given attribute.

Choose Corresponding A Select the value that belongs to a triple whose value is already chosen for another given attribute.

Most Complete* A Select the value from the dataset (source or target) that has fewest NULLs across all entities for the respective attribute.

* - requires metadata

Table 8.1: Conflict resolution policies and functions: A - All, S - String, C - Category (i.e., domain values have no order), T - Taxonomy (i.e., domain values have semi-order), N - Numeric.

‚ Case I: x₁is added toTt_j if x₁is added by the source dataset and x₂ is deleted from the target dataset:x₁P∆pA^S_t

j´t_iq ^x₂P∆pD^T_t

j´t_iq.

‚ Case II:x₁is added toT_t_j ifx₁is modified by the source dataset andx₂is deleted from the target dataset:x₁P∆pA^S_t_j_´t_iq ^x₂P∆pD^S_t_j_´t_iq ^x₂P∆pD^T_t_j_´t_iq.

‚ Case III:x₂is added toSt_j ifx₁is deleted from the source dataset andx₂is modified in the target dataset:x₁P∆pD^S_t

j´t_iq ^x₂P∆pA^T_t

j´t_iq ^x₁P∆pD^T_t

j´t_iq.

‚ Case IV: if the triple x₁ is added to the source dataset and x₂ is added to the target dataset:

x₁P∆pA^S_t_j_´t_iq _x₂P∆pA^T_t_j_´t_iq.

‚ Case V:ifx₃is modified by both source and target datasets: x₂ P∆pA^S_t_j_´t_iq^`^x₃ P∆pD^S_t_j_´t_iq^x₁P

∆pA^Tt_j´t_iq ^x₃ P∆pD^Tt_j´t_iq.

‚ Case VI:ifx₁is modified by the target dataset: x₁ P∆pA^S_t

j´t_iq ^x₂P∆pA^T_t

j´t_iq ^x₁P∆pD^T_t

j´t_iq.

‚ Case VII:ifx₁is modified by the source dataset:x₂ P∆pA^S_t_j_´t_iq ^x₁P∆pD^S_t_j_´t_iq ^x₁ P∆pA^T_t_j_´t_iq.

As we discussed earlier, whether a conflict between two triple exists depends heavily on the type of property. Consider two triplesps, p, o₁qandps, p,o₂q, ifpisrdfs:label, we measure the similarity betweeno₁ and o₂ using the Levenshtein distance. We pick both values of rdfs:label if their similarity is below a certain threshold otherwise we treat them as conflicting.

Im Dokument Federated Query Processing over Heterogeneous Data Sources in a Semantic Data Lake (Seite 121-125)