• Keine Ergebnisse gefunden

4 Homogeneous Team Formation

even heuristics. InChapter 5, we develop an exact integer linear program-ming approach and a heuristic based on the (slightly adapted) scheme and successfully test them on empirical data.

InTheorem 4.3, we showed that HOMOGENEOUSTEAMFORMATIONis fixed-parameter tractable with respect to the combined parameter (t,p) by using a brute-force approach for the hint enumeration. The corresponding running time is rather impractical. However, combining the brute-force approach for the hint enumeration with the ROWASSIGNMENTapproach for the hint checking allows for an integer linear programming formulation (seeFigure 4.5) which might be relevant in practice. As will be reported in Section 5.3, we tested a closely related integer linear program for combina-torial data anonymization, where test data and software implementations are publicly available. The performance on empirical standard benchmark data for combinatorial data anonymization is surprisingly good—much bet-ter than the theoretical running time bounds indicate (seeSection 5.3for details).

ForTheorem 4.5we developed an algorithm which, given a permutation of the pattern vectors, computes a solution hint. We showed that applying this algorithm to all possible permutations will provide a correct hint which allows to find an overall solution. Relaxing this to only use few (promising) permutations might lead to interesting heuristics or polynomial-time ap-proximation algorithms. InSubsection 5.2.3we develop a heuristic which only takes the most promising permutations. This leads to a fast algorithm with surprisingly good anonymization quality (seeSection 5.3for details).

Open Questions. Besides the general quest for improving our worst-case upper bounds, several concrete questions remain open. For example, the parameterized complexity (fixed-parameter tractability vs W[1]-hardness) of HOMOGENEOUSTEAMFORMATIONfor the parameters cost boundsand the combined parameter (m,p), wheremis the number of columns andpis the number of pattern vectors, remains open. A particularly interesting open question is whether HOMOGENEOUSTEAMFORMATIONfor parameterpis fixed-parameter tractable or W[1]-hard whensis bounded. It also seems worth investigating HOMOGENEOUSTEAMFORMATIONfrom the viewpoint of polynomial-time approximation or approximation with fixed-parameter algorithms.

5 Pattern-Guided k-Anonymity

In this chapter, using approaches which were successful for the homoge-neous team formation task fromChapter 4, we suggest a user-oriented approach to combinatorial data anonymization as best known by theNP -hardk-ANONYMITYproblem. A data matrix is calledk-anonymous if every row appears at leastktimes—the goal of theNP-hardk-ANONYMITY prob-lem then is to make a given matrixk-anonymous by suppressing (blank-ing out) as few entries as possible. Build(blank-ing on approaches and ideas fromChapter 4, we describe an enhancedk-anonymization problem called PATTERN-GUIDEDk-ANONYMITYwhere the users specify in which combi-nations suppressions may occur. In this way, the user of the anonymized data can express the differing importance of various data features (that is, columns of the input matrix) and combinations thereof. We show that PATTERN-GUIDEDk-ANONYMITYisNP-hard. We complement this by a fixed-parameter tractability result based on a “data-driven parameteriza-tion” and, based on this, develop an exact ILP-based solution method as well as a simple but very effective greedy heuristic. Experiments on several real-world datasets show that our heuristic easily matches up to the estab-lished “Mondrian” algorithm [LDR06] fork-ANONYMITYin terms of quality of the anonymization and outperforms it in terms of running time.

5.1 Motivation and Model

Making a matrixk-anonymous, that is, each row has to occur at leastktimes, is a classic model for (combinatorial) data privacy [Fun+10;Nav+12]. The idea behind is that each row of the matrix represents an individual and the k-fold appearance of the corresponding row shall avoid that the person or object behind can be identified. To reach this goal, clearly some information loss has to be accepted, that is, some entries of the matrix have to be suppressed (blanked out); in this way, information about certain attributes (represented by the columns of the matrix) is lost. Thus, the natural goal

5 Pattern-Guidedk-Anonymity

is to minimize this loss of information when transforming an arbitrary data matrix into a k-anonymous one. The corresponding optimization problem k-ANONYMITY is NP-hard (even in special cases) and hard to approximate [BDD11;Bon+11;BW10;CPS10;MW04]. Nevertheless, it plays a significant role in many applications, thereby mostly relying on heuristic approaches fork-ANONYMITY[CT09;GKV10;Nav+12].

It was observed that care has to be taken concerning the “usefulness”

(also in terms of expressiveness) of the anonymized data [LS07;RSH07].

Indeed, depending on the application that has to work on thek-anonymized data, certain entry suppressions may “hurt” less than others. For instance, considering medical data records, the information about eye color may be less informative than information about the blood pressure. Hence, it would be useful for the user of the anonymized data to specify information that may help doing the anonymization process in a more sophisticated way. A promising approach is to allow the user to specify which combinations of attributes are less harmful to suppress than others. Observe that doing this means to partition records about individuals into groups where the attribute combinations which may be suppressed (and thus also those attributes that have to be homogeneous) are prespecified. This is very close to our model of HOMOGENEOUSTEAMFORMATIONfromChapter 4. The main difference is that for the team formation task we ask for a one-to-one matching be-tween given patterns and specific output groups (teams). The user specifies a set of possible teams and the homogeneity requirement for each team.

There may be several possible teams having the same homogeneity require-ment but each of them is associated with one individual pattern vector.

For anonymization purposes, whenever a specific attribute combination to suppress (pattern) is allowed, then any number of instantiations is allowed—

there is no reason to allow only a limited number of instantiations. Hence, defining the model for anonymization purposes requires subtle differences formalities which will be introduced next. We emphasize that both models are still similar enough such that we can adapt some of the algorithmic approaches fromChapter 4in this chapter. However, the hardness proofs fromSubsection 4.2.1can not be easily transferred, since all of them rely on the one-to-one mapping between pattern vectors and output groups.

Altogether, with our new model we can improve both onk-ANONYMITY by letting the data user influence the anonymization process as well as on a previous model [Bre+11a] for user-guided data anonymization by allowing the full flexibility for the data user to influence the anonymization process.

5.1 Motivation and Model

5.1.1 The Model

In this subsection, we introduce and recall some concepts necessary to mathematically define our model of pattern-guided data anonymization.

Arow typeis a maximal set of identical rows of a matrix.

Definition 5.1([Sam01;SS98;Swe02b]). A matrix isk-anonymousif every row type contains at leastkrows in the matrix, that is, for every row in the matrix one can find at leastk−1 other identical rows.

Matrices are madek-anonymous by suppressing some of their entries.

Formally,suppressingan entryM[i,j] of ann×m-matrixMover alphabetΣ with 1≤i≤nand 1≤j≤mmeans to simply replaceM[i,j]∈Σby the new symbol “?”, ending up with a matrix over the alphabetΣ∪{?}.

Our central enhancement of thek-ANONYMITYmodel lies in the user-specific pattern mask guiding the anonymization process: Every row in the k-anonymous output matrix has to conform to one of the given pattern vec-tors. Note that both the input table and the given patterns mathematically are matrices, but we use different terms to easier distinguish between them:

The “pattern mask” consists of “pattern vectors” and the “input matrix”

consists of “rows”.

Definition 5.2. A row r in a matrix M∈Σ∪{?}n×m matchesa pattern vectorv∈{,?}mif and only if∀1≤i≤m:r[i]=?⇐⇒ v[i]=?, that is,r andvhave?-symbols at the same positions.

With these definitions we can now formally define our central computa-tional problem. The decisive difference to our model for homogeneous team formation (Chapter 4) is that in our new model two non-identical output rows can match the same pattern vector.

PATTERN-GUIDEDk-ANONYMITY

Input: A matrix M∈Σn×m, a pattern mask P∈{,?}p×m, and two positive integerskands.

Question: Can one suppress at most sentries ofM in order to obtain a k-anonymous matrixM0such that each row type ofM0matches to at least one pattern vector ofP?

5 Pattern-Guidedk-Anonymity

5.1.2 Our Contributions

Describing a polynomial-time many-to-one reduction from theNP-hard 3-SET COVERproblem, we show that PATTERN-GUIDEDk-ANONYMITYis NP-complete, even if the input matrix only consists of three columns, there are only two pattern vectors, andk=3. Notably, our reduction is completely different from the one for the closely related problem HOMOGENEOUSTEAM FORMATIONinChapter 4. There, theNP-hardness also holds if we do not have any lower bound on the team size (which would translate tok=1 for PATTERN-GUIDEDk-ANONYMITY). In contrast, we show polynomial-time solvability for PATTERN-GUIDEDk-ANONYMITYifk≤2. Motivated by the computational intractability result, we develop an exact algorithm that solves PATTERN-GUIDEDk-ANONYMITYinO(2t pt6p5m+nm) time for an n×minput matrixM,ppattern vectors, and the number of different rows inMbeingt. This shows that PATTERN-GUIDEDk-ANONYMITYis fixed-parameter tractable for the combined fixed-parameter (t,p). This result appears to be of practical interest only in special cases (“small” values fortandp are needed). It nevertheless paves the way to a formulation of an integer linear program for PATTERN-GUIDEDk-ANONYMITYthat exactly solves moderate-size instances of PATTERN-GUIDEDk-ANONYMITYin reasonable time. Furthermore, our fixed-parameter tractability result also leads to a simple and efficient greedy heuristic whose practical competitiveness is underlined by a set of experiments with real-world data, also favorably comparing with the Mondrian algorithm fork-ANONYMITY[LDR06]. In particular, our empirical findings strongly indicate that, even when neglect-ing the aspect of potentially stronger expressiveness for the data user side as provided by PATTERN-GUIDEDk-ANONYMITY, in combination with the greedy algorithm it allows for high-quality and very fast data anonymization, being comparable in terms of anonymization quality with the established Mondrian algorithm [LDR06] but significantly outperforming it in terms of efficiency.