• Keine Ergebnisse gefunden

5.3 Experimental Results

5.3.6 Heuristic vs. Exact Solution

5.3 Experimental Results

Greedy Heuristic

k u r #h havg hmax

2 1.4 13.0 63140 5.1 2448

3 1.4 13.6 41408 7.8 2448

4 1.4 14.7 31652 10.3 2448

5 1.4 14.9 25852 12.6 2448

6 1.5 15.3 22150 14.7 2448

7 1.5 15.4 19399 16.7 2448

8 1.5 16.1 17276 18.8 2448

9 1.5 16.2 15651 20.7 2448

10 1.5 16.6 14248 22.8 2448

25 1.6 20.2 6167 52.6 2448

50 1.8 23.4 2988 108.6 2448

75 1.9 25.7 1917 169.3 2448

100 1.9 27.6 1393 232.9 2838

Mondrian

u r #h havg hmax

1.5 3504.6 15984 2.4 9 1.6 2196.7 10233 3.8 11

1.7 1600.5 7458 5.2 12

1.7 1252.1 5887 6.5 13

1.8 1040.8 4856 7.9 15

1.8 894.5 4139 9.3 17

1.8 783.1 3618 10.6 19

1.9 694.6 3191 12.1 22

1.9 622.4 2840 13.6 27

2.1 272.7 1120 34.4 57

2.4 158.4 563 68.4 103

2.5 117.0 356 108.2 154

2.6 103.4 279 138.0 201

Table 5.5: Heuristic vs. Mondrian: Results for the Canada data set.

our heuristic always produces one block of at least 2448 identical rows. See Table 5.5for details.

Conclusions for Classicalk-Anonymity. We showed that our greedy heuristic is very efficient even for real-world datasets with more than 100,000 records andk≤100. Especially for smaller degrees of anonymityk≤ 10, Mondrian is at least ten times slower. Altogether, our heuristic out-performs Mondrian for all datasets except Nursery in terms of quality of the solution. There is no clear winner for the Nursery dataset. Hence, we demonstrated that even when neglecting the feature of pattern-guidedness and simply specifyingallpossible pattern vectors, our heuristic already pro-duces useful solutions that can at least compete with Mondrian’s solutions.

5 Pattern-Guidedk-Anonymity

answer the question how far the produced solutions are typically away from the optimum. We evaluate our experiments using the following criteria:

1. numbersof suppressions, 2. usefulness valueu, 3. running timerin seconds, 4. number #hof output row types,

5. average sizehavgof the output row types, and 6. maximum sizehmaxof the output row types.

Nursery data set. Our ILP implementationk-anonymized the Nursery dataset fork∈{2, . . . , 10, 25, 50, 75, 100} within two minutes for each input, that is, we solvedk-ANONYMITYwith a minimum number of suppressions.

In contrast, the ILP formulation could notk-anonymize the other datasets within 30 minutes for many values ofk.

Surprisingly, the results computed by the heuristic were optimal (in terms of number of suppressed entries) for all testedkand many results are better in terms of the other quality measures. The reason seems to be that the ILP implementation tends to find, for a fixed number of suppressions, solutions with high degree of anonymity. For example, the result of the ILP fork=6 is already 15-anonymous, whereas the result of the heuristic is 9-anonymous, yielding more and smaller output row types. Summarizing, the heuristic is at least 25 times faster than the ILP implementation and also produces solutions with a minimum number of suppressions which have a better quality concerning #h,havg, andhmaxvalues. SeeTable 5.6for details.

CMC data set. Consider the scenario where the user is interested in a k-anonymized version of the CMC dataset where each row has at most two suppressed entries. To fulfill these constraints, we specified all possible pattern vectors with at most two?-symbols (plus the all-?-vector to remove outliers) and applied our greedy heuristic and the ILP implementation for k∈{2, . . . , 10, 25, 50, 75, 100}.

As expected, the heuristic works much faster than the ILP implemen-tation (at least by a factor of ten). The solution quality depends on the

5.3 Experimental Results

Greedy Heuristic

k s u r #h havg hmax

2 12690 3.2 0.8 4320 3 3

3 12690 3.2 0.3 4320 3 3

4 12690 3.3 0.3 3240 4 4

5 12690 3.3 0.3 2592 5 5

6 25920 3.9 0.3 1440 9 9

7 25920 3.9 0.3 1440 9 9

8 25920 3.9 0.4 1440 9 9

9 25920 3.9 0.3 1440 9 9

10 25920 3.9 0.4 1080 12 12

25 38880 4.5 0.8 480 27 27

50 38880 4.8 1.2 216 60 60

75 38880 4.8 1.3 162 80 80

100 51840 5.4 1.6 120 108 108

ILP implementation

u r #h havg hmax

3.2 63 4320 3 3

3.2 63 4320 3 3

3.3 61 2592 5 5

3.3 61 2592 5 5

4.0 61 864 15 15

4.0 48 864 15 15

4.0 59 864 15 15

4.0 59 864 15 15

4.0 59 864 15 15

4.8 56 216 60 60

4.8 50 216 60 60

4.8 46 162 80 80

5.5 44 54 240 240

Table 5.6: Heuristic vs. ILP: Results for the Nursery dataset specifying all pattern vectors.

anonymity degreek. The results of the heuristic get closer to the optimum with increasingk. Whereas fork=2 the number of suppressions in the heuristic solution is 1.4 times the optimum, fork>10 the heuristic produces results with a minimum number of suppressions. Most other quality mea-sures behave similarly but the differences are less strong. The usefulness values of the heuristic results are at most as good as those of the ILP results fork≤7 andk≥50. SeeTable 5.7for details.

Adult-2 data set. Consider a user who is interested in the Adult-2 dataset.

Her main goal is to analyze correlations between income of the individuals and the other attributes (to detect discrimination). To get useful data, she specifies four constraints for an anonymized record.

1. Each record should contain at most two suppressed entries.

2. The attributes “education” and “salary class” should not be suppressed because she assumes a strong relation between them.

3. One of the attributes “work class” and “occupation” alone is useless for her, so either both should be suppressed or none of them.

4. Since she assumes discrimination because of age, sex, and race, at most one of these attributes should be suppressed.

5 Pattern-Guidedk-Anonymity

Greedy Heuristic

k s u r #h havg hmax

2 4112 3.18 0.1 533 2.8 249

3 6564 3.42 0.1 264 5.6 501

4 8252 3.57 0.1 153 9.6 696

5 8952 3.69 0.1 109 13.5 771

6 9821 3.76 0.1 78 18.9 874

7 10339 3.84 0.1 61 24.1 935

8 10878 3.95 0.1 47 31.3 998

9 11486 4.06 0.1 32 46.0 1074

10 11678 4.08 0.1 28 52.6 1098 25 13722 5.69 0.1 4 368.3 1347 50 14314 7.12 0.1 2 736.5 1421

75 14730 10.0 0.1 1 1473 1473

100 14730 10.0 0.1 1 1473 1473

ILP implementation

s u r #h havg hmax

2932 3.22 2.2 653 2.3 100 5216 3.46 3.7 349 4.2 320 7024 3.60 2.2 208 7.1 528 8065 3.73 3.9 146 10.1 646 9012 3.80 1.7 103 14.3 765 9751 3.84 1.5 76 19.4 856 10254 3.91 1.5 60 24.6 918 11051 4.00 1.3 44 33.4 1016 11462 4.05 1.4 35 42.1 1066 13722 5.37 1.1 5 294.6 1347 14314 7.12 1.2 2 736.5 1421 14730 10.0 1.2 1 1473 1473 14730 10.0 1.1 1 1473 1473

Table 5.7: Heuristic vs. ILP: Results for the CMC dataset specifying all pattern vectors with costs at most 2.

We generated the set of pattern vectors fulfilling her constraints (plus the all-?-vector to remove outliers) and applied our greedy heuristic and the ILP implementation fork∈{2, 3 . . . , 10, 25, 50, 75, 100}.

The ILP implementation took up to 6 minutes to compute one single instance, whereas the greedy heuristic needs always less than one second.

Moreover, the solution quality of the heuristic results is surprisingly good.

The number of suppressed entries is at most 1.31 times the optimum. The ILP is slightly better concerning the measures #h and havg. Only the maximum size of the output row types of the heuristic results is sometimes more than twice the maximum size of output row types of the ILP results for somek. Surprisingly, the usefulness values are always slightly better for the heuristic results. SeeTable 5.8for details.

Conclusion. In three scenarios with real-world datasets, we showed that our greedy heuristic performs well in terms of solution quality compared with the optimal solution produced by the ILP implementation. The results of the heuristic are relatively close to the optimum and in fact for many cases they were optimal, although our heuristic is much more efficient than the exact algorithm (the ILP was, on average, more than 1000 times slower). The heuristic results tend to get closer to the optimal number of suppressions with increasing degreekof anonymity.