• Keine Ergebnisse gefunden

5.3 Experimental Results

5.3.5 Heuristic vs. Mondrian

In this subsection we evaluate our experiments. Observe that the Mon-drian algorithm does not suppress entries but replaces them with more general ones. Hence, the number of suppressions as quality criteria is not suitable in the comparison; instead we use the usefulness as defined in Subsection 5.3.3. Overall, we use the following criteria to assess the quality of an anonymization algorithm:

1. Usefulness valueu, 2. Running timerin seconds, 3. Number #hof output row types,

4. Average sizehavgof the output row types, and

7http://cs.utdallas.edu/dspl/cgi-bin/toolbox/index.php?go=home

5.3 Experimental Results

0 20 40 60 80 100

0 500 1,000 1,500 2,000 2,500 3,000

degreekof anonymity

runningtimeinseconds

Heuristic Mondrian

0 20 40 60 80 100

2 3 4 5 6 7 8

degreekof anonymity

usefulness

Figure 5.1: Heuristic vs. Mondrian: Diagrams comparing running time and usefulness for the Adult dataset.

5. Maximum sizehmaxof the output row types.

Except for #h, lower values indicate better solutions.

For each data set, we computedk-anonymous datasets with our greedy heuristic and Mondrian fork∈{2, 3, . . . , 10, 25, 50, 75, 100}. In the presented tables comparing the results of the greedy heuristic and Mondrian we highlight the best obtained values with light gray background.

General Observations. The running time behavior of the tested algo-rithms is somewhat unexpected. Whereas Mondrian gets faster with in-creasingk, our greedy heuristic gets faster with decreasingk. The reason why the greedy heuristic is faster for small values ofkis that usually the cheap pattern vectors are used and, hence, the number of remaining input rows decreases soon. On the contrary, whenkis large, the cheap pattern vectors cannot be used and, hence, the greedy heuristic tests many pat-tern vectors before it actually starts with removing rows from the input matrix. Thus, for larger values ofkthe greedy heuristic comes closer to its worst-case running time ofO(pnm) withp=2m.

5 Pattern-Guidedk-Anonymity

Greedy Heuristic

k u r #h havg hmax

2 2.1 5.5 14589 2.2 16

3 2.3 13.2 9208 3.5 18

4 2.5 19.5 6670 4.9 25

5 2.6 24.9 5199 6.3 31

6 2.7 29.7 4315 7.5 42

7 2.9 34.1 3669 8.9 53

8 2.9 37.6 3193 10.2 53

9 3.0 41.2 2832 11.5 52

10 3.1 44.8 2559 12.7 56

25 3.8 79.3 1046 31.1 161

50 4.5 117.0 537 60.6 317

75 4.9 144.5 354 92.0 317

100 5.2 163.6 274 118.8 317

Mondrian

u r #h havg hmax

3.5 2789.4 11136 2.7 61

3.8 1803.5 7306 4.1 61

4.0 1337.9 5432 5.6 61

4.2 1062.0 4325 7.0 61

4.4 885.9 3597 8.4 61

4.5 754.7 3053 9.9 61

4.6 659.2 2663 11.3 61

4.8 588.3 2368 12.7 69

4.9 535.9 2145 14.1 69

6.0 229.2 850 35.5 90

6.7 127.4 430 70.1 135

7.3 93.6 287 105.1 242

7.8 76.0 209 144.3 242

Table 5.1: Heuristic vs. Mondrian: Results for the Adult data set.

Adult data set. Our greedy heuristic anonymized the Adult dataset in less than three minutes for all tested values of k. For k=3 and k=4 Mondrian took more than half an hour to anonymize the dataset. However, in contrast to all other values ofk, Mondrian was slightly faster fork=75 and k=100. Except for hmax withk≥25 all quality measures indicate that our heuristic produces the better solution. The usefulness value of the Mondrian solutions is between 1.5 and 1.7 times the usefulness value of our heuristic for all testedk—this indicates significantly better quality of the results of our heuristic. SeeTable 5.1for details andFigure 5.1for an illustration.

Adult-2 data set. The solutions for Adult-2 behave similarly to those for Adult. Our greedy heuristic with a maximum running time of five seconds is significantly faster than Mondrian with a maximum running time of 20 minutes (at least 10 times faster for all testedk). However, the usefulness is quite similar for both algorithms. Mondrian beats the heuristic by less than 1 % for k=50; the heuristic is slightly better for each other testedk. SeeTable 5.2for details.

Nursery data set. For the Nursery dataset, the heuristic is at least eight times faster than Mondrian. Concerning solution quality, this dataset is the most ambiguous one. Except fork=5 Mondrian produces better solutions

5.3 Experimental Results

Greedy Heuristic

k u r #h havg hmax

2 1.8 1.0 12022 2.7 45

3 1.9 1.2 7971 4.1 45

4 2.0 1.3 5890 5.5 45

5 2.0 1.5 4609 7.1 45

6 2.1 1.6 3836 8.5 45

7 2.2 1.7 3266 10.0 52

8 2.2 1.8 2837 11.5 63

9 2.3 1.9 2518 12.9 63

10 2.3 2.0 2273 14.3 66

25 2.7 2.9 914 35.6 164

50 3.1 3.9 460 70.8 349

75 3.3 4.4 310 105.0 552

100 3.4 4.9 245 132.9 552

Mondrian

u r #h havg hmax

1.9 1278.4 7971 3.8 113

2.0 887.7 5543 5.4 113

2.1 693.4 4319 7.0 113

2.1 565.4 3525 8.6 113

2.2 484.2 3020 10.0 113 2.3 418.0 2596 11.6 113 2.3 372.5 2308 13.1 113 2.3 339.0 2095 14.4 113 2.4 308.1 1890 16.0 113

2.7 139.0 801 37.7 113

3.1 79.3 414 72.9 145

3.4 59.8 277 108.9 200

3.6 49.6 210 143.6 279

Table 5.2: Heuristic vs. Mondrian: Results for the Adult-2 dataset.

in terms of usefulness, whereas our heuristic performs better in terms of maximum and average output row type size. For the number of output row types, there is no clear winner. SeeTable 5.3for details.

CMC data set. For the CMC dataset, both algorithms were very fast in computing k-anonymous datasets for every tested k. Mondrian took at most 10 seconds and our greedy heuristic took at most 1.2 seconds and was always faster than Mondrian. As for the solution quality, the heuristic can compete with Mondrian. The usefulness of the heuristic results is always slightly better, the Mondrian results have always at least 20 % less output row types, and the average output row type size of the heuristic results is always smaller. Only fork=5, 6, 7, and 8, the Mondrian results have a lower maximum size of the output row types. SeeTable 5.4for details.

Canada data set. Again, our heuristic outperforms Mondrian in terms of efficiency (at least three times faster). However, for this dataset the quality measures are contradictory. Whereas the usefulness of the heuristic results is always slightly better and the number of output row types of the heuristic results is at least four times the number of output row types of Mondrian results, the measures concerning the size of the output row types are significantly better for Mondrian. The reason seems to be that

5 Pattern-Guidedk-Anonymity

Greedy Heuristic

k u r #h havg hmax

2 3.2 0.8 4320 3.0 3

3 3.2 0.3 4320 3.0 3

4 3.3 0.3 3240 4.0 4

5 3.3 0.3 2592 5.0 5

6 3.9 0.3 1440 9.0 9

7 3.9 0.3 1440 9.0 9

8 3.9 0.4 1440 9.0 9

9 3.9 0.3 1440 9.0 9

10 4.0 0.4 1080 12.0 12

25 4.5 0.8 480 27.0 27

50 4.8 1.2 216 60.0 60

75 4.8 1.3 162 80.0 80

100 5.3 1.6 120 108.0 108

Mondrian

u r #h havg hmax

2.8 484.572 6468 2.0 3

3.1 233.710 3294 3.9 4

3.1 221.731 3186 4.1 6

3.3 122.665 1722 7.5 8

3.3 122.713 1722 7.5 8

3.4 104.568 1518 8.5 12 3.4 104.638 1518 8.5 12

3.6 67.630 922 14.1 16

3.6 68.079 922 14.1 16

4.1 28.229 334 38.8 48

4.5 18.330 176 73.6 96

4.7 13.638 116 111.7 144 4.9 13.179 100 129.6 144

Table 5.3: Heuristic vs. Mondrian: Results for the Nursery data set.

Greedy Heuristic

k u r #h havg hmax

2 3.3 0.1 718 2.0 4

3 3.5 0.1 461 3.2 7

4 3.7 0.2 334 4.4 9

5 3.9 0.2 258 5.7 15

6 4.1 0.2 216 6.8 17

7 4.2 0.2 183 8.1 17

8 4.4 0.3 158 9.3 18

9 4.5 0.3 139 10.6 18

10 4.5 0.3 127 11.6 18

25 5.6 0.4 48 30.7 53

50 6.3 0.6 27 54.6 77

75 6.9 0.7 17 86.6 148

100 7.3 0.8 13 113.3 167

Mondrian

u r #h havg hmax

3.5 9.134 599 2.5 7

3.8 6.375 391 3.8 8

4.1 4.848 273 5.4 10

4.3 4.205 223 6.6 11

4.5 3.729 184 8.0 13

4.7 3.328 155 9.5 16

4.9 3.085 135 10.9 17 5.0 2.914 122 12.1 21 5.2 2.717 108 13.6 21

6.4 1.862 43 34.3 57

7.3 1.556 22 67.0 95

7.8 1.404 13 113.3 148 8.0 1.368 10 147.3 204

Table 5.4: Heuristic vs. Mondrian: Results for the CMC dataset.

5.3 Experimental Results

Greedy Heuristic

k u r #h havg hmax

2 1.4 13.0 63140 5.1 2448

3 1.4 13.6 41408 7.8 2448

4 1.4 14.7 31652 10.3 2448

5 1.4 14.9 25852 12.6 2448

6 1.5 15.3 22150 14.7 2448

7 1.5 15.4 19399 16.7 2448

8 1.5 16.1 17276 18.8 2448

9 1.5 16.2 15651 20.7 2448

10 1.5 16.6 14248 22.8 2448

25 1.6 20.2 6167 52.6 2448

50 1.8 23.4 2988 108.6 2448

75 1.9 25.7 1917 169.3 2448

100 1.9 27.6 1393 232.9 2838

Mondrian

u r #h havg hmax

1.5 3504.6 15984 2.4 9 1.6 2196.7 10233 3.8 11

1.7 1600.5 7458 5.2 12

1.7 1252.1 5887 6.5 13

1.8 1040.8 4856 7.9 15

1.8 894.5 4139 9.3 17

1.8 783.1 3618 10.6 19

1.9 694.6 3191 12.1 22

1.9 622.4 2840 13.6 27

2.1 272.7 1120 34.4 57

2.4 158.4 563 68.4 103

2.5 117.0 356 108.2 154

2.6 103.4 279 138.0 201

Table 5.5: Heuristic vs. Mondrian: Results for the Canada data set.

our heuristic always produces one block of at least 2448 identical rows. See Table 5.5for details.

Conclusions for Classicalk-Anonymity. We showed that our greedy heuristic is very efficient even for real-world datasets with more than 100,000 records andk≤100. Especially for smaller degrees of anonymityk≤ 10, Mondrian is at least ten times slower. Altogether, our heuristic out-performs Mondrian for all datasets except Nursery in terms of quality of the solution. There is no clear winner for the Nursery dataset. Hence, we demonstrated that even when neglecting the feature of pattern-guidedness and simply specifyingallpossible pattern vectors, our heuristic already pro-duces useful solutions that can at least compete with Mondrian’s solutions.