• Keine Ergebnisse gefunden

Naive Ideas

N/A
N/A
Protected

Academic year: 2021

Aktie "Naive Ideas"

Copied!
9
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

14. Hashing

Hash Tables, Birthday Paradoxon, Hash functions, Perfect and Universal Hashing, Resolving Collisions with Chaining, Open Addressing, Probing

[Ottman/Widmayer, Kap. 4.1-4.3.2, 4.3.4, Cormen et al, Kap.

11-11.4]

366

Motivation

Gloal: Table of alln students of this course Requirement: fast access by name

367

Naive Ideas

Mapping Names =s1s2. . . sls to key k(s) =

ls

X

i=1

si·bi

blarge enough such taht different names map to different keys.

Store each data set at its index in a huge array.

Example withb = 100. Ascii-Valuessi. Anna 7→71111065

Jacqueline7→ 102110609021813999774 Unrealistic: requires too large arrays.

368

Better idea?

Allocation of an array of sizem(m > n).

Mapping Namesto

km(s) =

ls

X

i=1

si·bi

!

modm.

Different names can map to the same key (“Collision”). And then?

369

(2)

Estimation

Maybe collision do not really exist? We make an estimation ...

370

Absch¨atzung

Assumption: murns, nballs (wlog n≤ m).

nballs are put uniformly distributed into the urns

What is the collision probability?

Very similar question: with how many people (n) the probability that two of them share the same birthday (m= 365) is larger than50%?

371

Estimation

P(no collision) = mm · mm1 · · · mmn+1 = (m−n)!·mm! m.

Leta m. Withex = 1 +x+ x2!2 +. . . approximate1− ma ≈ema. This yields:

1− 1 m

·

1− 2 m

·...·

1− n−1 m

≈ e1+···+n−1m = en(n2m1). Thus P(Kollision) = 1−en(n−1)2m .

Puzzle answer: with 23 people the probability for a birthday collision is50.7%. Derived from the slightly more accurate

Stirling formula. 372

With filling degree:

With filling degree α :=

n/mit holds that (simplified further)

P(collision)≈ 1−eα2·m2.

100 200 300

0.5 1

10%

5%

20%

m

P(Kollision)

The maximal filling degree should be chosen according to the ratio n2/m.

373

(3)

Nomenclature

Hash funtionh: Mapping from the set of keysKto the index set {0,1, . . . , m−1}of an array (hash table).

h: K → {0,1, . . . , m−1}.

Normally |K| m. There arek1, k2 ∈ Kwithh(k1) =h(k2) (collision).

A hash function should map the set of keys as uniformly as possible to the hash table.

374

Examples of Good Hash Functions

h(k) =k modm,mprime

h(k) =bm(k·r− bk·rc)c,rirrational, paritcularly good:

r= 521.

375

Perfect Hashing

Is the set of used keys known up front? Then the hash function can be chosen perfectly. The practical construction is non-trivial.

Example: table of key words of a compiler.

376

Universal Hashing

|K|> m⇒ Set of “similar keys” can be chose such that a large number of collisions occur.

Impossible to select a “best” hash function for all cases.

Possible, however14: randomize!

Universal hash classH ⊆ {h: K → {0,1, . . . , m−1}} is a family of hash functions such that

∀k1 6=k2 ∈ K :|{h∈ H|h(k1) =h(k2)}| ≤ 1 m|H|.

14Similar as for quicksort

377

(4)

Universal Hashing

Theorem

A functionhrandomly chosen from a universal classH of hash functions randomly distributes an arbitrary sequence of keys from K as uniformly as possible on the available slots.

378

Universal Hashing

Initial remark for the proof of the theorem:

Define withx, y∈ K,h∈ H,Y ⊆ K: δ(x, y, h) =

(1, ifh(x) =h(y), x6=y 0, otherwise,

δ(x, Y, h) =X

yY

δ(x, y, h), δ(x, y,H) =X

h∈H

δ(x, y, h).

His universal if for allx, y∈ K,x6=y: δ(x, y,H)≤ |H|/m.

379

Universal Hashing

Proof of the theorem

S⊆ K: keys stored up to now.xis added now:

EH(δ(x, S, h)) =X

h∈H

δ(x, S, h)/|H|

= 1

|H|

X

h∈H

X

y∈S

δ(x, y, h) = 1

|H|

X

y∈S

X

h∈H

δ(x, y, h)

= 1

|H|

X

yS

δ(x, y,H)

1

|H|

X

yS

|H|/m= |S| m.

380

Universal Hashing is Relevant!

Letpbe prime andK ={0, . . . , p−1}. With a∈ K \ {0}, b∈ K define

hab : K → {0, . . . , m−1}, hab(x) = ((ax+b) mod p) modm.

Then the following theorem holds:

Theorem

The classH ={hab|a, b∈ K, a6= 0}is a universal class of hash functions.

381

(5)

Resolving Collisions

Examplem= 7,K ={0, . . . ,500},h(k) =k modm. Keys 12 , 53 , 5 , 15 , 2 , 19 , 43

Chaining the Collisions

15 43

2 53 12

5 19 hash table

Colliding entries

0 1 2 3 4 5 6

382

Resolving Collisions

Examplem= 7, K={0, . . . ,500},h(k) =k modm. Keys 12 , 53 , 5 , 15 , 2 , 19 , 43

Direct Chaining of the Colliding entries

15 43

2 53 12

5 19 hash table

Colliding entries

0 1 2 3 4 5 6

383

Algorithm for Hashing with Chaining

search(k) Search in list from positionh(k)fork. Return true if found, otherwise false.

insert(k) Check if kis in list at positionh(k). If no, then append k to the end of the list.

delete(k) Search the list at position h(k)fork. If successful, remove the list element.

384

Analysis (directly chained list)

1 Unsuccesful search. The average list lenght isα= mn. The list has to be traversed completely.

⇒Average number of entries considered Cn0 =α.

2 Successful search Consider the insertion history: keyj sees an average list length of(j−1)/m.

⇒Average number of considered entries Cn = 1

n Xn

j=1

(1 + (j −1)/m)) = 1 + 1 n

n(n−1)

2m ≈1 + α 2.

385

(6)

Advantages and Disadvantages

Advantages

Possible to overcommit: α > 1 Easy to remove keys.

Disadvantages

Memory consumption of the chains-

386

Open Addressing

Store the colliding entries directly in the hash table using aprobing functions(j, k)(0≤ j < m,k ∈ K)

Key table position along aprobing sequence

S(k) := (h(k)−s(0, k) modm, . . . ,(h(k)−(m−1, k)) modm

387

Algorithms for open addressing

search(k) Traverse table entries according toS(k). Ifk is found, return true. If the probing sequence is finished or an empty position is reached, return false.

insert(k) Search fork in the table according toS(k). Ifkis not present, insertk at the first free position in the probing sequence.

15

delete(k) Searchk in the table according toS(k). Ifkis found, mark the position ofk with adeleted flag

15A position is also free when it is non-empty and contains adeletedflag.

388

Linear Probing

s(j, k) =j ⇒

S(k) = (h(k) modm,(h(k)−1) modm, . . . ,(h(k) + 1) modm)

Examplem= 7, K={0, . . . ,500},h(k) =k modm. Key 12 , 53 , 5 , 15 , 2 , 19

0 1 2 3 4 5 6

12 5 53

15 2 19

389

(7)

Analysis linear probing (without proof)

1 Unsuccessful search. Average number of considered entries Cn0 ≈ 1

2

1 + 1 (1−α)2

2 Successful search. Average number of considered entries Cn ≈ 1

2

1 + 1 1−α

.

390

Discussion

Exampleα= 0.95

The unsuccessful search consideres 200 table entries on average!

? Disadvantage of the method?

! Primary clustering: simular hasht addresses have similar probing sequences⇒ long contiguous areas of used entries.

391

Quadratic Probing

s(j, k) =dj/2e2(−1)j

S(k) = (h(k) + 1, h(k)−1, h(k) + 4, h(k)−4, . . .) mod m

Examplem= 7,K ={0, . . . ,500},h(k) =k modm. Keys 12 , 53 , 5 , 15 , 2 , 19

0 1 2 3 4 5 6

12

53 5

15 2 19

392

Analysis Quadratic Probing (without Proof)

1 Unsuccessful search. Average number of entries considered Cn0 ≈ 1

1−α −α+ ln 1

1−α

2 Successful search. Average number of entries considered Cn ≈1 + ln

1 1−α

− α 2.

393

(8)

Discussion

Exampleα = 0.95

Unsuccessfuly search considers 22 entries on average

? Problems of this method?

! Secondary clustering: Synonymsk andk0(withh(k) =h(k0)) travers the same probing sequence.

394

Double Hashing

Two hash functionsh(k) andh0(k). s(j, k) = j·h0(k). S(k) = (h(k)h0(k), h(k)2h0(k), . . . , h(k)(m1)h0(k)) modm

Example:

m= 7, K={0, . . . ,500},h(k) =k mod 7,h0(k) = 1 +k mod 5. Keys 12 , 53 , 5 , 15 , 2 , 19

0 1 2 3 4 5 6

12 53 5 15 2 19

395

Double Hashing

Probing sequence must permute all hash addresses. Thus h0(k)6= 0andh0(k)may not dividem, for example guaranteed withmprime.

h0 should be independent ofh(avoiding secondary clustering)

Independence:

P((h(k) =h(k0)) (h0(k) =h0(k0))) =P(h(k) =h(k0))·P(h0(k) =h0(k0)). Independence fulfilled byh(k) =kmodmandh0(k) = 1 +k mod (m2)(m prime).

396

Analysis Double Hashing

Lethandh0 be independent, then:

1 Unsuccessful search. Average number of considered entries:

Cn0 ≈ 1 1−α

2 Successful search. Average number of considered entries:

Cn ≈1 + α 2 + α3

4 +α4 15 − α5

18 +· · ·<2.5

397

(9)

Overview

α= 0.50 α= 0.90 α= 0.95

Cn Cn0 Cn Cn0 Cn Cn0

Separate Chaining 1.250 1.110 1.450 1.307 1.475 1.337 Direct Chaining 1.250 0.500 1.450 0.900 1.475 0.950 Linear Probing 1.500 2.500 5.500 50.500 10.500 200.500 Quadratic Probing 1.440 2.190 2.850 11.400 3.520 22.050 Double Hashing 1.39 2.000 2.560 10.000 3.150 20.000

398

Referenzen

ÄHNLICHE DOKUMENTE

In the following sections, SHA-512 is described before SHA-384. That is because the SHA-384 algorithm is identical to SHA-512, with the exception of using a different initial

and existing (lightweight) ciphers that are transformed to hash functions (Poschmann [23] explains in the sections 6.5 and 6.6 the functionality of DM- and H-PRESENT that are

(a) The average number of probing rounds needed as a function of the maximum number of replicas R (for r d = 5 replicas of the document) when using the AWARE and random algorithms

2005 presented a new hash table design, named Fast Hash Table, based on hashing with choices and counting Bloom filter summaries that targets hard- ware implementations and

In this section we present three different exam- ples that utilize the query architecture based on sub- queries introduced above, thereby allowing more com- plex search queries to

(a) The average number of probing rounds needed as a function of the maximum number of replicas R (for r d = 5 replicas of the document) when using the AWARE and random algorithms

Counter Based Deletions. As with counting Bloom filters this scheme adds counters to the Bloom filter based summaries to keep track of the number of items inserted. The single

present a data structure named fast hash table (FHT) that uses a counting Bloom filter (CBF) summary in on-chip memory to identify the target bucket of an item.. Each