• Keine Ergebnisse gefunden

Task: Implementation of Weighted AllPairs

N/A
N/A
Protected

Academic year: 2022

Aktie "Task: Implementation of Weighted AllPairs"

Copied!
2
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Task: Implementation of Weighted AllPairs

December 21, 2018

1 Task

In Task 1, you implemented a specialized version of the AllPairs algorithm.

Each token contributed an equal amount to the total similarity, i.e., each token had a weight of 1. In Task 2, we generalize this such that each token may be associated with a different weight between 0 and 1.

1.1 Task 2

AllPairs can be modified to allow weighted similarity functions. Each token has a particular weight associated with it. The weight is the same for each occurrence of this token, e.g., token 14 in the input example above could have weight 0.1. It has this weight in all three sets it occurs in.

Your binary or script will therefore accept another input parameter:

./binary_or_script input_file weight_file jaccard_threshold

The weight file consists of a mapping of tokens to their weight. There is one mapping per line. Each line contains the token and the weight, separated by a colon. If a token is not mapped, 1 should be assumed as its weight. It may look like this:

3:0.1 2:0.2 14:0.1

The similarity function to implement is weighted Jaccard, which is defined as

JW(r, s) = P

w∈|r∩s|weight(w) P

w∈|r∪s|weight(w) There are a few things to rethink:

1. The input set have to processed in a different order. In Task 1, the sets were processed in increasing order of their sizes. For the weighted AllPairs, this has to be changed. One option is to process sets in increasing order of their total weight.

1

(2)

2. The token ordering in each set has to be changed. Here, one option is to sort the tokens of a set by decreasing order of weight.

3. The lengths of probing and indexing prefix need to be adapted to take the token weights into account.

2 Further Readings

The weighted set similarity join is discussed in [1, p. 17].

References

[1] C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for near-duplicate detection. TODS, 36(3):15, Aug. 2011.

2

Referenzen

ÄHNLICHE DOKUMENTE

Given a collection of sets R, the set similarity join computes all pairs of sets in R that are similar.. The similarity is assessed using a set similarity function sim(r, s)

After extensive wash- ing, the cells were counted, adjusted to equal monocyte numbers, and assayed for their re- sponses to activated serum and peptide (Table 1).. The response of

Möglicherweise wurde Ihre im Antragsformular angegebene Mobilnummer nicht korrekt registriert oder Sie haben Sie sich zwischenzeitlich eine neue Mobilnummer zugelegt und diese

Valaminek pénzzé történ ı kineve- zése, amely egyben törvényes fizet ı eszköz is egy meghatározott területen, azt a lehet ı séget még nem gátolja meg, hogy

This document describes the implementation of a Token-Ring Gateway in remote models of the IBM 3174 Subsystem Control Unit and examines some of the performance and

The originating station sends a TEST or XID command LPDU on the local ring with the address of the destination in the destination address field and to the null SAP

The token is used in the setup phase only whereas in the time-critical online phase the cloud computes the encrypted function on encrypted data using symmetric encryption

(A similar Hamiltonian has been de- veloped 2 , but including molecular vibrations, relativis- tic corrections and allowance for the fact that the mo- lecular centre of gravity