• Keine Ergebnisse gefunden

PS Ähnlichkeitssuche 
 in großen Datenbanken

N/A
N/A
Protected

Academic year: 2022

Aktie "PS Ähnlichkeitssuche 
 in großen Datenbanken"

Copied!
11
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Thomas Hütter, WS 2020

PS Ähnlichkeitssuche

in großen Datenbanken

Task 1

(2)

Motivation

Task 1

All users are stored in a database with the following schema:









For example, we could define “similar hobbies” as follows:

consider hobbies as sets of words,

define a threshold t = 0.75

using Jaccard similarity.

ID Name Hobbies

1 Peter {guitar, biking, swimming}

2 Sarah {skiing, hiking, singing}

3 John {singing, hiking}

4 Kate {guitar, skiing, swimming, running}

→ How can we compute the set similarity self join?

(3)

Given: a collections of sets, a similarity function, and a similarity threshold.

Task: find a basic algorithm that computes the set similarity self join.

Set Similarity Self Join Algorithm

Exercise

(4)

Algorithm: Set Similarity Self Join

Task 0

Compute Jaccard for each pair of sets.

Algorithm 0: setsimjoin(R, tJ):

input: R collection of sets, tJ similarity threshold

output: res set of result pairs (similarity at least tJ) res = []; //stores the result pairs

for i = 0 to |R| - 1:

for j = i + 1 to |R| - 1:

if (R[i] ⋂ R[j] / R[i] ⋃ R[j]) ≥ tJ: res = res ∪ (i, j);

return res;

(5)

Given: a collections of sets R, a similarity function, and a similarity threshold.

Task: using Algorithm 0, how many pairs are considered to compute R ⨝ R?

Exercise

Set Similarity Self Join Algorithm 2

sim

(6)

Length Filter

Optimization

Motivation: considering a quadratic number of pairs may not be feasible for large datasets.

Idea: not all pairs are relevant because they are too different.

Length filter: if the length difference of two sets is too high, the threshold cannot be reached even though their elements match.

(7)

Given: two sets r and s with |r| = 4, |s| = 6 and Jaccard similarity with threshold 0.7.

Task: do we need to consider (r, s), i.e., is there a possibility that J(r, s) ≥ 0.7?

Exercise

Length Filter

(8)

Given: a collections of sets R ordered by size, a similarity function Jaccard, 
 and a similarity threshold 0.75.


Idea: stop the inner loop of Algorithm 0 as soon as (ri, rj) cannot be higher than the threshold. For example, for r1 is only compared to r2 and r3.

3 8 9

r1 =

2 4 7 9

r2 =

R r3 =

1 2 4 7 9

r4 =

1 3 6 8

2 4 6 7 9

r5 =

Apply Length Filter

Optimization

(9)

Algorithm: Set Similarity Self Join

Task 1

Consider a collection of sets R ordered by size.

Algorithm 1: setsimjoin(R, tJ):

input: R collection of sets, tJ similarity threshold

output: res set of result pairs (similarity at least tJ) res = []; //stores the result pairs

for i = 0 to |R| - 1:

for j = i + 1 to |R| - 1:

if |R[j]| is to big for |R[i]| to reach tJ: break;

if (R[i] ⋂ R[j] / R[i] ⋃ R[j]) ≥ tJ: res = res ∪ (i, j);

return res;

(10)

Pre-processing

Goal: introduce a common input format that can be processed efficiently and leveraged to optimize the algorithm.

Idea: represent each set element by a single integer and sort them.

Set Similarity Self Join Algorithm

Hobbies

{guitar, swimming}

{hiking, guitar, singing}

{singing, guitar, swimming}

{guitar, skiing, swimming, hiking}

Hobbies {1, 2}

{1, 3, 4}

{1, 4, 5}

{1, 2, 3, 5}

Hobbies {1, 2}

{3, 1, 4}

{4, 1, 5}

{1, 5, 2, 3}

Original Dataset Pre-processed Dataset

Assign ascending


integers. Sort each set.

(11)

Implement Algorithm 1: setsimjoin(R, tJ).

• Include the length filter (input collection is ordered by size). (1 point)

• Find a smart way to verify J(r, s) ≥ tJ. (0.5 points)

Verify your implementation with given datasets on the teaching website.

Further datasets can be found at http://ssjoin.dbresearch.uni-salzburg.at/

datasets.html.

Follow the submission guidelines written on the teaching website.

Submit via abgaben.cosy.sbg.ac.at.

Deadline: 03.11.2020, 23:55

Homework

Task 1

Referenzen

ÄHNLICHE DOKUMENTE

Wrong assumptions of the beam energy and the telescope angle resolution can have a large impact on the quality of radiation length measurements.. Consequently, these pa- rameters

Scottish Vowel Length Rule (SVLR), prosodic timing, sound change, dialect contact, the Voicing Effect, real-time change, Scottish English, Glaswegian vernacular..

ВЕРГИЛИЯ И «АРГОНАВТИКЕ» ВАЛЕРИЯ ФЛАККА Статья посвящена исследованию характера распределения срав- нений в «Энеиде» Вергилия и «Аргонавтике» Валерия

Die Analyse gibt Aufschluss darüber, welche Faktoren relevant sind, wenn eine obli- gatorische Kommaposition als solche wahrgenommen, also ‚bedient‘ wird oder nicht.. Innovativ

From the evaluation of our data at RT, measured in samples grown by PLD, we obtain a mean propagation length of the order of 100 nm for thermally excited magnons, in agreement

Since the mutant YscP 497-515 can not be exported by the type III secretion machinery when expressed under its native promoter, the YscP tail might either be important for

The calculation of optimal lenght of the plain analyzer is presented, as in my conference abstract, also in Section 2 of the innovative paper by Loscertales.. The two

… in terms of similarity