• Keine Ergebnisse gefunden

Similarity Search in Large Databases

N/A
N/A
Protected

Academic year: 2022

Aktie "Similarity Search in Large Databases"

Copied!
5
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Similarity Search in Large Databases

Introduction to Similarity Search

Nikolaus Augsten

nikolaus.augsten@sbg.ac.at Department of Computer Sciences

University of Salzburg

http://dbresearch.uni-salzburg.at

WS 2021/22

Version October 26, 2021

Augsten (Univ. Salzburg) Similarity Search in Large Databases WS 2021/22 1 / 18

Similarity Search

Outline

1 Similarity Search Intuition Applications Framework

Augsten (Univ. Salzburg) Similarity Search in Large Databases WS 2021/22 2 / 18

Similarity Search Intuition

What is Similarity Search?

Similarity search deals with the question:

How similar are two objects?

“Objects”may be

strings (Augsten↔Augusten) tuples in a relational database

(Augsten|Dominikanerplatz 3|204|70188)

(N. Augsten|Dominikanerpl. 3|@|70188) documents (e.g., HTML or XML)

. . .

“Similar” is application dependant

Similarity Search Applications

Application I: Object Identification

Problem:

Two data items represent the same real world object (e.g., the same person),

but they are represented differently in the database(s).

How can this happen?

different coding conventions (e.g.,Gilmstrasse, Hermann-von-Gilm-Str.)

spelling mistakes (e.g.,Untervigil,Untervigli)

outdated values (e.g.,Siegesplatzused to beFriedensplatz).

incomplete/incorrect values (e.g., missing or wrong apartment number in residential address).

Focus in this course!

(2)

Application I: Flavors of Object Identification

Duplicate Detection one table

find all tuples in the table that represent the same thing in the real world

Example: Two companies merge and must build a single customer database.

Similarity Join two tables

join all tuples with similar values in the join attributes

Example: In order to detect tax fraud, data from different databases need to be linked.

Similarity Lookup one table, one tuple

find the tuple in the table that matches the given tuple best Example: Do we already have customer X in the database?

Augsten (Univ. Salzburg) Similarity Search in Large Databases WS 2021/22 5 / 18

Application II: Computational Biology

DNA and protein sequences

modelled as text over alphabet (e.g. {A,C,G,T}in DNA) Application: Search for a pattern in the text

look for given feature in DNA compare two DNAs

decode DNA

Problem: Exact matches fail

experimental measures have errors small changes that are not relevant mutations

Solution: Similarity search Search forsimilarpatterns

How similarare the patterns that you found?

Augsten (Univ. Salzburg) Similarity Search in Large Databases WS 2021/22 6 / 18

Similarity Search Applications

Application III: Error Correction in Signal Processing

Application: Transmit text signal over physical channel Problem: Transmission may introduce errors

Goal: Restore original (sent) message

Solution: Find correct text that is closest to received message.

Similarity Search Framework

Framework for Similarity Search

1. Preprocessing (e.g., lowercaseAugsten→augsten) 2. Search Space Reduction

Blocking

Sorted-Neighborhood Filtering (Pruning) 3. Compute Distances 4. Find Matches

(3)

Similarity Search Framework

Search Space Reduction: Brute Force

We consider the example of similarity join.

Similarity Join: Find all pairs of similar tuples in tables AandB.

Search space: A×B(all possible pairs of tuples) Complexity: compute|A||B|distances→expensive!

(|A|= 30k,|B|= 40k, 1ms per distance⇒join runs 2 weeks) Example: 16distance computations!

A

Tim m

Bill m Jane f Mary f

B

Bil m

Jane f

Tim m

Marie f Goal: Reduce search space!

Augsten (Univ. Salzburg) Similarity Search in Large Databases WS 2021/22 9 / 18

Similarity Search Framework

Search Space Reduction: Blocking

Blocking

PartitionAandB into blocks (e.g., group by chosen attribute).

Compare only tuples within blocks.

Example: Block by gender (m/f):

Tim m

Bill m

Bil m

Tim m

Mary f Jane f

Jane f Marie f

Improvement: 8distance computations (instead of 16)!

Augsten (Univ. Salzburg) Similarity Search in Large Databases WS 2021/22 10 / 18

Similarity Search Framework

Search Space Reduction: Sorted Neighborhood

Sorted Neighborhood

SortAandB (e.g., by one of the attributes).

Move a window of fixed size overAandB.

moveA-window if sort attribute of next tuple inAis smaller than inB otherwise moveB-window

Compare only tuples within the windows.

Example: Sort by name, use window of size 2:

A Bill mi Jane fi Mary fi Tim mi

B iBil m iJane f iMarie f

iTim m

Improvement: 12distance computations (instead of 16)!

Similarity Search Framework

Search Space Reduction: Filtering

Filtering (Pruning)

Remove (filter) tuples that cannot match, then compute the distances.

Idea: filter is faster than distance function.

Example: Do not match names that have no character in common:

Tim m

Bil m

Tim m

Jane f Marie f

Bill m

Bil m

Tim m

Jane f Marie f

Mary f

Bil m

Tim m

Jane f Marie f

Jane f

Bil m

Tim m

Jane f Marie f Improvement: 11distance computations (instead of 16)!

(4)

Distance Computation

Definition (Distance Function)

Given two sets of objects,AandB, a distance function forAandB maps each pair (a,b)∈A×B to a positive real number (including zero).

δ:A×B→R+0 We will define distance functions for

sets strings

ordered, labeled trees unordered, labeled trees

Augsten (Univ. Salzburg) Similarity Search in Large Databases WS 2021/22 13 / 18

Distance Matrix

Definition (Distance Matrix)

Given a distance functionδ for two sets of objects, A={a1, . . . ,an}and B ={b1, . . . ,bm}.

The distance matrixD is ann×m-matrix with dij =δ(ai,bj),

wheredij is the element at thei-th row and thej-th column ofD. Example distance matrix,A={a1,a2,a3},B ={b1,b2,b3}:

b1 b2 b3 a1 6 5 4 a2 2 2 1 a3 1 3 0

Augsten (Univ. Salzburg) Similarity Search in Large Databases WS 2021/22 14 / 18

Similarity Search Framework

Finding Matches: Threshold

b1 b2 b3

a1 6 5 4 a2 2 2 1 a3 1 3 0

Once we know the distances – which objects match?

Threshold Approach:

fix thresholdτ algorithm:

foreachdij ∈D do

ifdij < τ thenmatch (ai,bj) producesn:m-matches

Examplewithτ = 3: {(a2,b1),(a2,b2),(a2,b3),(a3,b1),(a3,b3)}

Similarity Search Framework

Finding Matches: Global Greedy

Global Greedy Approach:

algorithm:

M← ∅

A← {a1,a2, . . . ,an};B← {b1,b2, . . . ,bm} create sorted listLwith alldij ∈D

whileA6=∅andB6=∅do

dij ←deque smallest element fromL ifai ∈Aandbj ∈Bthen

M←M∪(ai,bj)

remove ai fromAandbj from B returnM

produces 1:1-matches

must deal with tie distances when sortingL!

(e.g. sort randomly, sort byi andj) Example (sort ties by i, j):

{(a3,b3),(a2,b1),(a1,b2)}

b1 b2 b3

a1 6 5 4 a2 2 2 1 a3 1 3 0

(5)

Similarity Search Framework

Overview: Finding Matches

b1 b2 b3 a1 6 5 4 a2 2 2 1 a3 1 3 0 Threshold Approach:

all objects with distance belowτ match producesn:m-matches

threshold approach for our example withτ = 3:

{(a2,b1),(a2,b2),(a2,b3),(a3,b1),(a3,b3)} Global Greedy Approach:

pair with smallest distance is chosen first produces 1:1-matches

global greedy approach for our example:

{(a3,b3),(a2,b1),(a1,b2)}

Augsten (Univ. Salzburg) Similarity Search in Large Databases WS 2021/22 17 / 18

Similarity Search Framework

Conclusion

Framework for similarity queries:

1. preprocessing

2. search space reduction blocking

sorted-neighborhood filtering (pruning)

3. compute distances: when are two objects similar?

4. find matches: threshold, global greedy

Augsten (Univ. Salzburg) Similarity Search in Large Databases WS 2021/22 18 / 18

Referenzen

ÄHNLICHE DOKUMENTE

Igor Friedensplatz 2/A/1 Andrej Friedensplatz 3 Francesco Untervigil 1 Johann Cimitero 6/B.. Igor Friedensplatz 2/A/2 Nikolaus

Arturas Gilmstrasse 3 Linas Marieng. 1/A Markus Cimitero 4 Michael Gilmstrasse 5 Igor Friedensplatz 2/A/1 Andrej Friedensplatz 3 Francesco Untervigil 1 Johann Cimitero 6/B

Windowed pq-Grams for Data-Centric XML Efficient Similarity Joins with Windowed pq-Grams Experiments..

The forest distance between two ordered forests is the minimum cost sequence of node edit operations (node deletion, node insertion, node rename) that transforms on forest into

We count all pq-grams whose leftmost leaf is a dummy node: Each leaf is the anchor node of exactly one pq-gram whose leftmost leaf is a dummy node, giving l pq-grams.. The

Cimitero is the Italian name for Friedhofplatz (German name) Problem: Friedensplatz looks more like Friedhofplatz than like Siegesplatz!.. Salzburg) Similarity Search in Large

nikolaus.augsten@sbg.ac.at Department of Computer Sciences. University

1/A Markus Cimitero 4 Michael Gilmstrasse 5 Igor Friedensplatz 2/A/1 Andrej Friedensplatz 3 Francesco Untervigil 1 Johann Cimitero 6/B Igor Friedensplatz 2/A/2 Nikolaus Cimitero