The Challenge of
Heterogeneity
Overview
Heterogeneity in Data
Distributed Data
Web 2.0
Heterogeneity of Users
Structuring music collections
Structuring tag collections
Heterogeneity in Data
Databases
Fixed set of attributes
Declared data types
Multi-relational
Very large number of records
Preparation for mining
Extract, Transform, Load
Select attributes
Declare label for learning
Handle missing values
Compose new attributes
Schema-mapping for re-use of
DM
Heterogeneity in Data
Time series data
Measurements over time
Business
Medicine
Production
Hand writing
Pictures
Music
Prediction
Classification
Clustering
Signal to Symbol
Heterogeneity in Data
Texts
High dimensional vectors
Sparse word vectors
Texts of the same class need not share a word!
Syntactic, semantic structures
Classification
Clustering
Named Entity Recognition,
Information Extraction
Distributed Data
Distributed databases of the same schema
Distributed databases of different schemas
Low-level, low capacity sensors
Peer-to-peer networks
Heterogeneity of Users
The same label name does not necessarily mean the same concept.
Different names may refer to the same set of items.
Users apply diverse
aspects, e.g., genre, time of day, episodes (summer 99),...
Users share some set of items (possibly under different names).
hip hop pop
metal alternative
death metal true metal
hip hop pop
piano classic guitar
classic jazz
classic pop
jazz favourites
blues modern
home work
office plane
Web 2.0
Organizing large data collections
requires semantic annotations.
Users annotate items with arbitrary tags.
No common ontology is required (“folksonomies”).
Users want to keep their
tags, but like to benefit
from efforts of others.
Structuring Music Collections
A concept’ s meaning is its extension, e.g., some
music.
A concept’ s meaning can be expressed by a classifier.
A concept hierarchy for
each aspect --> hierarchical classification.
Acquiring the hierarchy by clustering under the
assumption that user-given taggings are kept.
pop rock
metal a
d e
bad good
blues f
b
aggressive
Localized Alternative Cluster Ensembles (ECML 2006)
Acquiring hierarchical clusterings from
Own partial clusterings
Clusterings of other peers
Preserve taggings of users
Produce several alternative
Exploit input clusterings
Consider locality instead of global consensus
hip hop pop
metal alternative
death metal true metal
hip hop pop
piano classic guitar
classic jazz
classic pop
jazz favourites
blues modern
home work
office plane
LACE Algorithm
11alternative metal
true metal death metal
a
c
hip hop pop
d f
12b
a c
d
e f g b
Items are represented by Ids.
LACE Algorithm
11alternative metal
true metal death metal
a
c
hip hop pop
d f
12b
a c
d
e f g b
Best matching cluster node is
selected by f-measure.
LACE Algorithm
11alternative metal
true metal death metal
a
c
pop
d f
12b
11alternative metal
true metal death metal
a
b c
hip hop d
e f g
Items that are sufficiently similar to
items in the best matching clustering
are deleted from the query set.
LACE Algorithm
11alternative metal
true metal death metal
a
c
pop
d f
12b
11alternative metal
true metal death metal
a
b c
hip hop d
e f g
A new query is posed containing
the remaining items. Only tags not
used yet are considered.
LACE Algorithm
11alternative metal
true metal death metal
a
c
pop
d f
12b
11alternative metal
true metal death metal
a
b c
hip hop pop
d f
12
1hip hop
e g
The process continues until all items are
covered, no additional match is possible or a
maximal number of rounds is reached.
LACE Algorithm
11alternative metal
true metal death metal
a
c
hip hop pop
d f
12b
11alternative metal
true metal death metal
a
b c
hip hop pop
d e f
12’g
1Remaining items are added by
classification (kNN).
LACE Algorithm
11alternative metal
true metal death metal
a
c
hip hop pop
d f
12b
hip hop pop
1metal alternative
death metal true metal
Process starts anew until no more
matches are possible or the maximal
number of results is reached.
LACE Algorithm
11alternative metal
true metal death metal
a
c
hip hop pop
d f
12b
hip hop pop
1metal alternative
death metal true metal
home work
2office plane
3…
kProcess starts anew until no more
matches are possible or the maximal
number of results is reached.
LACE Algorithm
11alternative metal
true metal death metal
a
c
hip hop pop
d f
12b
P 2p N e tw o rk
hip hop pop
1metal alternative
death metal true metal
home work
2office plane