• Keine Ergebnisse gefunden

The Challenge of Heterogeneity

N/A
N/A
Protected

Academic year: 2022

Aktie "The Challenge of Heterogeneity"

Copied!
33
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

The Challenge of

Heterogeneity

(2)

Overview

 Heterogeneity in Data

 Distributed Data

 Web 2.0

 Heterogeneity of Users

 Structuring music collections

 Structuring tag collections

(3)

Heterogeneity in Data

 Databases

 Fixed set of attributes

 Declared data types

 Multi-relational

 Very large number of records

 Preparation for mining

 Extract, Transform, Load

 Select attributes

 Declare label for learning

 Handle missing values

 Compose new attributes

 Schema-mapping for re-use of

DM

(4)

Heterogeneity in Data

 Time series data

 Measurements over time

 Business

 Medicine

 Production

 Hand writing

 Pictures

 Music

 Prediction

 Classification

 Clustering

 Signal to Symbol

(5)

Heterogeneity in Data

 Texts

 High dimensional vectors

 Sparse word vectors

 Texts of the same class need not share a word!

 Syntactic, semantic structures

 Classification

 Clustering

 Named Entity Recognition,

Information Extraction

(6)

Distributed Data

 Distributed databases of the same schema

 Distributed databases of different schemas

 Low-level, low capacity sensors

 Peer-to-peer networks

(7)

Heterogeneity of Users

 The same label name does not necessarily mean the same concept.

 Different names may refer to the same set of items.

 Users apply diverse

aspects, e.g., genre, time of day, episodes (summer 99),...

 Users share some set of items (possibly under different names).

hip hop pop

metal alternative

death metal true metal

hip hop pop

piano classic guitar

classic jazz

classic pop

jazz favourites

blues modern

home work

office plane

(8)

Web 2.0

 Organizing large data collections

requires semantic annotations.

 Users annotate items with arbitrary tags.

 No common ontology is required (“folksonomies”).

 Users want to keep their

tags, but like to benefit

from efforts of others.

(9)

Structuring Music Collections

 A concept’ s meaning is its extension, e.g., some

music.

 A concept’ s meaning can be expressed by a classifier.

 A concept hierarchy for

each aspect --> hierarchical classification.

 Acquiring the hierarchy by clustering under the

assumption that user-given taggings are kept.

pop rock

metal a

d e

bad good

blues f

b

aggressive

(10)

Localized Alternative Cluster Ensembles (ECML 2006)

 Acquiring hierarchical clusterings from

 Own partial clusterings

 Clusterings of other peers

 Preserve taggings of users

 Produce several alternative

 Exploit input clusterings

 Consider locality instead of global consensus

hip hop pop

metal alternative

death metal true metal

hip hop pop

piano classic guitar

classic jazz

classic pop

jazz favourites

blues modern

home work

office plane

(11)

LACE Algorithm

11

alternative metal

true metal death metal

a

c

hip hop pop

d f

12

b

a c

d

e f g b

Items are represented by Ids.

(12)

LACE Algorithm

11

alternative metal

true metal death metal

a

c

hip hop pop

d f

12

b

a c

d

e f g b

Best matching cluster node is

selected by f-measure.

(13)

LACE Algorithm

11

alternative metal

true metal death metal

a

c

pop

d f

12

b

11

alternative metal

true metal death metal

a

b c

hip hop d

e f g

Items that are sufficiently similar to

items in the best matching clustering

are deleted from the query set.

(14)

LACE Algorithm

11

alternative metal

true metal death metal

a

c

pop

d f

12

b

11

alternative metal

true metal death metal

a

b c

hip hop d

e f g

A new query is posed containing

the remaining items. Only tags not

used yet are considered.

(15)

LACE Algorithm

11

alternative metal

true metal death metal

a

c

pop

d f

12

b

11

alternative metal

true metal death metal

a

b c

hip hop pop

d f

12

1

hip hop

e g

The process continues until all items are

covered, no additional match is possible or a

maximal number of rounds is reached.

(16)

LACE Algorithm

11

alternative metal

true metal death metal

a

c

hip hop pop

d f

12

b

11

alternative metal

true metal death metal

a

b c

hip hop pop

d e f

12’

g

1

Remaining items are added by

classification (kNN).

(17)

LACE Algorithm

11

alternative metal

true metal death metal

a

c

hip hop pop

d f

12

b

hip hop pop

1

metal alternative

death metal true metal

Process starts anew until no more

matches are possible or the maximal

number of results is reached.

(18)

LACE Algorithm

11

alternative metal

true metal death metal

a

c

hip hop pop

d f

12

b

hip hop pop

1

metal alternative

death metal true metal

home work

2

office plane

3

… 

k

Process starts anew until no more

matches are possible or the maximal

number of results is reached.

(19)

LACE Algorithm

11

alternative metal

true metal death metal

a

c

hip hop pop

d f

12

b

P 2p N e tw o rk

hip hop pop

1

metal alternative

death metal true metal

home work

2

office plane

3

… 

k

Ad hoc peer-to-peer network.

(20)

Structuring Music Collections

Challenge of music data:

 There is no perfect feature set for all mining tasks.

 Learning feature extraction for a classification task

Mierswa/Morik MLJ 2005

 Structuring music collections Wurst/Morik/Mierswa ECML 2006

User views are local models - no global consensus wanted!

Mierswa/Morik/Wurst, In:

Masseglia, Poncelet, l and Teisserie(editors),

Successes and New Directions

in Data Mining, 2007

(21)

Structuring Tag Collections

 Users annotate resources with arbitrary tags.

 Frequency of tags is shown by the tag cloud.

 Tags structure the

collection.

(22)

Navigation

 User may select a tag and sees the resources.

 User may follow related tags.

 Problem:

 No hierarchical structure.

 Restricted navigation to given tags.

 No navigation

according to subsets.

 Photography and art

cannot be found!

(23)

Given: Folksonomy

 A Folksonomy (U,T,R,Y), with

 U Users

 T tags

 R Resources

 Y U  T  R

 a record (u,t,r)  Y

means that user u has

annotated resource r

with tag t.

(24)

Wanted: Tagset clustering

 Hierarchical clustering of tags for navigation,

 based on frequency:

how many users used tag t?

supp: P(T) -->  supp

U

(T)=

|{uU| t T:  r R:

(u,t,r) Y}|

 Subset of the lattice of frequent tag sets that

optimizes clustering criteria.

(25)

Clustering

 Termset clustering: how many resources support a term?

 Given frequent term sets form a clustering with small overlap and large coverage.

Beil, Ester, Xu (2002) Frequent Term- Based Text Clustering, in KDD 2002 Fung, Wang, Ester (2003) Hierarchical Document Clustering Using Frequent Itemsets, in SDM 2003

 Heuristics for minimizing

overlap, maximizing coverage.

...{sun} {beach}

D1, D4, D5, D6, D2, D9, D13

D8, D10, D11, D15 D8, D10, D11, D15 D7, D14

D2, D9, D13

{sun, fun, beach}

{sun,fun} {fun, beach}

{sun,beach}

D1,D4,D6,D8 ... D2, D8, D9, D10

D10, D11, D13 D11, D15

{ } D1, ..., D16

(26)

Heterogeneous Preferences

Child-count vs. completeness (left); coverage vs. overlap (right)

(27)

Multi-objective Optimization

 Given frequent tag sets

 Find all optimal

clusterings according to two orthogonal criteria.

 Orthogonal criteria can only be determined empirically.

 Childcount: number of successors of a cluster

 Overlap: average overlap of clusters at each level.

 Completeness: how much of the lattice is

+

+ +

+ +

+ + + + + +

+

(28)

GA for Optimization

 NSGA II algorithm

Deb, Agrawal,Pratab,

Meyarivan (2000) in Procs.

Parallel Problem Solving from Nature

 Delivers all Pareto-optimal clusterings to a partial

lattice of frequent tag sets.

Initial population

Fitness Stop?

Selection

Crossover Mutation

Output

(29)

Encoding Frequent Tag Sets

 Given the lattice of possibly frequent tag sets,

 a Binary vector indicates the inclusion of a tag set into the clustering.

 A vector can be mutated by flipping bits.

 Two vectors can be combined

to a new one by crossover.

(30)

Result: Points of Pareto-front

 Childcount vs.

Completeness

 Pareto-front for different

minimal support

 Instances

(31)

Application

 Bibsonomy social bookmark system: Hotho, Jäschke, Schmitz, Stumme 2006

 780 users, 59.000 resources, 25.000 tags

 4000 frequent tag sets

 Optimization according to Childcount vs. Completeness and

(32)

Clustering

 Multi-objective optimization allows the user to select

among equally good clusterings -->

heterogeneity of users is respected

 High scalability, high dimensionality

 Understandable labels (tags)

 Hierarchical structure for

navigation.

(33)

Challenges for Data Mining

 High dimensional data

 High throughput data

 Distributed Data

 P2P networks

 Web 2.0

 Diverse user preferences

 Service for end-user systems, e.g. mobile

“ phones”

Referenzen

ÄHNLICHE DOKUMENTE

It was launched in 1999 with initial support from the World Bank and the United Nations Human Settlements Programme (UN-Habitat), the political heads of the four leading

haben und erfahren wollen, wie sich Texte durch Gebärden neu erschließen und ver- tiefen lassen, sind herzlich eingeladen, an diesem Samstag schon um 17.30 Uhr in die Kirche

Das rechte Bein geht zur Seite (Knie beugen), der linke Fuß wird langsam über den Boden hinterhergezogen und dann neben dem rechten abgestellt.. Die Arme werden bei der Bewegung

HAU.. BEAT OF THE BRONX Ein Unterrichtsprojekt zum Thema Hip-Hop – Bestell-Nr. November 1980 in Berlin, eigentlich Paul Würdig) ist ein deutscher Rapper, der beim

Es wird keine Haftung übernommen für Schäden durch die Verwendung von Informationen aus diesem Online-Angebot oder durch das Fehlen von Informationen.. Dies gilt auch für

However, the surface representations or the wrongly derived forms of the data endorsed by the Yorùbá orthography developers are not in harmony with what African or

observations grew a controversial hypothesis on the etiol- ogy of hip OA [20–22, 54–58, 61, 62]: the theory proposes many cases of osteoarthritis of the hip that previously

Die Schlussdiagnose vom "Eigenen im Gemeinsamen" der Hip-Hop-Kultur lässt damit offen, ob das Gemeinsame entweder die Brüchigkeit des Sozialen/der Subjektivität oder