• Keine Ergebnisse gefunden

Hierarchical Term Clustering

4.4 Conclusion

5.1.1 Hierarchical Term Clustering

As already described in chapter 4, a clustering for sibling relations can be conducted as a tagpath clustering where labelled clusters are obtained or as a term clustering, where the terms are constituting clusters. The best results regarding sibling relations have been achieved by means of term clustering. Our first experiment is, therefore, to apply agglomerative hierarchical clustering and Bi-Secting-K-Means for term clustering. This means that terms represented by vector of occurrence in sibling sets are clustered, yielding a binary tree where terms are finally the leafs of the tree. We also apply Bi-Secting-K-Means in a manner to produce a complete hierarchy, without stating a (not even roughly) known number of clusters in advance. Bi-Secting-K-Means can be applied to yield a fixed number of clusters. But in our scenario the number of clusters is not even roughly known nor do we know which is the suited strategy to decide for the clusters to be split until the desired number of clusters is achieved. By producing a complete hierarchy we avoid the need to decide for a K and for a strategy to choose the next cluster to split.

For the obtained hierarchies of clusters it is not possible to measure the achieved quality according to FMASO. The clusters are not partitions depicting sibling

5.1 Hierarchical clustering for Sibling Groups Hierarchies

groups in such a way that one set of sibling groups can be compared to the reference set of sibling groups. This could only be done by incorporating more

“heuristics” such as where the hierarchy is to be cut so that groups of siblings can be made explicit for agglomerative hierarchical clustering and for using Bi-Secting-K-Means with a fixed number of K cluster to be produced. But in both cases the hierarchical characteristic would get lost.

Next we will have a brief look at the results obtained while performing hierarchical term clustering. Figure 5.1 and 5.2 shows the resulting cluster hierarchies for agglomerative hierarchical clustering and Secting-K-Means clustering. Figure 5.3 shows a fraction of the hierarchy produced by Bi-Secting-K-Means clustering where details can be observed. We applied hierarchical agglomerative clustering with the Unweighted Pair Group Method with Arithmetic Mean - UGPMA [Jain and Dubes, 1988] heuristic which is supposed to prevent the creation of chains of clusters for the sake of a relatively high complexity of O(n2logn). But for the relatively smaller number of terms to be clustered this complexity is acceptable. The hierarchical agglomerative clustering shows only a relatively small chaining effect, whereas the Bi-Secting-K-Means clustering reveals that chaining is a more serious issue.

While one manually inspects that hierarchy, the disadvantage of the two exclusive clustering approaches applied as term clustering become apparent again. In an exclusive clustering approach such as K-Means (and hierarchical agglomerative clustering and Bi-Secting-K-Means) a membership in clusters is exclusive: an instance (term) can belong to only one cluster. This disadvantage was already pointed out while K-Means was applied for term clustering. But since the quality of the clustering cannot be measured according to the FMASO, we cannot quantify to which extent the clusterings represent siblings which are in accordance with the reference ontologies regarding the sibling relations. The rough impression is that a term clustering with exclusive cluster membership is too restrictive regarding providing sufficient suggestions of plausible siblings to the ontology engineer.

Concepts can have many siblings which are plausible and if one recognizes an error where the siblings depicted by the cluster hierarchy are not plausible, then one has no alterative suggestion. A term clustering which yields only a small number of information to be inspected by this cannot compensate the disadvantage of the limited number of siblings which can be observed for a concept. The idea is to perform hierarchical clustering as tagpath clustering. In a tagpath clustering, clusters need to be labelled and, therefore, the terms/features/concepts that take place constitute more than one sibling constellation. The circumstance that tagpath clustering is here more appealing is also the reason why we did not further investigate whether Bi-Secting-K-Means can be applied to produce a fixed number of clusters which could thus be compared. The disadvantage of exclusive clustering would still be prevalent.

564562560558556554552550548546544542540538536534532530528526

563561559557555553551549547545543541539537535533531529527525524523522521520519518517516515514513512511510509 283

508507506505504503 383

502501500499 463

498497496494493492 486 476 431 379

456 385

415 412

490 482 432

474 459 445 340

489487 469 428 408

438

479 429

462 405

434 423

484 452 371

378

483481 473472 420

468 465 399

450 435 417 344

400 334

362 314

430 411 396 386 323 299297

317 291 286

287

373 341

372 296

357

421 380 349 333

388 381 356 336 307 295

305

348

471 441 418 370 355 318

393 330

416 345 309 302 289

409 328

466 461 457 453 398 347

440 426 410 303

350

353

448 433 369

455 436 422 329

384

407

449 446 427 377375 366 342 332 324 308

360

439437 424 395 390 326 322 311 292290

413 374 361 315

320

346

425 419 392 368 365 358 352 335 327325 316 313 310 298

301

337 319

397 331 321

338

414 404 354

391 363 312 304 300

460 387 293 285

288

458 402 364

443 403 339

475 454

491 470 444 306

394 343

488 447 401 351

477 376

442

495 478 467 389

451 367 294

485 480 464 382

406 359 284

accommodation accommodation_equipment

address adventure_holiday

aerobic

afternoon agreement airport

animal animation

appartement

area

art_exhibition

ausblick

ausblickturm autumn

badminton

badminton_court balcony

ball ballroom bank banquet

bar barbecue_area

basilica

basketball_ground beach

beach_view

beach_volleyball_field beauty_farm

bed beer_garden

bicycle

bill billiard

billiard_room boat booking

bow_shooting_installation bowling_alley

brochure

buffet

bull_fight bus

business_event cabaret cafe

camping car

caravan

carriage casino castle

category

change_office cheque

chimney_room cinema

city

city_port city_wall climbing_wall

club

concert

concert_house conference

conference_folder conference_room congress

contract

country

crazy_golf_course cross_country_ski_run

cultural_event

cultural_installation culture cure

currency

date day

day_tour day_trip

daytime

disco

diving_station

drier driving_license early_season

educational_journey electronic_device elevator email equipment

event excursion

exhibition fango

farm

ferry first_class_hotel

fishing_equipment

fitness_course fitness_room fitness_studio

football football_field football_game

gallery

golf

golf_course group

gym

hair_dresser hair_dryer harbour

health_club

heritage_town

hoarse_carriage holiday

holiday_appartment

holiday_equipment holiday_time

holiday_village

hotel house_description

human_activity

ice_hall id_card indoor_swimming_pool information inn

iron

jazz_club journey kayak kindergarden kiosk

kitchenette law

library

living_thing local_recreation_area lounge

main_meal massage

masseur

material_thing meal menu middle_class_hotel

minibar money

moor_bath moor_therapy morning motel mud_therapy

museum

musical musical_theatre

nature_reserve night

night_cafe

non_material_thing non_private_accommodation_equipment off_season open_air_theatre

opera_house organization panorama

park

parking_lot

partially_material_thing party passport

pension

period

person

personal_thing pilgrimage

place plant

playground port presentation

price_list promenade

pub

public_holiday

qualitative_time_concept radio

recreational_installation region rental

riding_crop room

room_equipment root

route_description rowing_boat ruin

sailing_boat sanatorium

sauna sea_view seminar seminar_house

service ship shooting_gallery

shop

shopping_center short_trip

shower

shuttle_service sight single_room

situation ski_run

skittle_alley sledge

solarium

spatial_concept sport

sport_equipment sport_event

sport_holiday

sport_installation sport_shop sports_hall

spring

squash_court

squash_field starting_point station steam_bath summer sun_studio

swimming

swimming_pool

table_tennis_table telephone tennis

tennis_court

tent terrace theatre

theatre_house thermal_bath thermal_spring

thing time time_interval

tour_operator

tourism_center tourist

touristic_installation town

town_sightseeing_tour trail_map

transport_vehicle trimmdichpfad trip

turkish_bath

tv vehicle

videorecorder view visum volleyball_field

volleyball_ground walking_trail

washing_machine wedding weights_room

wellness_installation whirlpool

wine_tavern winter

yacht yacht_port

youth_hostel

Figure 5.1: Dendrogram of a agglomerative hierarchical clustering with UGPMA metric on a GBP dataset (term clustering, GSO1)

5.1 Hierarchical clustering for Sibling Groups Hierarchies

5.1 Hierarchical clustering for Sibling Groups Hierarchies