4.4 Conclusion
5.1.1 Hierarchical Term Clustering
As already described in chapter 4, a clustering for sibling relations can be conducted as a tagpath clustering where labelled clusters are obtained or as a term clustering, where the terms are constituting clusters. The best results regarding sibling relations have been achieved by means of term clustering. Our first experiment is, therefore, to apply agglomerative hierarchical clustering and Bi-Secting-K-Means for term clustering. This means that terms represented by vector of occurrence in sibling sets are clustered, yielding a binary tree where terms are finally the leafs of the tree. We also apply Bi-Secting-K-Means in a manner to produce a complete hierarchy, without stating a (not even roughly) known number of clusters in advance. Bi-Secting-K-Means can be applied to yield a fixed number of clusters. But in our scenario the number of clusters is not even roughly known nor do we know which is the suited strategy to decide for the clusters to be split until the desired number of clusters is achieved. By producing a complete hierarchy we avoid the need to decide for a K and for a strategy to choose the next cluster to split.
For the obtained hierarchies of clusters it is not possible to measure the achieved quality according to FMASO. The clusters are not partitions depicting sibling
5.1 Hierarchical clustering for Sibling Groups Hierarchies
groups in such a way that one set of sibling groups can be compared to the reference set of sibling groups. This could only be done by incorporating more
“heuristics” such as where the hierarchy is to be cut so that groups of siblings can be made explicit for agglomerative hierarchical clustering and for using Bi-Secting-K-Means with a fixed number of K cluster to be produced. But in both cases the hierarchical characteristic would get lost.
Next we will have a brief look at the results obtained while performing hierarchical term clustering. Figure 5.1 and 5.2 shows the resulting cluster hierarchies for agglomerative hierarchical clustering and Secting-K-Means clustering. Figure 5.3 shows a fraction of the hierarchy produced by Bi-Secting-K-Means clustering where details can be observed. We applied hierarchical agglomerative clustering with the Unweighted Pair Group Method with Arithmetic Mean - UGPMA [Jain and Dubes, 1988] heuristic which is supposed to prevent the creation of chains of clusters for the sake of a relatively high complexity of O(n2logn). But for the relatively smaller number of terms to be clustered this complexity is acceptable. The hierarchical agglomerative clustering shows only a relatively small chaining effect, whereas the Bi-Secting-K-Means clustering reveals that chaining is a more serious issue.
While one manually inspects that hierarchy, the disadvantage of the two exclusive clustering approaches applied as term clustering become apparent again. In an exclusive clustering approach such as K-Means (and hierarchical agglomerative clustering and Bi-Secting-K-Means) a membership in clusters is exclusive: an instance (term) can belong to only one cluster. This disadvantage was already pointed out while K-Means was applied for term clustering. But since the quality of the clustering cannot be measured according to the FMASO, we cannot quantify to which extent the clusterings represent siblings which are in accordance with the reference ontologies regarding the sibling relations. The rough impression is that a term clustering with exclusive cluster membership is too restrictive regarding providing sufficient suggestions of plausible siblings to the ontology engineer.
Concepts can have many siblings which are plausible and if one recognizes an error where the siblings depicted by the cluster hierarchy are not plausible, then one has no alterative suggestion. A term clustering which yields only a small number of information to be inspected by this cannot compensate the disadvantage of the limited number of siblings which can be observed for a concept. The idea is to perform hierarchical clustering as tagpath clustering. In a tagpath clustering, clusters need to be labelled and, therefore, the terms/features/concepts that take place constitute more than one sibling constellation. The circumstance that tagpath clustering is here more appealing is also the reason why we did not further investigate whether Bi-Secting-K-Means can be applied to produce a fixed number of clusters which could thus be compared. The disadvantage of exclusive clustering would still be prevalent.
564562560558556554552550548546544542540538536534532530528526
563561559557555553551549547545543541539537535533531529527525524523522521520519518517516515514513512511510509 283
508507506505504503 383
502501500499 463
498497496494493492 486 476 431 379
456 385
415 412
490 482 432
474 459 445 340
489487 469 428 408
438
479 429
462 405
434 423
484 452 371
378
483481 473472 420
468 465 399
450 435 417 344
400 334
362 314
430 411 396 386 323 299297
317 291 286
287
373 341
372 296
357
421 380 349 333
388 381 356 336 307 295
305
348
471 441 418 370 355 318
393 330
416 345 309 302 289
409 328
466 461 457 453 398 347
440 426 410 303
350
353
448 433 369
455 436 422 329
384
407
449 446 427 377375 366 342 332 324 308
360
439437 424 395 390 326 322 311 292290
413 374 361 315
320
346
425 419 392 368 365 358 352 335 327325 316 313 310 298
301
337 319
397 331 321
338
414 404 354
391 363 312 304 300
460 387 293 285
288
458 402 364
443 403 339
475 454
491 470 444 306
394 343
488 447 401 351
477 376
442
495 478 467 389
451 367 294
485 480 464 382
406 359 284
accommodation accommodation_equipment
address adventure_holiday
aerobic
afternoon agreement airport
animal animation
appartement
area
art_exhibition
ausblick
ausblickturm autumn
badminton
badminton_court balcony
ball ballroom bank banquet
bar barbecue_area
basilica
basketball_ground beach
beach_view
beach_volleyball_field beauty_farm
bed beer_garden
bicycle
bill billiard
billiard_room boat booking
bow_shooting_installation bowling_alley
brochure
buffet
bull_fight bus
business_event cabaret cafe
camping car
caravan
carriage casino castle
category
change_office cheque
chimney_room cinema
city
city_port city_wall climbing_wall
club
concert
concert_house conference
conference_folder conference_room congress
contract
country
crazy_golf_course cross_country_ski_run
cultural_event
cultural_installation culture cure
currency
date day
day_tour day_trip
daytime
disco
diving_station
drier driving_license early_season
educational_journey electronic_device elevator email equipment
event excursion
exhibition fango
farm
ferry first_class_hotel
fishing_equipment
fitness_course fitness_room fitness_studio
football football_field football_game
gallery
golf
golf_course group
gym
hair_dresser hair_dryer harbour
health_club
heritage_town
hoarse_carriage holiday
holiday_appartment
holiday_equipment holiday_time
holiday_village
hotel house_description
human_activity
ice_hall id_card indoor_swimming_pool information inn
iron
jazz_club journey kayak kindergarden kiosk
kitchenette law
library
living_thing local_recreation_area lounge
main_meal massage
masseur
material_thing meal menu middle_class_hotel
minibar money
moor_bath moor_therapy morning motel mud_therapy
museum
musical musical_theatre
nature_reserve night
night_cafe
non_material_thing non_private_accommodation_equipment off_season open_air_theatre
opera_house organization panorama
park
parking_lot
partially_material_thing party passport
pension
period
person
personal_thing pilgrimage
place plant
playground port presentation
price_list promenade
pub
public_holiday
qualitative_time_concept radio
recreational_installation region rental
riding_crop room
room_equipment root
route_description rowing_boat ruin
sailing_boat sanatorium
sauna sea_view seminar seminar_house
service ship shooting_gallery
shop
shopping_center short_trip
shower
shuttle_service sight single_room
situation ski_run
skittle_alley sledge
solarium
spatial_concept sport
sport_equipment sport_event
sport_holiday
sport_installation sport_shop sports_hall
spring
squash_court
squash_field starting_point station steam_bath summer sun_studio
swimming
swimming_pool
table_tennis_table telephone tennis
tennis_court
tent terrace theatre
theatre_house thermal_bath thermal_spring
thing time time_interval
tour_operator
tourism_center tourist
touristic_installation town
town_sightseeing_tour trail_map
transport_vehicle trimmdichpfad trip
turkish_bath
tv vehicle
videorecorder view visum volleyball_field
volleyball_ground walking_trail
washing_machine wedding weights_room
wellness_installation whirlpool
wine_tavern winter
yacht yacht_port
youth_hostel
Figure 5.1: Dendrogram of a agglomerative hierarchical clustering with UGPMA metric on a GBP dataset (term clustering, GSO1)
5.1 Hierarchical clustering for Sibling Groups Hierarchies
5.1 Hierarchical clustering for Sibling Groups Hierarchies