• Keine Ergebnisse gefunden

The recall which was achieved with an acceptable precision is too low to present it in a diagram. This is a circumstance which is common to automatic approaches for finding synonyms; they can only find a small fraction of synonym relations.

8.5 Conclusion

Although with Group-By-Path a better results on finding synonymous terms were achieved than with a traditional Bag-Of-Words approach even those improved results are inadequate for finding all synonyms which are expected to exist in a vocabulary. One can only state that it is possible to find a rather small number of synonyms with acceptable precision.

9 Domain Relevance enhanced Term Weighting for Learning Sibling

Groups - XTREEM-SG T,DR

In this chapter we present the TF×IDF×DR term weighting scheme. It is derived from TF×IDF term weighting and additionally incorporates a further domain relevance (DR) factor reflecting the degree to which a term is considered characteristic within the dataset in comparison to an external comparison ground.

The newly proposed DR enhanced term weighting scheme is applied on a variant of an XTREEM-SG procedure where in contrast to chapter 4, the vocabulary depicting the feature space is automatically obtained by the XTREEM-T approach described in chapter 7. In such an automatically obtained feature space terms which are of less importance to the domain of interest can occur but they should get only little influence on the results.

Term weighting is often performed for processing textual data represented in the vector space model. The most prominent weighting scheme is TF×IDF [Salton and Buckley, 1987]. In the next section we describe the motivation for creating a new term weighting schema. The DR enhanced term weighting is supposed to bring up sibling groups given by cluster labels which are more domain relevant than without DR term weighting as we will investigate in the evaluation experiments.

9.1 Motivation

The motivation to extend the existing TF×IDF term weighting is twofold. The first is that the frequency distribution of terms in datasets obtained by Group-By-Path which is used to derive IDF scores is “different” or “distorted” compared to Bag-Of-Words text document datasets. This aspect is further described in section 9.1.1. The other reason described in section 9.1.2 is that in the context of clustering based ontology learning, the obtained results, sibling group clusters in particular and otherwise motivated clusters in general, are to be consumed by ontology engineers. The ontology engineer who intends to conceptualize a domain can be expected to be interested in “domain specific” concepts more than in general world patterns.

9.1.1 Distorted Occurrence Distributions

TF×IDF is intended to be useful on regular vector space models obtained by vectorising textual documents. The Group-By-Path approach presented in chapter 3 allows us to “access” and represent semi-structured Web documents in a different way compared to traditional text document vectorizations. A vectorization of a Web document collection with the aim of finding semantic sibling relations is described in chapter 4. The vectorization performed in the processing procedure of XTREEM-SG in chapter 4, a vocabulary of terms, is required input to the procedure. The vocabulary is manually crafted and, therefore, does rarely contain terms which are not of user interest. In practice, one cannot always expect the input vocabulary to be of high quality. The feature space might be automatically obtained without support of any terminology acquisition method at all. Even when there was an automatic acquisition of terms by means of a terminology acquisition method, as, for example, those described in chapter 7, the obtained vocabularies can be expected to be erroneous.

Text document vectorizations with TF×IDF term weighting can cope with noisy vocabularies. Terms which are referred to as stop words are handled well by TF×IDF on traditional Bag-Of-Words vectorizations; but this can be different for Group-By-Path vectorizations. If a feature space contains the term “the”, for traditional Bag-Of-Words vectorizations there is likely to be a non-zero term score in each vector. The term “the” has nearly no “separation strength” and is outweighed by TF×IDF term weighting. In contrast, by accessing Web documents according to the Group-By-Path approach described in chapter 3, the term “the”

might be captured as a candidate sibling term. And since this happens rather seldom, “the” occurs together with other sibling terms and can be scored by TF×IDF as if it is a reasonable good candidate term, TF×IDF does not punish this term as hard as for Bag-Of-Words vectorizations. We refer to this circumstance as “distorted frequency distributions”. The “uninformative” terms which have a high frequency according to Zipf’s law [Zipf, 1949] are not necessarily the terms with high frequency obtained in Group-By-Path vectorizations. The proposed TF×IDF×DR is supposed to be able to better cope with distorted occurrence distributions by incorporating a measure of term relevance influenced by external evidence. Terms which are captured in a certain fraction of paths, such as “home”,

“top”, “feedback” and so on might be those which lead to the establishment of clusters. By potentially punishing such terms which are not characteristics for a domain Web document collection, those terms get less influence.

9.1.2 Interest towards Domain Relevant Terms

From our experiments on mining semantic sibling relations from Web documents on an open vocabulary by means of the XTREEM Group-By-Path approach [Brunzel and Spiliopoulou, 2006a], it became desirable to reduce the influence of non domain relevant terms on the results. Though correct