• Keine Ergebnisse gefunden

Dimensions and Hierarchies in the Multidimensional Data Model

4.1 State of the Art of Dimensional Modeling

4.1.1 Rigidness of OLAP Dimensions . . . 60 4.1.2 Related Work on Modeling Dimension Hierarchies . . . 60 4.2 Academic Management as the Motivating Case Study . . . 62 4.3 Categorization of Dimension and Hierarchy Types . . . 66 4.3.1 Refining the Formal Framework . . . 66 4.3.2 Dimension Types . . . 68 4.4 Classification of Hierarchy Types . . . 70 4.4.1 Strict vs. Non-Strict Hierarchies . . . 71 4.4.2 Types of Homogeneous Hierarchies . . . 73 4.4.3 Types of Heterogeneous Hierarchies . . . 75 4.5 Classification of Multiple Hierarchies . . . 84

4.1 State of the Art of Dimensional Modeling

Dimension hierarchyis a central concept in OLAP as it specifies valid aggregation paths for exploring the facts in a data cube at different levels of detail and in a hierarchical fashion: typically, data analysis evolves

59

60 Chapter 4 : Dimensions and Hierarchies in the Multidimensional Data Model

from a more abstract view of coarsely-grained aggregates to a more precise view obtained through a sequence of drill-down and slice&dice operations. In OLAP tools, dimension schemes and instances are presented in the form of hierarchical data navigation structures, which enable purely visual specification of analytical queries: the user simply navigates to the desired categories and their members. Abstraction of a database query into a navigation interface raises the issue of reachability: each lower-level member must be reachable from each of its containing members at the upper levels. Another important issue is correct summation: subto-tals produced by drilling down should sum up to the value of the decomposed total. The above issues are well investigated in the literature and are formalized as the concepts of aggregate navigation and summarizability.

4.1.1 Rigidness of OLAP Dimensions

Rigidness of the standard OLAP technology is primarily due to the requirement ofsummarizabilityfor all dimension hierarchies. The concepts of summarizability and well-formedness were coined by Rafanelli and Shoshani [152] in the context of statistical databases and redefined by Lenz and Shoshani [98] for OLAP.

Summarizability guarantees correct aggregation and optimized performance, as any aggregate view is obtain-able from a set of precomputed views of superior granularity. However, the hierarchies in many real-world applications are not summarizable and, therefore, may not be used as dimensions in their original form. In case of minor irregularities, the data tree can be balanced by filling the “gaps” with artificial nodes. In highly unbalanced hierarchies, however, suchlike transformations may be undesirable. Yet in other scenarios, e.g., in taxonomy-based classifications, it is crucial to preserve the original state of hierarchical structures.

Another limitation of OLAP comes from enforcinguniform granularity, or precision, in the members of the same category type. As stated in [145], in some applications the data may be prone to naturally varying precision, e.g., the diagnosis of a patient, whereas in other scenarios mixed granularity arises as a result of combining data from different sources. These variations are eliminated by “cleansing” the data prior to loading it into a data warehouse.

The requirement ofcompletenessprohibits missing (NULL) values in facts and dimensions, as those values aggravate the invocation of aggregate functions and correct interpretation of computed aggregates.

Correct aggregation is also enforced via the requirement ofhomogeneity. Even though it is admissible to define multiple hierarchies within the same dimension, each of those hierarchies must be homogeneous, i.e., each level of the tree corresponds to a single category and all members of a given category have ancestors in the same set of categories [61].

Analysts are frequently confronted with data, which violates the above restrictions and which, therefore, cannot be adequately supported by standard OLAP systems.

4.1.2 Related Work on Modeling Dimension Hierarchies

Despite the overall maturity and successful establishment of the OLAP technology, the domain of conceptual design still faces a lot of research challenges. New issues arise when this technology, tailored primarily towards the needs of business performance analysis, is applied in novel contexts that deal with rather complex data that cannot be trivially rearranged into homogeneous cubes of uniformly grained facts and perfectly balanced dimension hierarchies.

Early works on multidimensional data models, such as Kimball’s star schema model [81], cube operator of Gray et al. [51], the conceptual model for OLAP of Gyssens and Lakshmanan [52], and the cube data model of Datta and Thomas [36] tightly coupled the conceptual model with the logical, especially the rela-tional one, considering dimension hierarchies as mere collections of attributes used as grouping criteria for aggregating measures. The underlying star schema approach trades semantic richness off against simplicity by not capturing hierarchical relationships at the schema level.

4.1 : State of the Art of Dimensional Modeling 61

A step in the right direction was achieved by a concurrent branch of research on so-called structured cube models, in which dimension hierarchies are explicit in the scheme.

Li and Wang [99] introduced the notion of agrouping relationto reflect hierarchical relationships between the attributes of a dimension and definedgrouping algebraas an extension of relational algebra that includes order-oriented and aggregation operators.

Agrawal et al. [5] also proposed a model along with a set of algebraic operators for multidimensional databases. Their model provides support for multiple hierarchies in a dimension and symmetric treatment of dimensions and measures by means ofPUSH andPULL operators. However, the model does not distin-guish between structure and contents, which results in the necessity of encoding structural and functional information into the query.

Cabibbo and Torlone [16] proposed a formal multidimensional model, which is truly implementation-independent and thus provides clear distinction between practical and conceptual aspects. Complex structured dimensions are modeled by specifying a partial order on dimension levels, thus accounting for the possibility of multiple aggregation paths. In the follow-up work [17], the authors extend their formal framework by presenting an algebraic and a graphical language for querying multidimensional databases.

Vassiliadis [179] provides a formal model that defines dimensions as lattices of levels, in which each path is a linear, totally ordered list of levels. The model also includes a set of useful cube operations. Furthermore, the authors present mappings of the conceptual model and its operators to an extended relational model and to multidimensional arrays.

A powerful approach to modeling dimension hierarchies along with SQL query language extensions called SQLpHqwas presented by Jagadish et al. [70]. The notion of a dimension is formalized by introducing the concepts of ahierarchical domain,hierarchy,hierarchy scheme, anddimension scheme. SQLpHqdoes not require data hierarchies to be balanced or homogeneous. The data model allows capturing of structural heterogeneity at schema level.

The necessity of dropping the restriction of homogeneity has been recognized by the researchers who proposed respective extensions in form of dimension constraints [61], multidimensional normal forms [94, 96], transformation techniques [144], and mapping algorithms [109].

Hurtado and Mendelson contributed a series of works on summarizability for heterogeneous hierarchies.

In [60], a class of constraints for inferring summarizability in a particular class of heterogeneous dimensions is presented. A follow-up work [61] proposes a class of integrity constraints and schemes that enable reasoning about summarizability in general heterogeneous dimensions. A more recent work [59] summarizes previous contributions of the authors and provides a survey of related work on heterogeneity in OLAP.

Lehner et al. [96] relaxed the condition of summarizability to enable modeling of generalization hierar-chies by defining a Generalized Multidimensional Normal Form (GMNF) as a yardstick for the quality of multidimensional schemata. Lechtenbörger and Vossen [94] pointed out the methodological deficiency in de-riving multidimensional schema from the relational one and extend the framework of normal forms proposed in [96] to provide more guidance in data warehouse design.

In spite of numerous contributions and competing multidimensional models, there is still no consensus in the community about the modeling standards and especially what concerns different hierarchy types.

Pedersen et al. [145] proposed an extended multidimensional data model for handling complex data and a corresponding algebra for multidimensional objects. Model extensions in part of dimension hierarchies address variable granularity due to imprecision, missing values, non-covering, non-strict, and non-onto hi-erarchies. The issue of heterogeneity is not considered by this model. Innovative features, such as handling imprecision and incompleteness, are supported by means of the proprietary algebra, which has not been implemented in any OLAP system.

Some researchers propose to handle complex hierarchies at the logical schema level. Bauer et al. [9]

exploit the flexibility of the relational and, alternatively, the object-relational database constructs to obtain

62 Chapter 4 : Dimensions and Hierarchies in the Multidimensional Data Model

normalized OLAP schemes. Lin et al. [101] describe an object-relational modeling approach for warehousing complex data. Complexity is handled by employing inheritance and complex objects.

The conceptual design approach of Hüsemann et al. [64] builds upon the multidimensional normal form framework of [96] and investigates the properties of multiple hierarchies, subdividing them into optional (i.e., generalized) and alternative ones. Non-balanced hierarchies are not considered by this framework.

Pourabbas and Rafanelli [148] propose a characterization of hierarchy types, distinguishing between total and partial hierarchies, derived hierarchies, and different types of multiple hierarchies within a dimension.

The model admits existence of irregular hierarchies, however, not discriminating between different kinds of irregularity.

Abelló et al. [1] proposed a conceptual multidimensional model YAM2 based on the UML notation.

The model provides a sound framework for classifying various hierarchy types, e.g., symmetric, non-strict, derived, multiple alternative, and generalized hierarchies. Besides, the model also formalizes the types of relationships between correlated dimensions, distinguishing between generalization, aggregation, and deriva-tion. A conceptual model with very rich multidimensional semantics was proposed by Luján-Mora et al.

in [104]: dimension hierarchies are classified using the properties of strictness, completeness, degeneration, optionality, multiplicity, derivation, etc. However, the model focuses on the graphical presentation of the extended set of constructs using UML rather than on formal aspects.

The works of Malinowski and Zimányi focus on extending the conceptual model to handle complex hi-erarchies, not addressed by standard OLAP systems. [108] presents a conceptual classification of hierarchies and proposes a graphical notation based on the E/R model. In [109], the authors formalize and extend their framework, presenting it asMultiDimER, a conceptual multidimensional model, and a relational mapping of all defined hierarchy types. The authors evaluate various approaches to enforcing summarizability by trans-forming complex hierarchies and propose their own mappings, which overcome the deficiencies of previously proposed solutions. However, the author’s own conceptual model is presented in a rather informal fashion compared to other models of the state-of-the-art. Besides, this work does not consider the implications of extending the data model on the applicability of OLAP operators.