• Keine Ergebnisse gefunden

Mapping the Multidimensional Model to the Relational Model

Relational Implementation of the Multidimensional Data Model

7.1 Mapping the Multidimensional Model to the Relational Model

In Section 2.2.3 we highlighted the major data warehouse implementation alternatives and explained why we favor the relational OLAP architecture as the target platform for implementing our extended conceptual model. We also discussed various logical design options for the relational implementation, such as star, snowflake, galaxy, fact constellation, and star cluster schemata. We adopt the galaxy schema based on nor-malized storage of dimension hierarchies, as in the snowflake schema, with centralized, i.e., non-redundant, maintenance of semantically related categories. Advantages of this schema can be summarized as follows:

Normalized storage of dimension hierarchies enforces consistent maintenance of data hierarchies by using foreign key constraints for mapping child-parent relationships.

Hierarchical structure of a dimension is explicit in the logical schema as each dimension level is mapped to a separate dimension table referencing its parent level(s).

Non-redundant storage guarantees anomaly-free maintenance of dimension hierarchies.

Since each category is mapped to a separate dimension table, unification of the multidimensional space becomes trivial: each set of conform or compatible categories is managed in a centralized dimension table of the respective category type.

Properties of a particular dimension level can be easily associated with that level by adding them as attributes to the respective dimension table.

Semantically related fact schemes and dimensions are mapped to related logical structures. Valid roll-up paths can be derived simply by “tracing” the foreign key relationships between tables.

Interchangeability of fact and dimension roles is given as both are mapped to relational tables.

Interchangeability of dimension category, property attribute, and measure role is given as all character-istics are stored as attributes in relational tables.

Non-strict mappings can be handled using an established data warehousing practice of bridge tables.

Inheritance in fact and dimension schemes can be handled using (materialized) views and/or integrity constraints (e.g., triggers).

A conceptual-to-relational mapping of the multidimensional data model is a thoroughly investigated field of data warehousing, with many established methodologies and designer tools, which foster automated ac-quisition of logical and physical schemes from the conceptual ones. However, existing approaches perform poorly or even fail when it comes to handling extensions and modifications of the multidimensional model.

To identify the actual challenges of providing a logical mapping of our extended model, we proceed by “re-viving” the general rules of the relational data warehouse design according to the galaxy schema [7, 8, 92]:

The two types of tables are afact tableand adimension table.

Each fact type is mapped to a relation of type fact table that includes all measure characteristics of the fact type as well as a foreign key reference to each associated dimension.

Each dimension level is mapped to a separate dimension table that includes all attributes of that level as well as a foreign key reference to each associated parent dimension level.

If a roll-up relationship between a pair of dimension levels is non-strict, it may not be referenced via a foreign key from within the child level table. Instead, such relationship is extracted into a so-called

“bridge table” that includes a foreign key reference for each of the two affected dimension levels.

The fact table is associated with each of its dimensions by referencing the dimension’s bottom level, i.e., each dimension is represented in the fact by the primary key attribute of its bottom category.

A partial roll-up relationship is expressed by declaring the respective parent key reference as nullable.

Each set of semantically related categories (conform or compatible ones) is maintained centralized in the same dimension table.

7.1 : Mapping the Multidimensional Model to the Relational Model 133

The above relational model with “snowflaked” dimension hierarchies, arranged into galaxies by sharing semantically related categories, produces a rather self-describing logical scheme. Fact and dimension cate-gory types are mapped to fact and dimension tables, respectively, and roll-up relationships are represented by foreign keys. Related fact schemes are also evident as their logical schemata share dimension tables.

However, the relational model is incapable of capturing multidimensional semantics in its entirety. For in-stance, it provides no constraints for specifying measure additivity, exclusivity or compatibility of multiple hierarchies, degenerate fact types, etc. Semantic description of the logical schema in a data warehouse sys-tem is handled by the metadata layer. The metadata repository is even considered the key component in the data warehouse architecture [41] as it manages all the information necessary for successful interplay of all system’s components. Metadata management is the subject of the last section in this chapter.

Application of the relational data warehouse design principles is rather straightforward for most of the multidimensional constructs. Ambiguities or complications arise when dealing with non-conventional prop-erties, such as degenerated and derived elements or irregular dimension hierarchies. In the remainder of this section we discuss the overall process of obtaining logical representations of all types of fact and dimen-sion schemes supported by our extended model. Logical schemes are constructed using the UML profile for logical data warehouse design proposed in [102] and introduced in Section 2.2.4.

7.1.1 Mapping Fact Schemes

Each non-derived fact type is mapped to a separate fact table consisting of the fact’s measures and their di-mensional characteristics. The relational model does not distinguish between normal and degenerate facts.

Fact tables of non-measurable facts consist solely of the respective dimensional characteristics, which, to-gether, build a fact entry. If a fact identifier dimension exists in the scheme, it is used as a primary key of the fact table. Otherwise, the entire set of the dimensional attributes form a composite key or, alternatively, a sur-rogate (i.e., system-generated) fact identifier attribute can be added to the scheme. As an example, consider a multidimensional fragment consisting of a non-measurable factSURGERYand its degenerate measurable factSURGERY-PARTICIPANTin Figure 7.1a with the corresponding logical implementation shown in Figure 7.1b. As expected, each fact scheme maps to a fact table of its own. The fact table of a measurable fact

<<PK>> SurgeryID

(a) FactsSURGERYand SURGERY-PARTICIPANTinX-DFM

(b) Fact tablesSURGERYandSURGERY-PARTICIPANTin the UML profile

Figure 7.1: Example of conceptual fact schemes and their logical representations

134 Chapter 7 : Relational Implementation of the Multidimensional Data Model

contains each measure as an attribute (e.g.,FeeinSURGERY-PARTICIPANT), whereas the fact table schema of a non-measurable fact consists solely of dimension attributes.

Dimension attributes of a fact table correspond to fact-dimensional roll-up relationships and are specified as foreign key references to the respective dimension tables (e.g.,PatientIDinSURGERYreferences dimen-sion tablePATIENT). Multiple roles of the same dimension type in a fact scheme result in multiple foreign key references targeting the same dimension table. For example, dimension tableTIMESTAMPSis referenced asStartTimeand asEndTimecharacteristic in fact tableSURGERY.

The only kind of dimension attribute not referencing another table is the one representing a degenerated dimension, which exists only in the context of its containing fact scheme and, therefore, is not extracted into a dimension table of its own. In our example, SurgeryIDis a degenerated dimension ofSURGERY.

Moreover, SurgeryIDis a fact identifier and, as such, it qualifies as a primary key inSURGERY. In fact schemes with no fact identifier, the primary key is composed of the whole set of dimension attributes, as in SURGERY-PARTICIPANT, whereSurgeryID,ParticipantID, andRoleIDbuild the key.

Notice that there is no evident distinction between fact tables of fully-fledged and degenerate facts. An indication of degeneration is given, if the fact table contains a dimension attribute referencing another fact table. For example, the schema ofSURGERY-PARTICIPANTcontains the attributeSurgeryID, i.e., references SURGERYas a dimension table. Thereby, the relational representation inherently resolves the duality of facts and dimensions: both types are mapped to relations and both kinds of roll-up relationships – fact-dimensional and intra-dimensional ones – are implemented as foreign keys. Subsequently, metadata is used to specify the role of each table in a particular fact’s context.

7.1.2 Handling Derived Elements

According to the rules of “good” database design, derived elements – attributes or relations – should not be materialized to avoid redundancy and update anomalies. Since the instance of a derived element can be calculated by querying its base element(s), views (virtual relations) should be employed for storing the definition of such elements. In the data warehouse design, however, materialization of calculated data views is a common practice, aimed primarily at boosting the performance.

DERIVED FACTS

Typically, materialized views are constructed as projections of primary facts by aggregating their measures to a subset of the fact’s dimensions and/or to a desired granularity within a dimension. Such aggregated views are commonly referred to as “cube computations” or “summary tables” and represent derived multi-dimensional structures of typefact. Further examples of derived facts are those obtained as an outcome of scheme-transforming OLAP operators, such as drill-across, push, and pull. In the UML notation, derived facts are modeled using theviewelement, linking the former to the fact’s base tables. As an example, consider the derivation of factDISCIPLINE-PARTICIPANTdefined as a drill-across ofSURGERY’s two degenerate facts SURGERY-PARTICIPANTandSURGERY-DISCIPLINE(the conceptual scheme of the base facts is depicted in Figure 6.11). The resulting fact schemeDISCIPLINE-PARTICIPANTshould include dimensionsdisciplineand participantand a measuresurgeriesenumerating the number of surgeries a participant had in the respective discipline. Figure 7.2 shows the logical schema fragment of this derived fact and its input fact tables.

At the logical design stage, it is not mandatory to distinguish between standard and materialized views since the actual implementation decision can be taken during the physical design phase depending on storage availability and other constraints.

7.1 : Mapping the Multidimensional Model to the Relational Model 135

<<PK>> SurgeryID, ParticipantID, RoleID

<<FK>> SurgeryID (SURGERY)

<<FK>> ParticipantID (PARTICIPANT)

<<FK>> RoleID (ROLE) SurgeryID: int ParticipantID: int RoleID: int Fee: numeric

<<Table>>

SURGERY-PARTICIPANT

PARTICIPANT

ROLE

<<PK>> SurgeryID, DiagnosisID, DisciplineID, TherapyID

<<FK>> SurgeryID (SURGERY)

<<FK>> DiagnosisID (DIAGNOSIS)

<<FK>> TherapyID (THERAPY)

<<FK>> DisciplineID (DISCIPLINE) SurgeryID: int

DiagnosisID: int TherapyID: int DisciplineID: int

<<Table>>

SURGERY-DISCIPLINE

SURGERY

DIAGNOSIS DisciplineID: int

ParticipantID: int Surgeries: int

<<View>>

DISCIPLINE-PARTICIPANT THERAPY

DISCIPLINE

Figure 7.2: Derived fact modeled as a view linked to the base fact tables

DERIVED MEASURES AND DIMENSION CATEGORIES

Besides derived facts, our conceptual model supports derived elements of typemeasureanddimension cat-egory. To our knowledge, there exist no established guidelines in the data warehouse design for handling these kinds of derivation. We have come to realize that in most cases it is unfeasible to implement derived measures and dimension categories using the (materialized) view mechanism.

In case of a measure, materialized storage of its values in the fact table is crucial for query performance.

The question is how and where exactly derived measures should be stored. We distinguish betweensimple andcomplexderivation.

In the simple case, a new measure is computed from one or more measure attributes of the same fact scheme and, therefore, it has the same set of dimensional characteristics as its input measure(s). For example, net-profitcan be computed as a difference betweenrevenueandexpenditure. If the new measure were to be realized using a materialized view, the view would have to include the entire set of dimensional attributes and the same number of fact entries as the base fact table. Obviously, it is much cheaper to materialize just the measure attribute itself by adding it as a column to the base fact table. Since existing fact entries are not subject to modification, there is no overhead for guarding the consistency of derived measures.

Derivation is complex, if the granularity of the derived measure differs from that of its input measure(s).

In such a case, it would be incorrect to materialize the derived attribute in its base fact scheme. Therefore, complex measure derivation is considered to be a special case of fact derivation and is handled accordingly, i.e., using materialized views.

Derived dimension categories represent an additional challenge of the relational design as dimension tables need to have full-fledged schemata in order to be usable as targets of foreign key relationships and to reference other dimension tables. Therefore, derived categories are mapped to tables rather than to views.

Initial data loading is performed using an appropriate query while subsequent maintenance relies on triggers (updates of the base dimension tables are propagated to the table of the derived category).

7.1.3 Mapping Dimension Hierarchies

“Snowflaked” dimension storage is achieved by structuring and normalizing dimension hierarchies according to the Third Normal Form (3NF) [41]. As a result, each category (or category type) is mapped to a dimension table of its own, containing the respective dimension level attribute along with all property attributes as well as a foreign key attribute for each outgoing roll-up relationship of that category. Surrogate keys are commonly used to achieve efficient implementation of foreign key references.

Normalized hierarchy maintenance has a number of advantages when dealing with complex dimensions.

For example, it provides a more stable context for adding new hierarchies: new roll-up relationships of an

136 Chapter 7 : Relational Implementation of the Multidimensional Data Model

existing category can be introduced without affecting other categories in the dimension. Snowflake schema appears to be the most appropriate solution for dimensions with complex schemes, e.g., with a large number of levels or containing multiple related and/or independent hierarchies. However, the 3NF appears too restrictive or provides no guidelines for modeling partial related roll-up relationships and irregular hierarchies. For instance, consider a non-covering hierarchy ofprojectin Figure 7.6a. Intuitively, its relational mapping should consist of the dimension tablesPROJECT,OFFICE,BUILDING, andCITY. As for foreign key relationships, the bottom levelprojecthas a twofold roll-up relationship withcity, a direct and a transitive (viaofficeand building) one. Consequently, two conflicting mapping options arise: a reference toCITYshould be stored withinPROJECT, on the one hand, and within its upper levelBUILDING, on the other hand.

Non-strict roll-up relationships represent another challenge: existence of multiple parent-level values for the same element makes it impossible to map this relationship to an atomic-valued parent column.

Apparently, major challenges of mapping dimension schemes to relations arise when dealing with hetero-geneous, ragged, non-balanced, and non-strict hierarchies. Our approach to handling such non-summarizable mappings is based on the concepts ofcontrolled heterogeneityandcontrolled non-strictness, which restore summarizable aggregation behavior within a hierarchy by forcing the latter to undergo a series if transforma-tions at the scheme level and, subsequently, at the instance level. The first phase is concerned with scheme transformation of heterogeneous hierarchies: a dimension scheme is normalized as to eliminate incomplete and overlapping specialization, mixed-grain, and non-covering roll-up relationships. In the second phase, instance normalization is conducted: multiple and generalized hierarchies are decomposed into constituent homogeneous subtrees in order to identify non-onto and non-strict mappings and eliminate those by ap-plying instance normalization techniques. In the next two sections, the above two phases of normalizing non-summarizable hierarchies and obtaining their relational mappings are described in detail.