• Keine Ergebnisse gefunden

Elements of the Multidimensional Data Model

Background and Related Work

2.2 OLAP and the Multidimensional Data Model

2.2.1 Elements of the Multidimensional Data Model

The multidimensional data model is aggregation-centric, i.e., it uses numeric measures as its analysis objects [24]. A fact entry is identified at the finest available granularity and normally corresponds to a single transac-tion or an event. Parameters that determine the values of a measure are referred to as dimensions. Consider an example from the domain of academic administration, with thenumber of enrollmentsas the measure of interest. The associated dimensions areSemester,Degree, andCountry. Often, the members of a dimension are organized in a containment-type hierarchy to enable additional aggregation levels. For instance, single countries inCountrydimension can be grouped into subcontinents and, subsequently, into continents.

DATA CUBES

A natural representation of a set of fact entries along with the associated dimensions and their hierarchies is known as amultidimensional data cube, or ahypercube. The number of dimensions in a cube corresponds to the dimensionality of its contained measure(s). Each dimension can be interpreted as an axis in a multidi-mensional space with the dimension’s member values as its coordinates. Finally, each cell contains a value of the measure defined by that cell’s dimensional coordinates.

Depending on the application, cubes may range from dense (each cell has a non-null measure value) to sparse (significant portion of empty cells) ones. The sparsity tends to increase with the increasing number of dimensions and with the increasing granularity within a dimension. Figure 2.3 shows a sample 3-dimensional cube (fragment) with student enrollment numbers. In addition to the original fact entries (white cells), slices of subaggregates, computed for each combination of dimensions (colored cells) as well as the absolute total value (a dark-grey cell), are shown. Despite the implied 3-dimensionality of the termcube, in OLAP there is no limitation on the number of dimensions in a cube.

DIMENSIONS

Dimensions represent a crucial concept in OLAP as they provide the context for analyzing the facts – analysts use them to filter the data and aggregate it to the desired level of detail. Each dimension of a cube corresponds to an axis in a multidimensional data space. The values of a dimension are calleddimension elements, or members. Members may be arranged into aclassification hierarchy, composed of multiple levels, with each level representing a distinct granularity within the dimension. Each member is mapped to aclassification nodein the hierarchy, with the members of the finest grain as the leaf nodes [8].

2.2 : OLAP and the Multidimensional Data Model 15

Figure 2.3: A sample 3-dimensional cube (fragment) storing student enrollment numbers (white cells) along with its 2-dimensional, 1-dimensional, and 0-dimensional projections (colored cells)

In the multidimensional data model dimensions are represented via theirclassification schemes, in which each level of the hierarchy, denoted aclassification level, orcategory, is a node connected by an edge to its parent level. In some literature, the classification scheme is termedintension, with the actual classification hierarchy building itsinstance, orextension[142]. Fact entries are linked to a dimension exclusively at the bottom level of the latter, whereas upper levels are used for querying the respective aggregates. Distinct paths in a classification scheme are referred to asconsolidation paths. A classification level is allowed to be composed of multiple attributes.

The analysis may be further enhanced by defining multiple classification hierarchies within the same dimension. ConsiderDegree dimension in the aforementioned student enrollment example. The bottom-level members are the degrees into which students are enrolled. A degree is composed of the attributesStudy Subject(e.g., “Physics”) andDegree Type(e.g., “Bachelor”). Intuitively, single degree values can be grouped byStudy Subject, on the one hand, and byDegree Type, on the other hand. Figure 2.4 shows the graph of the resulting aggregation paths withinDegreeas well as the actual data hierarchy behind each aggregation path.

The attribute upon which the hierarchy is defined is called theanalysis criterion. Hierarchies within a dimension may refer to the same or to different analysis criteria. Hierarchies depicted in Figure 2.4 refer to various criteria: one classification is based onDegree Type, while the other draws uponSubject. Attributes holding non-hierarchical characteristics of the category’s members are calledproperties. In our example, Department category inDegree dimension may have properties such as Dean, Location, andFoundation Date. Property role of any attribute is not global but is limited to the context of a given hierarchy. For example,Locationattribute ofDepartmentis a property in the context of the subject hierarchy inDegree. However,Locationmay be used as an analysis criterion for defining a hierarchy of department locations, such asLocationÕDistrictÕCity.

Notice that dimension hierarchies are normally defined in terms of the partial ordering, i.e., as parent-child relationships between its members with no ordering between the members of the same level. However, for some dimensions, such as time, the total ordering on its members is given. The conventional OLAP approach requires dimension hierarchies to be strict, regular and balanced to ensure correct aggregation.

16 Chapter 2 : Background and Related Work

Degree Faculty Degree Type Department

Subject

ALL

Diploma Master

Bachelor

Comp. Science, B.Sc. Comp. Science, M. Sc. Mathematics, Dipl.

Physics, B. Sc.

Physics & Chemistry Maths & Comp. Science

Mathematics Comp. Science

Physics

ALL

Mathematics Comp. Science

Physics

Comp. Science, B.Sc. Comp. Science, M. Sc. Mathematics, Dipl.

Physics, B. Sc.

Degree Faculty Degree Type Department

Subject

Figure 2.4: DimensionDegreewith multiple hierarchies: scheme (left) and instances (right)

FACTS

Facts represent the subjects of the analysis. Thegranularity, or level of detail, of the facts corresponds to the atomic level of the modeled business subject. For example, the granularity of the student enrollment facts from the cube depicted in Figure 2.3 is given by degree, country, and semester. The finest granularity and homogeneity are ensured by requiring each fact entry to map to the bottom level in each of its dimensions.

Kimball identifies three fundamental types of facts [81]:

Transactionalfacts track the occurrence of events, with each detailed event captured into a fact entry and the measures being additive across all or most of the dimensions.

Periodic snapshotscapture the states of an entity at given points in time, such as inventory levels or account balance. The same entity (e.g., a product) and its state is registered as a new fact for each point in time. The measured states are not additive across time but additive across other dimensions.

Accumulating (cumulative) snapshotscapture the states of an entity up to a certain point in time with respect to some initial point, common for all facts. An example of cumulative snapshots is the total number of website visitors recorded at daily basis. The measures are non-additive over time (as they already hold accumulated values) but are additive across other dimensions.

Apart from the measurable fact types listed above, there exist useful fact types that do not contain any measures, i.e., consist of nothing but a set of dimensions. Kimball denotes such factsfactlessand names their common usage scenarios [83]:

Event-trackingfacts simply store the occurrence of an event itself (e.g., every single student enrollment case) without any specific measure of interest.

Coveragefacts are useful for tracking whether something has or has not happened. An example of a coverage fact type is the entirety of individual student application records that have or have not resulted

2.2 : OLAP and the Multidimensional Data Model 17

in an enrollment. Such facts are useful for computing the acceptance rate.

Anymany-to-many relationshipis a fact by definition.

Many real-world scenarios combine various types of facts to support complementary analysis tasks.

MEASURES

Measure is a fact property that the users want to analyze, predict, or optimize. The “holy grail” of the multidimensional design is that the most useful measures are numeric,continuously valued, and additive [81]. A measure of a query is defined by specifying a formula, usually a simple aggregate function, such as sum, that combines several measure values into one [143]. The measure attribute(s) and the formula should be chosen as to provide a meaningful value for all aggregation levels. Calculated, or derived, measures result from applying a function to one or more measure values pertaining to the same fact entry. An example of a derived measure is amount, obtained by multiplying price with quantity.

With respect to its aggregation behavior, each measure falls into one of the following three classes:

Fully additivemeasures can be summed up through all of the fact’s dimensions. The number of enroll-ments in the cube in Figure 2.3 is an example of such a measure.

Semi-additivemeasures may be totalled along some of the dimensions but not all of them. This kind of measures is typically encountered in snapshot facts, in which the values may not be aggregated over time.

Non-additivemeasures may be added up along no dimension at all. Values of type measurement or metric, such as age, height, or density are examples of numeric values non-additive in many contexts.

Measure additivity is determined at design time and stored in the data warehouse as metadata to be used for checking the validity of attempted queries.

Aggregate functions can be classified into three categories [51]:

Distributivefunctions, such asCOUNT(),MIN(),MAX(), andSUM()can be computed by partitioning their input into disjoint sets, aggregating each set individually, and then aggregating individual subag-gregates into the final result.

Algebraicfunctions can be expressed as a scalar function withM arguments (whereM is a bounded integer), each of which is obtained by applying a distributive aggregate function. For example,AVG() can be expressed asSUM()/COUNT().

Holisticfunctions cannot be computed by partitioning as the whole input is required for computing each output value. Common examples of such functions areMEDIAN(),MostFrequent(), andRANK().

Characterization of aggregate functions is necessary for determining theirself-maintainability, i.e., whether new aggregates can be computed directly from the old output of the function and from the changes to the base data [100].