• Keine Ergebnisse gefunden

Background and Related Work

2.1 Business Intelligence and its Components

2.1.1 Business Intelligence . . . 9 2.1.2 Data Warehousing . . . 11 2.2 OLAP and the Multidimensional Data Model . . . 14 2.2.1 Elements of the Multidimensional Data Model . . . 14 2.2.2 OLAP Operations . . . 17 2.2.3 OLAP Implementation Alternatives . . . 20 2.2.4 Data Warehouse Design Methodology . . . 24 2.3 Visual Analysis and Exploration . . . 29 2.3.1 Visual OLAP as an Emerging Trend . . . 30 2.3.2 Visual Exploration Framework . . . 31

2.1 Business Intelligence and its Components

Due to the fact that data warehousing and its related concepts have been influenced by the technical as well as by the business application experts, there exist discrepancies in the definition of some terms. As this thesis focuses on the database aspects of data warehousing research, we pursue a technical definition perspective.

2.1.1 Business Intelligence

Coined as a term by the Gartner Group analyst Howard Dresner in 1992 [10],Business Intelligence(BI) is a popularized umbrella term encompassing a set of concepts and methods to improve business decision making

9

10 Chapter 2 : Background and Related Work

by using fact-based support systems. Though often used synonymously withdecision support, the former is technically much broader, potentially encompassing knowledge management, enterprise resource planning, and data mining, among other practices. In the abundance of definitions emphasizing various aspects of BI, such as information processing, logistics, assessment, alerting, etc., we adopt the following definition:

“Business intelligence (BI) is a broad category of applications and technologies for gathering, storing, analyzing, and providing access to data to help enterprise users make better business decisions. BI applica-tions include the activities of decision support systems, querying and reporting, online analytical processing (OLAP), statistical analysis, forecasting, and data mining”[171].

Figure 2.1 shows an attempt to structure diverse BI perspectives by arranging them on a two-dimensional plane, as proposed by Gluchowski [46] and modified in [80]. The phases of the analytical data processing are ordered along the vertical axis, whereas the horizontal axis differentiates between the technology and the application focus. Based on the positioning of the application classes, Kemper et al. [80] propose to distinguish between three prevalent definitional levels:

Narrow BI definition is limited to a few core applications with a straightforward decision support function, such as OLAP, Management (MIS) and Executive (EIS) Information Systems.

Analysis-oriented BI definitionencompasses the entirety of end-user tools and applications that en-able interactive analysis. Such tools include data mining, reporting, balanced scorecards, etc.

Broad BI definition goes beyond the tools of the presentation layer by comprising all applications employed for decision support, directly or indirectly. ETL software falls into this category.

OLAP Management IS Executive IS

Technology Application

Data Analysis

Data Acquisition

DataMining Text Mining

Reporting Data Warehouse ETL (Extract, Transform, Load)

standard

ad-hoc

Performance Management / Balanced Scorecards Systems Analytical CRM

Planing / Consolidation

Analysis-oriented BI definition PROCESS

PHASE

FOCUS Narrow BI definition Broad BI definition

Figure 2.1: Definitional levels of Business Intelligence

2.1 : Business Intelligence and its Components 11

2.1.2 Data Warehousing

Data warehousing is a field that has emerged from the integration of a number of different technologies and experiences over the last two decades. W. H. Inmon coined the term“data warehouse”as early as in 1990 with the following definition:

“A data warehouse is a subject-oriented, integrated, non-volatile, and time-variant collection of data in support of management’s decisions. The data warehouse contains granular corporate data”[68].

The four salient characteristics enumerated in the above definition are explicated in Table 2.1.

Table 2.1: Data warehouse characteristics according to W. H. Inmon Characteristic Explanation

Subject-oriented The data is modeled according to the subject area of the respective enterprise, and not according to the application needs of operational systems. The topics of the analysis are enterprise-specific. Thereby, a proper perspective on the data from the decision-maker’s point of view is provided.

Integrated The data fed from multiple sources has to undergo extensive transformations to be brought into a coherent state. The main challenges here are to ensure consistent formatting, naming, data coding, and measurement units.

Non-volatile The data is loaded in a snapshot, static format; existing entries are not supposed to be further manipulated or deleted. Analytical operations areread-only.

Time-variant Each data unit is accurate with respect to some point or period in time. Thetime dimension is used to characterize the validity of the facts. Aggregation along time and evolution in time are the core analysis types in data warehouses.

As the concept of data warehousing matured over time, other definitions were proposed. R. Kimball provided a rather simple but accurate definition of a data warehouse as “a copy of transaction data specifically structured for query and analysis” [81]. Finally, the end user perspective is stressed by Jarke et al. who define a data warehouse as a “collection of technologies aimed at enabling the knowledge worker (executive, manager, and analyst) to make better and faster decisions” [72].

The term“data warehouse system”comprises the data warehouse itself as well as its accompanying com-ponents, such as design and ETL tools, Operational Data Store, metadata repository, analysis and presentation tools of the end-user. A classical reference architecture of a data warehouse system [8, 24, 89, 92, 138, 151, 161] depicted in Figure 2.2 is a refinement of the simplified version from Figure 1.1. We use the 5-layer model proposed in [151], in which each layer encapsulates a different stage of the data flow in the system.

TheData Sources Layerencompasses all information sources, primarily the company’s own operational databases, which function as data suppliers for a warehouse. In addition to the internal data, external data sources, such as third-party demographic and statistical databases, market research reports, and web docu-ments are frequently used to enrich the analysis base.

The task of the ETL Layer is to extract data from heterogeneous sources, cleanse it into a consistent state, transform it according to the target schema, and, finally, load it into the data warehouse. A set of activities required to populate data warehouses and OLAP applications with cleansed, consistent, integrated,

12 Chapter 2 : Background and Related Work

3rd layer: DATA WAREHOUSE

Data Warehouse

Metadata

Extractor

Enterprise ERM Resource

Management legacy systems operational DBs external sources 2nd layer: ETL

1st layer: DATA SOURCES

unstructured data cleansed raw data

Extractor

Staging area Data Mart Data Mart

Monitoring Administration

Archiving system Data Mart

Operational Data Store

4th layer: ANALYSIS

OLAP Data Mining DSS methods

5th layer: PRESENTATION

web frontend OLAP frontend Data Mining tool DSS frontend spreadsheet

Figure 2.2: A multi-layer data warehousing system architecture

and probably summarized data is described by the termETL(Extract, Transform, Load). The ETL process is subdivided into two phases: i)definition and ii) execution. The definition phase can be interpreted as a specification of the data warehouse metadata objects, whereas the execution phase uses the framework defined in the first phase to carry out the actual data loading routine (on a time-driven or even-driven basis) [89]. The entire latter phase takes place in a designated storage system calledstaging area, which is known as the “back room” portion of the data warehouse environment that lies out of bounds for end-users.

From the ETL layer, the transformed data is transferred to theData Warehouse Layer, which is a special purpose database along with its metadata repository. Ideally, there exists a single centralized data ware-house consolidating the entire company’s data. In practice, however, organizations frequently switch to a decentralized data mart architecture. Adata martis a departmental data warehouse subset focused on a spe-cific subject. As for the relationship between the data warehouse itself and the data marts, there exists two paradigms: Inmon suggests that data marts source their information from the enterprise-wide data warehouse, whereas Kimball defines the data warehouse to be the union of all data marts within the enterprise [67]. While current practices tend to be closer to Kimball’s approach, there is an emerging trend towards Inmon’s vision.

Data marts can coexist with the main data warehouse or be fully decentralized, each disposing of its own ETL tools. The disadvantage of the latter approach is the danger of facing integration problems if the initial design does not reflect a complete business model [24, 151].

Another integral component of the data warehouse layer is the metadata repository that enables shared

2.1 : Business Intelligence and its Components 13

access to metadata by various tools and processes. Metadata refers to the data required for managing the data warehouse and comprises administrative (setup, configuration, database objects and rules), business (definitions, ownership, access rights), and operational (origin, currency, usage statistics) metadata [24]. In the world of data warehousing, the administrative metadata, which describes the structure of the available data, functions as an index to the actual contents as the former allows the end-user to navigate through the data and analyze it interactively [68].

Operational Data Store(ODS) is a storage structure residing outside of the data warehouse environment and providing integrated real-time detailed data obtained from operational systems. In contrast to a data warehouse, which contains historical and summarized data, an ODS stores current operational data to support the demand for near-real-time data, e.g., for operational reporting or tactical decision making. Considering the growing analytical interest for fine-grained real-time data, Kimball proposes to relocate the ODS by coupling it tightly with the data warehouse as the “front edge” of the latter [84].

TheAnalysis Layerencompasses analysis methodologies and techniques, such as OLAP and data mining, which form the basis for the end-user BI tools. The termOLAP(On-Line Analytical Processing), synonymous tomultidimensional data analysis, was coined in 1993 by the inventor of the relational data model E. F. Codd to describe a kind of software that analyzes business data in a top-down hierarchical fashion:

“OLAP is the name given to the dynamic enterprise analysis required to create, manipulate, animate, and synthesize information from exegetical, contemplative, and formulaic data analysis models. . . This includes the ability to discern new or unanticipated relationships between variables, the ability to identify the param-eters necessary to handle large amounts of data, to create an unlimited number of dimensions (consolidation paths), and to specify cross-dimensional conditions and expressions”[30].

Data miningtechniques provide advanced predictive and analytical functionality by identifying distribu-tion patterns (segmentadistribu-tion), characteristic behaviors (classificadistribu-tion) and reladistribu-tionships (associadistribu-tion) within a dataset [24, 151].

Finally, thePresentation Layerconsists solely of the frontend analytical applications, commonly referred to as BI tools, for reporting, querying, and mining the data. Presentation tools differ in the degree of freedom (pre-defined vs. ad hoc queries), complexity and customizability to individual requirements. With the recent advancements in the Internet technology, there is a clear trend towards browser-based frontend solutions.

Another trendy development is a unification of diverse analysis toolkits in a comprehensive BI platform, based on an organization-wide business model. Examples of mature BI platform solutions are Oracle BI Suite [133] and BusinessObjects Enterprise [14]. A pioneering initiative in developing an open-source BI suite is Pentaho BI Platform [139], which is entirely web-based and highly customizable.

Obviously, each data warehouse solution has to be designed and implemented individually and in accor-dance with specific requirements of a given organization and the application domain. However, there are certain general characteristics that make out the advantages of the data warehousing approach [92, 151]:

availability of an organization-wide unified and consistent data model, ability to benefit from external data sources,

maintenance of historic and summarized data in a separate database, no interference with operational sources,

optimized performance for complex queries, accessibility to a wide range of users, user-friendly frontend tools.

Subject-orientation guarantees applicability of data warehousing to virtually any application domain.

Data warehouse solutions are employed in trade, finance, banking, insurance, production etc. Initially de-signed for satisfying decision support needs of business enterprises, the performance-oriented BI approach has recently found its way to a multitude of non-conventional applications, such as life sciences, health-care,

14 Chapter 2 : Background and Related Work

academia, and government, to name a few distinguished fields. In the next section we take a closer look at the OLAP technology and investigate what factors contribute to its nearly universal applicability.