Motivation - Flexible processing of streamed context data in a distributed environment

T

he amount of data we have to face every day grows steadily. The velocity in which new data is produced increases every day as well. Eric Schmidt, CEO of Google, once estimated the size of the World Wide Web (WWW) is about 5 EBs (10¹⁸ Bytes =1.000.000.000.000.000.000 Bytes) [27]. This amount of data covers only the WWW-related data—also non-WWW-related data is being produced and the total amount grows every day. In 2009, Andreas Weigend, for-mer chief scientist of Amazon, predicted ”that human beings would generate more data in 2009 than in all prior human history” [135]. According to Eric Schmidt in 2011 we are adding this amount of data to the human Database(human DB) every two days. This human DB is composed of many different data sources, including the WWW with all its publicly available unstructured data and structured data, rather private structured data present in scientific or medical DBs as well as sensors producing a continuous stream of data, e. g. to sense the envi-ronment. This can be summed up by the termbig data. Big datacharacterizes the acknowledg-ment of ”the exponential growth, availability and use of information in the data-rich landscape of tomorrow” [56]. According to Gartner, solving the big data challenge involves more than just managing high volumes of data [56]. This becomes evident if we look at the huge number of different sources, data is available from: The public data present in the WWW, the rather private scientific data stored in DBs, and the streams of highly volatile and time-critical data.

Thus, not only the volume of data should be considered but also the variety and velocity of data must be taken into account in order to keep track with the highly dynamic and manifold nature of data [56].

While research proposed many interesting and efficient solutions to process unstructured as well as structured data of static nature, for the dynamic nature of streamed data there are still open questions w.r.t. the big data issue for context-aware applications. This especially includes

providing concepts and coping with the integration of highly domain-specific functionality for applications relying on the data stream processing paradigm. The termstatichere means that the update frequency is low compared to data streams which we define as being dynamic since updates most likely occur with high frequency. A data stream is characterized as a potentially infinite flow of data elements from one or more data sources. The processing of such data streams is typically done in two steps: Data elements are collected from the data sources and are processed according to a processing definition consisting of a defined set and ordering of operators, which are interconnected defining a processing pipeline. Over the past few years data stream processing has been in the focus of research all over the world. Research ranges from proposals for Data Stream Processing System (DSPS) architectures [2, 14, 57] over adjusted stream processing techniques [112, 126, 153] to query distribution and re-use resulting in more and more sophisticated techniques [10, 153].

The term DSPS will be used as a synonym for Data Stream Management System (DSMS) throughout this thesis. Centralized approaches such as [14] exist also, but this thesis will not further differentiate between the two DSPS variants but will assume that DSPSs are distributed as this variant constitutes the current state-of-the-art.

This thesis covers the Database Management System (DBMS)-oriented perspective on data stream processing. In the DB context this wasfirstly mentioned by Babcock et al. [16]. They discuss the question of how a management system for the domain of data stream processing, i. e. how a DSPS should look like. Conversely, this means to transfer the management function-alities of DBMSs and to adapt them to meet the specific requirements of DSPS. However, DSPSs never reached such a huge community as DBMSs did. Depending on the domain of interest, e. g. context-aware visualization, the processing of such data is often related to highly domain-specific functionality. This domain-domain-specific functionality is—beside others—specified in terms of highly specialized operators that may require specialized hardware to run smoothly. E. g. in the context of visualization an operator that renders a scenery requiring a Graphics Processing Unit (GPU). The seamless integration of these highly specialized operators into DSPSs is a key feature to address and adequately support a wide range of applications relying on the data stream processing paradigm. This is because the potential infiniteness of data streams prohibit their storage and postponed processing. Also data transfers must be reduced to a minimum to permit an efficient processing of data streams since high data volumes are assumed. This creates a strong dependency between application requirements on the one side and system capabilities on the other side. This fact must be taken into account by DSPSs [43].

It could be reasonably contended that the development of this application-specific functional-ity has to be done anyway in order to make the applicationfinally work. However, as described in [43] an adaptation problem is still persisting. Usually, DSPSs provide a generic querying and processing mechanism to process the streamed data in an application-independent man-ner. This especially means that operators are rather generic and resemble those of DBMSs. But context-aware applications rely on models of the physical world which often have different data formats. This model of the physical world is given by static context information such as map data and 3D models as well as dynamic information from billions of sensors located

1.1 Motivation 41

Figure 1.1:Application scenario of a mobile context-aware application tracing friends.

in our physical environment, e. g. Global Positioning System (GPS) sensors in mobile devices.

They also heavily rely on highly specialized data operations, as discussed in Chapter 2. Sen-sors and more general data sources, such as position data of moving objects, produce streamed data continuously that are consumed by context-aware applications. A prominent example of context-aware applications are location-aware applications which rely on the context informa-tion regarding their surroundings with respect to the current posiinforma-tion. Due to these reasons, the model of providing ratherfixed querying and processing mechanisms does not hold for the domain of stream processing as regards context-aware applications.

As depicted in Figure 1.1, context-aware applications may produce data streams (denoted by B^m) and at the same time consume data streams (denoted by A^m). In this scenario, a user running a mobile context-aware application A^mwants to visualize a map of its surroundings.

The map displays his friends pinned to their current location. The friends have mobile devices Bmwith numerous integrated sensors which allow to sense the environment. E. g. they pro-vide their current position originating from a GPS sensor producing a continuous stream of position data elements. However, to get a nicely displayed map of its surrounding the mobile application also needs additional data originating from third-party servers C^mlocated in the WWW. These could be servers providing map data or personal data of the friends of interest.

The mobile context-aware application is designed to receive the resulting image of the

sur-rounding map. Therefore, the position data originating from the mobile devices GPS sensor (denoted by the source operatorS1) and the personal data originating from third-party servers (denoted by the source operatorS2) must be processed according to a stream processing graph (SP graph) which is executed within a DSPS. A SP graph is usually not processed on a sin-gle machine but distributed on different machines. These are denoted by the different dashed white boxes. A SP graph consists of a number of sources S, operators O, and sinksT that are interconnected, building a network of operators¹. Thereby, the sourcesS1andS2 provide the data streams for the operators O1 and O2. O1 combines the data originating from the two sources. Therefore, each mobile device (representing a friend) is connected to a social media profile on third-party servers. In order to augment the position of a mobile device with ad-ditional data, for each device the respective data must be extracted from third-party servers.

O2receives the preprocessed data and rendersan image of the scenery. This operator heavily relies on specialized hardware, i. e. a GPU, thus making a seamless and flexible integration of specialized operators mandatory. The result is send to the sink T, representing the mobile context-aware application.

Knowing the bandwidth requirements of this application, an application developer can clearly identify the specific QoS requirements for the correct distribution of the SP graph.

These requirements are a good indicator for the DSPS to decide how best to distribute the SP graph to meet the application requirements. It is important to note that within the same DSPS many other different applications might exist. These applications might have different requirements. E. g. an interactive stream-based game application in a first place needs a fast and reactive SP graph, i. e. it needs the latency to be minimized.

Moreover, users participating in the process might not want to expose their current location to potentially unknown parties, restricting e. g. data access to known or trusted ones only.

Therefore, additionally to theflexible integration of specialized operators security aspects must also be considered, limiting the access of data as well as the granularity at which data is made available. In the sample scenario, the resulting data for the third-party server might be less accurate, thus only indicating in which state a user currently is obfuscating his or her actual position.

In the scenario sketched here it seems reasonable not to perform the combination and ren-dering of the data on the mobile device. This is not recommendable, since potential high volume data transfers over mobile networks is problematic. Moreover, processing power of mobile devices is often insufficient. Thus, these operations should be performed near to the original data and furthermore in an adequate environment, i. e. DSPSs. But many DSPSs lack the integration of specialized operators. Besides, the particular requirement characteristics of the operators are not considered as well.

To avoid context-aware applications consisting of many isolated application parts ”knitted together in a hurry”, DSPSs should provide an adequate integration mechanism for such ap-plications. This reduces redundancy and increases reuse of existing functionality. That is, for

1Sources and sinks are special operators. Sources exclusively produce data whereas sinks exclusively consume data.

1.2 Contributions and Outline 43

Im Dokument Flexible processing of streamed context data in a distributed environment (Seite 39-43)