Analysis on the dimensional level - Validation and enrichment of the Big Data dimensions using

Characteristics of Big Data

2.3 Validation and enrichment of the Big Data dimensions using Topic Modelsusing Topic Models

2.3.4 Analysis on the dimensional level

Table 2.5: Results of the Topic Model application on a randomly generated corpus

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5

programs process algorithms performance data

code management queries memories mined

analysis development results system behavior

semantics creation complexity caches image

programming governance set disk digital

language services class control research

language studies - monitoring application

verification knowledge - drivers analysis

compiler shared - power techniques

In this section, the topic models have been applied on the overall corpus in order to validate the identified dimensions. In the next section, the topic models will be applied to dimension-specific publications in order to enrich those.

recent publications entirely.

Within the analyzed corpus, several papers existed with an focus on the utilization of specific Big Data technologies in an industry context with a subordinated consideration of technological or methodological aspects. The overarching category for those publica-tions has been namedapplication. The author chose this notion in accordance with the already established use in two of the analyzed definitions [Chen et al.,2012;Cuzzocrea et al.,2011].

Furthermore, ii) it became clear that an assignment of each paper to only one dimen-sions is not suitable due to the breadth of contained topics of the recent publications on Big Data. The results of the assignments can be found in table2.6.

Following the results, recent publications have focused on the infrastructure aspect, fol-lowed by methods and applications. There were a total of twelve publications which have been identified manually, targeting data-relevant topics, which explains why this topic did not come up as a separate topic based on the application of the topic models.

Table 2.6: Number of publications per dimension after the manual assignment

Dimension Number of publications IT infrastructure 112

Method 99

Application 84

Data 12

After the publications have been pre-assigned to the individual dimensions based on a manual process, in the next step, the topic models are applied to the dimension-specific publications except for the data dimension due to its low number of related papers.¹⁶ In the following section, the results are discussed with respect to the extent to which and how they account for the Big Data concept.

16The screening of the literature revealed that the topic of data relevant topics, e.g. data quality management, meta data management, data modelling etc. has not been subject yet to publications in the field of Big Data.

Table 2.7: Results of the Topic Model application on the publications belonging to the IT infrastructure dimension

Topic 1 Topic 2 Topic 3 cloud queries network computing database social

cluster stores results mapreduce search latency processing analysis traffic

parallel research

-hadoop index

-distributed processing -platform prototype

-- framework

-2.3.4.1 IT infrastructure dimension

The results of the topic model application on the 112 Infrastructure related publications (mp = 0.86) show a distinction between words related to hardware (Topic 1) and software (Topic 2) (table2.7). Cloud computingplays a dominant role within the hardware topic.

Although the words cloud and computing do not account solely for a Big Data infras-tructure, the increasing amount of data led to a rise in the cloud applications, and vice versa; therefore, cloud computing, both as a driver and enabling technology, is relevant within a Big Data hardware topic [Argawal et al.,2011]. Furthermore, theMapReduce framework in combination withHadoop cluster as aplatform for thedistributed, parallel processing of the data play a major role within the field of Big Data-related hardware.

This dominance is emphasized as the words come up both in the analysis of the overall corpus as well as based on the dimension-specific corpus.

The words in Topic 2 are representatives for a software-oriented perspective on Big Data, which includequeries, processed ondatabasesorstoresas subjects of data handling [Feng et al.,2012]. Furthermore, the word combinationsearch, indexing, analysis, andresearch are an application/task-oriented perspective on the IT infrastructure. Prototypemainly refer to the developed frameworks for test scenarios.

In contrast to several words in Topic 1, the words in Topic 2 are not connected explicitly

with Big Data, as the contained words are general infrastructural topics. Topic 3 has lower information value than Topics 1 and 2. Besides the first two words with network andsocial, which fit into the named aspect of the (social) network analysis, no remaining words fit into a specific category. The aspect of network analysis will be discussed in the next section.

2.3.4.2 Method dimension

The results of the publications on methods (table 2.8) (mp = 0.71) offer two insights:

i) MapReduce/Hadoop play a major role in the method-related publications (Topic 2), which is convincing because MapReduce is a methodological approach, Hadoop its im-plementation. Therefore, whereas in the IT infrastructure section, publications target the development of clusters, the focus in the methods dimension is on the fitting of the MapReduce algorithm to data characteristic-related requirements. One would not cope with the concept of Big Data if it were reduced on MapReduce/Hadoop, but with regard to the open source availability and comprehensive developing community and support, MapReduce/Hadoop have an outstanding position within the Big Data concept [McAfee and Brynjolfsson, 2012]. Furthermore, ii) the results of the methods dimension high-light the aspect of networks. This can be found in Topics 3 and 5, supplemented by the word graphs, which demonstrates an increasing relevance for the analysis of social online networks and named as a Big Data specific methodChen et al.[2012]. The study and analysis ofuser behavior targets the analysis of online behavior in social networks and platforms such as Twitter or Facebook, but still are general, research-related words.

In contrast to Topics 2, 3, and 5, Topics 1 and 4 do not represent distinctly identifiable Topics. The words in Topic one are generic and do not allow to draw a conclusion to an underlying Topic. Topic 4 holds general methodological words as algorithm or cluster and point at machine learning based classification, which is not necessarily a Big Data specific topic. Therefore, Topic 1 and 4 have not been considered any further.

Table 2.8: Results of the Topic Model application on the publications belonging to the method dimension

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5

signal mapreduce graph time rate

event processing search machine network

code hadoop studies virtual effective

resolution implementation research results service parameters computing results cluster systems

- distributed users algorithm users

- systems online mean social

- queries prediction classification temporal

- efficiency network complex method

- cloud engines method analysis

2.3.4.3 Application dimension

The resulting topics in table 2.9 (mp = 0.64) cover a broad range of applications in the Big Data context but non distinct identifiable. Topic 1 partly targets the analysis of social networks, which corresponds to topics 3 and 5 from the methods dimension (table2.8). This finding emphasizes the relevance of the network topic and its analysis within the Big Data context [Chen et al., 2012]. Topic 2 offers words such as busi-ness, challenges, and market, which are generic application-related words that are not distinctive for Big Data; therefore they do not contribute to a clarification of the appli-cation dimension. The same accounts for Topic 3. In contrast, Topic 4 represents words such asstorage and system, which is an IT infrastructure aspect within the application dimension but again these three words are generic.

Table 2.9: Results of the Topic Model application on the publications belonging to the application dimension

Topic 1 Topic 2 Topic 3 Topic 4

social business search model

emerging challenges information cloud

internet science time service

bias classification processing storage information speech results application

research research user requirements

platforms - text systems

processing - emotional evaluation

user - -

-network - -

Im Dokument OPUS 4 | Empirical development and evaluation of a maturity model for big data applications (Seite 46-51)