Characteristics of Big Data
2.3 Validation and enrichment of the Big Data dimensions using Topic Modelsusing Topic Models
2.3.4 Analysis on the dimensional level
Table 2.5: Results of the Topic Model application on a randomly generated corpus
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
programs process algorithms performance data
code management queries memories mined
analysis development results system behavior
semantics creation complexity caches image
programming governance set disk digital
language services class control research
language studies - monitoring application
verification knowledge - drivers analysis
compiler shared - power techniques
In this section, the topic models have been applied on the overall corpus in order to validate the identified dimensions. In the next section, the topic models will be applied to dimension-specific publications in order to enrich those.
recent publications entirely.
Within the analyzed corpus, several papers existed with an focus on the utilization of specific Big Data technologies in an industry context with a subordinated consideration of technological or methodological aspects. The overarching category for those publica-tions has been namedapplication. The author chose this notion in accordance with the already established use in two of the analyzed definitions [Chen et al.,2012;Cuzzocrea et al.,2011].
Furthermore, ii) it became clear that an assignment of each paper to only one dimen-sions is not suitable due to the breadth of contained topics of the recent publications on Big Data. The results of the assignments can be found in table2.6.
Following the results, recent publications have focused on the infrastructure aspect, fol-lowed by methods and applications. There were a total of twelve publications which have been identified manually, targeting data-relevant topics, which explains why this topic did not come up as a separate topic based on the application of the topic models.
Table 2.6: Number of publications per dimension after the manual assignment
Dimension Number of publications IT infrastructure 112
Method 99
Application 84
Data 12
After the publications have been pre-assigned to the individual dimensions based on a manual process, in the next step, the topic models are applied to the dimension-specific publications except for the data dimension due to its low number of related papers.16 In the following section, the results are discussed with respect to the extent to which and how they account for the Big Data concept.
16The screening of the literature revealed that the topic of data relevant topics, e.g. data quality management, meta data management, data modelling etc. has not been subject yet to publications in the field of Big Data.
Table 2.7: Results of the Topic Model application on the publications belonging to the IT infrastructure dimension
Topic 1 Topic 2 Topic 3 cloud queries network computing database social
cluster stores results mapreduce search latency processing analysis traffic
parallel research
-hadoop index
-distributed processing -platform prototype
-- framework
-2.3.4.1 IT infrastructure dimension
The results of the topic model application on the 112 Infrastructure related publications (mp = 0.86) show a distinction between words related to hardware (Topic 1) and software (Topic 2) (table2.7). Cloud computingplays a dominant role within the hardware topic.
Although the words cloud and computing do not account solely for a Big Data infras-tructure, the increasing amount of data led to a rise in the cloud applications, and vice versa; therefore, cloud computing, both as a driver and enabling technology, is relevant within a Big Data hardware topic [Argawal et al.,2011]. Furthermore, theMapReduce framework in combination withHadoop cluster as aplatform for thedistributed, parallel processing of the data play a major role within the field of Big Data-related hardware.
This dominance is emphasized as the words come up both in the analysis of the overall corpus as well as based on the dimension-specific corpus.
The words in Topic 2 are representatives for a software-oriented perspective on Big Data, which includequeries, processed ondatabasesorstoresas subjects of data handling [Feng et al.,2012]. Furthermore, the word combinationsearch, indexing, analysis, andresearch are an application/task-oriented perspective on the IT infrastructure. Prototypemainly refer to the developed frameworks for test scenarios.
In contrast to several words in Topic 1, the words in Topic 2 are not connected explicitly
with Big Data, as the contained words are general infrastructural topics. Topic 3 has lower information value than Topics 1 and 2. Besides the first two words with network andsocial, which fit into the named aspect of the (social) network analysis, no remaining words fit into a specific category. The aspect of network analysis will be discussed in the next section.
2.3.4.2 Method dimension
The results of the publications on methods (table 2.8) (mp = 0.71) offer two insights:
i) MapReduce/Hadoop play a major role in the method-related publications (Topic 2), which is convincing because MapReduce is a methodological approach, Hadoop its im-plementation. Therefore, whereas in the IT infrastructure section, publications target the development of clusters, the focus in the methods dimension is on the fitting of the MapReduce algorithm to data characteristic-related requirements. One would not cope with the concept of Big Data if it were reduced on MapReduce/Hadoop, but with regard to the open source availability and comprehensive developing community and support, MapReduce/Hadoop have an outstanding position within the Big Data concept [McAfee and Brynjolfsson, 2012]. Furthermore, ii) the results of the methods dimension high-light the aspect of networks. This can be found in Topics 3 and 5, supplemented by the word graphs, which demonstrates an increasing relevance for the analysis of social online networks and named as a Big Data specific methodChen et al.[2012]. The study and analysis ofuser behavior targets the analysis of online behavior in social networks and platforms such as Twitter or Facebook, but still are general, research-related words.
In contrast to Topics 2, 3, and 5, Topics 1 and 4 do not represent distinctly identifiable Topics. The words in Topic one are generic and do not allow to draw a conclusion to an underlying Topic. Topic 4 holds general methodological words as algorithm or cluster and point at machine learning based classification, which is not necessarily a Big Data specific topic. Therefore, Topic 1 and 4 have not been considered any further.
Table 2.8: Results of the Topic Model application on the publications belonging to the method dimension
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
signal mapreduce graph time rate
event processing search machine network
code hadoop studies virtual effective
resolution implementation research results service parameters computing results cluster systems
- distributed users algorithm users
- systems online mean social
- queries prediction classification temporal
- efficiency network complex method
- cloud engines method analysis
2.3.4.3 Application dimension
The resulting topics in table 2.9 (mp = 0.64) cover a broad range of applications in the Big Data context but non distinct identifiable. Topic 1 partly targets the analysis of social networks, which corresponds to topics 3 and 5 from the methods dimension (table2.8). This finding emphasizes the relevance of the network topic and its analysis within the Big Data context [Chen et al., 2012]. Topic 2 offers words such as busi-ness, challenges, and market, which are generic application-related words that are not distinctive for Big Data; therefore they do not contribute to a clarification of the appli-cation dimension. The same accounts for Topic 3. In contrast, Topic 4 represents words such asstorage and system, which is an IT infrastructure aspect within the application dimension but again these three words are generic.
Table 2.9: Results of the Topic Model application on the publications belonging to the application dimension
Topic 1 Topic 2 Topic 3 Topic 4
social business search model
emerging challenges information cloud
internet science time service
bias classification processing storage information speech results application
research research user requirements
platforms - text systems
processing - emotional evaluation
user - -
-network - -