Start: 08:15 am CET
Daniel Kocher
Salzburg, Summer term 2021
Department of Computer Sciences University of Salzburg
Final Exam
Regular date/time:June 28, 2021, 08:00 - 09:30 am CET
Partially overlaps with another exam (starting at 09:30 am CET).
Options:
1. Find another time, e.g., start at 07:30 am CET.
2. Find another date, e.g., June 29, 2021, 3:00 pm CET.
3. Start a poll in the PLUS Umfragetool1.
1https://umfrage.sbg.ac.at
2
Data Processing
Literature, Sources, and Credits
Literature:
• Silberschatz et al.Database System Concepts. McGraw Hill, Sixth Edition, 2010. In particular Chapter 10 –Big Data.
Credits:These slides are partially based on slides of other lectures.
• Slides of Silberschatz et al. Database System Concepts. McGraw Hill, Sixth Edition, 2010. In particular Chapter 10 –Big Data.
4
Motivation
“Big data”that does not fit on a single machine needs to beprocessed
⇒Higher degree of distribution and parallelism
Volume:Thousands of machines (nodes) are required to store and process the data.
Velocity:Data arrives at a very high pace and needs to be processed immediately to respond to certain events.
Variety:Different data formats are used for different purposes and may need to be processed collectively (e.g., logs and the actual data of an application).
5
Data Sources
The weband its various applications⇒Web logs.
• Recommendations
• User interaction patterns
• Advertisement
• ...
Sensors report data continuously and at a very high pace⇒“Internet of things”. Data fromsocial media platforms.
Metadata in communication networks to predict/prevent problems.
Data Sources
The weband its various applications⇒Web logs.
• Recommendations
• User interaction patterns
• Advertisement
• ...
Smartphone appsand data about the user interactions.
Sensors report data continuously and at a very high pace⇒“Internet of things”.
Data fromsocial media platforms.
Metadata in communication networks to predict/prevent problems.
6
Motivation
Similar to database systems, many companies developed their own solutions to process these large amounts of data.
Problems?
• Parallelism.
• Load balancing.
• ...
• Dealing with failures in the distributed environment is not trivial.
• ...
Motivation
Similar to database systems, many companies developed their own solutions to process these large amounts of data.
Problems?
• Satisfying the performance requirements is not easy.
• Parallelism.
• Load balancing.
• ...
• Dealing with failures in the distributed environment is not trivial.
• ...
7
Goal:A framework that implements thisfunctionality transparently.
• Allowcomplexdata processingtasks.
• Transparent and automaticparallelizationof the tasks.
• Built-in and transparentfault tolerance.
Data Storage
InPart I – Data Management, we have covered:
• Differentmodelsandsystemsto store complex data.
• Parallelanddistributeddatabase systems.
• Fragmentation(akasharding)andreplication.
We havenot yetheard aboutdistributed file systems.
9
Every computer/operating system has alocal file system (FS).Effectively, the file system takes care of how the data is stored on your hard disk and how the user can retrieve it. Furthermore, the FS implements a common interface to access the files.
Adistributed file system (DFS)provides the same functionality across a cluster of nodes transparently, i.e., the user interacts with the distributed FS as if it would be a local FS.
Examples:The Google File System (GFS) and the Hadoop File System (HDFS).
Distributed File Systems
Designed to store very large files (up to hundreds of gigabytes).
A file is split into multiple blocks, which are then distributed across multiple nodes.
Techniques like fragmentation and replication are often used in combination to provide high availability.
Functionality:
• Hierarchical organization (i.e., directory structures).
• File reconstruction (i.e., mapping a file name to the distributed blocks).
• Access to a distributed file (through the file name).
11
Introduction
A generic framework (or paradigm) for a common situation in parallel computing:
Apply a function to each of our data items.
Specifically, we want toapply two functionsone after another:
1. Apply a first function, themap()function, to each data item.
2. Apply a second function, thereduce()function, to each result item of (1).
Input data Read map Write Interm. result Read reduce Write Final result
12
Example – WordCount
Task:Count the occurrence of each word in a collection of files.
1. Single file on a single machine (node) 2. Multiple files on multiple nodes
⇒Not that easy ...
Example – WordCount
Task:Count the occurrence of each word in a collection of files.
1. Single file on a single machine (node)⇒Straightforward.
2. Multiple files on multiple nodes⇒Not that easy ...
13
Input file:
There is only one Lord of the Ring, only one who can bend it to his will.
Desired Result:
Word Count Word Count Word Count Word Count
There 1 is 1 only 2 one 2
Lord 1 of 1 the 1 Ring 1
who 1 can 1 bend 1 it 1
to 1 his 1 will. 1
Specify thecore logicthrough two complementary functions,map()andreduce().
Example – WordCount with MapReduce
Step 1:Themap()function is invoked on each input record, and produces one or more intermediate data items. Each intermediate data item is a key-value pair(rkey,value).
15
Themap()Function:
# P s e u d o c o d e i n P y t h o n−l i k e s y n t a x . def map( l i n e ) :
# We c o n s i d e r e a c h l i n e a r e c o r d and s p l i t i t by w h i t e s p a c e . f o r word i n l i n e . s p l i t ( ) :
# O u t p u t t h e i n t e r m e d i a t e d a t a i t e m .
# e m i t ( x , y ) i s a p s e u d o f u n c t i o n t h a t o u t p u t s a p a i r ( x , y ) . emit ( word , 1 )
Output:
("There", 1), ("is", 1), ("only", 1), ("one", 1), ("Lord", 1), ("of", 1), ("the", 1), ("Ring", 1), ("only", 1), ("one", 1), ("who", 1), ("can", 1), ("bend", 1), ("his", 1), ("will.", 1)
Example – WordCount with MapReduce
Step 1:Themap()function is invoked on each input record, and produces one or more intermediate data items. Each intermediate data item is a key-value pair(rkey,value). Step 2:(rkey,value)pairs are grouped based on the key, i.e., data items with the same key are grouped together. This results in one list per key,(rkey,valuelist).
17
(rkey,value)Pairs:
("There", 1), ("is", 1), ("only", 1), ("one", 1), ("Lord", 1), ("of", 1), ("the", 1), ("Ring", 1), ("only", 1), ("one", 1), ("who", 1), ("can", 1), ("bend", 1), ("his", 1), ("will.", 1)
(rkey,valuelist)Pairs:
("There", [1]), ("is", [1]), ("only", [1,1]), ("one", [1,1]), ("Lord", [1]), ("of", [1]), ("the", [1]), ("Ring", [1]), ("who", [1]), ("can", [1]), ("bend", [1]), ("his", [1]), ("will.", [1])
Example – WordCount with MapReduce
Step 1:Themap()function is invoked on each input record, and produces one or more intermediate data items. Each intermediate data item is a key-value pair(rkey,value).
Step 2:(rkey,value)pairs are grouped based on the key, i.e., data items with the same key are grouped together. This results in one list per key,(rkey,valuelist).
Step 3:Thereduce()function is invoked on each(rkey,valuelist)pair and typically aggregates the results for a specificrkey(i.e., word).
19
Thereduce()Function:
# P s e u d o c o d e i n P y t h o n−l i k e s y n t a x . def reduce( rkey , v a l u e l i s t ) :
count = 0 # t o t a l number o f o c c u r r e n c e s f o r v a l u e i n v a l u e l i s t :
count = count + v a l u e
# O u t p u t t h e f i n a l word c o u n t .
# e m i t ( x , y ) i s a p s e u d o f u n c t i o n t h a t o u t p u t s a p a i r ( x , y ) . emit ( rkey , count )
Final Result:
("There", 1), ("is", 1), ("only", 2), ("one", 2), ("Lord", 1),
("of", 1), ("the", 1), ("Ring", 1), ("who", 1), ("can", 1), ("bend", 1), ("his", 1), ("will.", 1)
The MapReduce Framework
What about multiple files on multiple machines?
What about parallelism?
21
The MapReduce Framework
Input data Read mapLocal WriteInterm. result Remote Read, Shuffle reduce Write Final result
Part 3
Part 4 . . .
Partn
Input data on DFS
map3
. . .
mapm
Part 3
. . .
Partm
Intermed. result
reduce2
. . .
reduces
Part 2
. . .
Part s
Final result
The MapReduce Framework
Part 1 Read mapLocal WriteInterm. result Remote Read, Shuffle reduce Write Final result
Part 2
Part 3
Part 4 . . .
Partn
Input data on DFS
map2
map3
. . .
mapm
Part 2
Part 3
. . .
Partm
Intermed. result
reduce2
. . .
reduces
Part 2
. . .
Part s
Final result
22
The MapReduce Framework
Part 1 Read mapLocal WritePart 1 Remote Read, Shuffle reduce Write Final result
Part 2
Part 3
Part 4 . . .
Partn
Input data on DFS map2
map3
. . .
mapm
Part 2
Part 3
. . .
Partm
Intermed. result
. . .
reduces
. . .
Part s
Final result
The MapReduce Framework
Part 1 Read mapLocal WritePart 1 Remote Read, Shuffle reduce Write Part 1
Part 2
Part 3
Part 4 . . .
Partn
Input data on DFS map2
map3
. . .
mapm
Part 2
Part 3
. . .
Partm
Intermed. result
reduce2
. . .
reduces
Part 2
. . .
Part s
Final result
22
Each task (map/reduce) runs on a node, i.e., a node can be mapperandreducer.
Traditionally, MapReduce isdisk-based, i.e., the input data for a map/reduce task is read from hard disk and the (intermediate) result is flushed back onto hard disk.
Disclaimer:MapReduce is not the solution to all problems.
• Other systems (incl. DBSs) may be beneficial for particular problems.
• MapReduce is stateless, i.e., mappers/reducers unaware of other mappers/reducers⇒Not ideal for iterative algorithms.
Many parallel programming frameworks are based on the idea of MapReduce2, e.g., Apache Hadoop, Apache Spark, Apache Flink, ...
2https://research.google/pubs/pub62/
Q&A
Start: 08:15 am CET
Distributed Information Management
Daniel Kocher
Salzburg, Summer term 2021
Department of Computer Sciences University of Salzburg
• Introduction to data processing (data sources, motivation).
• Distributed file systems (DFS).
• The MapReduce framework (3 phases, WordCount example).
WordCount with Parallel MapReduce
Input data Read mapLocal Write Interm. result Remote Read, Shuffle reduce Write Final result
one Lord of
the Ring, only
one who can
bend it to
his will. Input data (DFS)
map2
map3
map4
(“one”, 1), (“Lord”, 1), (“of”, 1)
(“one”, 1), (“who”, 1), (“can”, 1)
(“bend”, 1), (“it”, 1), (“to”, 1), (“his”, 1), (“will.”, 1)
Intermed. result
“bend”, “can”,
“his”, “is”, “Lord”
reduce2
“one”, “only”,
“of”, “Ring”
reduce3
“the”, “There”,
“will.”, “who”
(“one”, 2), (“only”, 2), (“of”, 1), (“Ring”, 1)
(“the”, 1), (“There”, 1), (“will.”, 1), (“who”, 1)
Final result
28
WordCount with Parallel MapReduce
There is only Read mapLocal Write Interm. result Remote Read, Shuffle reduce Write Final result
one Lord of
the Ring, only
one who can
bend it to
his will.
Input data (DFS)
map2
map3
map4
(“one”, 1), (“who”, 1), (“can”, 1)
(“bend”, 1), (“it”, 1), (“to”, 1), (“his”, 1), (“will.”, 1)
Intermed. result
reduce2
“one”, “only”,
“of”, “Ring”
reduce3
“the”, “There”,
“will.”, “who”
(“one”, 2), (“only”, 2), (“of”, 1), (“Ring”, 1)
(“the”, 1), (“There”, 1), (“will.”, 1), (“who”, 1)
Final result
WordCount with Parallel MapReduce
There is only map
(“There”, 1), (“is”, 1), (“only”, 1), (“the”, 1), (“Ring”, 1),
(“only”, 1)
reduce Final result
Read Local Write Remote Read, Shuffle Write
one Lord of
the Ring, only
one who can
bend it to
his will.
Input data (DFS)
map2
map3
map4
(“one”, 1), (“Lord”, 1), (“of”, 1)
(“one”, 1), (“who”, 1), (“can”, 1)
(“bend”, 1), (“it”, 1), (“to”, 1), (“his”, 1), (“will.”, 1)
Intermed. result
“bend”, “can”,
“his”, “is”, “Lord”
reduce2
“one”, “only”,
“of”, “Ring”
reduce3
“the”, “There”,
“will.”, “who”
(“one”, 2), (“only”, 2), (“of”, 1), (“Ring”, 1)
(“the”, 1), (“There”, 1), (“will.”, 1), (“who”, 1)
Final result
28
There is only map
(“There”, 1), (“is”, 1), (“only”, 1), (“the”, 1), (“Ring”, 1),
(“only”, 1)
reduce
(“bend”, 1), (“can”, 1), (“his”, 1), (“is”, 1), (“Lord”, 1)
Read Local Write Remote Read, Shuffle Write
one Lord of
the Ring, only
one who can
bend it to
his will.
Input data (DFS)
map2
map3
map4
(“one”, 1), (“Lord”, 1), (“of”, 1)
(“one”, 1), (“who”, 1), (“can”, 1)
(“bend”, 1), (“it”, 1), (“to”, 1), (“his”, 1), (“will.”, 1)
Intermed. result
“bend”, “can”,
“his”, “is”, “Lord”
reduce2
“one”, “only”,
“of”, “Ring”
reduce3
“the”, “There”,
“will.”, “who”
(“one”, 2), (“only”, 2), (“of”, 1), (“Ring”, 1)
(“the”, 1), (“There”, 1), (“will.”, 1), (“who”, 1)
Final result
MapReduce in MongoDB
Batches vs. Streams
Batch Data:Abatchis a large butbounded static dataset.Before data can be processed, all data must be completely available (e.g., on hard disk).
Streaming Data:Astreamis anunbounded evolving dataset.Data items are processed as they stream into the system one after another, i.e., the data does not have to be completely available.
30
Stateless Processing:Thecurrent operation processestheinput data
independently, i.e., without considering preceding executions. The independence of the state makes it easier to scale.
Stateful Processing: Preceding executions may influencethe outcome of the current execution, i.e., processing history is taken into account. Recording and respecting the state makes it harder to scale.
Batch Processing
We wait until abatch of data(i.e., a block of data) is accumulated and then weprocess the data in the batch all at once.For example, we could analyze the data that
accumulates over one day.
Data is stored but not processed at arrival. In some scenarios, we must rely on these batches, e.g., when the “full” batch provides more insights.
Astate is often transferredfrom one batch to the next.
32
Batch Processing
Data source ...
accumulated data
May serve as input again
Batch Processing
Data source ...
Batch 2
accumulated data Batch 1
Operation ...
May serve as input again
33
Batch Processing
Data source ...
Batch 2
accumulated data Batch 1
Operation
May serve as input again
Batch Processing
Data source ...
Batch 2
accumulated data Batch 1
Operation ...
May serve as input again
33
Data source ...
Batch 2
accumulated data Batch 1
Operation ...
May serve as input again
Stream Processing
We do not wait for the data to accumulate butprocess each single data item
continuously(at arrival). This allows a real-time response and typically involves simple transformations.
Stream processing is usedif the data naturally arrivesin a continuousstream(e.g., twitter) or if we build adata-driven systemthat needs torespond quickly(e.g., fraud detection).
Traditional stream processing is stateless, but modern systems (e.g., Apache Flink) also implement stateful stream processing.
34
Stream Processing
Data source ...
one data item at a time
Operation ...
May serve as input again
Stream Processing
Data source ...
one data item at a time Item6 Item5 Item4 Item3 Item2 Item1
Operation ...
May serve as input again
35
Stream Processing
Data source ...
one data item at a time Item6 Item5 Item4 Item3 Item2 Item1
Operation
May serve as input again
Stream Processing
Data source ...
one data item at a time Item6 Item5 Item4 Item3 Item2 Item1
Operation ...
May serve as input again
35
Data source ...
one data item at a time Item6 Item5 Item4 Item3 Item2 Item1
Operation ...
May serve as input again
Micro-Batch Processing
Mixes batch and stream processing:Processes the data in tiny accumulations, so-called micro-batches. For example, we can accumulate data for 10s and then process this micro-batch.
Allows a system to provide near real-time responses. Often called“pseudo stream processing”(in contrast to “native stream processing”).
36
Apache Hadoop4
Open-source implementation of theMapReduce paradigmthat is designed asbatch processing system.
• Supports alinear data flowbut does not support iterative processing (i.e., loops).
• Is adisk-based system(HDFS), thus typically slower than in-memory systems.
• Scales to tens of thousands of machines (with commodity hardware).
• TheHadoop ecosystem3is quite large.
3https://hadoopecosystemtable.github.io/
4https://hadoop.apache.org/
37
Open-sourceparallel processing systemthat is designed asmicro-batch processing system mainly for analytics operations.
Duringcomputation, thedatais keptin main memory (RAM), thus Spark is typically faster than Apache Hadoop. If the data does not fit into RAM, it falls back to disk storage (e.g., using HDFS) and provides similar performance to disk-based systems.
• Supportsiterative processing(e.g., machine learning).
• Generalizes MapReduceand integrates into the Scala programming language.
• Supports stream processingwith micro-batches (time-based windows).
• Performance heavily relies on main memory.
5https://spark.apache.org/
Apache Spark
Spark implements the concept of so-calledresilient distributed datasets (RDDs).An RDD is an immutable distributed collection of data elements that is partitioned across multiple nodes (for fault tolerance).
RDDs allowin-memory transformations and actions.Transformations areapplied in a lazy fashion, i.e., they are not executed immediately but tracked in alineage graph.This improves performance and implements the fault tolerance.
For the interested reader, we refer to the official publications on Apache Spark6 7.
6https://people.csail.mit.edu/matei/papers/2010/hotcloud spark.pdf 7https://people.csail.mit.edu/matei/papers/2012/nsdi spark.pdf
39
Apache Spark
Lineage Graph:A graph that encodes how an RDD was derived (usually from stable storage), e.g., RDD 2 was derived from RDD 1 (which may represent some input file).
RDD 1 RDD 2 RDD 3
RDD 4
RDD 5
transformation history
filter map
filter
union union
lineage graph⇒More potential for optimizations (all transformations are known). Fault Tolerance:Lost RDDs can be recomputed from other RDDs using the lineage graph, i.e., lost data is recovered without replication.
Apache Spark
Lineage Graph:A graph that encodes how an RDD was derived (usually from stable storage), e.g., RDD 2 was derived from RDD 1 (which may represent some input file).
RDD 1 RDD 2 RDD 3
RDD 4
RDD 5
transformation history
filter map
filter
union union
Lazy Evaluation:Only actions trigger execution, transformations are recorded in the lineage graph⇒More potential for optimizations (all transformations are known).
Fault Tolerance:Lost RDDs can be recomputed from other RDDs using the lineage graph, i.e., lost data is recovered without replication.
40
Lineage Graph:A graph that encodes how an RDD was derived (usually from stable storage), e.g., RDD 2 was derived from RDD 1 (which may represent some input file).
RDD 1 RDD 2 RDD 3
RDD 4
RDD 5
transformation history
filter map
filter
union union
Lazy Evaluation:Only actions trigger execution, transformations are recorded in the lineage graph⇒More potential for optimizations (all transformations are known).
Fault Tolerance:Lost RDDs can be recomputed from other RDDs using the lineage graph, i.e., lost data is recovered without replication.
Apache Spark
For atable-like abstraction, Spark implementsdataframes.A dataframe is an immutable distributed collection of data elements (like RDDs), but thedata is organized in columns(RDDs store unstructured data).
Dataframes are a higher level of abstraction and support powerful APIs (the Dataframe and the SparkSQL API).Datasetscan be seen astype-safe dataframes.
df = spark . read . j s o n ( ” example . j s o n ” ) df . show ( )
# p r i n t s t h e ” s c h e m a ” , i . e . , k e y s + v a l u e t y p e s .
df . printSchema ( )
# L a z y e v a l u a t i o n : d f . s e l e c t ( ” n a m e ” ) i s t r a c k e d i n t h e l i n e a g e g r a p h ,
# o n l y . s h o w ( ) t r i g g e r s e x e c u t i o n .
df . s e l e c t ( ” name ” ) . show ( )
41
Open-sourceparallel processing systemthat is designed asnative stream processing system.
• The streaming architecture supportsiterative processing(e.g., machine learning).
• Unified framework for processing batches and streams.
• Can operate in astateful or statelesscomputation mode.
• Implements fault tolerance through checkpoints/snapshots.
8https://flink.apache.org/
Apache Flink
Distributed Stream Processing:Data items in the streams are grouped and distributed based on some key (cf. colors), and each node is responsible for some key range.
...
...
...
...
Op.
Node 1
Op.
Node 2
Op.
Node 3
...
...
...
...
...
43
StatefulDistributed Stream Processing9:Thestateisaccumulatedand
maintained over timein a distributed manner byco-locating it(i.e., storing it on the node that runs the operation).
...
...
...
Op.
Node 1
State of node 1
Op.
Node 2
Op.
Node 3
...
...
...
9This is a very simplified description. For a detailed description, please check https://flink.apache.org/features/2017/07/04/flink-rescalable- state.html
Apache Flink
Fault Tolerance10:Special items calledbarriersare injected into the streams and force the nodes towrite a checkpoint of data and stateonto (distributed) durable storage (e.g., HDFS).Nodeirecordsits data and state since thelast barrierwas processed.
...
...
...
Barrier
Op.
Node 1
Op.
Node 2
Op.
Node 3
...
...
...
HDFS State of node 3 since last barrier
Items of node 3 since last barrier
10This is a very simplified description. For a detailed description, please check
https://ci.apache.org/projects/flink/flink-docs- release-1.1/internals/stream checkpointing.html
45
Fault Tolerance10:Special items calledbarriersare injected into the streams and force the nodes towrite a checkpoint of data and stateonto (distributed) durable storage (e.g., HDFS).Nodeirecordsits data and state since thelast barrierwas processed.
...
...
...
Barrier
Op.
Node 1
Op.
Node 2
Op.
Node 3
...
...
...
HDFS State of node 3 since last barrier
Items of node 3 since last barrier
10This is a very simplified description. For a detailed description, please check
https://ci.apache.org/projects/flink/flink-docs- release-1.1/internals/stream checkpointing.html
Apache SystemDS13
Formerly known as SystemML (developed by IBM). Apache SystemDS is adistributed machine-learning (ML) systemthat scales to large clusters. Its focus is on the integration of the entire data science lifecycle(i.e., data
integration/cleaning/preparation, ML model training, serving the data).
SystemDS11 12bridges the gap from simple ML algorithms written in R/Python to executing the ML algorithm at scale on a large cluster. It provides adeclarative language for MLand can execute in-memory on a single machine or on a large Spark cluster.
11https://www.youtube.com/watch?v=n3JJP6UbH6Q
12http://cidrdb.org/cidr2020/papers/p22-boehm- cidr20.pdf 13https://systemds.apache.org/
46
Initially developed by LinkedIn, Apache Kafka14is apowerful building blockin many large-scaledata processing pipelines.At its core, it is adistributed and
fault-tolerant logging system.
The internals are comparable to the logging mechanism in a database system (i.e., log entries are stored in order in an append-only fashion).
Producer-Consumer Paradigm:Applications send (produce) messages to a Kafka node, which are then processed (consumed) by another application. Messages are stored by “topic” and a consumer subscribes to a topic to retrieve the corresp. messages.
14https://www.microsoft.com/en-us/research/wp- content/uploads/2017/09/Kafka.pdf 15https://kafka.apache.org/
Apache Kafka
ApacheKafkaCluster
Topic 1 M1,1 M1,2 M1,3 ...
Topic 2 M2,1 M2,2 M2,3 ...
Topic 3 M3,1 M3,2 M3,3 ...
...
Producer2
Producer1 ... Producern
Topic 1
Topics 1 + 2 Topics 2 + 3
Consumer2
Consumer1 ... Consumerm
Topic 1 – 3
Topics 1 Topic 3
Mi,j. . .j-th message of topici
48
Apache Kafka
ApacheKafkaCluster
Topic 1 M1,1 M1,2 M1,3 ...
Topic 2 M2,1 M2,2 M2,3 ...
Topic 3 M3,1 M3,2 M3,3 ...
...
Producer2
Producer1 ... Producern
Consumer2
Consumer1 ... Consumerm
Topic 1 – 3
Topics 1 Topic 3
Mi,j. . .j-th message of topici
Apache Kafka
ApacheKafkaCluster
Topic 1 M1,1 M1,2 M1,3 ...
Topic 2 M2,1 M2,2 M2,3 ...
Topic 3 M3,1 M3,2 M3,3 ...
...
Producer2
Producer1 ... Producern
Topic 1
Topics 1 + 2 Topics 2 + 3
Consumer2
Consumer1 ... Consumerm
Topic 1 – 3
Topics 1 Topic 3
Mi,j. . .j-th message of topici
48
ApacheKafkaCluster
Topic 1 M1,1 M1,2 M1,3 ...
Topic 2 M2,1 M2,2 M2,3 ...
Topic 3 M3,1 M3,2 M3,3 ...
...
Producer2
Producer1 ... Producern
Topic 1
Topics 1 + 2 Topics 2 + 3
Consumer2
Consumer1 ... Consumerm
Topic 1 – 3
Topics 1 Topic 3
Mi,j. . .j-th message of topici