Start: 08:15 am CET

(1)

Start: 08:15 am CET

(2)

Daniel Kocher

Salzburg, Summer term 2021

Department of Computer Sciences University of Salzburg

(3)

Final Exam

Regular date/time:June 28, 2021, 08:00 - 09:30 am CET

Partially overlaps with another exam (starting at 09:30 am CET).

Options:

1. Find another time, e.g., start at 07:30 am CET.

2. Find another date, e.g., June 29, 2021, 3:00 pm CET.

3. Start a poll in the PLUS Umfragetool¹.

1https://umfrage.sbg.ac.at

2

(4)

Data Processing

(5)

Literature, Sources, and Credits

Literature:

• Silberschatz et al.Database System Concepts. McGraw Hill, Sixth Edition, 2010. In particular Chapter 10 –Big Data.

Credits:These slides are partially based on slides of other lectures.

• Slides of Silberschatz et al. Database System Concepts. McGraw Hill, Sixth Edition, 2010. In particular Chapter 10 –Big Data.

4

(6)

(7)

Motivation

“Big data”that does not fit on a single machine needs to beprocessed

⇒Higher degree of distribution and parallelism

Volume:Thousands of machines (nodes) are required to store and process the data.

Velocity:Data arrives at a very high pace and needs to be processed immediately to respond to certain events.

Variety:Different data formats are used for different purposes and may need to be processed collectively (e.g., logs and the actual data of an application).

5

(8)

Data Sources

The weband its various applications⇒Web logs.

• Recommendations

• User interaction patterns

• Advertisement

• ...

Sensors report data continuously and at a very high pace⇒“Internet of things”. Data fromsocial media platforms.

Metadata in communication networks to predict/prevent problems.

(9)

Data Sources

The weband its various applications⇒Web logs.

• Recommendations

• User interaction patterns

• Advertisement

• ...

Smartphone appsand data about the user interactions.

Sensors report data continuously and at a very high pace⇒“Internet of things”.

Data fromsocial media platforms.

Metadata in communication networks to predict/prevent problems.

6

(10)

Motivation

Similar to database systems, many companies developed their own solutions to process these large amounts of data.

Problems?

• Parallelism.

• Load balancing.

• ...

• Dealing with failures in the distributed environment is not trivial.

• ...

(11)

Motivation

Similar to database systems, many companies developed their own solutions to process these large amounts of data.

Problems?

• Satisfying the performance requirements is not easy.

• Parallelism.

• Load balancing.

• ...

• Dealing with failures in the distributed environment is not trivial.

• ...

7

(12)

Goal:A framework that implements thisfunctionality transparently.

• Allowcomplexdata processingtasks.

• Transparent and automaticparallelizationof the tasks.

• Built-in and transparentfault tolerance.

(13)

Data Storage

InPart I – Data Management, we have covered:

• Differentmodelsandsystemsto store complex data.

• Parallelanddistributeddatabase systems.

• Fragmentation(akasharding)andreplication.

We havenot yetheard aboutdistributed file systems.

9

(14)

Every computer/operating system has alocal file system (FS).Effectively, the file system takes care of how the data is stored on your hard disk and how the user can retrieve it. Furthermore, the FS implements a common interface to access the files.

Adistributed file system (DFS)provides the same functionality across a cluster of nodes transparently, i.e., the user interacts with the distributed FS as if it would be a local FS.

Examples:The Google File System (GFS) and the Hadoop File System (HDFS).

(15)

Distributed File Systems

Designed to store very large files (up to hundreds of gigabytes).

A file is split into multiple blocks, which are then distributed across multiple nodes.

Techniques like fragmentation and replication are often used in combination to provide high availability.

Functionality:

• Hierarchical organization (i.e., directory structures).

• File reconstruction (i.e., mapping a file name to the distributed blocks).

• Access to a distributed file (through the file name).

11

(16)

(17)

Introduction

A generic framework (or paradigm) for a common situation in parallel computing:

Apply a function to each of our data items.

Specifically, we want toapply two functionsone after another:

1. Apply a first function, the^map()function, to each data item.

2. Apply a second function, the^reduce()function, to each result item of (1).

Input data ^Read map ^Write Interm. result ^Read reduce ^Write Final result

12

(18)

Example – WordCount

Task:Count the occurrence of each word in a collection of files.

1. Single file on a single machine (node) 2. Multiple files on multiple nodes

⇒Not that easy ...

(19)

Example – WordCount

Task:Count the occurrence of each word in a collection of files.

1. Single file on a single machine (node)⇒Straightforward.

2. Multiple files on multiple nodes⇒Not that easy ...

13

(20)

Input file:

There is only one Lord of the Ring, only one who can bend it to his will.

Desired Result:

Word Count Word Count Word Count Word Count

There 1 is 1 only 2 one 2

Lord 1 of 1 the 1 Ring 1

who 1 can 1 bend 1 it 1

to 1 his 1 will. 1

Specify thecore logicthrough two complementary functions,^map()and^reduce().

(21)

Example – WordCount with MapReduce

Step 1:The^map()function is invoked on each input record, and produces one or more intermediate data items. Each intermediate data item is a key-value pair(rkey,value).

15

(22)

Themap()Function:

# P s e u d o c o d e i n P y t h o n−l i k e s y n t a x . def map( l i n e ) :

# We c o n s i d e r e a c h l i n e a r e c o r d and s p l i t i t by w h i t e s p a c e . f o r word i n l i n e . s p l i t ( ) :

# O u t p u t t h e i n t e r m e d i a t e d a t a i t e m .

# e m i t ( x , y ) i s a p s e u d o f u n c t i o n t h a t o u t p u t s a p a i r ( x , y ) . emit ( word , 1 )

Output:

("There", 1), ("is", 1), ("only", 1), ("one", 1), ("Lord", 1), ("of", 1), ("the", 1), ("Ring", 1), ("only", 1), ("one", 1), ("who", 1), ("can", 1), ("bend", 1), ("his", 1), ("will.", 1)

(23)

Step 1:The^map()function is invoked on each input record, and produces one or more intermediate data items. Each intermediate data item is a key-value pair(rkey,value). Step 2:(rkey,value)pairs are grouped based on the key, i.e., data items with the same key are grouped together. This results in one list per key,(rkey,valuelist).

17

(24)

(rkey,value)Pairs:

("There", 1), ("is", 1), ("only", 1), ("one", 1), ("Lord", 1), ("of", 1), ("the", 1), ("Ring", 1), ("only", 1), ("one", 1), ("who", 1), ("can", 1), ("bend", 1), ("his", 1), ("will.", 1)

(rkey,valuelist)Pairs:

("There", [1]), ("is", [1]), ("only", [1,1]), ("one", [1,1]), ("Lord", [1]), ("of", [1]), ("the", [1]), ("Ring", [1]), ("who", [1]), ("can", [1]), ("bend", [1]), ("his", [1]), ("will.", [1])

(25)

Step 1:The^map()function is invoked on each input record, and produces one or more intermediate data items. Each intermediate data item is a key-value pair(rkey,value).

Step 2:(rkey,value)pairs are grouped based on the key, i.e., data items with the same key are grouped together. This results in one list per key,(rkey,valuelist).

Step 3:The^reduce()function is invoked on each(rkey,valuelist)pair and typically aggregates the results for a specificrkey(i.e., word).

19

(26)

Thereduce()Function:

# P s e u d o c o d e i n P y t h o n−l i k e s y n t a x . def reduce( rkey , v a l u e l i s t ) :

count = 0 # t o t a l number o f o c c u r r e n c e s f o r v a l u e i n v a l u e l i s t :

count = count + v a l u e

# O u t p u t t h e f i n a l word c o u n t .

# e m i t ( x , y ) i s a p s e u d o f u n c t i o n t h a t o u t p u t s a p a i r ( x , y ) . emit ( rkey , count )

Final Result:

("There", 1), ("is", 1), ("only", 2), ("one", 2), ("Lord", 1),

("of", 1), ("the", 1), ("Ring", 1), ("who", 1), ("can", 1), ("bend", 1), ("his", 1), ("will.", 1)

(27)

The MapReduce Framework

What about multiple files on multiple machines?

What about parallelism?

21

(28)

Input data ^Read mapLocal WriteInterm. result Remote Read, Shuffle reduce Write Final result

Part 3

Part 4 . . .

Partn

Input data on DFS

map₃

. . .

map_m

Part 3

. . .

Partm

Intermed. result

reduce₂

. . .

reduce_s

Part 2

. . .

Part s

Final result

(29)

Part 1 ^Read mapLocal WriteInterm. result Remote Read, Shuffle reduce Write Final result

Part 2

Part 3

Part 4 . . .

Partn

Input data on DFS

map₂

map₃

. . .

map_m

Part 2

Part 3

. . .

Partm

Intermed. result

reduce₂

. . .

reduce_s

Part 2

. . .

Part s

Final result

22

(30)

Part 1 ^Read mapLocal WritePart 1 Remote Read, Shuffle reduce Write Final result

Part 2

Part 3

Part 4 . . .

Partn

Input data on DFS map₂

map₃

. . .

map_m

Part 2

Part 3

. . .

Partm

Intermed. result

. . .

reduce_s

. . .

Part s

Final result

(31)

Part 1 ^Read mapLocal WritePart 1 Remote Read, Shuffle reduce Write Part 1

Part 2

Part 3

Part 4 . . .

Partn

Input data on DFS map₂

map₃

. . .

map_m

Part 2

Part 3

. . .

Partm

Intermed. result

reduce₂

. . .

reduce_s

Part 2

. . .

Part s

Final result

22

(32)

Each task (map/reduce) runs on a node, i.e., a node can be mapperandreducer.

Traditionally, MapReduce isdisk-based, i.e., the input data for a map/reduce task is read from hard disk and the (intermediate) result is flushed back onto hard disk.

Disclaimer:MapReduce is not the solution to all problems.

• Other systems (incl. DBSs) may be beneficial for particular problems.

• MapReduce is stateless, i.e., mappers/reducers unaware of other mappers/reducers⇒Not ideal for iterative algorithms.

Many parallel programming frameworks are based on the idea of MapReduce², e.g., Apache Hadoop, Apache Spark, Apache Flink, ...

2https://research.google/pubs/pub62/

(33)

Q&A

(34)

Start: 08:15 am CET

(35)

Distributed Information Management

Daniel Kocher

Salzburg, Summer term 2021

Department of Computer Sciences University of Salzburg

(36)

• Introduction to data processing (data sources, motivation).

• Distributed file systems (DFS).

• The MapReduce framework (3 phases, WordCount example).

(37)

WordCount with Parallel MapReduce

Input data ^Read mapLocal Write Interm. result Remote Read, Shuffle reduce Write Final result

one Lord of

the Ring, only

one who can

bend it to

his will. Input data (DFS)

map₂

map₃

map₄

(“one”, 1), (“Lord”, 1), (“of”, 1)

(“one”, 1), (“who”, 1), (“can”, 1)

(“bend”, 1), (“it”, 1), (“to”, 1), (“his”, 1), (“will.”, 1)

Intermed. result

“bend”, “can”,

“his”, “is”, “Lord”

reduce₂

“one”, “only”,

“of”, “Ring”

reduce₃

“the”, “There”,

“will.”, “who”

(“one”, 2), (“only”, 2), (“of”, 1), (“Ring”, 1)

(“the”, 1), (“There”, 1), (“will.”, 1), (“who”, 1)

Final result

28

(38)

There is only ^Read mapLocal Write Interm. result Remote Read, Shuffle reduce Write Final result

one Lord of

the Ring, only

one who can

bend it to

his will.

Input data (DFS)

map₂

map₃

map₄

(“one”, 1), (“who”, 1), (“can”, 1)

Intermed. result

reduce₂

reduce₃

(“one”, 2), (“only”, 2), (“of”, 1), (“Ring”, 1)

Final result

(39)

There is only map

(“There”, 1), (“is”, 1), (“only”, 1), (“the”, 1), (“Ring”, 1),

(“only”, 1)

reduce Final result

Read Local Write Remote Read, Shuffle Write

one Lord of

the Ring, only

one who can

bend it to

his will.

Input data (DFS)

map₂

map₃

map₄

(“one”, 1), (“Lord”, 1), (“of”, 1)

(“one”, 1), (“who”, 1), (“can”, 1)

Intermed. result

reduce₂

reduce₃

(“one”, 2), (“only”, 2), (“of”, 1), (“Ring”, 1)

Final result

28

(40)

There is only map

(“There”, 1), (“is”, 1), (“only”, 1), (“the”, 1), (“Ring”, 1),

(“only”, 1)

reduce

(“bend”, 1), (“can”, 1), (“his”, 1), (“is”, 1), (“Lord”, 1)

Read Local Write Remote Read, Shuffle Write

one Lord of

the Ring, only

one who can

bend it to

his will.

Input data (DFS)

map₂

map₃

map₄

(“one”, 1), (“Lord”, 1), (“of”, 1)

(“one”, 1), (“who”, 1), (“can”, 1)

Intermed. result

reduce₂

reduce₃

(“one”, 2), (“only”, 2), (“of”, 1), (“Ring”, 1)

Final result

(41)

MapReduce in MongoDB

(42)

(43)

Batches vs. Streams

Batch Data:Abatchis a large butbounded static dataset.Before data can be processed, all data must be completely available (e.g., on hard disk).

Streaming Data:Astreamis anunbounded evolving dataset.Data items are processed as they stream into the system one after another, i.e., the data does not have to be completely available.

30

(44)

Stateless Processing:Thecurrent operation processestheinput data

independently, i.e., without considering preceding executions. The independence of the state makes it easier to scale.

Stateful Processing: Preceding executions may influencethe outcome of the current execution, i.e., processing history is taken into account. Recording and respecting the state makes it harder to scale.

(45)

Batch Processing

We wait until abatch of data(i.e., a block of data) is accumulated and then weprocess the data in the batch all at once.For example, we could analyze the data that

accumulates over one day.

Data is stored but not processed at arrival. In some scenarios, we must rely on these batches, e.g., when the “full” batch provides more insights.

Astate is often transferredfrom one batch to the next.

32

(46)

Batch Processing

Data source ...

accumulated data

May serve as input again

(47)

Batch Processing

Data source ...

Batch 2

accumulated data Batch 1

Operation ...

33

(48)

Batch Processing

Data source ...

Batch 2

Operation

(49)

Batch Processing

Data source ...

Batch 2

Operation ...

33

(50)

Data source ...

Batch 2

Operation ...

(51)

Stream Processing

We do not wait for the data to accumulate butprocess each single data item

continuously(at arrival). This allows a real-time response and typically involves simple transformations.

Stream processing is usedif the data naturally arrivesin a continuousstream(e.g., twitter) or if we build adata-driven systemthat needs torespond quickly(e.g., fraud detection).

Traditional stream processing is stateless, but modern systems (e.g., Apache Flink) also implement stateful stream processing.

34

(52)

Stream Processing

Data source ...

one data item at a time

Operation ...

(53)

Stream Processing

Data source ...

one data item at a time Item6 Item5 Item4 Item3 Item2 Item1

Operation ...

35

(54)

Stream Processing

Data source ...

Operation

(55)

Stream Processing

Data source ...

Operation ...

35

(56)

Data source ...

Operation ...

(57)

Micro-Batch Processing

Mixes batch and stream processing:Processes the data in tiny accumulations, so-called micro-batches. For example, we can accumulate data for 10s and then process this micro-batch.

Allows a system to provide near real-time responses. Often called“pseudo stream processing”(in contrast to “native stream processing”).

36

(58)

(59)

Apache Hadoop⁴

Open-source implementation of theMapReduce paradigmthat is designed asbatch processing system.

• Supports alinear data flowbut does not support iterative processing (i.e., loops).

• Is adisk-based system(HDFS), thus typically slower than in-memory systems.

• Scales to tens of thousands of machines (with commodity hardware).

• TheHadoop ecosystem³is quite large.

3https://hadoopecosystemtable.github.io/

4https://hadoop.apache.org/

37

(60)

Open-sourceparallel processing systemthat is designed asmicro-batch processing system mainly for analytics operations.

Duringcomputation, thedatais keptin main memory (RAM), thus Spark is typically faster than Apache Hadoop. If the data does not fit into RAM, it falls back to disk storage (e.g., using HDFS) and provides similar performance to disk-based systems.

• Supportsiterative processing(e.g., machine learning).

• Generalizes MapReduceand integrates into the Scala programming language.

• Supports stream processingwith micro-batches (time-based windows).

• Performance heavily relies on main memory.

5https://spark.apache.org/

(61)

Apache Spark

Spark implements the concept of so-calledresilient distributed datasets (RDDs).An RDD is an immutable distributed collection of data elements that is partitioned across multiple nodes (for fault tolerance).

RDDs allowin-memory transformations and actions.Transformations areapplied in a lazy fashion, i.e., they are not executed immediately but tracked in alineage graph.This improves performance and implements the fault tolerance.

For the interested reader, we refer to the official publications on Apache Spark^{6 7}.

6https://people.csail.mit.edu/matei/papers/2010/hotcloud spark.pdf 7https://people.csail.mit.edu/matei/papers/2012/nsdi spark.pdf

39

(62)

Apache Spark

Lineage Graph:A graph that encodes how an RDD was derived (usually from stable storage), e.g., RDD 2 was derived from RDD 1 (which may represent some input file).

RDD 1 RDD 2 RDD 3

RDD 4

RDD 5

transformation history

filter map

filter

union union

lineage graph⇒More potential for optimizations (all transformations are known). Fault Tolerance:Lost RDDs can be recomputed from other RDDs using the lineage graph, i.e., lost data is recovered without replication.

(63)

Apache Spark

RDD 1 RDD 2 RDD 3

RDD 4

RDD 5

filter map

filter

union union

Lazy Evaluation:Only actions trigger execution, transformations are recorded in the lineage graph⇒More potential for optimizations (all transformations are known).

Fault Tolerance:Lost RDDs can be recomputed from other RDDs using the lineage graph, i.e., lost data is recovered without replication.

40

(64)

RDD 1 RDD 2 RDD 3

RDD 4

RDD 5

filter map

filter

union union

Lazy Evaluation:Only actions trigger execution, transformations are recorded in the lineage graph⇒More potential for optimizations (all transformations are known).

Fault Tolerance:Lost RDDs can be recomputed from other RDDs using the lineage graph, i.e., lost data is recovered without replication.

(65)

Apache Spark

For atable-like abstraction, Spark implementsdataframes.A dataframe is an immutable distributed collection of data elements (like RDDs), but thedata is organized in columns(RDDs store unstructured data).

Dataframes are a higher level of abstraction and support powerful APIs (the Dataframe and the SparkSQL API).Datasetscan be seen astype-safe dataframes.

df = spark . read . j s o n ( ” example . j s o n ” ) df . show ( )

# p r i n t s t h e ” s c h e m a ” , i . e . , k e y s + v a l u e t y p e s .

df . printSchema ( )

# L a z y e v a l u a t i o n : d f . s e l e c t ( ” n a m e ” ) i s t r a c k e d i n t h e l i n e a g e g r a p h ,

# o n l y . s h o w ( ) t r i g g e r s e x e c u t i o n .

df . s e l e c t ( ” name ” ) . show ( )

41

(66)

Open-sourceparallel processing systemthat is designed asnative stream processing system.

• The streaming architecture supportsiterative processing(e.g., machine learning).

• Unified framework for processing batches and streams.

• Can operate in astateful or statelesscomputation mode.

• Implements fault tolerance through checkpoints/snapshots.

8https://flink.apache.org/

(67)

Apache Flink

Distributed Stream Processing:Data items in the streams are grouped and distributed based on some key (cf. colors), and each node is responsible for some key range.

...

Op.

Node 1

Op.

Node 2

Op.

Node 3

...

43

(68)

StatefulDistributed Stream Processing⁹:Thestateisaccumulatedand

maintained over timein a distributed manner byco-locating it(i.e., storing it on the node that runs the operation).

...

Op.

Node 1

State of node 1

Op.

Node 2

Op.

Node 3

...

9This is a very simplified description. For a detailed description, please check https://flink.apache.org/features/2017/07/04/flink-rescalable- state.html

(69)

Apache Flink

Fault Tolerance¹⁰:Special items calledbarriersare injected into the streams and force the nodes towrite a checkpoint of data and stateonto (distributed) durable storage (e.g., HDFS).Nodeirecordsits data and state since thelast barrierwas processed.

...

Barrier

Op.

Node 1

Op.

Node 2

Op.

Node 3

...

HDFS State of node 3 since last barrier

Items of node 3 since last barrier

10This is a very simplified description. For a detailed description, please check

https://ci.apache.org/projects/flink/flink-docs- release-1.1/internals/stream checkpointing.html

45

(70)

Fault Tolerance¹⁰:Special items calledbarriersare injected into the streams and force the nodes towrite a checkpoint of data and stateonto (distributed) durable storage (e.g., HDFS).Nodeirecordsits data and state since thelast barrierwas processed.

...

Barrier

Op.

Node 1

Op.

Node 2

Op.

Node 3

...

HDFS State of node 3 since last barrier

Items of node 3 since last barrier

10This is a very simplified description. For a detailed description, please check

https://ci.apache.org/projects/flink/flink-docs- release-1.1/internals/stream checkpointing.html

(71)

Apache SystemDS¹³

Formerly known as SystemML (developed by IBM). Apache SystemDS is adistributed machine-learning (ML) systemthat scales to large clusters. Its focus is on the integration of the entire data science lifecycle(i.e., data

integration/cleaning/preparation, ML model training, serving the data).

SystemDS^{11 12}bridges the gap from simple ML algorithms written in R/Python to executing the ML algorithm at scale on a large cluster. It provides adeclarative language for MLand can execute in-memory on a single machine or on a large Spark cluster.

11https://www.youtube.com/watch?v=n3JJP6UbH6Q

12http://cidrdb.org/cidr2020/papers/p22-boehm- cidr20.pdf 13https://systemds.apache.org/

46

(72)

Initially developed by LinkedIn, Apache Kafka¹⁴is apowerful building blockin many large-scaledata processing pipelines.At its core, it is adistributed and

fault-tolerant logging system.

The internals are comparable to the logging mechanism in a database system (i.e., log entries are stored in order in an append-only fashion).

Producer-Consumer Paradigm:Applications send (produce) messages to a Kafka node, which are then processed (consumed) by another application. Messages are stored by “topic” and a consumer subscribes to a topic to retrieve the corresp. messages.

14https://www.microsoft.com/en-us/research/wp- content/uploads/2017/09/Kafka.pdf 15https://kafka.apache.org/

(73)

Apache Kafka

ApacheKafkaCluster

Topic 1 M1,1 M_1,2 M1,3 ...

Topic 2 M2,1 M_2,2 M2,3 ...

Topic 3 M3,1 M_3,2 M3,3 ...

...

Producer₂

Producer₁ ... Producer_n

Topic 1

Topics 1 + 2 Topics 2 + 3

Consumer₂

Consumer₁ ... Consumer_m

Topic 1 – 3

Topics 1 Topic 3

Mi,j. . .j-th message of topici

48

(74)

Apache Kafka

ApacheKafkaCluster

Topic 1 M1,1 M_1,2 M1,3 ...

Topic 2 M2,1 M_2,2 M2,3 ...

Topic 3 M3,1 M_3,2 M3,3 ...

...

Producer₂

Consumer₂

Topic 1 – 3

Topics 1 Topic 3

(75)

Apache Kafka

ApacheKafkaCluster

Topic 1 M1,1 M_1,2 M1,3 ...

Topic 2 M2,1 M_2,2 M2,3 ...

Topic 3 M3,1 M_3,2 M3,3 ...

...

Producer₂

Topic 1

Consumer₂

Topic 1 – 3

Topics 1 Topic 3

48

(76)

ApacheKafkaCluster

Topic 1 M1,1 M_1,2 M1,3 ...

Topic 2 M2,1 M_2,2 M2,3 ...

Topic 3 M3,1 M_3,2 M3,3 ...

...

Producer₂

Topic 1

Consumer₂

Topic 1 – 3

Topics 1 Topic 3

(77)