Non-Standard Database Systems

(1)

Non-Standard Database Systems

Parallel Databases

Nikolaus Augsten

nikolaus.augsten@sbg.ac.at Department of Computer Sciences

University of Salzburg

http://dbresearch.uni-salzburg.at

Sommersemester 2019

Version 19. M¨arz 2019

Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 1 / 47

Outline

1 I/O Parallelism

2 Interquery Parallelism

3 Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism

4 Query Optimization and System Design

I/O Parallelism

Outline

1 I/O Parallelism

I/O Parallelism

Introduction

Parallel machinesare becoming quite common and affordable prices of microprocessors, memory, and disks have dropped sharply recent desktop computers feature multiple processors and this trend is projected to accelerate

Databases are growing

large volumes of transaction data are collected and stored for later analysis

large objects like multimedia data are increasingly stored in databases Large-scale parallel database systemsincreasingly used for:

storing large volumes of data

processing time-consuming decision-support queries providing high throughput for transaction processing

(2)

I/O Parallelism

Parallelism in Databases

Databases naturally lend themselves to parallelism:

Data can bepartitioned across multiple disks for parallel I/O.

Individual relational operations (e.g., sort, join, aggregation) can be executed in parallel

each processor can work independently on its own data partition Queries are expressed inhigh level language (SQL, translated to relational algebra)

makes parallelization easier

Different queries can be run in parallel with each other.Concurrency controltakes care of conflicts.

I/O Parallelism

I/O Parallelism/1

Reduce the time required to retrieve relations from disk by partitioning the relations on multiple disks.

Horizontal partitioning— tuples of a relation are divided among many disks such that each tuple resides on one disk.

Partitioning techniques (number of disks =n):

Round-robin:

send thei−thtuple inserted in the relation to diski mod n.

Hash partitioning:

choose one or more attributes as the partitioning attributes.

choose hash function h with range 0 . . . n−1

leti denote result of hash functionhapplied to the partitioning attribute value of a tuple. Send tuple todisk i.

I/O Parallelism

I/O Parallelism/2

Range partitioning

Choose an attributeas the partitioning attribute.

Apartitioning vector[v0,v1, . . . ,vn−2] is chosen.

Partitioning:Letv be the partitioning attribute value of a tuple. Tuples such thatv≤vi+1go todisk i+ 1. Tuples withv <v0go todisk0 and tuples withv≥vn−2go todisk n−1.

Example:with a partitioning vector [5,11], a tuple with partitioning attribute value of 2 will go to disk 0, a tuple with value 8 will go to disk1, while a tuple with value 20 will go todisk 2.

I/O Parallelism

Comparison of Partitioning Techniques/1

Evaluate how well partitioning techniques support the following types of data access:

1. scanning the entire relation.

2. locating a tuple associatively —point queries.

E.g.,r.A= 25.

3. locating all tuples such that the value of a given attribute lies within a specified range —range queries.

E.g., 10≤r.A<25.

(3)

I/O Parallelism

Comparison of Partitioning Techniques/2

Round robin:

Advantages

best suited forsequential scanof entire relation

all disks have almost an equal number of tuples; retrieval work is thus wellbalanced between disks.

Point queries and range queriesare difficult to process no clustering— relevant tuples are scattered across all disks

I/O Parallelism

Comparison of Partitioning Techniques/3

Hash partitioning:

Good forsequential access

assuming hash function is good, and partitioning attributes form a key, tuples will be equally distributed between disks

retrieval work is then wellbalanced between disks.

Good forpoint querieson partitioning attribute

lookup single disk, leaving others available for answering other queries.

No clustering, so difficult to answer range queries

I/O Parallelism

Comparison of Partitioning Techniques/4

Range partitioning:

Providesdata clustering by partitioning attribute value.

Good forsequential access

Good forpoint querieson partitioning attribute: only one disk needs to be accessed.

Forrange querieson partitioning attribute, one to a few disks may need to be accessed

remaining disks are available for other queries.

good if result tuples are from one to a few blocks.

if many blocks are to be fetched, they may still be fetched from one to a few disks: potential parallelism in disk access is wasted.

Example: partition by order date, then tuples with recent order dates will be accessed more frequently, leading to so-calledexecution skew

I/O Parallelism

Partitioning a Relation across Disks

If a relation contains only afew tuples which willfit into asingle disk block, then assign the relation to a single disk.

Large relations are preferably partitioned across all the available disks.

If a relation consists ofmdisk blocks and there are ndisks available, then the relation should be allocated to min(m,n) disks.

(4)

I/O Parallelism

Handling of Data Skew

Distribution of tuples to disks may beskewed: some disks have many tuples, while others have fewer tuples.

Skew limits speedup.Example:

relation with 1000 tuples is partitioned to 100 disks (10 tuples/disk) expected speedup for scan:×100

skew: one disk has 40 tuples⇒max. speedup is×25 Types of skew:

Attribute-value skew:

Some values appear in the partitioning attributes of many tuples; all the tuples with the same value for the partitioning attribute end up in the same partition.

Can occur withrange-partitioningandhash-partitioning.

Partition skew:

Withrange-partitioning, badly chosen partition vector may assign too many tuples to some partitions and too few to others.

Less likely with hash-partitioning if a good hash-function is chosen.

I/O Parallelism

Handling Skew using Histograms

Balanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assumeuniform distributionwithin each range of the histogram Histogram can be constructed byscanningrelation, or sampling (blocks containing) tuples of the relation

1-5 6-10 11-15 16-20 21-25

value

frequency

10 20 30 40 50

I/O Parallelism

Handling Skew Using Virtual Processor Partitioning

Skew in range partitioning can be handled elegantly usingvirtual processor partitioning:

create alarge number of partitions(say 10 to 20 times the number of processors)

Assign virtual processorsto partitions either in round-robin fashion or based on estimated cost of processing each virtual partition

Basic idea:

If any normal partition would have been skewed, it is very likely the skew is spread over a number of virtual partitions

Skewed virtual partitions get spread across a number of processors, so work gets distributed evenly!

Interquery Parallelism

Outline

1 I/O Parallelism

(5)

Interquery Parallelism

Queries/transactionsexecute in parallel with one another.

Increases transactionthroughput; used primarily to scale up a transaction processing system to support a larger number of transactions per second.

Easiest form of parallelism to support, particularly in ashared-memory parallel database, because even sequential database systems support concurrent processing.

More complicated to implement onshared-disk orshared-nothing architectures

Locking and logging must be coordinated by passing messages between processors.

Data in a local buffer may have been updated at another processor.

Cache-coherencyhas to be maintained — reads and writes of data in buffer must find latest version of data.

Cache Coherency Protocol

Example of a cache coherency protocolfor shared-disksystems:

Before reading/writing to a page, the page must belockedin shared/exclusive mode.

On locking a page, the page must be read from disk

Beforeunlockinga page, the page must be written to disk if it was modified.

More complex protocols with fewer disk reads/writes exist.

Cache coherency protocols forshared-nothingsystems are similar.

Each database page is assigned a home processor. Requests to fetch the page or write it to disk are sent to the home processor.

Intraquery Parallelism

Outline

1 I/O Parallelism

Intraquery Parallelism

Execution of asingle queryin parallel on multiple processors/disks;

important for speeding up long-running queries.

Two complementary forms of intraquery parallelism:

Intraoperation Parallelism— parallelize the execution of each individual operation in the query.

Interoperation Parallelism— execute the different operations in a query expression in parallel.

Intraoperation parallelism scales betterwith increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a query.

(6)

Intraquery Parallelism Intraoperation Parallelism

Parallel Processing of Relational Operations

Our discussion of parallel algorithms assumes:

read-onlyqueries

shared-nothingarchitecture

nprocessors,P0, . . . , Pn−1, andndisksD0, . . . , Dn−1, where diskDi

is associated with processorPi.

If processor has multiple disks: simulate a single diskD_i. Shared-nothing architectures can be efficiently simulated on shared-memory and shared-disk systems.

Algorithms for shared-nothing systems can thus be run on shared-memory and shared-disk systems.

However, some optimizations may be possible.

Parallel Sort/1

Range-Partitioning Sort

Choose processors P₀, . . . , P_m₋₁, where m≤n to do sorting.

Createrange-partition vector withmranges, on the sorting attributes Redistribute the relation using range partitioning

all tuples that lie in thei^th range are sent to processorPi

Pi stores the tuples it received temporarily on diskDi

this step requires I/O and communication overhead Each processorPi sorts its partition of the relationlocally.

Each processors executes same operation (sort) in parallel with other processors, without any interaction with the others (data parallelism).

Finalmerge operationis trivial: range-partitioning ensures that, for 0≤i <j <m, the key values in processorP_i are all less than the key values inP_j.

Parallel Sort/2

Parallel External Sort-Merge

Assume the relation has already beenpartitioned among disks D₀, . . . , D_n−1 (in whatever manner).

Each processorP_i locally sorts the data on diskD_i.

Sorted runs of processors aremerged to get the final sorted output.

Parallelize the mergingof sorted runs as follows:

The sorted partitions at each processorPi are range-partitioned across the processorsP0, . . . , Pm−1.

Each processorPi performs a merge on the streams as they are received, to get a single sorted run.

The sorted runs on processorsP0, . . . , P_m−1 are concatenated to get the final result.

Parallel Join

The join operation requires pairs of tuplesto be tested to see if they satisfy thejoin condition, and if they do, the pair is added to the join output.

Parallel join algorithms attempt to split the pairsto be tested over several processors. Each processor then computes part of the join locally.

In a final step, the results from each processor can be collected together to produce the final result.

(7)

Partitioned Join/1

Forequi-joinsandnatural joins, it is possible to partition the two input relations across the processors, and compute the join locally at each processor.

Let r ands be the input relations, and we want to compute r ./_r.A=s.B s.

r and s each are partitioned into npartitions, denoted r₀, r₁, . . . , r_n−1 ands₀, s₁, . . . , s_n−1.

Can use either range partitioningorhash partitioning.

r and s must be partitioned on their join attributes (r.Aands.B), using the same range-partitioning vector or hash function.

Partitionsr_i and s_i are sent to processor P_i,

Each processorP_i locally computes r_i ./ri.A=si.B s_i. Any of the standard join methods can be used.

Partitioned Join/2

P0

r0 s0

P₁

r1 s1

P2

r2 s2

P3

r₃ s₃

.. . .

..

. .. .

. ..

r

. . ..

s

Partitioned Parallel Hash-Join/1

Parallelizing partitioned hash join:

Assumes is smaller thanr and therefores is chosen as the build relation.

Ahash functionh₁ takes the join attribute value of each tuple in s and maps this tuple to one of then processors.

Each processorP_i reads the tuples of s that are on its diskD_i, and sends each tuple to the appropriate processor based on hash function h1. Lets_i denote the tuples of relations that are sent to processorP_i. As tuples of relations are received at the destination processors, they are partitioned further using another hash function, h2, which is used to compute the hash-join locally.

Partitioned Parallel Hash-Join/2

Once the tuples ofs have been distributed, the larger relationr is redistributed across the m processors using the hash function h1

Letri denote the tuples of relationr that are sent to processorPi. As ther tuples are received at the destination processors, they are repartitioned using the function h₂

(just as theprobe relationis partitioned in the sequential hash-join algorithm).

Each processorP_i executes the build and probe phases of the hash-join algorithm on the local partitions r_i and s_i to produce a partition of the final result of the hash-join.

Note: Hash-join optimizationscan be applied to the parallel case e.g., thehybrid hash-join algorithmcan be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in.

(8)

Fragment-and-Replicate Join/1

Partitioning not possible for some join conditions E.g., non-equijoin conditions, such asr.A>s.B.

For joins were partitioning is not applicable, parallelization can be accomplished byfragment and replicate technique

Special case – asymmetric fragment-and-replicate:

One of the relations, sayr, is partitioned; any partitioning technique can be used.

The other relation,s, is replicated across all the processors.

ProcessorPi then locally computes the join ofri with all of s using any join technique.

Parallel Nested-Loop Join

Assume that

relations is much smaller than relationr

r is stored by partitioning (partitioning technique irrelevant) there is an index on a join attribute of relationr at each of the partitions of relationr.

Useasymmetric fragment-and-replicate, with relations being replicated, and using the existing partitioning of relationr.

Each processorP_j where a partition of relations is stored reads the tuples of relation s stored in Dj, and replicates the tuples to every other processorP_i.

At the end of this phase, relations is replicated at all sites that store tuples of relationr.

Each processorP_i performs an indexed nested-loop joinof relations with thei^th partition of relationr.

Fragment-and-Replicate Join/2

r0

r1

r2

r3

. .. r

P0

P1

P2

P3

.. .

s

Asymmetric fragment and replicate

P0,0 P0,1 P0,2 P0,3

P1,0 P1,1 P1,2 P1,3

P2,0 P2,1 P2,2 P2,3

P3,0 P3,1 P3,2 P3,3

r0

r1

r2

r

r3

. ..

r_n−1

s0 s1 s2

s

s3 ... s_m−1 .

.

. . . . . pn−1,m−1

Fragment and replicate

Fragment-and-Replicate Join/3

General case:reduces the sizes of the relations at each processor.

r is partitioned intonpartitionsr0, r1, . . . , rn−1;s is partitioned into mpartitions,s0, s1, . . . , sm−1.

Any partitioning technique may be used.

There must be at leastm∗nprocessors.

Label the processors as

P0,0, P0,1, . . . , P0,m−1, P1,0, . . . , Pn−1,m−1.

Pi,j computes the join ofri withsj. In order to do so,ri is replicated to Pi,0, Pi,1, . . . , Pi,m−1, whilesi is replicated toP0,i, P1,i, . . . , Pn−1,i

Any join technique can be used at each processorPi,j.

(9)

Fragment-and-Replicate Join/4

Both versions of fragment-and-replicate work withany join condition, since every tuple inr can be tested with every tuple ins.

Usually has ahigher cost than partitioning, since one of the relations (for asymmetric fragment-and-replicate) or both relations (for general fragment-and-replicate) have to be replicated.

Sometimesasymmetric fragment-and-replicateis preferable even though partitioning could be used.

Other Relational Operations/1

Selectionσ_θ(r)

If θ is of the forma_i =v, where a_i is an attribute andv a value.

Ifr is partitioned onai the selection is performed at asingle processor.

If θ is of the forml ≤a_i ≤u (i.e.,θ is a range selection) and the relation has been range-partitioned on ai

Selection is performed ateach processorwhose partition overlaps with the specified range of values.

In all other cases: the selection is performed inparallel at all the processors.

Other Relational Operations/2

Duplicate elimination

Perform by using either of theparallel sort techniques

eliminate duplicates as soon as they are found during sorting.

Can also partition the tuples (using either range- or hash-partitioning) andperform duplicate elimination locallyat each processor.

Projection

Projectionwithout duplicate eliminationcan be performed as tuples are read in from disk in parallel.

If duplicate elimination is required, any of the aboveduplicate elimination techniquescan be used.

Grouping/Aggregation

Partitionthe relationon the grouping attributesand then compute the aggregate values locally at each processor.

Can reduce cost of transferring tuples during partitioning by partly computing aggregate values before partitioning.

Consider the sum aggregation operation:

Perform aggregation operation at each processorPi on those tuples stored on diskDi

results in tuples with partial sums at each processor.

Result of the local aggregation is partitioned on the grouping attributes, and the aggregation performed again at each processorPi

to get the final result.

Fewer tuples need to be sent to other processors during partitioning.

(10)

Cost of Parallel Evaluation of Operations

If there is no skew in the partitioning, and there is no overhead due to the parallel evaluation, expectedspeedupwill be n

Ifskew and overheads are also to be taken into account, the time taken by a parallel operation can be estimated as

T_part+T_asm+max(T₀, T₁, . . . , T_n₋₁)

Tpart is the time for partitioning the relations Tasmis the time for assembling the results

Ti is the time taken for the operation at processorPi

this needs to be estimated taking into account the skew, and the time wasted in contentions.

Intraquery Parallelism Interoperation Parallelism

Interoperator Parallelism

Two typesof interoperation parallelism:

pipelined parallelism independent parallelism

Pipelined Parallelism

Example:Consider a join of four relations r1./r2./r3./r4

Set up a pipelinethat computes the three joins in parallel LetP1 be assigned the computation oftemp1=r1./r2

AndP2 be assigned the computation oftemp2=temp1./r3

AndP3 be assigned the computation oftemp2./r4

Each operation canexecute in parallelsending result tuples to the next operation even while it is computing further results

Requires pipelineable (non-blocking) join evaluation algorithm (e.g., indexed nested loops join)

Factors Limiting Utility of Pipeline Parallelism

Pipeline parallelism is useful since itavoids writing intermediate results to disk

Useful withsmall number of processors, but does not scale up well with more processors. One reason is that pipeline chains do not attain sufficient length.

Cannot pipeline operators which donot produce outputuntil all inputs have been accessed (e.g., aggregate and sort)

Little speedup is obtained for the frequent cases of execution skewin which one operator’s execution cost is much higher than the others.

Advantage:avoids writing intermediate results to disk

(11)

Independent Parallelism

Example:Consider a join of four relations r1./r2./r3./r4

Independent parallelism:

LetP1 be assigned the computation oftemp1=r1./r2

AndP2 be assigned the computation oftemp2=r3./r4

AndP3 be assigned the computation oftemp1./temp2

P1 andP2 can workindependently in parallel P3 has to wait for input fromP1 andP2

Can pipeline output ofP1andP2toP3, combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism.

less useful in a highly parallel system.

Query Optimization and System Design

Outline

1 I/O Parallelism

Query Optimization/1

Query optimization in parallel databases is significantlymore complex than query optimization in sequential databases.

Cost modelsare more complicated, since we must take into account partitioning costs and issues such as skew and resource contention.

Whenschedulingexecution tree in parallel system, must decide:

How to parallelize each operation and how many processors to use for it.

What operations to pipeline, what operations to execute independently in parallel, and what operations to execute sequentially, one after the other.

Determining theamount of resources to allocate for each operation is a problem.

E.g., allocating more processors than optimal can result in high communication overhead.

Long pipelinesshould be avoided as the final operation may wait a lot for inputs, while holding precious resources

Query Optimization/2

Use heuristics: Number of parallel evaluation plans much larger than number of sequential evaluation plans.

Heuristic 1: No pipelining, only intra-operation parallelism:

Parallelize every operation on all processors

Use standard optimization technique, but with new cost model Heuristic 2: First choose most efficientsequential plan and then choose how best to parallelize the operations in that plan.

Volcano parallel database popularized theexchange-operator model exchange operator is introduced into query plans to partition and distribute tuples

each operation works independently on local data on each processor, in parallel with other copies of the operation

Choosing agood physical storage organization(partitioning technique) is important to speed up queries.

(12)

Design of Parallel Systems/1

Some issues in the design of parallel systems:

Parallel loading of data from external sources is needed in order to handle large volumes of incoming data.

Resilience to failureof some processors or disks.

Probability of some disk or processor failing is higher in a parallel system.

Operation (perhaps with degraded performance) should be possible in spite of failure.

Redundancy achieved by storing extra copy of every data item at another processor.

Design of Parallel Systems/2

On-line reorganizationof data and schema changes must be supported.

For example, index construction on terabyte databases can take hours or days even on a parallel system.

Need to allow other processing (insertions/deletions/updates) to be performed on relation even as index is being constructed.

Basic idea: index constructiontracks changesand “catches up” on changes at the end.

Also need support foron-line repartitioning and schema changes (executed concurrently with other processing).

Examples of Parallel Database Systems

Teradata (1979), appliance, still large market share IBM Netezza (1999), appliance

Microsoft DATAllegro / Parallel Data Warehouse (2003), appliance Greenplum (2005), Pivotal, open source

Vertica Analytic Database (2005) commodity hardware Oracle Exadata (2008), appliance

SAP Hana (2010), main memory, appliance