Non-Standard Database Systems
Parallel Databases
Nikolaus Augsten
nikolaus.augsten@sbg.ac.at Department of Computer Sciences
University of Salzburg
http://dbresearch.uni-salzburg.at
Sommersemester 2019
Version 19. M¨arz 2019
Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 1 / 47
Outline
1 I/O Parallelism
2 Interquery Parallelism
3 Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism
4 Query Optimization and System Design
Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 2 / 47
I/O Parallelism
Outline
1 I/O Parallelism
2 Interquery Parallelism
3 Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism
4 Query Optimization and System Design
I/O Parallelism
Introduction
Parallel machinesare becoming quite common and affordable prices of microprocessors, memory, and disks have dropped sharply recent desktop computers feature multiple processors and this trend is projected to accelerate
Databases are growing
large volumes of transaction data are collected and stored for later analysis
large objects like multimedia data are increasingly stored in databases Large-scale parallel database systemsincreasingly used for:
storing large volumes of data
processing time-consuming decision-support queries providing high throughput for transaction processing
I/O Parallelism
Parallelism in Databases
Databases naturally lend themselves to parallelism:
Data can bepartitioned across multiple disks for parallel I/O.
Individual relational operations (e.g., sort, join, aggregation) can be executed in parallel
each processor can work independently on its own data partition Queries are expressed inhigh level language (SQL, translated to relational algebra)
makes parallelization easier
Different queries can be run in parallel with each other.Concurrency controltakes care of conflicts.
Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 5 / 47
I/O Parallelism
I/O Parallelism/1
Reduce the time required to retrieve relations from disk by partitioning the relations on multiple disks.
Horizontal partitioning— tuples of a relation are divided among many disks such that each tuple resides on one disk.
Partitioning techniques (number of disks =n):
Round-robin:
send thei−thtuple inserted in the relation to diski mod n.
Hash partitioning:
choose one or more attributes as the partitioning attributes.
choose hash function h with range 0 . . . n−1
leti denote result of hash functionhapplied to the partitioning attribute value of a tuple. Send tuple todisk i.
Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 6 / 47
I/O Parallelism
I/O Parallelism/2
Range partitioning
Choose an attributeas the partitioning attribute.
Apartitioning vector[v0,v1, . . . ,vn−2] is chosen.
Partitioning:Letv be the partitioning attribute value of a tuple. Tuples such thatv≤vi+1go todisk i+ 1. Tuples withv <v0go todisk0 and tuples withv≥vn−2go todisk n−1.
Example:with a partitioning vector [5,11], a tuple with partitioning attribute value of 2 will go to disk 0, a tuple with value 8 will go to disk1, while a tuple with value 20 will go todisk 2.
I/O Parallelism
Comparison of Partitioning Techniques/1
Evaluate how well partitioning techniques support the following types of data access:
1. scanning the entire relation.
2. locating a tuple associatively —point queries.
E.g.,r.A= 25.
3. locating all tuples such that the value of a given attribute lies within a specified range —range queries.
E.g., 10≤r.A<25.
I/O Parallelism
Comparison of Partitioning Techniques/2
Round robin:
Advantages
best suited forsequential scanof entire relation
all disks have almost an equal number of tuples; retrieval work is thus wellbalanced between disks.
Point queries and range queriesare difficult to process no clustering— relevant tuples are scattered across all disks
Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 9 / 47
I/O Parallelism
Comparison of Partitioning Techniques/3
Hash partitioning:
Good forsequential access
assuming hash function is good, and partitioning attributes form a key, tuples will be equally distributed between disks
retrieval work is then wellbalanced between disks.
Good forpoint querieson partitioning attribute
lookup single disk, leaving others available for answering other queries.
No clustering, so difficult to answer range queries
Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 10 / 47
I/O Parallelism
Comparison of Partitioning Techniques/4
Range partitioning:
Providesdata clustering by partitioning attribute value.
Good forsequential access
Good forpoint querieson partitioning attribute: only one disk needs to be accessed.
Forrange querieson partitioning attribute, one to a few disks may need to be accessed
remaining disks are available for other queries.
good if result tuples are from one to a few blocks.
if many blocks are to be fetched, they may still be fetched from one to a few disks: potential parallelism in disk access is wasted.
Example: partition by order date, then tuples with recent order dates will be accessed more frequently, leading to so-calledexecution skew
I/O Parallelism
Partitioning a Relation across Disks
If a relation contains only afew tuples which willfit into asingle disk block, then assign the relation to a single disk.
Large relations are preferably partitioned across all the available disks.
If a relation consists ofmdisk blocks and there are ndisks available, then the relation should be allocated to min(m,n) disks.
I/O Parallelism
Handling of Data Skew
Distribution of tuples to disks may beskewed: some disks have many tuples, while others have fewer tuples.
Skew limits speedup.Example:
relation with 1000 tuples is partitioned to 100 disks (10 tuples/disk) expected speedup for scan:×100
skew: one disk has 40 tuples⇒max. speedup is×25 Types of skew:
Attribute-value skew:
Some values appear in the partitioning attributes of many tuples; all the tuples with the same value for the partitioning attribute end up in the same partition.
Can occur withrange-partitioningandhash-partitioning.
Partition skew:
Withrange-partitioning, badly chosen partition vector may assign too many tuples to some partitions and too few to others.
Less likely with hash-partitioning if a good hash-function is chosen.
Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 13 / 47
I/O Parallelism
Handling Skew using Histograms
Balanced partitioning vector can be constructed from histogram in a relatively straightforward fashion
Assumeuniform distributionwithin each range of the histogram Histogram can be constructed byscanningrelation, or sampling (blocks containing) tuples of the relation
1-5 6-10 11-15 16-20 21-25
value
frequency
10 20 30 40 50
Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 14 / 47
I/O Parallelism
Handling Skew Using Virtual Processor Partitioning
Skew in range partitioning can be handled elegantly usingvirtual processor partitioning:
create alarge number of partitions(say 10 to 20 times the number of processors)
Assign virtual processorsto partitions either in round-robin fashion or based on estimated cost of processing each virtual partition
Basic idea:
If any normal partition would have been skewed, it is very likely the skew is spread over a number of virtual partitions
Skewed virtual partitions get spread across a number of processors, so work gets distributed evenly!
Interquery Parallelism
Outline
1 I/O Parallelism
2 Interquery Parallelism
3 Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism
4 Query Optimization and System Design
Interquery Parallelism
Interquery Parallelism
Queries/transactionsexecute in parallel with one another.
Increases transactionthroughput; used primarily to scale up a transaction processing system to support a larger number of transactions per second.
Easiest form of parallelism to support, particularly in ashared-memory parallel database, because even sequential database systems support concurrent processing.
More complicated to implement onshared-disk orshared-nothing architectures
Locking and logging must be coordinated by passing messages between processors.
Data in a local buffer may have been updated at another processor.
Cache-coherencyhas to be maintained — reads and writes of data in buffer must find latest version of data.
Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 17 / 47
Interquery Parallelism
Cache Coherency Protocol
Example of a cache coherency protocolfor shared-disksystems:
Before reading/writing to a page, the page must belockedin shared/exclusive mode.
On locking a page, the page must be read from disk
Beforeunlockinga page, the page must be written to disk if it was modified.
More complex protocols with fewer disk reads/writes exist.
Cache coherency protocols forshared-nothingsystems are similar.
Each database page is assigned a home processor. Requests to fetch the page or write it to disk are sent to the home processor.
Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 18 / 47
Intraquery Parallelism
Outline
1 I/O Parallelism
2 Interquery Parallelism
3 Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism
4 Query Optimization and System Design
Intraquery Parallelism
Intraquery Parallelism
Execution of asingle queryin parallel on multiple processors/disks;
important for speeding up long-running queries.
Two complementary forms of intraquery parallelism:
Intraoperation Parallelism— parallelize the execution of each individual operation in the query.
Interoperation Parallelism— execute the different operations in a query expression in parallel.
Intraoperation parallelism scales betterwith increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a query.
Intraquery Parallelism Intraoperation Parallelism
Parallel Processing of Relational Operations
Our discussion of parallel algorithms assumes:
read-onlyqueries
shared-nothingarchitecture
nprocessors,P0, . . . , Pn−1, andndisksD0, . . . , Dn−1, where diskDi
is associated with processorPi.
If processor has multiple disks: simulate a single diskDi. Shared-nothing architectures can be efficiently simulated on shared-memory and shared-disk systems.
Algorithms for shared-nothing systems can thus be run on shared-memory and shared-disk systems.
However, some optimizations may be possible.
Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 21 / 47
Intraquery Parallelism Intraoperation Parallelism
Parallel Sort/1
Range-Partitioning Sort
Choose processors P0, . . . , Pm−1, where m≤n to do sorting.
Createrange-partition vector withmranges, on the sorting attributes Redistribute the relation using range partitioning
all tuples that lie in theith range are sent to processorPi
Pi stores the tuples it received temporarily on diskDi
this step requires I/O and communication overhead Each processorPi sorts its partition of the relationlocally.
Each processors executes same operation (sort) in parallel with other processors, without any interaction with the others (data parallelism).
Finalmerge operationis trivial: range-partitioning ensures that, for 0≤i <j <m, the key values in processorPi are all less than the key values inPj.
Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 22 / 47
Intraquery Parallelism Intraoperation Parallelism
Parallel Sort/2
Parallel External Sort-Merge
Assume the relation has already beenpartitioned among disks D0, . . . , Dn−1 (in whatever manner).
Each processorPi locally sorts the data on diskDi.
Sorted runs of processors aremerged to get the final sorted output.
Parallelize the mergingof sorted runs as follows:
The sorted partitions at each processorPi are range-partitioned across the processorsP0, . . . , Pm−1.
Each processorPi performs a merge on the streams as they are received, to get a single sorted run.
The sorted runs on processorsP0, . . . , Pm−1 are concatenated to get the final result.
Intraquery Parallelism Intraoperation Parallelism
Parallel Join
The join operation requires pairs of tuplesto be tested to see if they satisfy thejoin condition, and if they do, the pair is added to the join output.
Parallel join algorithms attempt to split the pairsto be tested over several processors. Each processor then computes part of the join locally.
In a final step, the results from each processor can be collected together to produce the final result.
Intraquery Parallelism Intraoperation Parallelism
Partitioned Join/1
Forequi-joinsandnatural joins, it is possible to partition the two input relations across the processors, and compute the join locally at each processor.
Let r ands be the input relations, and we want to compute r ./r.A=s.B s.
r and s each are partitioned into npartitions, denoted r0, r1, . . . , rn−1 ands0, s1, . . . , sn−1.
Can use either range partitioningorhash partitioning.
r and s must be partitioned on their join attributes (r.Aands.B), using the same range-partitioning vector or hash function.
Partitionsri and si are sent to processor Pi,
Each processorPi locally computes ri ./ri.A=si.B si. Any of the standard join methods can be used.
Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 25 / 47
Intraquery Parallelism Intraoperation Parallelism
Partitioned Join/2
P0
r0 s0
P1
r1 s1
P2
r2 s2
P3
r3 s3
.. . .
..
. .. .
. ..
r
. . ..
s
Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 26 / 47
Intraquery Parallelism Intraoperation Parallelism
Partitioned Parallel Hash-Join/1
Parallelizing partitioned hash join:
Assumes is smaller thanr and therefores is chosen as the build relation.
Ahash functionh1 takes the join attribute value of each tuple in s and maps this tuple to one of then processors.
Each processorPi reads the tuples of s that are on its diskDi, and sends each tuple to the appropriate processor based on hash function h1. Letsi denote the tuples of relations that are sent to processorPi. As tuples of relations are received at the destination processors, they are partitioned further using another hash function, h2, which is used to compute the hash-join locally.
Intraquery Parallelism Intraoperation Parallelism
Partitioned Parallel Hash-Join/2
Once the tuples ofs have been distributed, the larger relationr is redistributed across the m processors using the hash function h1
Letri denote the tuples of relationr that are sent to processorPi. As ther tuples are received at the destination processors, they are repartitioned using the function h2
(just as theprobe relationis partitioned in the sequential hash-join algorithm).
Each processorPi executes the build and probe phases of the hash-join algorithm on the local partitions ri and si to produce a partition of the final result of the hash-join.
Note: Hash-join optimizationscan be applied to the parallel case e.g., thehybrid hash-join algorithmcan be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in.
Intraquery Parallelism Intraoperation Parallelism
Fragment-and-Replicate Join/1
Partitioning not possible for some join conditions E.g., non-equijoin conditions, such asr.A>s.B.
For joins were partitioning is not applicable, parallelization can be accomplished byfragment and replicate technique
Special case – asymmetric fragment-and-replicate:
One of the relations, sayr, is partitioned; any partitioning technique can be used.
The other relation,s, is replicated across all the processors.
ProcessorPi then locally computes the join ofri with all of s using any join technique.
Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 29 / 47
Intraquery Parallelism Intraoperation Parallelism
Parallel Nested-Loop Join
Assume that
relations is much smaller than relationr
r is stored by partitioning (partitioning technique irrelevant) there is an index on a join attribute of relationr at each of the partitions of relationr.
Useasymmetric fragment-and-replicate, with relations being replicated, and using the existing partitioning of relationr.
Each processorPj where a partition of relations is stored reads the tuples of relation s stored in Dj, and replicates the tuples to every other processorPi.
At the end of this phase, relations is replicated at all sites that store tuples of relationr.
Each processorPi performs an indexed nested-loop joinof relations with theith partition of relationr.
Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 30 / 47
Intraquery Parallelism Intraoperation Parallelism
Fragment-and-Replicate Join/2
r0
r1
r2
r3
. .. r
P0
P1
P2
P3
.. .
s
Asymmetric fragment and replicate
P0,0 P0,1 P0,2 P0,3
P1,0 P1,1 P1,2 P1,3
P2,0 P2,1 P2,2 P2,3
P3,0 P3,1 P3,2 P3,3
r0
r1
r2
r
r3
. ..
rn−1
s0 s1 s2
s
s3 ... sm−1 .
.
.
.
.
. . . . . pn−1,m−1
Fragment and replicate
Intraquery Parallelism Intraoperation Parallelism
Fragment-and-Replicate Join/3
General case:reduces the sizes of the relations at each processor.
r is partitioned intonpartitionsr0, r1, . . . , rn−1;s is partitioned into mpartitions,s0, s1, . . . , sm−1.
Any partitioning technique may be used.
There must be at leastm∗nprocessors.
Label the processors as
P0,0, P0,1, . . . , P0,m−1, P1,0, . . . , Pn−1,m−1.
Pi,j computes the join ofri withsj. In order to do so,ri is replicated to Pi,0, Pi,1, . . . , Pi,m−1, whilesi is replicated toP0,i, P1,i, . . . , Pn−1,i
Any join technique can be used at each processorPi,j.
Intraquery Parallelism Intraoperation Parallelism
Fragment-and-Replicate Join/4
Both versions of fragment-and-replicate work withany join condition, since every tuple inr can be tested with every tuple ins.
Usually has ahigher cost than partitioning, since one of the relations (for asymmetric fragment-and-replicate) or both relations (for general fragment-and-replicate) have to be replicated.
Sometimesasymmetric fragment-and-replicateis preferable even though partitioning could be used.
Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 33 / 47
Intraquery Parallelism Intraoperation Parallelism
Other Relational Operations/1
Selectionσθ(r)
If θ is of the formai =v, where ai is an attribute andv a value.
Ifr is partitioned onai the selection is performed at asingle processor.
If θ is of the forml ≤ai ≤u (i.e.,θ is a range selection) and the relation has been range-partitioned on ai
Selection is performed ateach processorwhose partition overlaps with the specified range of values.
In all other cases: the selection is performed inparallel at all the processors.
Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 34 / 47
Intraquery Parallelism Intraoperation Parallelism
Other Relational Operations/2
Duplicate elimination
Perform by using either of theparallel sort techniques
eliminate duplicates as soon as they are found during sorting.
Can also partition the tuples (using either range- or hash-partitioning) andperform duplicate elimination locallyat each processor.
Projection
Projectionwithout duplicate eliminationcan be performed as tuples are read in from disk in parallel.
If duplicate elimination is required, any of the aboveduplicate elimination techniquescan be used.
Intraquery Parallelism Intraoperation Parallelism
Grouping/Aggregation
Partitionthe relationon the grouping attributesand then compute the aggregate values locally at each processor.
Can reduce cost of transferring tuples during partitioning by partly computing aggregate values before partitioning.
Consider the sum aggregation operation:
Perform aggregation operation at each processorPi on those tuples stored on diskDi
results in tuples with partial sums at each processor.
Result of the local aggregation is partitioned on the grouping attributes, and the aggregation performed again at each processorPi
to get the final result.
Fewer tuples need to be sent to other processors during partitioning.
Intraquery Parallelism Intraoperation Parallelism
Cost of Parallel Evaluation of Operations
If there is no skew in the partitioning, and there is no overhead due to the parallel evaluation, expectedspeedupwill be n
Ifskew and overheads are also to be taken into account, the time taken by a parallel operation can be estimated as
Tpart+Tasm+max(T0, T1, . . . , Tn−1)
Tpart is the time for partitioning the relations Tasmis the time for assembling the results
Ti is the time taken for the operation at processorPi
this needs to be estimated taking into account the skew, and the time wasted in contentions.
Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 37 / 47
Intraquery Parallelism Interoperation Parallelism
Interoperator Parallelism
Two typesof interoperation parallelism:
pipelined parallelism independent parallelism
Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 38 / 47
Intraquery Parallelism Interoperation Parallelism
Pipelined Parallelism
Example:Consider a join of four relations r1./r2./r3./r4
Set up a pipelinethat computes the three joins in parallel LetP1 be assigned the computation oftemp1=r1./r2
AndP2 be assigned the computation oftemp2=temp1./r3
AndP3 be assigned the computation oftemp2./r4
Each operation canexecute in parallelsending result tuples to the next operation even while it is computing further results
Requires pipelineable (non-blocking) join evaluation algorithm (e.g., indexed nested loops join)
Intraquery Parallelism Interoperation Parallelism
Factors Limiting Utility of Pipeline Parallelism
Pipeline parallelism is useful since itavoids writing intermediate results to disk
Useful withsmall number of processors, but does not scale up well with more processors. One reason is that pipeline chains do not attain sufficient length.
Cannot pipeline operators which donot produce outputuntil all inputs have been accessed (e.g., aggregate and sort)
Little speedup is obtained for the frequent cases of execution skewin which one operator’s execution cost is much higher than the others.
Advantage:avoids writing intermediate results to disk
Intraquery Parallelism Interoperation Parallelism
Independent Parallelism
Example:Consider a join of four relations r1./r2./r3./r4
Independent parallelism:
LetP1 be assigned the computation oftemp1=r1./r2
AndP2 be assigned the computation oftemp2=r3./r4
AndP3 be assigned the computation oftemp1./temp2
P1 andP2 can workindependently in parallel P3 has to wait for input fromP1 andP2
Can pipeline output ofP1andP2toP3, combining independent parallelism and pipelined parallelism
Does not provide a high degree of parallelism useful with a lower degree of parallelism.
less useful in a highly parallel system.
Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 41 / 47
Query Optimization and System Design
Outline
1 I/O Parallelism
2 Interquery Parallelism
3 Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism
4 Query Optimization and System Design
Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 42 / 47
Query Optimization and System Design
Query Optimization/1
Query optimization in parallel databases is significantlymore complex than query optimization in sequential databases.
Cost modelsare more complicated, since we must take into account partitioning costs and issues such as skew and resource contention.
Whenschedulingexecution tree in parallel system, must decide:
How to parallelize each operation and how many processors to use for it.
What operations to pipeline, what operations to execute independently in parallel, and what operations to execute sequentially, one after the other.
Determining theamount of resources to allocate for each operation is a problem.
E.g., allocating more processors than optimal can result in high communication overhead.
Long pipelinesshould be avoided as the final operation may wait a lot for inputs, while holding precious resources
Query Optimization and System Design
Query Optimization/2
Use heuristics: Number of parallel evaluation plans much larger than number of sequential evaluation plans.
Heuristic 1: No pipelining, only intra-operation parallelism:
Parallelize every operation on all processors
Use standard optimization technique, but with new cost model Heuristic 2: First choose most efficientsequential plan and then choose how best to parallelize the operations in that plan.
Volcano parallel database popularized theexchange-operator model exchange operator is introduced into query plans to partition and distribute tuples
each operation works independently on local data on each processor, in parallel with other copies of the operation
Choosing agood physical storage organization(partitioning technique) is important to speed up queries.
Query Optimization and System Design
Design of Parallel Systems/1
Some issues in the design of parallel systems:
Parallel loading of data from external sources is needed in order to handle large volumes of incoming data.
Resilience to failureof some processors or disks.
Probability of some disk or processor failing is higher in a parallel system.
Operation (perhaps with degraded performance) should be possible in spite of failure.
Redundancy achieved by storing extra copy of every data item at another processor.
Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 45 / 47
Query Optimization and System Design
Design of Parallel Systems/2
On-line reorganizationof data and schema changes must be supported.
For example, index construction on terabyte databases can take hours or days even on a parallel system.
Need to allow other processing (insertions/deletions/updates) to be performed on relation even as index is being constructed.
Basic idea: index constructiontracks changesand “catches up” on changes at the end.
Also need support foron-line repartitioning and schema changes (executed concurrently with other processing).
Augsten (Univ. Salzburg) NSDB – Parallel Databases Sommersemester 2019 46 / 47
Query Optimization and System Design
Examples of Parallel Database Systems
Teradata (1979), appliance, still large market share IBM Netezza (1999), appliance
Microsoft DATAllegro / Parallel Data Warehouse (2003), appliance Greenplum (2005), Pivotal, open source
Vertica Analytic Database (2005) commodity hardware Oracle Exadata (2008), appliance
SAP Hana (2010), main memory, appliance