• Keine Ergebnisse gefunden

Relational Database Systems 2

N/A
N/A
Protected

Academic year: 2021

Aktie "Relational Database Systems 2"

Copied!
71
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Relational Database Systems 2

Silke Eckstein

Benjamin Köhncke

Institut für Informationssysteme

Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

2. Physical Data Storage

(2)

Query Processor

Application Interfaces

Applications DDL Applications Programs

Transaction Manager

Embedded Embedded

DML DML

DB Scheme Application

Programs Direct Query Application

Programmers

DB Administrators

1 Architecture

Data Storage Manager

Indices Statistics

DDL Interpreter

Query Engine Query Evaluation

Engine Object Code

Programs Object Code

Buffer Manager File Manager

Catalog/

Dictionary

Precompiler DML Precompiler

DML Compiler

(3)

2.1 Introduction 2.2 Hard Disks 2.3 RAIDs

2.4 SANs and NAS 2.5 Case Study

2 Physical Data Storage

2.5 Case Study

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 3

(4)

• DBMS needs to retrieve, update and process persistently stored data

Storage consideration is an important factor in planning a database system (physical layer)

– Remember:

The data has to be securely stored, but

2.1 Physical Storage Introduction

– Remember:

The data has to be securely stored, but access to the data should be declarative!

Headquarters in Redwood City, CA

(5)

• Data is stored on a storage media. Media highly differ in terms of

Random Access Speed

Random/ Sequential Read/Write speed

– Capacity

2.1 Physical Storage Introduction

– Capacity

– Cost

per Capacity

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 5 EN 13.1

(6)

• Capacity: Quantifies the amount of data which can be stored

– Base Units: 1 Bit, 1 Byte = 23 Bit = 8 Bit

– Capacity units according to IEC, IEEE, NIST, etc:

Usually used for file sizes and primary storage (for higher degree of confusion, sometimes used with SI abbreviations…)

1 KiB = 10241 Byte; 1 MiB = 10242 Byte ; 1 GiB = 10243 Byte;

2.1 Relevant Media Characteristics

1 KiB = 10241 Byte; 1 MiB = 10242 Byte ; 1 GiB = 10243 Byte;

– Capacity units according to SI:

Usually used for advertising secondary/tertiary storage

1 KB = 10001 Byte 0.976 KiB; 1 MB = 10002 Byte 0.954 MiB;

1 GB = 10003 Byte 0.931 GiB; …

– Especially used by the networking community:

1 Kb = 10001 Bit = 0.125 KB 0.122 KiB ; 1 Mb = 10002 Bit = 0.125 MB 0.119 MiB

(7)

2.1 A Kilo-Joke

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 7 http://xkcd.com/

(8)

• Random Access Time: Average time to access a random piece of data at a known media position

Usually measured in ms or ns

Within some media, access time can vary depending on position (e.g. hard disks)

2.1 Characteristic Parameters

Within some media, access time can vary depending on position (e.g. hard disks)

• Transfer Rate: Average amount consecutive of data which can be transferred per time unit

Usually measured in KB/sec, MB/sec, GB/sec,…

Sometimes also in Kb/sec, Mb/sec, Gb/sec

(9)

• Volatile: Memory needs constant power to keep data

Dynamic: Dynamic volatile memory needs to be “refreshed”

regularly to keep data

Static: No refresh necessary

• Access Modes

Random Access: Any piece of data can be accessed in

2.1 Other characteristics

Random Access: Any piece of data can be accessed in approximately the same time

Sequential Access: Data can only be accessed in sequential order

• Write Mode

Mutable Storage: Can be read and written arbitrarily Write Once Read Many (WORM)

Interesting for legal issues Sarbanes-Oxley Act (2002)

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 9

(10)

Online media

„always on“

Each single piece of data can be accessed fast e.g. hard drives, main memory

Nearline media

Compromise between online and offline

2.1 Online, Nearline, Offline

Compromise between online and offline

Offline media can automatically put “on line”

e.g. juke boxes, robot libraries

Offline media (disconnected media)

Not under direct control of processing unit Have to be connected manually

e.g. box of backup tapes in basement

(11)

• Media characteristics result in a storage hierarchy

• DBMS optimize data distribution among the storage levels

– Primary Storage: Fast, limited capacity, high price, usually volatile electronic storage

2.1 The Storage Hierarchy

usually volatile electronic storage

Frequently used data / current work data

– Secondary Storage: Slower, large capacity, lower price

Main stored data

– Tertiary Storage: Even slower, huge capacity, even lower price, usually offline

Backup and long term storage of not frequently used data

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 11

(12)

2.1 The Storage Hierarchy

Cost Speed

Primary

Secondary

Cache, RAM

~100 ns

Cost Speed

Optical Disks, Tape

Secondary

Tertiary

Flash, Magnetic Disks

~10 ms

> 1 s

(13)

Type Media Size Random Acc. Speed

Transfer Speed

Characteristics Price Price/GB Pri L1-Processor Cache

(Intel QX9000 )

32 KiB 0.0008 ms 6200 MB/sec Vol, Stat, RA,OL Pri DDR3-Ram

(Corsair 1600C7DHX)

2 GiB 0.004 ms 8000 MB/sec Vol, Dyn, Ra, OL

€38 € 19 Sec Harddrive SSD

(OCZ Vertex2)

160 GB < 1 ms 285 MB/sec Stat, RA, OL €239 € 1,50 Sec Harddrive Magnetic 2000 GB 8.5 ms 138 MB/sec Stat, RA, OL €143 € 0.07

2.1 Storage Media – Examples

Sec Harddrive Magnetic

(Seagate ST32000641AS)

2000 GB 8.5 ms 138 MB/sec Stat, RA, OL €143 € 0.07 Ter DVD+R

(Verbatim DVD+R)

4.7 GB 98 ms 11 MB/sec Stat, RA, OF, WORM

€ 0.36/Disk € 0.07 Ter LTO Streamer

(Freecom LTO-920i)

800 GB 58 sec 120 MB/sec Stat, SA, OF €80/Tape € 0.10

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 13

Last updated April 2011

Pri= Primary, Sec=Secondary, Ter=Tertiary

Vol=Volatile, Stat=Static, Dyn=Dynamic, RA=Random Access, SA=Sequential Access OL=Online, OF=Offline, WORM=Write Once Read Many

(14)

• Hard drives are currently the standard for large, cheap and persistent storage

– Usually used as the main storage media for most data in a DB

• DBMS need to be optimized for

2.2 Magnetic Disk Storage – HDs

• DBMS need to be optimized for efficient disk storage and access

– Data access needs to be as fast as possible

– Often used data should be accessible with highest speed, rarely needed data may take longer

– Different data items needed for certain reoccurring tasks should also be stored/accessed together

(15)

Directionally magnetization of a ferromagnetic material

Realized on hard disk platters

Base platter made of non-magnetic aluminum or glass substrate

Magnetic grains worked into base platter to form magnetic regions

Each region represents 1 Bit

Read head can detect

2.2 HD – How does it work?

Read head can detect magnetization direction of each region

Write head may change direction

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 15

(16)

• Giant Magnetoresistance Effect (GMR)

Discovered 1988 simultaneously by Peter Grünberg and Albert Fert

Both honored with the 2007 Nobel Prize in Physics

Allows the construction of efficient read heads:

2.2 HD – Notable Technology Advances

Allows the construction of efficient read heads:

The electric resistance of an alternating ferro- and non-magnetic material giantally changes within changing directed magnetic fields

http://www.research.ibm.com/research/demos/gmr/cyberdemo1.htm

(17)

• Perpendicular

Recording (used since 2005)

Longitudal Recording limited to ~200 Gb/inch2 due to superparamagnetic effect

Thermal energy may spontaneously change magnetic direction

Perpendicular recording allows for up to 1000 Gb/inch2

2.2 HD – Notable Technology Advances

Perpendicular recording allows for up to 1000 Gb/inch Very simplified: Align

magnetic field orthogonal to surface instead of

parallel

Magnetic regions can be smaller

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 17

(18)

Usage of magnetic grains instead of continuous magnetic material

Between magnetic direction transitions, Neel Spikes are formed

Areas of unsure magnetic direction

2.2 HD – Notable Technology Advances

Areas of unsure magnetic direction

Neel Spikes are larger for continuous materials

Magnetic regions can be smaller as the transition width can be reduced

(19)

A hard disk is made up of multiple double-sided platters

Platter sides are called surfaces

Platters are fixed on main spindle and rotate at equal and constant speed (common: 5400 rpm / 7200 rpm)

Each surface has it’s own read and write head Heads are attached to arms

Arms can position heads

2.2 HD – Basic Architecture

Arms can position heads along the surface

Heads cannot move inde- pendently

Heads have no contact to surface and hover on top of an air bearing

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 19 EN 13.2

(20)

• Each surface is divided into circular tracks

Some disks may use spirals

• All tracks of all surfaces with the same diameter are called cylinder

2.2 HD – Basic Architecture

are called cylinder

Data within the same cylinder can be accessed very efficiently

EN 13.2

(21)

• Each track is subdivided into sectors of equal capacity

a) Fixed angle sector subdivision

Same number of sectors per track, changing density, constant speed

2.2 HD – Basic Architecture

constant speed

b) Fixed data density

Outer tracks have more sectors than inner tracks

Transfer speed higher on outer tracks

• Adjacent sectors can be grouped into clusters

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 21 EN 13.2

(22)

• Hard drives are not completely reliable!

– Drives do fail

– Means for physical failure recovery are necessary

Backups

Redundancy

• Hard drives age and wear down.

2.2 HD - Reliability

• Hard drives age and wear down.

Wear significantly increases by:

– Contact cycles (head parking) – Spindle start-stop

– Power-on hours

– Operation outside ideal environment

Temperature too low/high

Unstable voltage

(23)

• Reliability measures are statistical values assuming certain usage patterns

Desktop usage (all per year): 2 400 hours, 10 000 motor start/stops, 25°C temperature

Server usage (all per year): 8 760 hours, 250 motor start/stops, 40°C temperature

2.2 HD - Reliability

40°C temperature

– Non-Recoverable read errors: A sector on the surface cannot be read anymore – the data is lost

Desktop disk: 1 per 1014 read bits, Server: 1 per 1015 read bits

Disk can detect this!

– Maximum contact cycles: Maximum number of allowed head contacts (parking)

Usually around 50 000 cycles

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 23

(24)

Mean Time Between Failure (MTBF): Statistically

anticipated time for a large disk population failing to 50%

Drive manufactures usually use optimistic simulations to guess the MTBF

2.2 HD - Reliability

guess the MTBF

Desktop: 0.7 million hours (80 years), Server: 1.2 million hours (137 years) – Manufacturers values

Annualized Failure Rate (AFR): Probability of a failure per year based on MTBF

AFR = OperatingHoursPerYear / MTBFhours

Desktop: 0.34%, Server: 0.73%

(25)

Failure rate during a hard disks lifespan is not constant

Can be better modeled by the “bathtub curve” having 3 components

Infant Mortality Rate Wear Out Failures

2.2 HD - Reliability

Wear Out Failures Random Failures

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 25

(26)

• Report by Google

– 100,000 consumer grade disks (80-400GB, ATA Interface, 5400- 7200 RPM)

• Results (among others)

– Drives fail often!

2.2 Real World Failure Rates

Careful: 2+ year results are biased. See reference.

– Drives fail often!

– There is infant mortality

– High usage increases infant mortality, but not later failure rates

– Observed AFR is around 7% and MTBF 16.6 years!

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Failure trends in a large disk drive population

E. Pinheiro, W.-D. Weber, L. A. Barroso 5th USENIX conference on File and Storage

(27)

Seagate ST32000641AS 2 TB (Desktop Harddrive, 2011)

Manufacturer’s specifications

2.2 HD - Example Specs

Specification Value

Capacity 2 TB

Platters 4

Heads 8

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 27

Heads 8

Cylinders 16,383

Sectors per track 63

Bytes per sector 512

Spindle Speed 7200 RPM

MTBF 85 years

AFR 0.34 %

Random Seek 8.5 ms

Average latency 4.2 ms

(28)

• Assume a storage need of 100 TB. Only following HDs are available

– Capacity: 1 TB capacity each

– MTBF: 100,000 hours each (ca. 11 years)

• Consider using 100 of these disks independently

2.2 Reliability - Considerations

• Consider using 100 of these disks independently (w/o RAID).

– Total Storage: 100 000 GB = 100 TB – MTBF: 1,000 hours (ca. 42 days)

– THIS IS BAD!

• More sophisticated ways of using multiple disks are

needed

(29)

• Alternative to hard-drives: SSD

Use microchips which retain data in non-volatile

memory chips

and contain no moving parts

• Use the same interface as hard disk drives

Solid State Disk (SSD)

Easily replacing in most applications possible

• Key components

Memory

Controller

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 29

(30)

• Flash-Memory

Most SSDs use NAND-based flash memory Retains memory even without power

Slower than DRAM solutions

Single-level cell versus multi-level cell Wears down!

Memory

Wears down!

• DRAM

Use volatile Random Access Memory

Ultrafast data access (< 10 microseconds)

Sometimes use internal battery or external power device to ensure data persistence

Only for Applications that require even faster access, but do not need data persistence after power loss

(31)

• The controller is an embedded processor

• Incorporates the electronics that bridge the NAND memory components to the host computer

Controller

computer

• Some of its functions

error correction, wear leveling, bad block

mapping, read and write caching, encryption, garbage

collection

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 31

(32)

• Advantages

– Low access time and latency

– No moving parts shock resistant – Silent

– Lighter and more energy-efficient than HDDs

• Disadvantages

SSD - Summary

• Disadvantages

– Divided into blocks; if one byte is changed the whole block has to be rewritten (write amplification)

– 10 % of the storage capacity are allocated (spare area) – Limited ability of being rewritten (between 3000 and

100,000 cycles per cell)

Wear leveling algorithms assure that write operations are equally distributed to the cells

(33)

The disk controller organizes low level access to the disk

e.g. head positioning, error checking, signal processing

Usually integrated into the disk

Provides unified and abstracted interface to access the disks (e.g. LBA)

Connects disk to an peripheral bus (e.g.

IDE, SCSI, FiberChannel, SAS)

2.2 HD – Controller

Host Bus Adapter

Peripheral Internal Bus Inner System / Mainboard

IDE, SCSI, FiberChannel, SAS)

The host bus adapter (HBA) bridges between the peripheral bus and systems internal bus (like PCIe, PCI)

Internal Bus usually integrated into systems main board

Often confused as being the disk controller

DAS (Directly Attached Storage)

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 33

Disk Controller

Mechanics

Peripheral Bus

(34)

• Sectors can be logically grouped to blocks by the operating system

Sectors in a block do not necessarily need to be adjacent

e.g. NTFS defaults to 4 KiB per block

2.2 HD – Controller

e.g. NTFS defaults to 4 KiB per block

8 sectors on a modern disk

• Hardware address of a block is combination of

Cylinder number, surface number, block number within track

Controller maps hardware address to logical block

address

(LBA)

(35)

• Disk controller transfers content of whole blocks to buffer

Buffer resides in primary storage and can be accessed efficiently

Time needed to transfer a random block (4KiB/Block on ST3100034AS): (<10 msec)

2.2 HD – Controller

ST3100034AS): (<10 msec)

Seek Time: Time needed to position head to correct cylinder (<8 msec)

Latency (Rotational Delay): Time until the correct block arrives below the head (<0.14 msec)

Block Transfer Time: Time to read all sectors of block (<0.01 msec)

Bulk Transfer Rate for n adjacent blocks (<20msec for n=10)

Seek time + Rotational Delay + n * Block Transfer Rate

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 35

(36)

• Locating data on a disk is a major bottleneck

Try operating on data already in buffer

Aim for bulk transfer, avoid random block transfer

2.2 HD – Controller

Aim for bulk transfer, avoid random block transfer

(37)

• A single HD is often not sufficient

Limited capacity

Limited speed

Limited reliability

• Idea: Combine multiple HD into a RAID Array

2.3 RAID

• Idea: Combine multiple HD into a RAID Array (Redundant Array of Independent Disks)

RAID Array treats multiple hardware disks as a single logical disk

More HDs for increased capacity

Parallel access for increased speed

Controlled redundancy for increased reliability

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 37 Silber 11.3

(38)

The RAID controller connects to multiple hard disks

Disks are virtualized and appear to be just one single logical disk

The RAID controller acts as an

2.3 RAID Controller

RAID Controller

Internal Bus

The RAID controller acts as an extended specialized HBA

(Host Bus Adapter)

Still DAS (Directly Attached Storage)

Peripheral Bus Represented

as single logical Disk

(39)

• Mirroring (or shadowing): Increases reliability by complete redundancy

• Idea: Mirror Disks are exact copies of original disk

Not space efficient

• Read speed can be n times as fast, write speed does not increase

2.3 RAID Principles - Mirroring

increase

• Increases reliability. Assume

Two disks with a MTBF 11 years each

One original disk, one mirror disk

Assume disk failures are independent of each other (unrealistic)

Disk replacement time of 10 hours

MTBF of mirror system is >57,000 years!

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 39 Silber 11.3

(40)

• Striping: Improve performance by parallelism

• Idea: Distribute data among all disks for increased performance

• BitLevel Striping: Split all bits of a byte to the disks

– e.g. for 8 disks, write i-th bit to disk i

Number of disk needs to be a power of 2

2.3 RAID Principles - Striping

– e.g. for 8 disks, write i-th bit to disk i

– Number of disk needs to be a power of 2 – Each disk is involved in each access

Access rate does not increase

Read and write transfer speed linearly increases

Simultaneous accesses not possible

– Good for speeding up few, sequential and large accesses

40

(41)

• Block Level Striping: Distribute blocks among the disks

Only one disk is involved reading a specific block

Read and write speed of a single block not increased

Other disks still free to read/write other blocks

2.3 RAID Principles - Striping

Other disks still free to read/write other blocks

Read and write speed of multiple accesses increase

Good for large number of parallel accesses

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 41 Silber 11.3

(42)

• Error Correction Codes: Increase reliability with computed redundancy

• Hamming Codes (~1940)

Can detect and repair 1 bit errors within

a set of

n data bits by computing k parity bits

2.3 RAID Principles - Error Correction Codes

a set of

n data bits by computing k parity bits

n = 2k - k – 1

n=1, k=2; n=4, k=3; n = 11, k=4; n = 26, k=5; …

Especially used for in-memory and tape error correction

Media cannot detect errors autonomously

Not really used for hard drives anymore

(43)

• Interleaved Parity (Reed-Solomon Algorithm on GD(2) Galois Field)

– Can repair 1-bit errors (when the error is known)

– Hard Disks can detect read errors themselves, no need for complete Hamming codes

2.3 RAID Principles - Error Correction Codes

for complete Hamming codes – Basic Idea:

From n data pieces D1,…,Dn compute a parity data Dp by combining data using logical XOR (eXclusive OR)

XOR is associative and commutative Important: A XOR B XOR B = A

i.e. Dp= D1 XOR D2 XOR XOR Dn

Assume D2 was lost. It can be reconstructed by D2= Dp XOR D1 XOR D3 XOR XOR Dn

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 43

(44)

• Interleaved Parity. Example:

• A = 0101, B = 1100, C = 1011

• P = 0010 = A XOR B XOR C

• C is lost.

– P = A XOR B XOR C

2.3 RAID Principles : Interleaved Parity

0101

XOR 1100

XOR 1011

P 0010

0010

– P = A XOR B XOR C – C = P XOR A XOR B

– C = A XOR B XOR C XOR A XOR B – C = A XOR A XOR B XOR B XOR C – C = 0 XOR C

– C = 1011

XOR 0101

XOR 1100

C 1011

(45)

The 3 RAID principles can be combined in multiple ways

Not every combination is useful

This led to the definition of 7 core RAID levels

RAID 0 – RAID 6

The most dominant levels are RAID 0, RAID 1, RAID 1+0, RAID 5

In following examples, assume

2.3 RAID in practical applications

In following examples, assume

A MTBF of 100,000 hours (11.42 years) per disk A Mean Time to Repair (MTTR) of 6 hours

Failure rate is constant and failures between disks are independent MTBFraid is the mean time to data loss within the raid if each failing

disk is replaced within the MTTR

D is the number of drives in the RAID set

C=200 GB is capacity of one disk, Craid capacity of whole raid

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 45

(46)

• Mean Time to Repair (MTTR)

– MTTR = TimeToNotice + RebuildTime – Assume time to notice of 0.5 hours

– Rebuild time is the time for completely writing back lost data

Assume disk capacity of 200GB

2.3 RAID in practical applications

Assume disk capacity of 200GB

Write back speed of 10 MB/sec

Consisting of reading remaining disks Computing parity / Reconstructing data

Rebuild time around 5.5 hours

– During rebuild, a RAID is especially vulnerable

– MTTR = 6 hours

(47)

File A (A1-Ax), File B (B1-Bx), File C (C1-Cx)

Raid 0

Block-Level-Striping only

Increased parallel access and transfer speeds, reduced reliability

All disks contain data (0% overhead)

2.3 RAID Levels

All disks contain data (0% overhead) Works with any number of disks MTBFraid= MTBFdisk/ D

4 disks:

MTBFraid= 2.86 years

Craid = 800 GB (0 GB wasted (0%))

Common size: 2 disks

MTBFraid= 5.72 years

Craid = 400 GB (0 GB wasted (0%))

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 47

(48)

Raid 1

Mirroring only

Increased reliability, increased read transfer speed, low space efficiency

MTBFraid = MTBFdisk

D/ (D! * MTTRD-1) 4 disks:

2.3 RAID Levels

4 disks:

MTBFraid= 2.2 trillion years

Craid = 200 GB (600 GB wasted (75%))

Age of universe may be around 15 billion years…

Common size: 2 disks

MTBFraid= 95,130 years

Craid = 200 GB (200 GB wasted (50%))

(49)

• RAID 2

Not used anymore in practice

was used in old mainframes

Bit-Level-Striping

Use Hamming Codes

Usually Hamming Code(7,4) – 4 data bits, 3 parity bits Reliable 1-Bit error recovery (i.e. one disk may fail)

2.3 RAID Levels

Usually Hamming Code(7,4) – 4 data bits, 3 parity bits

Reliable 1-Bit error recovery (i.e. one disk may fail)

3 redundant disks per 4 data disks (75% overhead)

Ratio better for larger number of disks

MTBFraid = MTBFdisk2/ (D*(D-1) * MTTR)

7 disks (does not really make sense for 4 – not comparable to other values)

MTBFraid= 4,530 years

Craid= 800 GB (600 GB wasted (43%))

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 49

(50)

• RAID 3

Interleaved-Parity Byte-Level-Striping Dedicated Parity Disk

Bottleneck! Every write operation needs to update the parity disk.

No parallel writes

2.3 RAID Levels

No parallel writes

1 redundant disk per n data disks

Overhead decreases with number of disks while reliability decreases

25% overhead for 4 data disks

MTBFraid = MTBFdisk2/ (D*(D-1) * MTTR) 4 disks

MTBFraid= 15,854 years

Craid= 600 GB (200 GB wasted (25%))

(51)

• RAID 4

Block-Level Striping

As RAID 3 otherwise

4 disks (common size)

2.3 RAID Levels

4 disks (common size)

MTBFraid = 15,854 years

Craid = 600 GB (200 GB wasted (25%))

5 disks (also common size)

MTBFraid = 9,513 years

Craid = 800 GB (200 GB wasted (20%))

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 51

(52)

• RAID 5

– Parity is distributed among the hard disks

May allow for parallel block writes

– As RAID 4 otherwise

2.3 RAID Levels

– As RAID 4 otherwise

– Bottleneck when writing many files smaller than a block

Whole parity block has to be read and re-written for each minor write

– Can recover from a single disk failure – MTBFraid and Craid as for RAID 3 & 4

(53)

RAID 6

Two independent parity blocks distributed among the disks

May be implemented by parity on orthogonal data or by using Reed-Solomon on GF(28)

As RAID 5 otherwise

2 redundant disk per n data disks

Can recover from a double disk failure

No vulnerability during single failure rebuild

2.3 RAID Levels

No vulnerability during single failure rebuild

Very suitable for larger arrays

Writer overhead due to more complicated parity computation

MTBFraid= MTBFdisk

3/ (D*(D-2)*(D-1) * MTTR2) 4 disks

MTBFraid= 132 million years

Craid= 400 GB (400 GB wasted (50%))

8 disks (common)

MTBFraid= 9,437 years (~RAID 5 w. D=5)

Craid= 1,200 GB (400 GB wasted (25%))

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 53

(54)

Additionally, there are hybrid levels combing the core levels

RAID 0+1, RAID 1+0, RAID 5+0, RAID 5+1, RAID 6+6, …

Raid 1+0

Mirrored sets nested in a striped set

RAID 0 on sets of RAID 1 sets

Very high read and write transfer speeds, increased reliability, low space efficiency, limited maximum size

Most performant RAID combination

2.3 Practical use of RAIDS

Most performant RAID combination

D1= Drives per RAID 1, D0=Number of RAID1 sets MTBFraid= MTBFdisk

D1 / (D1! * MTTRD1-1) / D0

4 disks: D1 = 2, D0= 2

MTBFraid= 47,565 years

Craid= 400 GB (400 GB wasted (50%)) 6 disks: D1 = 2, D0= 3

MTBFraid= 31,706 years

Craid= 600 GB (600 GB wasted (50%))

(55)

• RAIDs controllers directly connect storage to the system bus

– Storage available to only one system/ server/ application

• Number of disks is limited

– Consumer grade RAID: 2-4 disks – Enterprise grade RAID: 8-24+ disks

2.4 Beyond RAID

– Enterprise grade RAID: 8-24+ disks

• Solutions

– NAS (Network Attached Storage): Provide abstracted file systems via network (software solution)

– SAN (Storage Area Network): Virtualized logical storage within a specialized network on block level (hardware solution)

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 55

(56)

• Before discussing NAS, we need file systems

• A file systems is a software for abstracting file operations on a logical storage device

– Files are a collection of binary data

2.4 File Systems vs. Raw Devices

Application Software

File System

– Files are a collection of binary data

Creating, reading, writing, deleting, finding, organizing

– How does a file access translate into top-level operations on a logical

storage device?

e.g. which blocks have to be read/written?

Bridge between application software and (abstracted) hardware

File System

Logical Storage

(57)

• Raw Devices access allows

applications to bypass the OS and the file system

• Application may directly tune

2.4 File Systems vs. Raw Devices

Application Software

• Application may directly tune aspects of physical storage

• May lead to very efficient implementations

Used for e.g. high performance database, system virtualization, etc

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 57

Logical Storage

(58)

Idea: Provide a remote file system using already available network infrastructure

NAS: Network Attached Storage

Use specialized network protocols (e.g. CIFS, NFS, FTP, etc)

Easiest case: File Server (e.g. Linux+Samba)

Advantages:

Easy to setup, easy to use, cheap infrastructure

2.4 NAS – Network Attached Storage

Application Software

Network

Advantages:

Easy to setup, easy to use, cheap infrastructure Allows sharing of storage among several systems Abstracts on file system level (easy for most

applications)

Disadvantages

Inefficient and slow

large protocol and processing overhead

Abstracts on file system level (not suitable for special purposes like raw devices or storage virtualization)

File System

Logical Storage

NAS Server

(59)

SANs offer specialized high-speed networks for storage devices

Usually uses local FibreChannel networks

Remote location may be connected via Ethernet or IP-WAN (Internet)

Network uses specialized storage protocols

iFCP (SCSI on FiberChannel)

iSCSI (SCSI on TCP/IP)

2.4 SAN – Storage Area Network

Application Software

File System

iSCSI (SCSI on TCP/IP)

HyperSCSI (SCSI on raw ethernet)

SANs provide raw block level access to logical storage devices

Logical disks of any size can be offered by the SAN For a client system using a logical disk, it might

appears like a local disk or RAID

Client system has full control over file systems on logical disks

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 59

Logical Storage

SAN

(60)

2.4 SAN – Storage Area Network

SAN Switch SAN Switch

SAN HBA SAN HBA

SAN HBA SAN HBA

SAN Switch

NAS Protocol (CIFS)

Ethernet Network

WAN-SAN Bus (HyperSCSI)

NAS NAS Head

SAN HBA

SAN/RAID HBA

Peripheral Bus (SCSI, SAS, etc.)

SAN Switch SAN Switch

SAN HBA

SAN Bus (iFCP) Head

NAS Head

(61)

Advantages:

Very efficient

Highly optimized local network infrastructure

Optimized protocols with low overhead

Very flexible (any number of systems may use any number of disks at any location)

2.4 SAN – Storage Area Network

Very flexible (any number of systems may use any number of disks at any location)

Helps for disaster protection

SAN can transparently span to even remote locations

May also employ NAS heads for NAS-like behavior

Disadvantages

Expensive

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 61

(62)

• How much storage and bandwidth is needed

2.5 Case Study

• How much storage and bandwidth is needed by YouTube and how might it organized?

• All top secret, but there are educated guesses and

some (older) leaked data…

(63)

A Google video search restricted to YouTube.com reveals 187,397,091 indexed videos

3.35 min/movie : Based on TOP-100 all time videos 2.3 MB/min: Based on a sample (very low variation) 8.3 MB/video

Guessed size of all videos on YouTube is 1.56 PB

Assume 160 GB/disk with MTBF=16.6 years

2.5 Case Study

Assume 160 GB/disk with MTBF=16.6 years

Based on the Google reliability study

9,800 hard disks are needed to store all videos just once without any redundancy

MTBF = 14 hours ...

Using 1,960 5+1 RAID 5’s, 11,760 disks are needed

MTBF = 6.84 years - not too great…

Still, each video only available once

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 63

(64)

Using 196 (9+1)(5+1) RAID 55 arrays, 13,066 disks are needed

RAID 5 Arrays with 6 disks each. 10 of these arrays form an overlaying RAID 5.

MTBF = 14 million years (finally, data is “safe” at one location)

2.5 Case Study

location)

Still, each video only available once

No global disaster safety No global load balancing

• How might this look?

13066 x 1089 x 60 x

(65)

• YouTube grows fast

Currently, around 200,000 new videos per day (1.66

TB/day)

• Larger number of disks have to be added per month

2.5 Case Study

month

Around 440 disks/month for new videos

Around 80 disks/month to replace broken ones

Growing exponentially

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 65

(66)

• It gets even worse…

• YouTube serves 200 million videos per day (as of mid 2007)

30 PB of data EVERY MONTH

– 154 Gbps

(read: 154 Gigabit per second)

2.5 Case Study

– 154 Gbps

(read: 154 Gigabit per second)

Results to average 586,000 concurrent streams

Popular videos are around 250 000 views per day

600 concurrent streams per FILE (25 MB/sec)

This bandwidth is insanely expensive:

600,000 USD/month

(67)

This massive amount of data cannot be hosted and served from a single location…

2.5 Case Study

Data needs to be distributed and globally load balanced

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 67

(68)

• YouTube does not host and provide videos themselves

They hire Limelight Networks for that

• Limelight Networks

2.5 Case Study

• Limelight Networks

Large CDN (Content Delivery Network) Provider

Own 25 POP (Point Of Presence) connected with

own backbone

Each POP with up to 1000’s storage servers

Can serve up to 1 Tbps!

(69)

• Limelight automatically distributes content among all POP

Data is massively redundant

More popular data replicated more, less popular replicated less Each file is served from the closest location with bandwidth to

spare

Global load balancing

2.5 Case Study

Global load balancing

Data is disaster proof!

• What to learn?

• Large scale data storage and serving

Very resource intensive Very expensive

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 69

(70)

• There are different types of storage

Usually, there is a storage hierarchy

Faster, smaller, more expensive storage

Slower, bigger, less expensive storage

• Hard drives are currently the most popular media

Physical Storage

• Hard drives are currently the most popular media

Mechanical device

High sequential transfer rates,

Bad random access times, low random transfer rates

Prone to failure

DBMS must be optimized for the used storage

devices!

(71)

• Access Pathes

Physical Data Access

Index Structures

Physical Tuning

Next Lecture

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 71

Referenzen

ÄHNLICHE DOKUMENTE

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 12 EN 1.6.1.. 1.1 Characteristics

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 2.. 2 Physical

Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 4 SKS 10.5!. 3.1 Introduction to

– Both of the two child nodes to the left and right of the deleted element have the minimum number of elements (L-1) and then can then be joined into a legal single node with

• Cost estimate: (Height of the tree or 1 for hash index) plus #pages that contain overflow lists. 5.5

• For a non-cluster index the expected number of blocks accessed is #blocks ≈ #result size. • If no index is given, #blocks

• DB administrators may provide optimization hints to override optimizer heuristics and results. – Uses explain statement‟s PLAN_TABLE

8.1 Basic join order optimization 8.2 Join cost and size estimations 8.3 Left-deep join trees.. 8.4 Dynamic programming 8.5