Relational Database Systems 2

(1)

Relational Database Systems 2

Silke Eckstein

Benjamin Köhncke

Institut für Informationssysteme

Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

2. Physical Data Storage

(2)

Query Processor

Application Interfaces

Applications DDL Applications Programs

Transaction Manager

Embedded Embedded

DML DML

DB Scheme Application

Programs Direct Query Application

Programmers

DB Administrators

1 Architecture

Data Storage Manager

Indices Statistics

DDL Interpreter

Query Engine Query Evaluation

Engine Object Code

Programs Object Code

Buffer Manager File Manager

Catalog/

Dictionary

Precompiler DML Precompiler

DML Compiler

(3)

2.1 Introduction 2.2 Hard Disks 2.3 RAIDs

2.4 SANs and NAS 2.5 Case Study

2 Physical Data Storage

2.5 Case Study

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 3

(4)

• DBMS needs to retrieve, update and process persistently stored data

–

Storage consideration is an important factor in planning a database system (physical layer)

– Remember:

The data has to be securely stored, but

2.1 Physical Storage Introduction

– Remember:

The data has to be securely stored, but access to the data should be declarative!

Headquarters in Redwood City, CA

(5)

• Data is stored on a storage media. Media highly differ in terms of

–

Random Access Speed

–

Random/ Sequential Read/Write speed

– Capacity

2.1 Physical Storage Introduction

– Capacity

– Cost

per Capacity

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 5 EN 13.1

(6)

• Capacity: Quantifies the amount of data which can be stored

– Base Units: 1 Bit, 1 Byte = 2³ Bit = 8 Bit

– Capacity units according to IEC, IEEE, NIST, etc:

• Usually used for file sizes and primary storage (for higher degree of confusion, sometimes used with SI abbreviations…)

• 1 KiB = 1024¹ Byte; 1 MiB = 1024²Byte ; 1 GiB = 1024³ Byte;

2.1 Relevant Media Characteristics

• 1 KiB = 1024¹ Byte; 1 MiB = 1024²Byte ; 1 GiB = 1024³ Byte;

…

– Capacity units according to SI:

• Usually used for advertising secondary/tertiary storage

• 1 KB = 1000¹ Byte ≈ 0.976 KiB; 1 MB = 1000²Byte ≈ 0.954 MiB;

1 GB = 1000³ Byte ≈ 0.931 GiB; …

– Especially used by the networking community:

• 1 Kb = 1000¹ Bit = 0.125 KB ≈ 0.122 KiB ; 1 Mb = 1000² Bit = 0.125 MB ≈ 0.119 MiB

(7)

2.1 A Kilo-Joke

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 7 http://xkcd.com/

(8)

• Random Access Time: Average time to access a random piece of data at a known media position

–

Usually measured in ms or ns

–

Within some media, access time can vary depending on position (e.g. hard disks)

2.1 Characteristic Parameters

Within some media, access time can vary depending on position (e.g. hard disks)

• Transfer Rate: Average amount consecutive of data which can be transferred per time unit

–

Usually measured in KB/sec, MB/sec, GB/sec,…

–

Sometimes also in Kb/sec, Mb/sec, Gb/sec

(9)

• Volatile: Memory needs constant power to keep data

– Dynamic: Dynamic volatile memory needs to be “refreshed”

regularly to keep data

– Static: No refresh necessary

• Access Modes

– Random Access: Any piece of data can be accessed in

2.1 Other characteristics

– Random Access: Any piece of data can be accessed in approximately the same time

– Sequential Access: Data can only be accessed in sequential order

• Write Mode

– Mutable Storage: Can be read and written arbitrarily – Write Once Read Many (WORM)

• Interesting for legal issues Sarbanes-Oxley Act (2002)

(10)

• Online media

– „always on“

– Each single piece of data can be accessed fast – e.g. hard drives, main memory

• Nearline media

– Compromise between online and offline

2.1 Online, Nearline, Offline

– Compromise between online and offline

– Offline media can automatically put “on line”

– e.g. juke boxes, robot libraries

• Offline media (disconnected media)

– Not under direct control of processing unit – Have to be connected manually

– e.g. box of backup tapes in basement

(11)

• Media characteristics result in a storage hierarchy

• DBMS optimize data distribution among the storage levels

– Primary Storage: Fast, limited capacity, high price, usually volatile electronic storage

2.1 The Storage Hierarchy

usually volatile electronic storage

• Frequently used data / current work data

– Secondary Storage: Slower, large capacity, lower price

• Main stored data

– Tertiary Storage: Even slower, huge capacity, even lower price, usually offline

• Backup and long term storage of not frequently used data

(12)

2.1 The Storage Hierarchy

Cost Speed

Primary

Secondary

Cache, RAM

~100 ns

Cost Speed

Optical Disks, Tape

Secondary

Tertiary

Flash, Magnetic Disks

~10 ms

> 1 s

(13)

Type Media Size Random Acc. Speed

Transfer Speed

Characteristics Price Price/GB Pri L1-Processor Cache

(Intel QX9000 )

32 KiB 0.0008 ms 6200 MB/sec Vol, Stat, RA,OL Pri DDR3-Ram

(Corsair 1600C7DHX)

2 GiB 0.004 ms 8000 MB/sec Vol, Dyn, Ra, OL

€38 € 19 Sec Harddrive SSD

(OCZ Vertex2)

160 GB < 1 ms 285 MB/sec Stat, RA, OL €239 € 1,50 Sec Harddrive Magnetic 2000 GB 8.5 ms 138 MB/sec Stat, RA, OL €143 € 0.07

2.1 Storage Media – Examples

Sec Harddrive Magnetic

(Seagate ST32000641AS)

2000 GB 8.5 ms 138 MB/sec Stat, RA, OL €143 € 0.07 Ter DVD+R

(Verbatim DVD+R)

4.7 GB 98 ms 11 MB/sec Stat, RA, OF, WORM

€ 0.36/Disk € 0.07 Ter LTO Streamer

(Freecom LTO-920i)

800 GB 58 sec 120 MB/sec Stat, SA, OF €80/Tape € 0.10

Last updated April 2011

Pri= Primary, Sec=Secondary, Ter=Tertiary

Vol=Volatile, Stat=Static, Dyn=Dynamic, RA=Random Access, SA=Sequential Access OL=Online, OF=Offline, WORM=Write Once Read Many

(14)

• Hard drives are currently the standard for large, cheap and persistent storage

– Usually used as the main storage media for most data in a DB

• DBMS need to be optimized for

2.2 Magnetic Disk Storage – HDs

• DBMS need to be optimized for efficient disk storage and access

– Data access needs to be as fast as possible

– Often used data should be accessible with highest speed, rarely needed data may take longer

– Different data items needed for certain reoccurring tasks should also be stored/accessed together

(15)

• Directionally magnetization of a ferromagnetic material

• Realized on hard disk platters

– Base platter made of non-magnetic aluminum or glass substrate

– Magnetic grains worked into base platter to form magnetic regions

• Each region represents 1 Bit

– Read head can detect

2.2 HD – How does it work?

– Read head can detect magnetization direction of each region

– Write head may change direction

(16)

• Giant Magnetoresistance Effect (GMR)

– Discovered 1988 simultaneously by Peter Grünberg and Albert Fert

• Both honored with the 2007 Nobel Prize in Physics

– Allows the construction of efficient read heads:

2.2 HD – Notable Technology Advances

– Allows the construction of efficient read heads:

• The electric resistance of an alternating ferro- and non-magnetic material giantally changes within changing directed magnetic fields

– http://www.research.ibm.com/research/demos/gmr/cyberdemo1.htm

(17)

• Perpendicular

Recording (used since 2005)

– Longitudal Recording limited to ~200 Gb/inch² due to superparamagnetic effect

• Thermal energy may spontaneously change magnetic direction

– Perpendicular recording allows for up to 1000 Gb/inch²

2.2 HD – Notable Technology Advances

– Perpendicular recording allows for up to 1000 Gb/inch – Very simplified: Align

magnetic field orthogonal to surface instead of

parallel

• Magnetic regions can be smaller

(18)

•

Usage of magnetic grains instead of continuous magnetic material

– Between magnetic direction transitions, Neel Spikes are formed

• Areas of unsure magnetic direction

2.2 HD – Notable Technology Advances

Areas of unsure magnetic direction

– Neel Spikes are larger for continuous materials

– Magnetic regions can be smaller as the transition width can be reduced

(19)

• A hard disk is made up of multiple double-sided platters

– Platter sides are called surfaces

– Platters are fixed on main spindle and rotate at equal and constant speed (common: 5400 rpm / 7200 rpm)

– Each surface has it’s own read and write head – Heads are attached to arms

• Arms can position heads

2.2 HD – Basic Architecture

• Arms can position heads along the surface

• Heads cannot move independently

– Heads have no contact to surface and hover on top of an air bearing

(20)

• Each surface is divided into circular tracks

–

Some disks may use spirals

• All tracks of all surfaces with the same diameter are called cylinder

2.2 HD – Basic Architecture

are called cylinder

–

Data within the same cylinder can be accessed very efficiently

EN 13.2

(21)

• Each track is subdivided into sectors of equal capacity

a) Fixed angle sector subdivision

• Same number of sectors per track, changing density, constant speed

2.2 HD – Basic Architecture

constant speed

b) Fixed data density

• Outer tracks have more sectors than inner tracks

• Transfer speed higher on outer tracks

• Adjacent sectors can be grouped into clusters

(22)

• Hard drives are not completely reliable!

– Drives do fail

– Means for physical failure recovery are necessary

• Backups

• Redundancy

• Hard drives age and wear down.

2.2 HD - Reliability

• Hard drives age and wear down.

Wear significantly increases by:

– Contact cycles (head parking) – Spindle start-stop

– Power-on hours

– Operation outside ideal environment

• Temperature too low/high

• Unstable voltage

(23)

• Reliability measures are statistical values assuming certain usage patterns

• Desktop usage (all per year): 2 400 hours, 10 000 motor start/stops, 25°C temperature

• Server usage (all per year): 8 760 hours, 250 motor start/stops, 40°C temperature

2.2 HD - Reliability

40°C temperature

– Non-Recoverable read errors: A sector on the surface cannot be read anymore – the data is lost

• Desktop disk: 1 per 10¹⁴ read bits, Server: 1 per 10¹⁵ read bits

• Disk can detect this!

– Maximum contact cycles: Maximum number of allowed head contacts (parking)

• Usually around 50 000 cycles

(24)

–

Mean Time Between Failure (MTBF): Statistically

anticipated time for a large disk population failing to 50%

• Drive manufactures usually use optimistic simulations to guess the MTBF

2.2 HD - Reliability

guess the MTBF

• Desktop: 0.7 million hours (80 years), Server: 1.2 million hours (137 years) – Manufacturers values

–

Annualized Failure Rate (AFR): Probability of a failure per year based on MTBF

• AFR = OperatingHoursPerYear / MTBF_hours

• Desktop: 0.34%, Server: 0.73%

(25)

•

Failure rate during a hard disks lifespan is not constant

•

Can be better modeled by the “bathtub curve” having 3 components

– Infant Mortality Rate – Wear Out Failures

2.2 HD - Reliability

– Wear Out Failures – Random Failures

(26)

• Report by Google

– 100,000 consumer grade disks (80-400GB, ATA Interface, 5400- 7200 RPM)

• Results (among others)

– Drives fail often!

2.2 Real World Failure Rates

Careful: 2+ year results are biased. See reference.

– Drives fail often!

– There is infant mortality

– High usage increases infant mortality, but not later failure rates

– Observed AFR is around 7% and MTBF 16.6 years!

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Failure trends in a large disk drive population

E. Pinheiro, W.-D. Weber, L. A. Barroso 5th USENIX conference on File and Storage

(27)

•

Seagate ST32000641AS 2 TB (Desktop Harddrive, 2011)

– Manufacturer’s specifications

2.2 HD - Example Specs

Specification Value

Capacity 2 TB

Platters 4

Heads 8

Cylinders 16,383

Sectors per track 63

Bytes per sector 512

Spindle Speed 7200 RPM

MTBF 85 years

AFR 0.34 %

Random Seek 8.5 ms

Average latency 4.2 ms

(28)

• Assume a storage need of 100 TB. Only following HDs are available

– Capacity: 1 TB capacity each

– MTBF: 100,000 hours each (ca. 11 years)

• Consider using 100 of these disks independently

2.2 Reliability - Considerations

• Consider using 100 of these disks independently (w/o RAID).

– Total Storage: 100 000 GB = 100 TB – MTBF: 1,000 hours (ca. 42 days)

– THIS IS BAD!

• More sophisticated ways of using multiple disks are

needed

(29)

• Alternative to hard-drives: SSD

–

Use microchips which retain data in non-volatile

memory chips

and contain no moving parts

• Use the same interface as hard disk drives

Solid State Disk (SSD)

–

Easily replacing in most applications possible

• Key components

–

Memory

–

Controller

(30)

• Flash-Memory

– Most SSDs use NAND-based flash memory – Retains memory even without power

– Slower than DRAM solutions

– Single-level cell versus multi-level cell – Wears down!

Memory

– Wears down!

• DRAM

– Use volatile Random Access Memory

– Ultrafast data access (< 10 microseconds)

– Sometimes use internal battery or external power device to ensure data persistence

– Only for Applications that require even faster access, but do not need data persistence after power loss

(31)

• The controller is an embedded processor

• Incorporates the electronics that bridge the NAND memory components to the host computer

Controller

computer

• Some of its functions

–

error correction, wear leveling, bad block

mapping, read and write caching, encryption, garbage

collection

(32)

• Advantages

– Low access time and latency

– No moving parts shock resistant – Silent

– Lighter and more energy-efficient than HDDs

• Disadvantages

SSD - Summary

• Disadvantages

– Divided into blocks; if one byte is changed the whole block has to be rewritten (write amplification)

– 10 % of the storage capacity are allocated (spare area) – Limited ability of being rewritten (between 3000 and

100,000 cycles per cell)

• Wear leveling algorithms assure that write operations are equally distributed to the cells

(33)

• The disk controller organizes low level access to the disk

– e.g. head positioning, error checking, signal processing

– Usually integrated into the disk

– Provides unified and abstracted interface to access the disks (e.g. LBA)

– Connects disk to an peripheral bus (e.g.

IDE, SCSI, FiberChannel, SAS)

2.2 HD – Controller

Host Bus Adapter

Peripheral Internal Bus Inner System / Mainboard

IDE, SCSI, FiberChannel, SAS)

• The host bus adapter (HBA) bridges between the peripheral bus and systems internal bus (like PCIe, PCI)

– Internal Bus usually integrated into systems main board

– Often confused as being the disk controller

• DAS (Directly Attached Storage)

Disk Controller

Mechanics

Peripheral Bus

(34)

• Sectors can be logically grouped to blocks by the operating system

–

Sectors in a block do not necessarily need to be adjacent

–

e.g. NTFS defaults to 4 KiB per block

2.2 HD – Controller

–

e.g. NTFS defaults to 4 KiB per block

• 8 sectors on a modern disk

• Hardware address of a block is combination of

–

Cylinder number, surface number, block number within track

–

Controller maps hardware address to logical block

address

(LBA)

(35)

• Disk controller transfers content of whole blocks to buffer

– Buffer resides in primary storage and can be accessed efficiently

– Time needed to transfer a random block (4KiB/Block on ST3100034AS): (<10 msec)

2.2 HD – Controller

ST3100034AS): (<10 msec)

• Seek Time: Time needed to position head to correct cylinder (<8 msec)

• Latency (Rotational Delay): Time until the correct block arrives below the head (<0.14 msec)

• Block Transfer Time: Time to read all sectors of block (<0.01 msec)

– Bulk Transfer Rate for n adjacent blocks (<20msec for n=10)

• Seek time + Rotational Delay + n * Block Transfer Rate

(36)

• Locating data on a disk is a major bottleneck

–

Try operating on data already in buffer

–

Aim for bulk transfer, avoid random block transfer

2.2 HD – Controller

Aim for bulk transfer, avoid random block transfer

(37)

• A single HD is often not sufficient

–

Limited capacity

–

Limited speed

–

Limited reliability

• Idea: Combine multiple HD into a RAID Array

2.3 RAID

• Idea: Combine multiple HD into a RAID Array (Redundant Array of Independent Disks)

–

RAID Array treats multiple hardware disks as a single logical disk

• More HDs for increased capacity

• Parallel access for increased speed

• Controlled redundancy for increased reliability

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 37 Silber 11.3

(38)

•

The RAID controller connects to multiple hard disks

– Disks are virtualized and appear to be just one single logical disk

– The RAID controller acts as an

2.3 RAID Controller

RAID Controller

Internal Bus

– The RAID controller acts as an extended specialized HBA

(Host Bus Adapter)

– Still DAS (Directly Attached Storage)

Peripheral Bus Represented

as single logical Disk

(39)

• Mirroring (or shadowing): Increases reliability by complete redundancy

• Idea: Mirror Disks are exact copies of original disk

– Not space efficient

• Read speed can be n times as fast, write speed does not increase

2.3 RAID Principles - Mirroring

increase

• Increases reliability. Assume

– Two disks with a MTBF 11 years each

• One original disk, one mirror disk

• Assume disk failures are independent of each other (unrealistic)

– Disk replacement time of 10 hours

– ► MTBF of mirror system is >57,000 years!

(40)

• Striping: Improve performance by parallelism

• Idea: Distribute data among all disks for increased performance

• BitLevel Striping: Split all bits of a byte to the disks

– e.g. for 8 disks, write i-th bit to disk i

Number of disk needs to be a power of 2

2.3 RAID Principles - Striping

– e.g. for 8 disks, write i-th bit to disk i

– Number of disk needs to be a power of 2 – Each disk is involved in each access

• Access rate does not increase

• Read and write transfer speed linearly increases

• Simultaneous accesses not possible

– Good for speeding up few, sequential and large accesses

40

(41)

• Block Level Striping: Distribute blocks among the disks

–

Only one disk is involved reading a specific block

• Read and write speed of a single block not increased

• Other disks still free to read/write other blocks

2.3 RAID Principles - Striping

• Other disks still free to read/write other blocks

• Read and write speed of multiple accesses increase

–

Good for large number of parallel accesses

(42)

• Error Correction Codes: Increase reliability with computed redundancy

• Hamming Codes (~1940)

–

Can detect and repair 1 bit errors within

a set of

n data bits by computing k parity bits

2.3 RAID Principles - Error Correction Codes

a set of

n data bits by computing k parity bits

• n = 2^k - k – 1

• n=1, k=2; n=4, k=3; n = 11, k=4; n = 26, k=5; …

–

Especially used for in-memory and tape error correction

• Media cannot detect errors autonomously

• Not really used for hard drives anymore

(43)

• Interleaved Parity (Reed-Solomon Algorithm on GD(2) Galois Field)

– Can repair 1-bit errors (when the error is known)

– Hard Disks can detect read errors themselves, no need for complete Hamming codes

2.3 RAID Principles - Error Correction Codes

for complete Hamming codes – Basic Idea:

• From n data pieces D₁,…,D_n compute a parity data D_p by combining data using logical XOR (eXclusive OR)

– XOR is associative and commutative – Important: A XOR B XOR B = A

• i.e. D_p= D₁ XOR D₂ XOR … XOR D_n

• Assume D₂ was lost. It can be reconstructed by D₂= D_p XOR D₁ XOR D₃ XOR … XOR D_n

(44)

• Interleaved Parity. Example:

• A = 0101, B = 1100, C = 1011

• P = 0010 = A XOR B XOR C

• C is lost.

– P = A XOR B XOR C

2.3 RAID Principles : Interleaved Parity

0101

XOR 1100

XOR 1011

P 0010

0010

– P = A XOR B XOR C – C = P XOR A XOR B

– C = A XOR B XOR C XOR A XOR B – C = A XOR A XOR B XOR B XOR C – C = 0 XOR C

– C = 1011

XOR 0101

XOR 1100

C 1011

(45)

• The 3 RAID principles can be combined in multiple ways

– Not every combination is useful

• This led to the definition of 7 core RAID levels

– RAID 0 – RAID 6

– The most dominant levels are RAID 0, RAID 1, RAID 1+0, RAID 5

• In following examples, assume

2.3 RAID in practical applications

• In following examples, assume

– A MTBF of 100,000 hours (11.42 years) per disk – A Mean Time to Repair (MTTR) of 6 hours

– Failure rate is constant and failures between disks are independent – MTBFraid is the mean time to data loss within the raid if each failing

disk is replaced within the MTTR

– D is the number of drives in the RAID set

– C=200 GB is capacity of one disk, Craid capacity of whole raid

(46)

• Mean Time to Repair (MTTR)

– MTTR = TimeToNotice + RebuildTime – Assume time to notice of 0.5 hours

– Rebuild time is the time for completely writing back lost data

• Assume disk capacity of 200GB

2.3 RAID in practical applications

• Assume disk capacity of 200GB

• Write back speed of 10 MB/sec

– Consisting of reading remaining disks – Computing parity / Reconstructing data

• Rebuild time around 5.5 hours

– During rebuild, a RAID is especially vulnerable

– MTTR = 6 hours

(47)

• File A (A1-Ax), File B (B1-Bx), File C (C1-Cx)

• Raid 0

– Block-Level-Striping only

– Increased parallel access and transfer speeds, reduced reliability

– All disks contain data (0% overhead)

2.3 RAID Levels

– All disks contain data (0% overhead) – Works with any number of disks – MTBFraid= MTBFdisk/ D

– 4 disks:

• MTBFraid= 2.86 years

• Craid = 800 GB (0 GB wasted (0%))

– Common size: 2 disks

• MTBFraid= 5.72 years

(48)

• Raid 1

– Mirroring only

– Increased reliability, increased read transfer speed, low space efficiency

– MTBFraid = MTBFdisk

D/ (D! * MTTR^D-1) – 4 disks:

2.3 RAID Levels

– 4 disks:

• MTBFraid= 2.2 trillion years

• Age of universe may be around 15 billion years…

– Common size: 2 disks

• MTBFraid= 95,130 years

(49)

• RAID 2

– Not used anymore in practice

• was used in old mainframes

– Bit-Level-Striping

– Use Hamming Codes

• Usually Hamming Code(7,4) – 4 data bits, 3 parity bits Reliable 1-Bit error recovery (i.e. one disk may fail)

2.3 RAID Levels

• Usually Hamming Code(7,4) – 4 data bits, 3 parity bits

• Reliable 1-Bit error recovery (i.e. one disk may fail)

– 3 redundant disks per 4 data disks (75% overhead)

• Ratio better for larger number of disks

– MTBF_raid = MTBF_disk²/ (D*(D-1) * MTTR)

– 7 disks (does not really make sense for 4 – not comparable to other values)

• Craid= 800 GB (600 GB wasted (43%))

(50)

• RAID 3

– Interleaved-Parity – Byte-Level-Striping – Dedicated Parity Disk

• Bottleneck! Every write operation needs to update the parity disk.

• No parallel writes

2.3 RAID Levels

• No parallel writes

– 1 redundant disk per n data disks

• Overhead decreases with number of disks while reliability decreases

• 25% overhead for 4 data disks

– MTBF_raid = MTBF_disk²/ (D*(D-1) * MTTR) – 4 disks

(51)

• RAID 4

–

Block-Level Striping

–

As RAID 3 otherwise

–

4 disks (common size)

2.3 RAID Levels

4 disks (common size)

• MTBF_raid = 15,854 years

• C_raid = 600 GB (200 GB wasted (25%))

–

5 disks (also common size)

• MTBF_raid = 9,513 years

• C_raid = 800 GB (200 GB wasted (20%))

(52)

• RAID 5

– Parity is distributed among the hard disks

• May allow for parallel block writes

– As RAID 4 otherwise

2.3 RAID Levels

– Bottleneck when writing many files smaller than a block

• Whole parity block has to be read and re-written for each minor write

– Can recover from a single disk failure – MTBFraid and Craid as for RAID 3 & 4

(53)

• RAID 6

– Two independent parity blocks distributed among the disks

• May be implemented by parity on orthogonal data or by using Reed-Solomon on GF(2⁸)

– 2 redundant disk per n data disks

• Can recover from a double disk failure

• No vulnerability during single failure rebuild

2.3 RAID Levels

• No vulnerability during single failure rebuild

• Very suitable for larger arrays

• Writer overhead due to more complicated parity computation

– MTBFraid= MTBFdisk

3/ (D*(D-2)*(D-1) * MTTR²) – 4 disks

• MTBFraid= 132 million years

– 8 disks (common)

• MTBFraid= 9,437 years (~RAID 5 w. D=5)

• Craid= 1,200 GB (400 GB wasted (25%))

(54)

• Additionally, there are hybrid levels combing the core levels

– RAID 0+1, RAID 1+0, RAID 5+0, RAID 5+1, RAID 6+6, …

• Raid 1+0

– Mirrored sets nested in a striped set

• RAID 0 on sets of RAID 1 sets

– Very high read and write transfer speeds, increased reliability, low space efficiency, limited maximum size

Most performant RAID combination

2.3 Practical use of RAIDS

– Most performant RAID combination

– D1= Drives per RAID 1, D0=Number of RAID1 sets – MTBFraid= MTBFdisk

D1 / (D1! * MTTR^D1-1) / D0

– 4 disks: D1 = 2, D0= 2

• Craid= 400 GB (400 GB wasted (50%)) – 6 disks: D1 = 2, D0= 3

(55)

• RAIDs controllers directly connect storage to the system bus

– Storage available to only one system/ server/ application

• Number of disks is limited

– Consumer grade RAID: 2-4 disks – Enterprise grade RAID: 8-24+ disks

2.4 Beyond RAID

– Enterprise grade RAID: 8-24+ disks

• Solutions

– NAS (Network Attached Storage): Provide abstracted file systems via network (software solution)

– SAN (Storage Area Network): Virtualized logical storage within a specialized network on block level (hardware solution)

(56)

• Before discussing NAS, we need file systems

• A file systems is a software for abstracting file operations on a logical storage device

– Files are a collection of binary data

2.4 File Systems vs. Raw Devices

Application Software

File System

– Files are a collection of binary data

• Creating, reading, writing, deleting, finding, organizing

– How does a file access translate into top-level operations on a logical

storage device?

• e.g. which blocks have to be read/written?

• Bridge between application software and (abstracted) hardware

File System

Logical Storage

(57)

• Raw Devices access allows

applications to bypass the OS and the file system

• Application may directly tune

2.4 File Systems vs. Raw Devices

• Application may directly tune aspects of physical storage

• May lead to very efficient implementations

–

Used for e.g. high performance database, system virtualization, etc

Logical Storage

(58)

• Idea: Provide a remote file system using already available network infrastructure

– NAS: Network Attached Storage

– Use specialized network protocols (e.g. CIFS, NFS, FTP, etc)

– Easiest case: File Server (e.g. Linux+Samba)

• Advantages:

Easy to setup, easy to use, cheap infrastructure

2.4 NAS – Network Attached Storage

Network

Advantages:

– Easy to setup, easy to use, cheap infrastructure – Allows sharing of storage among several systems – Abstracts on file system level (easy for most

applications)

• Disadvantages

– Inefficient and slow

• large protocol and processing overhead

– Abstracts on file system level (not suitable for special purposes like raw devices or storage virtualization)

File System

Logical Storage

NAS Server

(59)

• SANs offer specialized high-speed networks for storage devices

– Usually uses local FibreChannel networks

– Remote location may be connected via Ethernet or IP-WAN (Internet)

– Network uses specialized storage protocols

• iFCP (SCSI on FiberChannel)

• iSCSI (SCSI on TCP/IP)

2.4 SAN – Storage Area Network

File System

• iSCSI (SCSI on TCP/IP)

• HyperSCSI (SCSI on raw ethernet)

• SANs provide raw block level access to logical storage devices

– Logical disks of any size can be offered by the SAN – For a client system using a logical disk, it might

appears like a local disk or RAID

– Client system has full control over file systems on logical disks

Logical Storage

SAN

(60)

2.4 SAN – Storage Area Network

SAN Switch SAN Switch

SAN HBA SAN HBA

SAN Switch

NAS Protocol (CIFS)

Ethernet Network

WAN-SAN Bus (HyperSCSI)

NAS NAS Head

SAN HBA

SAN/RAID HBA

Peripheral Bus (SCSI, SAS, etc.)

SAN Switch SAN Switch

SAN HBA

SAN Bus (iFCP) Head

NAS Head

(61)

•

Advantages:

– Very efficient

• Highly optimized local network infrastructure

• Optimized protocols with low overhead

– Very flexible (any number of systems may use any number of disks at any location)

2.4 SAN – Storage Area Network

– Very flexible (any number of systems may use any number of disks at any location)

– Helps for disaster protection

• SAN can transparently span to even remote locations

– May also employ NAS heads for NAS-like behavior

•

Disadvantages

– Expensive

(62)

• How much storage and bandwidth is needed

2.5 Case Study

• How much storage and bandwidth is needed by YouTube and how might it organized?

• All top secret, but there are educated guesses and

some (older) leaked data…

(63)

• A Google video search restricted to YouTube.com reveals 187,397,091 indexed videos

– 3.35 min/movie : Based on TOP-100 all time videos – 2.3 MB/min: Based on a sample (very low variation) – 8.3 MB/video

• Guessed size of all videos on YouTube is 1.56 PB

Assume 160 GB/disk with MTBF=16.6 years

2.5 Case Study

– Assume 160 GB/disk with MTBF=16.6 years

• Based on the Google reliability study

– 9,800 hard disks are needed to store all videos just once without any redundancy

• MTBF = 14 hours ...

– Using 1,960 5+1 RAID 5’s, 11,760 disks are needed

• MTBF = 6.84 years - not too great…

• Still, each video only available once

(64)

–

Using 196 (9+1)(5+1) RAID 55 arrays, 13,066 disks are needed

• RAID 5 Arrays with 6 disks each. 10 of these arrays form an overlaying RAID 5.

• MTBF = 14 million years (finally, data is “safe” at one location)

2.5 Case Study

location)

• Still, each video only available once

– No global disaster safety – No global load balancing

• How might this look?

13066 x 1089 x 60 x

(65)

• YouTube grows fast

–

Currently, around 200,000 new videos per day (1.66

TB/day)

• Larger number of disks have to be added per month

2.5 Case Study

month

–

Around 440 disks/month for new videos

–

Around 80 disks/month to replace broken ones

• Growing exponentially

(66)

• It gets even worse…

• YouTube serves 200 million videos per day (as of mid 2007)

–

30 PB of data EVERY MONTH

– 154 Gbps

(read: 154 Gigabit per second)

2.5 Case Study

– 154 Gbps

(read: 154 Gigabit per second)

–

Results to average 586,000 concurrent streams

–

This bandwidth is insanely expensive:

600,000 USD/month

(67)

•

This massive amount of data cannot be hosted and served from a single location…

2.5 Case Study

•

Data needs to be distributed and globally load balanced

(68)

• YouTube does not host and provide videos themselves

–

They hire Limelight Networks for that

• Limelight Networks

2.5 Case Study

• Limelight Networks

–

Large CDN (Content Delivery Network) Provider

–

Own 25 POP (Point Of Presence) connected with

own backbone

• Each POP with up to 1000’s storage servers

• Can serve up to 1 Tbps!

(69)

• Limelight automatically distributes content among all POP

– Data is massively redundant

– More popular data replicated more, less popular replicated less – Each file is served from the closest location with bandwidth to

spare

• Global load balancing

2.5 Case Study

• Global load balancing

– Data is disaster proof!

• What to learn?

• Large scale data storage and serving

– Very resource intensive – Very expensive

(70)

• There are different types of storage

–

Usually, there is a storage hierarchy

• Faster, smaller, more expensive storage

• Slower, bigger, less expensive storage

• Hard drives are currently the most popular media

Physical Storage

• Hard drives are currently the most popular media

–

Mechanical device

• High sequential transfer rates,

• Bad random access times, low random transfer rates

• Prone to failure

–

DBMS must be optimized for the used storage

devices!

(71)

• Access Pathes

–

Physical Data Access

–

Index Structures

–