Relational Database Systems 2
Silke Eckstein
Benjamin Köhncke
Institut für Informationssysteme
Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de
2. Physical Data Storage
Query Processor
Application Interfaces
Applications DDL Applications Programs
Transaction Manager
Embedded Embedded
DML DML
DB Scheme Application
Programs Direct Query Application
Programmers
DB Administrators
1 Architecture
Data Storage Manager
Indices Statistics
DDL Interpreter
Query Engine Query Evaluation
Engine Object Code
Programs Object Code
Buffer Manager File Manager
Catalog/
Dictionary
Precompiler DML Precompiler
DML Compiler
2.1 Introduction 2.2 Hard Disks 2.3 RAIDs
2.4 SANs and NAS 2.5 Case Study
2 Physical Data Storage
2.5 Case Study
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 3
• DBMS needs to retrieve, update and process persistently stored data
–
Storage consideration is an important factor in planning a database system (physical layer)
– Remember:
The data has to be securely stored, but
2.1 Physical Storage Introduction
– Remember:
The data has to be securely stored, but access to the data should be declarative!
Headquarters in Redwood City, CA
• Data is stored on a storage media. Media highly differ in terms of
–
Random Access Speed
–
Random/ Sequential Read/Write speed
– Capacity2.1 Physical Storage Introduction
– Capacity
– Cost
per Capacity
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 5 EN 13.1
• Capacity: Quantifies the amount of data which can be stored
– Base Units: 1 Bit, 1 Byte = 23 Bit = 8 Bit
– Capacity units according to IEC, IEEE, NIST, etc:
• Usually used for file sizes and primary storage (for higher degree of confusion, sometimes used with SI abbreviations…)
• 1 KiB = 10241 Byte; 1 MiB = 10242 Byte ; 1 GiB = 10243 Byte;
2.1 Relevant Media Characteristics
• 1 KiB = 10241 Byte; 1 MiB = 10242 Byte ; 1 GiB = 10243 Byte;
…
– Capacity units according to SI:
• Usually used for advertising secondary/tertiary storage
• 1 KB = 10001 Byte ≈ 0.976 KiB; 1 MB = 10002 Byte ≈ 0.954 MiB;
1 GB = 10003 Byte ≈ 0.931 GiB; …
– Especially used by the networking community:
• 1 Kb = 10001 Bit = 0.125 KB ≈ 0.122 KiB ; 1 Mb = 10002 Bit = 0.125 MB ≈ 0.119 MiB
2.1 A Kilo-Joke
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 7 http://xkcd.com/
• Random Access Time: Average time to access a random piece of data at a known media position
–
Usually measured in ms or ns
–
Within some media, access time can vary depending on position (e.g. hard disks)
2.1 Characteristic Parameters
Within some media, access time can vary depending on position (e.g. hard disks)
• Transfer Rate: Average amount consecutive of data which can be transferred per time unit
–
Usually measured in KB/sec, MB/sec, GB/sec,…
–
Sometimes also in Kb/sec, Mb/sec, Gb/sec
• Volatile: Memory needs constant power to keep data
– Dynamic: Dynamic volatile memory needs to be “refreshed”
regularly to keep data
– Static: No refresh necessary
• Access Modes
– Random Access: Any piece of data can be accessed in
2.1 Other characteristics
– Random Access: Any piece of data can be accessed in approximately the same time
– Sequential Access: Data can only be accessed in sequential order
• Write Mode
– Mutable Storage: Can be read and written arbitrarily – Write Once Read Many (WORM)
• Interesting for legal issues Sarbanes-Oxley Act (2002)
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 9
• Online media
– „always on“
– Each single piece of data can be accessed fast – e.g. hard drives, main memory
• Nearline media
– Compromise between online and offline
2.1 Online, Nearline, Offline
– Compromise between online and offline
– Offline media can automatically put “on line”
– e.g. juke boxes, robot libraries
• Offline media (disconnected media)
– Not under direct control of processing unit – Have to be connected manually
– e.g. box of backup tapes in basement
• Media characteristics result in a storage hierarchy
• DBMS optimize data distribution among the storage levels
– Primary Storage: Fast, limited capacity, high price, usually volatile electronic storage
2.1 The Storage Hierarchy
usually volatile electronic storage
• Frequently used data / current work data
– Secondary Storage: Slower, large capacity, lower price
• Main stored data
– Tertiary Storage: Even slower, huge capacity, even lower price, usually offline
• Backup and long term storage of not frequently used data
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 11
2.1 The Storage Hierarchy
Cost Speed
Primary
Secondary
Cache, RAM
~100 ns
Cost Speed
Optical Disks, Tape
Secondary
Tertiary
Flash, Magnetic Disks
~10 ms
> 1 s
Type Media Size Random Acc. Speed
Transfer Speed
Characteristics Price Price/GB Pri L1-Processor Cache
(Intel QX9000 )
32 KiB 0.0008 ms 6200 MB/sec Vol, Stat, RA,OL Pri DDR3-Ram
(Corsair 1600C7DHX)
2 GiB 0.004 ms 8000 MB/sec Vol, Dyn, Ra, OL
€38 € 19 Sec Harddrive SSD
(OCZ Vertex2)
160 GB < 1 ms 285 MB/sec Stat, RA, OL €239 € 1,50 Sec Harddrive Magnetic 2000 GB 8.5 ms 138 MB/sec Stat, RA, OL €143 € 0.07
2.1 Storage Media – Examples
Sec Harddrive Magnetic
(Seagate ST32000641AS)
2000 GB 8.5 ms 138 MB/sec Stat, RA, OL €143 € 0.07 Ter DVD+R
(Verbatim DVD+R)
4.7 GB 98 ms 11 MB/sec Stat, RA, OF, WORM
€ 0.36/Disk € 0.07 Ter LTO Streamer
(Freecom LTO-920i)
800 GB 58 sec 120 MB/sec Stat, SA, OF €80/Tape € 0.10
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 13
Last updated April 2011
Pri= Primary, Sec=Secondary, Ter=Tertiary
Vol=Volatile, Stat=Static, Dyn=Dynamic, RA=Random Access, SA=Sequential Access OL=Online, OF=Offline, WORM=Write Once Read Many
• Hard drives are currently the standard for large, cheap and persistent storage
– Usually used as the main storage media for most data in a DB
• DBMS need to be optimized for
2.2 Magnetic Disk Storage – HDs
• DBMS need to be optimized for efficient disk storage and access
– Data access needs to be as fast as possible
– Often used data should be accessible with highest speed, rarely needed data may take longer
– Different data items needed for certain reoccurring tasks should also be stored/accessed together
• Directionally magnetization of a ferromagnetic material
• Realized on hard disk platters
– Base platter made of non-magnetic aluminum or glass substrate
– Magnetic grains worked into base platter to form magnetic regions
• Each region represents 1 Bit
– Read head can detect
2.2 HD – How does it work?
– Read head can detect magnetization direction of each region
– Write head may change direction
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 15
• Giant Magnetoresistance Effect (GMR)
– Discovered 1988 simultaneously by Peter Grünberg and Albert Fert
• Both honored with the 2007 Nobel Prize in Physics
– Allows the construction of efficient read heads:
2.2 HD – Notable Technology Advances
– Allows the construction of efficient read heads:
• The electric resistance of an alternating ferro- and non-magnetic material giantally changes within changing directed magnetic fields
– http://www.research.ibm.com/research/demos/gmr/cyberdemo1.htm
• Perpendicular
Recording (used since 2005)
– Longitudal Recording limited to ~200 Gb/inch2 due to superparamagnetic effect
• Thermal energy may spontaneously change magnetic direction
– Perpendicular recording allows for up to 1000 Gb/inch2
2.2 HD – Notable Technology Advances
– Perpendicular recording allows for up to 1000 Gb/inch – Very simplified: Align
magnetic field orthogonal to surface instead of
parallel
• Magnetic regions can be smaller
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 17
•
Usage of magnetic grains instead of continuous magnetic material
– Between magnetic direction transitions, Neel Spikes are formed
• Areas of unsure magnetic direction
2.2 HD – Notable Technology Advances
Areas of unsure magnetic direction
– Neel Spikes are larger for continuous materials
– Magnetic regions can be smaller as the transition width can be reduced
• A hard disk is made up of multiple double-sided platters
– Platter sides are called surfaces
– Platters are fixed on main spindle and rotate at equal and constant speed (common: 5400 rpm / 7200 rpm)
– Each surface has it’s own read and write head – Heads are attached to arms
• Arms can position heads
2.2 HD – Basic Architecture
• Arms can position heads along the surface
• Heads cannot move inde- pendently
– Heads have no contact to surface and hover on top of an air bearing
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 19 EN 13.2
• Each surface is divided into circular tracks
–
Some disks may use spirals
• All tracks of all surfaces with the same diameter are called cylinder
2.2 HD – Basic Architecture
are called cylinder
–
Data within the same cylinder can be accessed very efficiently
EN 13.2
• Each track is subdivided into sectors of equal capacity
a) Fixed angle sector subdivision
• Same number of sectors per track, changing density, constant speed
2.2 HD – Basic Architecture
constant speed
b) Fixed data density
• Outer tracks have more sectors than inner tracks
• Transfer speed higher on outer tracks
• Adjacent sectors can be grouped into clusters
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 21 EN 13.2
• Hard drives are not completely reliable!
– Drives do fail
– Means for physical failure recovery are necessary
• Backups
• Redundancy
• Hard drives age and wear down.
2.2 HD - Reliability
• Hard drives age and wear down.
Wear significantly increases by:
– Contact cycles (head parking) – Spindle start-stop
– Power-on hours
– Operation outside ideal environment
• Temperature too low/high
• Unstable voltage
• Reliability measures are statistical values assuming certain usage patterns
• Desktop usage (all per year): 2 400 hours, 10 000 motor start/stops, 25°C temperature
• Server usage (all per year): 8 760 hours, 250 motor start/stops, 40°C temperature
2.2 HD - Reliability
40°C temperature
– Non-Recoverable read errors: A sector on the surface cannot be read anymore – the data is lost
• Desktop disk: 1 per 1014 read bits, Server: 1 per 1015 read bits
• Disk can detect this!
– Maximum contact cycles: Maximum number of allowed head contacts (parking)
• Usually around 50 000 cycles
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 23
–
Mean Time Between Failure (MTBF): Statistically
anticipated time for a large disk population failing to 50%
• Drive manufactures usually use optimistic simulations to guess the MTBF
2.2 HD - Reliability
guess the MTBF
• Desktop: 0.7 million hours (80 years), Server: 1.2 million hours (137 years) – Manufacturers values
–
Annualized Failure Rate (AFR): Probability of a failure per year based on MTBF
• AFR = OperatingHoursPerYear / MTBFhours
• Desktop: 0.34%, Server: 0.73%
•
Failure rate during a hard disks lifespan is not constant
•
Can be better modeled by the “bathtub curve” having 3 components
– Infant Mortality Rate – Wear Out Failures
2.2 HD - Reliability
– Wear Out Failures – Random Failures
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 25
• Report by Google
– 100,000 consumer grade disks (80-400GB, ATA Interface, 5400- 7200 RPM)
• Results (among others)
– Drives fail often!
2.2 Real World Failure Rates
Careful: 2+ year results are biased. See reference.
– Drives fail often!
– There is infant mortality
– High usage increases infant mortality, but not later failure rates
– Observed AFR is around 7% and MTBF 16.6 years!
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Failure trends in a large disk drive population
E. Pinheiro, W.-D. Weber, L. A. Barroso 5th USENIX conference on File and Storage
•
Seagate ST32000641AS 2 TB (Desktop Harddrive, 2011)
– Manufacturer’s specifications
2.2 HD - Example Specs
Specification Value
Capacity 2 TB
Platters 4
Heads 8
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 27
Heads 8
Cylinders 16,383
Sectors per track 63
Bytes per sector 512
Spindle Speed 7200 RPM
MTBF 85 years
AFR 0.34 %
Random Seek 8.5 ms
Average latency 4.2 ms
• Assume a storage need of 100 TB. Only following HDs are available
– Capacity: 1 TB capacity each
– MTBF: 100,000 hours each (ca. 11 years)
• Consider using 100 of these disks independently
2.2 Reliability - Considerations
• Consider using 100 of these disks independently (w/o RAID).
– Total Storage: 100 000 GB = 100 TB – MTBF: 1,000 hours (ca. 42 days)
– THIS IS BAD!
• More sophisticated ways of using multiple disks are
needed
• Alternative to hard-drives: SSD
–
Use microchips which retain data in non-volatile
memory chipsand contain no moving parts
• Use the same interface as hard disk drives
Solid State Disk (SSD)
–
Easily replacing in most applications possible
• Key components
–
Memory
–Controller
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 29
• Flash-Memory
– Most SSDs use NAND-based flash memory – Retains memory even without power
– Slower than DRAM solutions
– Single-level cell versus multi-level cell – Wears down!
Memory
– Wears down!
• DRAM
– Use volatile Random Access Memory
– Ultrafast data access (< 10 microseconds)
– Sometimes use internal battery or external power device to ensure data persistence
– Only for Applications that require even faster access, but do not need data persistence after power loss
• The controller is an embedded processor
• Incorporates the electronics that bridge the NAND memory components to the host computer
Controller
computer
• Some of its functions
–
error correction, wear leveling, bad block
mapping, read and write caching, encryption, garbage
collection
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 31
• Advantages
– Low access time and latency
– No moving parts shock resistant – Silent
– Lighter and more energy-efficient than HDDs
• Disadvantages
SSD - Summary
• Disadvantages
– Divided into blocks; if one byte is changed the whole block has to be rewritten (write amplification)
– 10 % of the storage capacity are allocated (spare area) – Limited ability of being rewritten (between 3000 and
100,000 cycles per cell)
• Wear leveling algorithms assure that write operations are equally distributed to the cells
• The disk controller organizes low level access to the disk
– e.g. head positioning, error checking, signal processing
– Usually integrated into the disk
– Provides unified and abstracted interface to access the disks (e.g. LBA)
– Connects disk to an peripheral bus (e.g.
IDE, SCSI, FiberChannel, SAS)
2.2 HD – Controller
Host Bus Adapter
Peripheral Internal Bus Inner System / Mainboard
IDE, SCSI, FiberChannel, SAS)
• The host bus adapter (HBA) bridges between the peripheral bus and systems internal bus (like PCIe, PCI)
– Internal Bus usually integrated into systems main board
– Often confused as being the disk controller
• DAS (Directly Attached Storage)
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 33
Disk Controller
Mechanics
Peripheral Bus
• Sectors can be logically grouped to blocks by the operating system
–
Sectors in a block do not necessarily need to be adjacent
–
e.g. NTFS defaults to 4 KiB per block
2.2 HD – Controller
–
e.g. NTFS defaults to 4 KiB per block
• 8 sectors on a modern disk
• Hardware address of a block is combination of
–
Cylinder number, surface number, block number within track
–
Controller maps hardware address to logical block
address(LBA)
• Disk controller transfers content of whole blocks to buffer
– Buffer resides in primary storage and can be accessed efficiently
– Time needed to transfer a random block (4KiB/Block on ST3100034AS): (<10 msec)
2.2 HD – Controller
ST3100034AS): (<10 msec)
• Seek Time: Time needed to position head to correct cylinder (<8 msec)
• Latency (Rotational Delay): Time until the correct block arrives below the head (<0.14 msec)
• Block Transfer Time: Time to read all sectors of block (<0.01 msec)
– Bulk Transfer Rate for n adjacent blocks (<20msec for n=10)
• Seek time + Rotational Delay + n * Block Transfer Rate
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 35
• Locating data on a disk is a major bottleneck
–
Try operating on data already in buffer
–
Aim for bulk transfer, avoid random block transfer
2.2 HD – Controller
Aim for bulk transfer, avoid random block transfer
• A single HD is often not sufficient
–
Limited capacity
–Limited speed
–
Limited reliability
• Idea: Combine multiple HD into a RAID Array
2.3 RAID
• Idea: Combine multiple HD into a RAID Array (Redundant Array of Independent Disks)
–
RAID Array treats multiple hardware disks as a single logical disk
• More HDs for increased capacity
• Parallel access for increased speed
• Controlled redundancy for increased reliability
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 37 Silber 11.3
•
The RAID controller connects to multiple hard disks
– Disks are virtualized and appear to be just one single logical disk
– The RAID controller acts as an
2.3 RAID Controller
RAID Controller
Internal Bus
– The RAID controller acts as an extended specialized HBA
(Host Bus Adapter)
– Still DAS (Directly Attached Storage)
Peripheral Bus Represented
as single logical Disk
• Mirroring (or shadowing): Increases reliability by complete redundancy
• Idea: Mirror Disks are exact copies of original disk
– Not space efficient
• Read speed can be n times as fast, write speed does not increase
2.3 RAID Principles - Mirroring
increase
• Increases reliability. Assume
– Two disks with a MTBF 11 years each
• One original disk, one mirror disk
• Assume disk failures are independent of each other (unrealistic)
– Disk replacement time of 10 hours
– ► MTBF of mirror system is >57,000 years!
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 39 Silber 11.3
• Striping: Improve performance by parallelism
• Idea: Distribute data among all disks for increased performance
• BitLevel Striping: Split all bits of a byte to the disks
– e.g. for 8 disks, write i-th bit to disk i
Number of disk needs to be a power of 2
2.3 RAID Principles - Striping
– e.g. for 8 disks, write i-th bit to disk i
– Number of disk needs to be a power of 2 – Each disk is involved in each access
• Access rate does not increase
• Read and write transfer speed linearly increases
• Simultaneous accesses not possible
– Good for speeding up few, sequential and large accesses
40
• Block Level Striping: Distribute blocks among the disks
–
Only one disk is involved reading a specific block
• Read and write speed of a single block not increased
• Other disks still free to read/write other blocks
2.3 RAID Principles - Striping
• Other disks still free to read/write other blocks
• Read and write speed of multiple accesses increase
–
Good for large number of parallel accesses
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 41 Silber 11.3
• Error Correction Codes: Increase reliability with computed redundancy
• Hamming Codes (~1940)
–
Can detect and repair 1 bit errors within
a set of
n data bits by computing k parity bits2.3 RAID Principles - Error Correction Codes
a set of
n data bits by computing k parity bits• n = 2k - k – 1
• n=1, k=2; n=4, k=3; n = 11, k=4; n = 26, k=5; …
–
Especially used for in-memory and tape error correction
• Media cannot detect errors autonomously
• Not really used for hard drives anymore
• Interleaved Parity (Reed-Solomon Algorithm on GD(2) Galois Field)
– Can repair 1-bit errors (when the error is known)
– Hard Disks can detect read errors themselves, no need for complete Hamming codes
2.3 RAID Principles - Error Correction Codes
for complete Hamming codes – Basic Idea:
• From n data pieces D1,…,Dn compute a parity data Dp by combining data using logical XOR (eXclusive OR)
– XOR is associative and commutative – Important: A XOR B XOR B = A
• i.e. Dp= D1 XOR D2 XOR … XOR Dn
• Assume D2 was lost. It can be reconstructed by D2= Dp XOR D1 XOR D3 XOR … XOR Dn
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 43
• Interleaved Parity. Example:
• A = 0101, B = 1100, C = 1011
• P = 0010 = A XOR B XOR C
• C is lost.
– P = A XOR B XOR C
2.3 RAID Principles : Interleaved Parity
0101
XOR 1100
XOR 1011
P 0010
0010
– P = A XOR B XOR C – C = P XOR A XOR B
– C = A XOR B XOR C XOR A XOR B – C = A XOR A XOR B XOR B XOR C – C = 0 XOR C
– C = 1011
XOR 0101
XOR 1100
C 1011
• The 3 RAID principles can be combined in multiple ways
– Not every combination is useful
• This led to the definition of 7 core RAID levels
– RAID 0 – RAID 6
– The most dominant levels are RAID 0, RAID 1, RAID 1+0, RAID 5
• In following examples, assume
2.3 RAID in practical applications
• In following examples, assume
– A MTBF of 100,000 hours (11.42 years) per disk – A Mean Time to Repair (MTTR) of 6 hours
– Failure rate is constant and failures between disks are independent – MTBFraid is the mean time to data loss within the raid if each failing
disk is replaced within the MTTR
– D is the number of drives in the RAID set
– C=200 GB is capacity of one disk, Craid capacity of whole raid
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 45
• Mean Time to Repair (MTTR)
– MTTR = TimeToNotice + RebuildTime – Assume time to notice of 0.5 hours
– Rebuild time is the time for completely writing back lost data
• Assume disk capacity of 200GB
2.3 RAID in practical applications
• Assume disk capacity of 200GB
• Write back speed of 10 MB/sec
– Consisting of reading remaining disks – Computing parity / Reconstructing data
• Rebuild time around 5.5 hours
– During rebuild, a RAID is especially vulnerable
– MTTR = 6 hours
• File A (A1-Ax), File B (B1-Bx), File C (C1-Cx)
• Raid 0
– Block-Level-Striping only
– Increased parallel access and transfer speeds, reduced reliability
– All disks contain data (0% overhead)
2.3 RAID Levels
– All disks contain data (0% overhead) – Works with any number of disks – MTBFraid= MTBFdisk/ D
– 4 disks:
• MTBFraid= 2.86 years
• Craid = 800 GB (0 GB wasted (0%))
– Common size: 2 disks
• MTBFraid= 5.72 years
• Craid = 400 GB (0 GB wasted (0%))
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 47
• Raid 1
– Mirroring only
– Increased reliability, increased read transfer speed, low space efficiency
– MTBFraid = MTBFdisk
D/ (D! * MTTRD-1) – 4 disks:
2.3 RAID Levels
– 4 disks:
• MTBFraid= 2.2 trillion years
• Craid = 200 GB (600 GB wasted (75%))
• Age of universe may be around 15 billion years…
– Common size: 2 disks
• MTBFraid= 95,130 years
• Craid = 200 GB (200 GB wasted (50%))
• RAID 2
– Not used anymore in practice
• was used in old mainframes
– Bit-Level-Striping
– Use Hamming Codes
• Usually Hamming Code(7,4) – 4 data bits, 3 parity bits Reliable 1-Bit error recovery (i.e. one disk may fail)
2.3 RAID Levels
• Usually Hamming Code(7,4) – 4 data bits, 3 parity bits
• Reliable 1-Bit error recovery (i.e. one disk may fail)
– 3 redundant disks per 4 data disks (75% overhead)
• Ratio better for larger number of disks
– MTBFraid = MTBFdisk2/ (D*(D-1) * MTTR)
– 7 disks (does not really make sense for 4 – not comparable to other values)
• MTBFraid= 4,530 years
• Craid= 800 GB (600 GB wasted (43%))
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 49
• RAID 3
– Interleaved-Parity – Byte-Level-Striping – Dedicated Parity Disk
• Bottleneck! Every write operation needs to update the parity disk.
• No parallel writes
2.3 RAID Levels
• No parallel writes
– 1 redundant disk per n data disks
• Overhead decreases with number of disks while reliability decreases
• 25% overhead for 4 data disks
– MTBFraid = MTBFdisk2/ (D*(D-1) * MTTR) – 4 disks
• MTBFraid= 15,854 years
• Craid= 600 GB (200 GB wasted (25%))
• RAID 4
–
Block-Level Striping
–As RAID 3 otherwise
–4 disks (common size)
2.3 RAID Levels
4 disks (common size)
• MTBFraid = 15,854 years
• Craid = 600 GB (200 GB wasted (25%))
–
5 disks (also common size)
• MTBFraid = 9,513 years
• Craid = 800 GB (200 GB wasted (20%))
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 51
• RAID 5
– Parity is distributed among the hard disks
• May allow for parallel block writes
– As RAID 4 otherwise
2.3 RAID Levels
– As RAID 4 otherwise
– Bottleneck when writing many files smaller than a block
• Whole parity block has to be read and re-written for each minor write
– Can recover from a single disk failure – MTBFraid and Craid as for RAID 3 & 4
• RAID 6
– Two independent parity blocks distributed among the disks
• May be implemented by parity on orthogonal data or by using Reed-Solomon on GF(28)
– As RAID 5 otherwise
– 2 redundant disk per n data disks
• Can recover from a double disk failure
• No vulnerability during single failure rebuild
2.3 RAID Levels
• No vulnerability during single failure rebuild
• Very suitable for larger arrays
• Writer overhead due to more complicated parity computation
– MTBFraid= MTBFdisk
3/ (D*(D-2)*(D-1) * MTTR2) – 4 disks
• MTBFraid= 132 million years
• Craid= 400 GB (400 GB wasted (50%))
– 8 disks (common)
• MTBFraid= 9,437 years (~RAID 5 w. D=5)
• Craid= 1,200 GB (400 GB wasted (25%))
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 53
• Additionally, there are hybrid levels combing the core levels
– RAID 0+1, RAID 1+0, RAID 5+0, RAID 5+1, RAID 6+6, …
• Raid 1+0
– Mirrored sets nested in a striped set
• RAID 0 on sets of RAID 1 sets
– Very high read and write transfer speeds, increased reliability, low space efficiency, limited maximum size
Most performant RAID combination
2.3 Practical use of RAIDS
– Most performant RAID combination
– D1= Drives per RAID 1, D0=Number of RAID1 sets – MTBFraid= MTBFdisk
D1 / (D1! * MTTRD1-1) / D0
– 4 disks: D1 = 2, D0= 2
• MTBFraid= 47,565 years
• Craid= 400 GB (400 GB wasted (50%)) – 6 disks: D1 = 2, D0= 3
• MTBFraid= 31,706 years
• Craid= 600 GB (600 GB wasted (50%))
• RAIDs controllers directly connect storage to the system bus
– Storage available to only one system/ server/ application
• Number of disks is limited
– Consumer grade RAID: 2-4 disks – Enterprise grade RAID: 8-24+ disks
2.4 Beyond RAID
– Enterprise grade RAID: 8-24+ disks
• Solutions
– NAS (Network Attached Storage): Provide abstracted file systems via network (software solution)
– SAN (Storage Area Network): Virtualized logical storage within a specialized network on block level (hardware solution)
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 55
• Before discussing NAS, we need file systems
• A file systems is a software for abstracting file operations on a logical storage device
– Files are a collection of binary data
2.4 File Systems vs. Raw Devices
Application Software
File System
– Files are a collection of binary data
• Creating, reading, writing, deleting, finding, organizing
– How does a file access translate into top-level operations on a logical
storage device?
• e.g. which blocks have to be read/written?
• Bridge between application software and (abstracted) hardware
File System
Logical Storage
• Raw Devices access allows
applications to bypass the OS and the file system
• Application may directly tune
2.4 File Systems vs. Raw Devices
Application Software
• Application may directly tune aspects of physical storage
• May lead to very efficient implementations
–
Used for e.g. high performance database, system virtualization, etc
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 57
Logical Storage
• Idea: Provide a remote file system using already available network infrastructure
– NAS: Network Attached Storage
– Use specialized network protocols (e.g. CIFS, NFS, FTP, etc)
– Easiest case: File Server (e.g. Linux+Samba)
• Advantages:
Easy to setup, easy to use, cheap infrastructure
2.4 NAS – Network Attached Storage
Application Software
Network
Advantages:
– Easy to setup, easy to use, cheap infrastructure – Allows sharing of storage among several systems – Abstracts on file system level (easy for most
applications)
• Disadvantages
– Inefficient and slow
• large protocol and processing overhead
– Abstracts on file system level (not suitable for special purposes like raw devices or storage virtualization)
File System
Logical Storage
NAS Server
• SANs offer specialized high-speed networks for storage devices
– Usually uses local FibreChannel networks
– Remote location may be connected via Ethernet or IP-WAN (Internet)
– Network uses specialized storage protocols
• iFCP (SCSI on FiberChannel)
• iSCSI (SCSI on TCP/IP)
2.4 SAN – Storage Area Network
Application Software
File System
• iSCSI (SCSI on TCP/IP)
• HyperSCSI (SCSI on raw ethernet)
• SANs provide raw block level access to logical storage devices
– Logical disks of any size can be offered by the SAN – For a client system using a logical disk, it might
appears like a local disk or RAID
– Client system has full control over file systems on logical disks
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 59
Logical Storage
SAN
2.4 SAN – Storage Area Network
SAN Switch SAN Switch
SAN HBA SAN HBA
SAN HBA SAN HBA
SAN Switch
NAS Protocol (CIFS)
Ethernet Network
WAN-SAN Bus (HyperSCSI)
NAS NAS Head
SAN HBA
SAN/RAID HBA
Peripheral Bus (SCSI, SAS, etc.)
SAN Switch SAN Switch
SAN HBA
SAN Bus (iFCP) Head
NAS Head
•
Advantages:
– Very efficient
• Highly optimized local network infrastructure
• Optimized protocols with low overhead
– Very flexible (any number of systems may use any number of disks at any location)
2.4 SAN – Storage Area Network
– Very flexible (any number of systems may use any number of disks at any location)
– Helps for disaster protection
• SAN can transparently span to even remote locations
– May also employ NAS heads for NAS-like behavior
•
Disadvantages
– Expensive
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 61
• How much storage and bandwidth is needed
2.5 Case Study
• How much storage and bandwidth is needed by YouTube and how might it organized?
• All top secret, but there are educated guesses and
some (older) leaked data…
• A Google video search restricted to YouTube.com reveals 187,397,091 indexed videos
– 3.35 min/movie : Based on TOP-100 all time videos – 2.3 MB/min: Based on a sample (very low variation) – 8.3 MB/video
• Guessed size of all videos on YouTube is 1.56 PB
Assume 160 GB/disk with MTBF=16.6 years
2.5 Case Study
– Assume 160 GB/disk with MTBF=16.6 years
• Based on the Google reliability study
– 9,800 hard disks are needed to store all videos just once without any redundancy
• MTBF = 14 hours ...
– Using 1,960 5+1 RAID 5’s, 11,760 disks are needed
• MTBF = 6.84 years - not too great…
• Still, each video only available once
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 63
–
Using 196 (9+1)(5+1) RAID 55 arrays, 13,066 disks are needed
• RAID 5 Arrays with 6 disks each. 10 of these arrays form an overlaying RAID 5.
• MTBF = 14 million years (finally, data is “safe” at one location)
2.5 Case Study
location)
• Still, each video only available once
– No global disaster safety – No global load balancing
• How might this look?
13066 x 1089 x 60 x
• YouTube grows fast
–
Currently, around 200,000 new videos per day (1.66
TB/day)• Larger number of disks have to be added per month
2.5 Case Study
month
–
Around 440 disks/month for new videos
–
Around 80 disks/month to replace broken ones
• Growing exponentially
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 65
• It gets even worse…
• YouTube serves 200 million videos per day (as of mid 2007)
–
30 PB of data EVERY MONTH
– 154 Gbps
(read: 154 Gigabit per second)
2.5 Case Study
– 154 Gbps
(read: 154 Gigabit per second)
–
Results to average 586,000 concurrent streams
–Popular videos are around 250 000 views per day
• 600 concurrent streams per FILE (25 MB/sec)
–
This bandwidth is insanely expensive:
600,000 USD/month
•
This massive amount of data cannot be hosted and served from a single location…
2.5 Case Study
•
Data needs to be distributed and globally load balanced
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 67
• YouTube does not host and provide videos themselves
–
They hire Limelight Networks for that
• Limelight Networks
2.5 Case Study
• Limelight Networks
–
Large CDN (Content Delivery Network) Provider
–Own 25 POP (Point Of Presence) connected with
own backbone
• Each POP with up to 1000’s storage servers
• Can serve up to 1 Tbps!
• Limelight automatically distributes content among all POP
– Data is massively redundant
– More popular data replicated more, less popular replicated less – Each file is served from the closest location with bandwidth to
spare
• Global load balancing
2.5 Case Study
• Global load balancing
– Data is disaster proof!
• What to learn?
• Large scale data storage and serving
– Very resource intensive – Very expensive
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 69
• There are different types of storage
–
Usually, there is a storage hierarchy
• Faster, smaller, more expensive storage
• Slower, bigger, less expensive storage
• Hard drives are currently the most popular media
Physical Storage
• Hard drives are currently the most popular media
–
Mechanical device
• High sequential transfer rates,
• Bad random access times, low random transfer rates
• Prone to failure
–
DBMS must be optimized for the used storage
devices!
• Access Pathes
–
Physical Data Access
–Index Structures
–
Physical Tuning
Next Lecture
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 71