Wolf-Tilo Balke
Benjamin Köhncke
Institut für Informationssysteme
Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de
Relational Database Systems 2
2. Physical Data Storage
2.1 Introduction 2.2 Hard Disks 2.3 RAIDs
2.4 SANs and NAS 2.5 Case Study
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 2
2 Physical Data Storage
• DBMS needs to retrieve, update and process persistently stored data
– Storage consideration is an important factor in planning a database system (physical layer)
–
Remember:The data has to be securely stored, but access to the data should be declarative!
2.1 Physical Storage Introduction
Headquarters in Redwood City, CA
• Data is stored on a storage media. Media highly differ in terms of
– Random Access Speed
– Random/ Sequential Read/Write speed –
Capacity–
Costper Capacity
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 4 EN 13.1
2.1 Physical Storage Introduction
• Capacity: Quantifies the amount of data which can be stored
– Base Units: 1 Bit, 1 Byte = 23 Bit = 8 Bit
– Capacity units according to IEC, IEEE, NIST, etc:
• Usually used for file sizes and primary storage (for higher
degree of confusion, sometimes used with SI abbreviations…)
• 1 KiB = 10241 Byte; 1 MiB = 10242 Byte ; 1 GiB = 10243 Byte; …
– Capacity units according to SI:
• Usually used for advertising secondary/tertiary storage
• 1 KB = 10001 Byte ≈ 0.976 KiB; 1 MB = 10002 Byte ≈ 0.954 MiB;
1 GB = 10003 Byte ≈ 0.931 GiB; …
– Especially used by the networking community:
• 1 Kb = 10001 Bit = 0.125 KB ≈ 0.122 KiB ; 1 Mb = 10002 Bit = 0.125 MB ≈ 0.119 MiB
2.1 Relevant Media Characteristics
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 6 http://xkcd.com/
2.1 A Kilo-Joke
• Random Access Time: Average time to access a random piece of data at a known media position
– Usually measured in ms or ns
– Within some media, access time can vary depending on position (e.g. hard disks)
• Transfer Rate: Average amount consecutive of data which can be transferred per time unit
– Usually measured in KB/sec, MB/sec, GB/sec,…
– Sometimes also in Kb/sec, Mb/sec, Gb/sec
2.1 Characteristic Parameters
• Volatile: Memory needs constant power to keep data
– Dynamic: Dynamic volatile memory needs to be
“refreshed” regularly to keep data – Static: No refresh necessary
• Access Modes
– Random Access: Any piece of data can be accessed in approximately the same time
– Sequential Access: Data can only be accessed in sequential order
• Write Mode
– Mutable Storage: Can be read and written arbitrarily – Write Once Read Many (WORM)
• Interesting for legal issues Sarbanes-Oxley Act (2002)
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 8
2.1 Other characteristics
• Online media
– „always on“
– Each single piece of data can be accessed fast – e.g. hard drives, main memory
• Nearline media
– Compromise between online and offline
– Offline media can automatically put “on line”
– e.g. juke boxes, robot libraries
• Offline media (disconnected media)
– Not under direct control of processing unit – Have to be connected manually
– e.g. box of backup tapes in basement
2.1 Online, Nearline, Offline
• Media characteristics result in a storage hierarchy
• DBMS optimize data distribution among the storage levels
– Primary Storage: Fast, limited capacity, high price, usually volatile electronic storage
• Frequently used data / current work data
– Secondary Storage: Slower, large capacity, lower price
• Main stored data
– Tertiary Storage: Even slower, huge capacity, even lower price, usually offline
• Backup and long term storage of not frequently used data
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 10
2.1 The Storage Hierarchy
2.1 The Storage Hierarchy
Cost Speed
Optical Disks, Tape
Primary
Secondary
Tertiary
Cache, RAM
Flash, Magnetic Disks
~100 ns
~10 ms
> 1 s
Type Media Size Random Acc. Speed
Transfer Speed
Characteristics Price Price/GB Pri L1-Processor Cache
(Intel QX9000 )
32 KiB 0.0008 ms 6200 MB/sec Vol, Stat, RA,OL Pri DDR3-Ram
(Corsair 1600C7DHX)
2 GiB 0.004 ms 8000 MB/sec Vol, Dyn, Ra, OL
€200 € 93 Sec Harddrive SSD
(MTRON SSD MOBI64)
64 GB 0.1 ms 95 MB/sec Stat, RA, OL €1050 € 16 Sec Harddrive Magnetic
(Seagate ST3100034As)
1000 GB 12 ms 80 MB/sec Stat, RA, OL €200 € 0.20 Ter DVD+R
(Sony DRU-810A+Fuji Disks)
4.7 GB 98 ms 11 MB/sec Stat, RA, OF, WORM
€0.60/Disk € 0.12 Ter LTO Streamer
(Freecom LTO-920i)
800 GB 58 sec 120 MB/sec Stat, SA, OF €80/Tape € 0.10
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 12
2.1 Storage Media – Examples
Last updated March 2008
Pri= Primary, Sec=Secondary, Ter=Tertiary
Vol=Volatile, Stat=Static, Dyn=Dynamic, RA=Random Access, SA=Sequential Access OL=Online, OF=Offline, WORM=Write Once Read Many
• Hard drives are currently the standard for large, cheap and persistent storage
– Usually used as the main storage media for most data in a DB
• DBMS need to be optimized for efficient disk storage and access
– Data access needs to be as fast as possible
– Often used data should be accessible with highest speed, rarely needed data may take longer
– Different data items needed for certain reoccurring tasks should also be stored/accessed together
2.2 Magnetic Disk Storage – HDs
• Directionally magnetization of a ferromagnetic material
• Realized on hard disk platters
– Base platter made of non-magnetic aluminum or glass substrate – Magnetic grains worked into base platter to form magnetic regions
• Each region represents 1 Bit
– Read head can detect magnetization direction of each region
– Write head may change direction
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 14
2.2 HD – How does it work?
• Giant MagnetoResistance Effect (GMR)
– Discovered 1988 simultaneously by Peter Grünberg and Albert Fert
• Both honored with the 2007 Nobel Prize in Physics
– Allows the construction of efficient read heads:
• The electric resistance of an alternating ferro- and non-magnetic material giantally changes within changing directed magnetic fields
2.2 HD – Notable Technology
Advances
• Perpendicular Recording (used since 2005)
– Longitudal Recording limited to ~200 Gb/inch2 due to superparamagnetic effect
• Thermal energy may spontaneously change magnetic direction
– Perpendicular recording allows for up to 1000 Gb/inch2 – Very simplified: Align
magnetic field orthogonal to surface instead of
parallel
• Magnetic regions can be smaller
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 16
2.2 HD – Notable Technology
Advances
• Usage of magnetic grains instead of continuous magnetic material
– Between magnetic direction transitions, Neel Spikes are formed
• Areas of unsure magnetic direction
– Neel Spikes are larger for continuous materials
– Magnetic regions can be smaller as the transition width can be reduced
2.2 HD – Notable Technology
Advances
• A hard disk is made up of multiple double-sided platters
– Platter sides are called surfaces
– Platters are fixed on main spindle and rotate at equal and constant speed (common: 5400 rpm / 7200 rpm)
– Each surface has it’s own read and write head – Heads are attached to arms
• Arms can position heads along the surface
• Heads cannot move inde- pendently
– Heads have no contact to surface and hover on top of an air bearing
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 18 EN 13.2
2.2 HD – Basic Architecture
• Each surface is divided into circular tracks
– Some disks may use spirals
• All tracks of all surfaces with the same diameter are called cylinder
– Data within the same cylinder can be accessed very efficiently
EN 13.2
2.2 HD – Basic Architecture
• Each track is subdivided into sectors of equal capacity
a) Fixed angle sector subdivision
• Same number of sectors per track, changing density, constant speed
b) Fixed data density
• Outer tracks have more sectors than inner tracks
• Transfer speed higher on outer tracks
• Adjacent sectors can be grouped into clusters
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 20 EN 13.2
2.2 HD – Basic Architecture
• Hard drives are not completely reliable!
– Drives do fail
– Means for physical failure recovery are necessary
• Backups
• Redundancy
• Hard drives age and wear down.
Wear significantly increases by:
– Contact cycles (head parking) – Spindle start-stop
– Power-on hours
– Operation outside ideal environment
• Temperature too low/high
• Unstable voltage
2.2 HD - Reliability
• Reliability measures are statistical values assuming certain usage patterns
• Desktop usage (all per year): 2 400 hours, 10 000 motor start/stops, 25°C temperature
• Server usage (all per year): 8 760 hours, 250 motor start/stops, 40°C temperature
– Non-Recoverable read errors: A sector on the surface cannot be read anymore – the data is lost
• Desktop disk: 1 per 1014 read bits, Server: 1 per 1015 read bits
• Disk can detect this!
– Maximum contact cycles: Maximum number of allowed head contacts (parking)
• Usually around 50 000 cycles
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 22
2.2 HD - Reliability
– Mean Time Between Failure (MTBF): Statistically anticipated time for a large disk population failing to 50%
• Drive manufactures usually use optimistic simulations to guess the MTBF
• Desktop: 0.7 million hours (80 years), Server: 1.2 million hours (137 years) – Manufacturers values
– Annualized Failure Rate (AFR): Probability of a failure per year based on MTBF
• AFR = OperatingHoursPerYear / MTBFhours
• Desktop: 0.34%, Server: 0.73%
2.2 HD - Reliability
• Failure rate during a hard disks lifespan is not constant
• Can be better modeled by the “bathtub curve”
having 3 components
– Infant Mortality Rate – Wear Out Failures – Random Failures
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 24
2.2 HD - Reliability
• Report by Google
– 100,000 consumer grade disks (80-400GB, ATA Interface,
5400-7200 RPM)
• Results (among others)
– Drives fail often!
– There is an infant mortality – High usage increases infant
mortality, but not later failure rates
– Observed AFR is around 7%
and MTBF 16.6 years!
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Failure trends in a large disk drive population
E. Pinheiro, W.-D. Weber, L. A. Barroso 5th USENIX conference on File and Storage
2.2 Real World Failure Rates
Careful: 2+ year results are biased. See reference.
• Seagate ST3100034AS (Desktop Harddrive, 2008)
– Manufacturer’s specifications
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 26
2.2 HD - Example Specs
Specification Value
Capacity 1 TB
Platters 4
Heads 8
Cylinders 16,383
Sectors per track 63
Bytes per sector 512
Spindle Speed 7200 RPM
MTBF 80 years
AFR 0.34 %
• Assume a storage need of 10 TB. Only following HDs are available
– Capacity: 100 GB capacity each
– MTBF: 100,000 hours each (ca. 11 years)
• Consider using 100 of these disks independently (w/o RAID).
– Total Storage: 10 000 GB = 10 TB – MTBF: 1,000 hours (ca. 42 days) – THIS IS BAD!
• More sophisticated ways of using multiple disks are needed
2.2 Reliability -
Considerations
• The disk controller organizes low level access to the disk
– e.g. head positioning, error checking, signal processing
– Usually integrated into the disk – Provides unified and abstracted
interface to access the disks (e.g. LBA) – Connects disk to an peripheral bus (e.g.
IDE, SCSI, FiberChannel, SAS)
• The host bus adapter (HBA) bridges between the peripheral bus and
systems internal bus (like PCIe, PCI) – Internal Bus usually integrated into
systems main board
– Often confused as being the disk controller
• DAS (Directly Attached Storage)
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 28
2.2 HD – Controller
Disk Controller
Mechanics Host Bus
Adapter
Peripheral Bus
Internal Bus Inner System / Mainboard
• Sectors can be logically grouped to blocks by the operating system
– Sectors in a block do not necessarily need to be adjacent
– e.g. NTFS defaults to 4 KiB per block
• 8 sectors on a modern disk
• Hardware address of a block is combination of
– Cylinder number, surface number, block number within track
– Controller maps hardware address to logical block address (LBA)
2.2 HD – Controller
• Disk controller transfers content of whole blocks to buffer
– Buffer resides in a primary storage and can be accessed efficiently
– Time needed to transfer a random block (4KiB/Block on ST3100034AS): (<10 msec)
• Seek Time: Time needed to position head to correct cylinder (<8 msec)
• Latency (Rotational Delay): Time until the correct block arrives below the head (<0.14 msec)
• Block Transfer Time: Time to read all sectors of block (<0.01 msec)
– Bulk Transfer Rate for n adjacent blocks (<20msec for n=10)
• Seek time + Rotational Delay + n * Block Transfer Rate
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 30
2.2 HD – Controller
• Locating data on a disk is a major bottleneck
– Try operating on data already in buffer
– Aim for bulk transfer, avoid random block transfer
2.2 HD – Controller
• A single HD is often not sufficient
– Limited capacity – Limited speed – Limited reliability
• Idea: Combine multiple HD into a RAID Array (Redundant Array of Independent Disks)
– RAID Array treats multiple hardware disks as a single logical disk
• More HDs for increased capacity
• Parallel access for increased speed
• Controlled redundancy for increased reliability
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 32 Silber 11.3
2.3 RAID
• The RAID controller
connects to multiple hard disks
– Disks are virtualized and appear to be just one single logical disk
– The RAID controller acts as an extended specialized HBA (Host Bus Adapter)
– Still DAS (Directly Attached Storage)
2.3 RAID Controller
RAID Controller
Peripheral Bus
Internal Bus
Represented as single logical Disk
• Mirroring (or shadowing): Increases reliability by complete redundancy
• Idea: Mirror Disks are exact copies of original disk
– Not space efficient
• Read speed can be n times as fast, write speed does not increase
• Increases reliability. Assume
– Two disks with a MTBF 11 years each
• One original disk, one mirror disk
• Assume disk failures are independent of each other (unrealistic)
– Disk replacement time of 10 hours
– ► MTBF of mirror system is >57,000 years!
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 34 Silber 11.3
2.3 RAID Principles - Mirroring
• Striping: Improve performance by parallelism
• Idea: Distribute data among all disks for increased performance
• Bit Level Striping: Split all bits of a byte to the disks
– e.g. for 8 disks, write i-th byte to disk i – Number of disk needs to be a power of 2 – Each disk is involved in each access
• Access rate does not increase
• Read and write transfer speed linearly increases with each disk
• Simultaneous accesses not possible
– Good for speeding up few, sequential and large accesses
2.3 RAID Principles - Striping
• Block Level Striping: Distribute blocks among the disks
– Only one disk is involved reading a specific block
• Read and write speed of a single block not increased
• Other disks still free to read/write other blocks
• Read and write speed of multiple accesses increase
– Good for large number of parallel accesses
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 36 Silber 11.3
2.3 RAID Principles – Striping
• Error Correction Codes: Increase reliability with computed redundancy
• Hamming Codes
– Can detect and repair 1 bit errors within a set of
n data bits by computing k parity bits• n = 2k - k – 1
• n=1, k=2; n=4, k=3; n = 11, k=4; n = 26, k=5; …
– Especially used for in-memory and tape error correction
• Not really used for hard drives anymore
– Not further elaborated in this lecture
2.3 RAID Principles - Error Correction Codes
• Interleaved Parity (Reed-Solomon Algorithm on GD(2) Galois Field)
– Can repair 1-bit errors (when the error is known) – Hard Disks can detect read errors themselves, no
need for complete Hamming codes – Basic Idea:
• From n data pieces D1,…,Dn compute a parity data Dp by combining data using logical XOR (eXclusive OR)
– XOR is associative and commutative – Important: A XOR B XOR B = A
• i.e. Dp= D1 XOR D2 XOR … XOR Dn
• Assume D2 was lost. It can be reconstructed by D2= Dp XOR D1 XOR D3 XOR … XOR Dn
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 38
2.3 RAID Principles - Error Correction Codes
• Interleaved Parity. Example:
• A = 0101, B = 1100, C = 1011
• P = 0010 = A XOR B XOR C
• C is lost.
– P = A XOR B XOR C – C = P XOR A XOR B
– C = A XOR B XOR C XOR A XOR B – C = A XOR A XOR B XOR B XOR C – C = 0 XOR C
– C = 1011
2.3 RAID Principles : Interleaved Parity
0101
XOR 1100
XOR 1011
P 0010
0010
XOR 0101
XOR 1100
C 1011
• The 3 RAID principles can be combined in multiple ways
– Not every combination is useful
• This led to the definition of 7 core RAID levels
– RAID 0 – RAID 6
– The most dominant levels are RAID 0, RAID 1, RAID 1+0, RAID 5
• In following examples, assume
– A MTBF of 100,000 hours (11.42 years) per disk – A Mean Time to Repair (MTTR) of 6 hours
– Failure rate is constant and failures between disks are independent
– MTBFraid is the mean time to data loss within the raid if each failing disk is replaced within the MTTR
– D is the number of drives in the RAID set
– C=200 GB is capacity of one disk, Craid capacity of whole raid
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 40
2.3 RAID in practical applications
• Mean Time to Repair (MTTR)
– MTTR = TimeToNotice + RebuildTime – Assume time to notice of 0.5 hours
– Rebuild time is the time for completely writing back lost data
• Assume disk capacity of 200GB
• Write back speed of 10 MB/sec
– Consisting of reading remaining disks – Computing parity / Reconstructing data
• Rebuild time around 5.5 hours
– During rebuild, a RAID is especially vulnerable
– MTTR = 6 hours
2.3 RAID in practical applications
• File A (A1-Ax), File B (B1-Bx), File C (C1-Cx)
• Raid 0
– Block-Level-Striping only
– Increased parallel access and transfer speeds, reduced reliability
– All disks contain data (0% overhead) – Works with any number of disks
– MTBFraid = MTBFdisk/ D – 4 disks:
• MTBFraid= 2.86 years
• Craid = 800 GB (0 GB wasted (0%))
– Common size: 2 disks
• MTBFraid= 5.72 years
• Craid = 400 GB (0 GB wasted (0%))
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 42
2.3 RAID Levels
• Raid 1
– Mirroring only
– Increased reliability, increased read transfer speed, low space efficiency
– MTBFraid = MTBFdiskD/ (D! * MTTRD-1) – 4 disks:
• MTBFraid= 2.2 trillion years
• Craid = 200 GB (600 GB wasted (75%))
• Age of universe may be around 15 billion years…
– Common size: 2 disks
• MTBFraid= 95,130 years
• Craid = 200 GB (200 GB wasted (50%))
2.3 RAID Levels
• RAID 2
– Not used anymore in practice
• was used in old mainframes
– Bit-Level-Striping
– Use Hamming Codes
• Usually Hamming Code(7,4) – 4 data bits, 3 parity bits
• Reliable 1-Bit error recovery (i.e. one disk may fail)
– 3 redundant disks per 4 data disks (75% overhead)
• Ratio better for larger number of disks
– MTBFraid = MTBFdisk2/ (D*(D-1) * MTTR)
– 7 disks (does not really make sense for 4 – not comparable to other values)
• MTBFraid= 4,530 years
• Craid= 800 GB (600 GB wasted (43%))
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 44
2.3 RAID Levels
• RAID 3
– Interleaved-Parity – Byte-Level-Striping – Dedicated Parity Disk
• Bottleneck! Every write operation needs to update the parity disk.
• No parallel writes
– 1 redundant disk per n data disks
• Overhead decreases with number of disks while reliability decreases
• 25% overhead for 4 data disks
– MTBFraid = MTBFdisk2/ (D*(D-1) * MTTR) – 4 disks
• MTBFraid= 15,854 years
• Craid= 600 GB (200 GB wasted (25%))
2.3 RAID Levels
• RAID 4
– Block-Level Striping – As RAID 3 otherwise – 4 disks (common size)
• MTBFraid = 15,854 years
• Craid = 600 GB (200 GB wasted (25%))
– 5 disks (also common size)
• MTBFraid = 9,513 years
• Craid = 800 GB (200 GB wasted (20%))
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 46
2.3 RAID Levels
• RAID 5
– Parity is distributed among the hard disks
• May allow for parallel block writes
– As RAID 4 otherwise
– Bottleneck when writing many files smaller than a block
• Whole parity block has to be read and re-written for each minor write
– Can recover from a single disk failure
– MTBFraid and Craid as for RAID 3 &
4
2.3 RAID Levels
• RAID 6
– Two independent parity blocks distributed among the disks
• May be implemented by parity on orthogonal data or by using Reed-Solomon on GF(28)
– As RAID 5 otherwise
– 2 redundant disk per n data disks
• Can recover from a double disk failure
• No vulnerability during single failure rebuild
• Very suitable for larger arrays
• Writer overhead due to more complicated parity computation
– MTBFraid = MTBFdisk3/ (D*(D-2)*(D-1) * MTTR2) – 4 disks
• MTBFraid= 132 million years
• Craid= 400 GB (400 GB wasted (50%))
– 8 disks (common)
• MTBFraid= 9,437 years (~RAID 5 w. D=5)
• Craid= 1,200 GB (400 GB wasted (25%))
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 48
2.3 RAID Levels
• Additionally, there are hybrid levels combing the core levels
– RAID 0+1, RAID 1+0, RAID 5+0, RAID 5+1, RAID 6+6, …
• Raid 1+0
– Mirrored sets nested in a striped set
• RAID 0 on sets of RAID 1 sets
– Very high read and write transfer speeds, increased reliability, low space efficiency, limited maximum size
– Most performant RAID combination
– D1= Drives per RAID 1, D0=Number of RAID1 sets – MTBFraid= MTBFdiskD1 / (D1! * MTTRD1-1) / D0
– 4 disks: D1 = 2, D0= 2
• MTBFraid= 47,565 years
• Craid= 400 GB (400 GB wasted (50%)) – 6 disks: D1 = 2, D0= 3
• MTBFraid= 31,706 years
• Craid= 600 GB (600 GB wasted (50%))
2.3 Practical use of RAIDS
• RAIDs controllers directly connect storage to the system bus
– Storage available to only one system/ server/
application
• Number of disks is limited
– Consumer grade RAID: 2-4 disks – Enterprise grade RAID: 8-24+ disks
• Solutions
– NAS (Network Attached Storage): Provide abstracted file systems via network (software solution)
– SAN (Storage Area Network): Virtualized logical
storage within a specialized network on block level (hardware solution)
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 50
2.4 Beyond RAID
• Before discussing NAS, we need file systems
• A file systems is a software for abstracting file operations on a logical storage device
– Files are a collection of binary data
• Creating, reading, writing, deleting, finding, organizing
– How does a file access translate into top-level operations on a logical
storage device?
• e.g. which blocks have to be read/written?
• Bridge between application software and (abstracted) hardware
2.4 File Systems vs. Raw Devices
Application Software
File System
Logical Storage
• Raw Devices access allows applications to bypass the OS and the file system
• Application may directly tune aspects of physical storage
• May lead to very efficient implementations
– Used for e.g. high performance
database, system virtualization, etc
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 52
2.4 File Systems vs. Raw Devices
Application Software
Logical Storage
• Idea: Provide a remote file system using already available network infrastructure
– NAS: Network Attached Storage
– Use specialized network protocols (e.g. CIFS, NFS, FTP, etc)
– Easiest case: File Server (e.g. Linux+Samba)
• Advantages:
– Easy to setup, easy to use, cheap infrastructure – Allows sharing of storage among several systems – Abstracts on file system level (easy for most
applications)
• Disadvantages
– Inefficient and slow
• large protocol and processing overhead
– Abstracts on file system level (not suitable for special purposes like raw devices or storage virtualization)
2.4 NAS – Network Attached Storage
Application Software
File System
Logical Storage Network
NAS Server
• SANs offer specialized high-speed networks for storage devices
– Usually uses local FibreChannel networks
– Remote location may be connected via Ethernet or IP-WAN (Internet)
– Network uses specialized storage protocols
• iFCP (SCSI on FiberChannel)
• iSCSI (SCSI on TCP/IP)
• HyperSCSI (SCSI on raw ethernet)
• SANs provide raw block level access to logical storage devices
– Logical disks of any size can be offered by the SAN – For a client system using a logical disk, it might
appears like a local disk or RAID
– Client system has full control over file systems on logical disks
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 54
2.4 SAN – Storage Area Network
Application Software
File System
Logical Storage
SAN
2.4 SAN – Storage Area Network
SAN HBA
SAN/RAID HBA
Peripheral Bus (SCSI, SAS, etc.)
SAN Switch SAN Switch
SAN HBA
SAN Bus (iFCP) SAN
HBA
SAN HBA
SAN Switch
NAS Protocol (CIFS)
Ethernet Network
WAN-SAN Bus (HyperSCSI)
NAS Head
• Advantages:
– Very efficient
• Highly optimized local network infrastructure
• Optimized protocols with low overhead
– Very flexible (any number of systems may use any number of disks at any location)
– Helps for disaster protection
• SAN can transparently span to even remote locations
– May also employ NAS heads for NAS-like behavior
• Disadvantages
– Expensive
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 56
2.4 SAN – Storage Area Network
• How much storage and bandwidth is needed by YouTube and how might it organized?
• All top secret, but there are educated
guesses and some (older) leaked data…
2.5 Case Study
• A Google video search restricted to YouTube.com reveals 187,397,091 indexed videos
– 3.35 min/movie : Based on TOP-100 all time videos – 2.3 MB/min: Based on a sample (very low variation) – 8.3 MB/video
• Guessed size of all videos on YouTube is 1.56 PB
– Assume 160 GB/disk with MTBF=16.6 years
• Based on the Google reliability study
– 9,800 hard disks are needed to store all videos just once without any redundancy
• MTBF = 14 hours ...
– Using 1,960 5+1 RAID 5’s, 11,760 disks are needed
• MTBF = 6.84 years - not too great…
• Still, each video only available once
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 58
2.5 Case Study
– Using 196 (9+1)(5+1) RAID 55 arrays, 13,066
disks are needed• RAID 5 Arrays with 6 disks each. 10 of these arrays form an overlaying RAID 5.
• MTBF = 14 million years (finally, data is “safe” at one location)
• Still, each video only available once
– No global disaster safety – No global load balancing
• How might this look?
2.5 Case Study
13066 x 1089 x 60 x
• YouTube grows fast
– Currently, around 200,000 new videos per day (1.66 TB/day)
• Larger number of disks have to be added per month
– Around 440 disks/month for new videos
– Around 80 disks/month to replace broken ones
• Growing exponentially
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 60
2.5 Case Study
• It gets even worse…
• YouTube serves 200 million videos per day (as of mid 2007)
– 30 PB of data EVERY MONTH
–
154 Gbps(read: 154 Gigabit per second)
– Results to average 586,000 concurrent streams – Popular videos are around 250 000 views per day
• 600 concurrent streams per FILE (25 MB/sec)
– This bandwidth is insanely expensive:
600,000 USD/month
2.5 Case Study
• This massive amount of data cannot be hosted and served from a single location…
• Data needs to be distributed and globally load balanced
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 62
2.5 Case Study
• YouTube does not host and provide videos themselves
– They hire Limelight Networks for that
• Limelight Networks
– Large CDN (Content Delivery Network) Provider – Own 25 POP (Point Of Presence) connected with
own backbone
• Each POP with up to 1000’s storage servers
• Can serve up to 1 Tbps!
2.5 Case Study
• Limelight automatically distributes content among all POP
– Data is massively redundant
– More popular data replicated more, less popular replicated less
– Each file is served from the closest location with bandwidth to spare
• Global load balancing
– Data is disaster proof!
• What to learn?
• Large scale data storage and serving
– Very resource intensive – Very expensive
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 64