Relational Database Systems 2

(1)

Wolf-Tilo Balke

Benjamin Köhncke

Institut für Informationssysteme

Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

Relational Database Systems 2

2. Physical Data Storage

(2)

2.1 Introduction 2.2 Hard Disks 2.3 RAIDs

2.4 SANs and NAS 2.5 Case Study

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 2

2 Physical Data Storage

(3)

• DBMS needs to retrieve, update and process persistently stored data

– Storage consideration is an important factor in planning a database system (physical layer)

–

Remember:

The data has to be securely stored, but access to the data should be declarative!

2.1 Physical Storage Introduction

Headquarters in Redwood City, CA

(4)

• Data is stored on a storage media. Media highly differ in terms of

– Random Access Speed

– Random/ Sequential Read/Write speed –

Capacity

–

Cost

per Capacity

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 4 EN 13.1

2.1 Physical Storage Introduction

(5)

• Capacity: Quantifies the amount of data which can be stored

– Base Units: 1 Bit, 1 Byte = 2³ Bit = 8 Bit

– Capacity units according to IEC, IEEE, NIST, etc:

• Usually used for file sizes and primary storage (for higher

degree of confusion, sometimes used with SI abbreviations…)

• 1 KiB = 1024¹ Byte; 1 MiB = 1024²Byte ; 1 GiB = 1024³ Byte; …

– Capacity units according to SI:

• Usually used for advertising secondary/tertiary storage

• 1 KB = 1000¹ Byte ≈ 0.976 KiB; 1 MB = 1000²Byte ≈ 0.954 MiB;

1 GB = 1000³ Byte ≈ 0.931 GiB; …

– Especially used by the networking community:

• 1 Kb = 1000¹ Bit = 0.125 KB ≈ 0.122 KiB ; 1 Mb = 1000² Bit = 0.125 MB ≈ 0.119 MiB

2.1 Relevant Media Characteristics

(6)

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 6 http://xkcd.com/

2.1 A Kilo-Joke

(7)

• Random Access Time: Average time to access a random piece of data at a known media position

– Usually measured in ms or ns

– Within some media, access time can vary depending on position (e.g. hard disks)

• Transfer Rate: Average amount consecutive of data which can be transferred per time unit

– Usually measured in KB/sec, MB/sec, GB/sec,…

– Sometimes also in Kb/sec, Mb/sec, Gb/sec

2.1 Characteristic Parameters

(8)

• Volatile: Memory needs constant power to keep data

– Dynamic: Dynamic volatile memory needs to be

“refreshed” regularly to keep data – Static: No refresh necessary

• Access Modes

– Random Access: Any piece of data can be accessed in approximately the same time

– Sequential Access: Data can only be accessed in sequential order

• Write Mode

– Mutable Storage: Can be read and written arbitrarily – Write Once Read Many (WORM)

• Interesting for legal issues  Sarbanes-Oxley Act (2002)

2.1 Other characteristics

(9)

• Online media

– „always on“

– Each single piece of data can be accessed fast – e.g. hard drives, main memory

• Nearline media

– Compromise between online and offline

– Offline media can automatically put “on line”

– e.g. juke boxes, robot libraries

• Offline media (disconnected media)

– Not under direct control of processing unit – Have to be connected manually

– e.g. box of backup tapes in basement

2.1 Online, Nearline, Offline

(10)

• Media characteristics result in a storage hierarchy

• DBMS optimize data distribution among the storage levels

– Primary Storage: Fast, limited capacity, high price, usually volatile electronic storage

• Frequently used data / current work data

– Secondary Storage: Slower, large capacity, lower price

• Main stored data

– Tertiary Storage: Even slower, huge capacity, even lower price, usually offline

• Backup and long term storage of not frequently used data

2.1 The Storage Hierarchy

(11)

2.1 The Storage Hierarchy

Cost Speed

Optical Disks, Tape

Primary

Secondary

Tertiary

Cache, RAM

Flash, Magnetic Disks

~100 ns

~10 ms

> 1 s

(12)

Type Media Size Random Acc. Speed

Transfer Speed

Characteristics Price Price/GB Pri L1-Processor Cache

(Intel QX9000 )

32 KiB 0.0008 ms 6200 MB/sec Vol, Stat, RA,OL Pri DDR3-Ram

(Corsair 1600C7DHX)

2 GiB 0.004 ms 8000 MB/sec Vol, Dyn, Ra, OL

€200 € 93 Sec Harddrive SSD

(MTRON SSD MOBI64)

64 GB 0.1 ms 95 MB/sec Stat, RA, OL €1050 € 16 Sec Harddrive Magnetic

(Seagate ST3100034As)

1000 GB 12 ms 80 MB/sec Stat, RA, OL €200 € 0.20 Ter DVD+R

(Sony DRU-810A+Fuji Disks)

4.7 GB 98 ms 11 MB/sec Stat, RA, OF, WORM

€0.60/Disk € 0.12 Ter LTO Streamer

(Freecom LTO-920i)

800 GB 58 sec 120 MB/sec Stat, SA, OF €80/Tape € 0.10

2.1 Storage Media – Examples

Last updated March 2008

Pri= Primary, Sec=Secondary, Ter=Tertiary

Vol=Volatile, Stat=Static, Dyn=Dynamic, RA=Random Access, SA=Sequential Access OL=Online, OF=Offline, WORM=Write Once Read Many

(13)

• Hard drives are currently the standard for large, cheap and persistent storage

– Usually used as the main storage media for most data in a DB

• DBMS need to be optimized for efficient disk storage and access

– Data access needs to be as fast as possible

– Often used data should be accessible with highest speed, rarely needed data may take longer

– Different data items needed for certain reoccurring tasks should also be stored/accessed together

2.2 Magnetic Disk Storage – HDs

(14)

• Directionally magnetization of a ferromagnetic material

• Realized on hard disk platters

– Base platter made of non-magnetic aluminum or glass substrate – Magnetic grains worked into base platter to form magnetic regions

• Each region represents 1 Bit

– Read head can detect magnetization direction of each region

– Write head may change direction

2.2 HD – How does it work?

(15)

• Giant MagnetoResistance Effect (GMR)

– Discovered 1988 simultaneously by Peter Grünberg and Albert Fert

• Both honored with the 2007 Nobel Prize in Physics

– Allows the construction of efficient read heads:

• The electric resistance of an alternating ferro- and non-magnetic material giantally changes within changing directed magnetic fields

2.2 HD – Notable Technology

Advances

(16)

• Perpendicular Recording (used since 2005)

– Longitudal Recording limited to ~200 Gb/inch² due to superparamagnetic effect

• Thermal energy may spontaneously change magnetic direction

– Perpendicular recording allows for up to 1000 Gb/inch² – Very simplified: Align

magnetic field orthogonal to surface instead of

parallel

• Magnetic regions can be smaller

2.2 HD – Notable Technology

Advances

(17)

• Usage of magnetic grains instead of continuous magnetic material

– Between magnetic direction transitions, Neel Spikes are formed

• Areas of unsure magnetic direction

– Neel Spikes are larger for continuous materials

– Magnetic regions can be smaller as the transition width can be reduced

2.2 HD – Notable Technology

Advances

(18)

• A hard disk is made up of multiple double-sided platters

– Platter sides are called surfaces

– Platters are fixed on main spindle and rotate at equal and constant speed (common: 5400 rpm / 7200 rpm)

– Each surface has it’s own read and write head – Heads are attached to arms

• Arms can position heads along the surface

• Heads cannot move independently

– Heads have no contact to surface and hover on top of an air bearing

2.2 HD – Basic Architecture

(19)

• Each surface is divided into circular tracks

– Some disks may use spirals

• All tracks of all surfaces with the same diameter are called cylinder

– Data within the same cylinder can be accessed very efficiently

EN 13.2

2.2 HD – Basic Architecture

(20)

• Each track is subdivided into sectors of equal capacity

a) Fixed angle sector subdivision

• Same number of sectors per track, changing density, constant speed

b) Fixed data density

• Outer tracks have more sectors than inner tracks

• Transfer speed higher on outer tracks

• Adjacent sectors can be grouped into clusters

2.2 HD – Basic Architecture

(21)

• Hard drives are not completely reliable!

– Drives do fail

– Means for physical failure recovery are necessary

• Backups

• Redundancy

• Hard drives age and wear down.

Wear significantly increases by:

– Contact cycles (head parking) – Spindle start-stop

– Power-on hours

– Operation outside ideal environment

• Temperature too low/high

• Unstable voltage

2.2 HD - Reliability

(22)

• Reliability measures are statistical values assuming certain usage patterns

• Desktop usage (all per year): 2 400 hours, 10 000 motor start/stops, 25°C temperature

• Server usage (all per year): 8 760 hours, 250 motor start/stops, 40°C temperature

– Non-Recoverable read errors: A sector on the surface cannot be read anymore – the data is lost

• Desktop disk: 1 per 10¹⁴ read bits, Server: 1 per 10¹⁵ read bits

• Disk can detect this!

– Maximum contact cycles: Maximum number of allowed head contacts (parking)

• Usually around 50 000 cycles

2.2 HD - Reliability

(23)

– Mean Time Between Failure (MTBF): Statistically anticipated time for a large disk population failing to 50%

• Drive manufactures usually use optimistic simulations to guess the MTBF

• Desktop: 0.7 million hours (80 years), Server: 1.2 million hours (137 years) – Manufacturers values

– Annualized Failure Rate (AFR): Probability of a failure per year based on MTBF

• AFR = OperatingHoursPerYear / MTBF_hours

• Desktop: 0.34%, Server: 0.73%

2.2 HD - Reliability

(24)

• Failure rate during a hard disks lifespan is not constant

• Can be better modeled by the “bathtub curve”

having 3 components

– Infant Mortality Rate – Wear Out Failures – Random Failures

2.2 HD - Reliability

(25)

• Report by Google

– 100,000 consumer grade disks (80-400GB, ATA Interface,

5400-7200 RPM)

• Results (among others)

– Drives fail often!

– There is an infant mortality – High usage increases infant

mortality, but not later failure rates

– Observed AFR is around 7%

and MTBF 16.6 years!

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Failure trends in a large disk drive population

E. Pinheiro, W.-D. Weber, L. A. Barroso 5th USENIX conference on File and Storage

2.2 Real World Failure Rates

Careful: 2+ year results are biased. See reference.

(26)

• Seagate ST3100034AS (Desktop Harddrive, 2008)

– Manufacturer’s specifications

2.2 HD - Example Specs

Specification Value

Capacity 1 TB

Platters 4

Heads 8

Cylinders 16,383

Sectors per track 63

Bytes per sector 512

Spindle Speed 7200 RPM

MTBF 80 years

AFR 0.34 %

(27)

• Assume a storage need of 10 TB. Only following HDs are available

– Capacity: 100 GB capacity each

– MTBF: 100,000 hours each (ca. 11 years)

• Consider using 100 of these disks independently (w/o RAID).

– Total Storage: 10 000 GB = 10 TB – MTBF: 1,000 hours (ca. 42 days) – THIS IS BAD!

• More sophisticated ways of using multiple disks are needed

2.2 Reliability -

Considerations

(28)

• The disk controller organizes low level access to the disk

– e.g. head positioning, error checking, signal processing

– Usually integrated into the disk – Provides unified and abstracted

interface to access the disks (e.g. LBA) – Connects disk to an peripheral bus (e.g.

IDE, SCSI, FiberChannel, SAS)

• The host bus adapter (HBA) bridges between the peripheral bus and

systems internal bus (like PCIe, PCI) – Internal Bus usually integrated into

systems main board

– Often confused as being the disk controller

• DAS (Directly Attached Storage)

2.2 HD – Controller

Disk Controller

Mechanics Host Bus

Adapter

Peripheral Bus

Internal Bus Inner System / Mainboard

(29)

• Sectors can be logically grouped to blocks by the operating system

– Sectors in a block do not necessarily need to be adjacent

– e.g. NTFS defaults to 4 KiB per block

• 8 sectors on a modern disk

• Hardware address of a block is combination of

– Cylinder number, surface number, block number within track

– Controller maps hardware address to logical block address (LBA)

2.2 HD – Controller

(30)

• Disk controller transfers content of whole blocks to buffer

– Buffer resides in a primary storage and can be accessed efficiently

– Time needed to transfer a random block (4KiB/Block on ST3100034AS): (<10 msec)

• Seek Time: Time needed to position head to correct cylinder (<8 msec)

• Latency (Rotational Delay): Time until the correct block arrives below the head (<0.14 msec)

• Block Transfer Time: Time to read all sectors of block (<0.01 msec)

– Bulk Transfer Rate for n adjacent blocks (<20msec for n=10)

• Seek time + Rotational Delay + n * Block Transfer Rate

2.2 HD – Controller

(31)

• Locating data on a disk is a major bottleneck

– Try operating on data already in buffer

– Aim for bulk transfer, avoid random block transfer

2.2 HD – Controller

(32)

• A single HD is often not sufficient

– Limited capacity – Limited speed – Limited reliability

• Idea: Combine multiple HD into a RAID Array (Redundant Array of Independent Disks)

– RAID Array treats multiple hardware disks as a single logical disk

• More HDs for increased capacity

• Parallel access for increased speed

• Controlled redundancy for increased reliability

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 32 Silber 11.3

2.3 RAID

(33)

• The RAID controller

connects to multiple hard disks

– Disks are virtualized and appear to be just one single logical disk

– The RAID controller acts as an extended specialized HBA (Host Bus Adapter)

– Still DAS (Directly Attached Storage)

2.3 RAID Controller

RAID Controller

Peripheral Bus

Internal Bus

Represented as single logical Disk

(34)

• Mirroring (or shadowing): Increases reliability by complete redundancy

• Idea: Mirror Disks are exact copies of original disk

– Not space efficient

• Read speed can be n times as fast, write speed does not increase

• Increases reliability. Assume

– Two disks with a MTBF 11 years each

• One original disk, one mirror disk

• Assume disk failures are independent of each other (unrealistic)

– Disk replacement time of 10 hours

– ► MTBF of mirror system is >57,000 years!

2.3 RAID Principles - Mirroring

(35)

• Striping: Improve performance by parallelism

• Idea: Distribute data among all disks for increased performance

• Bit Level Striping: Split all bits of a byte to the disks

– e.g. for 8 disks, write i-th byte to disk i – Number of disk needs to be a power of 2 – Each disk is involved in each access

• Access rate does not increase

• Read and write transfer speed linearly increases with each disk

• Simultaneous accesses not possible

– Good for speeding up few, sequential and large accesses

2.3 RAID Principles - Striping

(36)

• Block Level Striping: Distribute blocks among the disks

– Only one disk is involved reading a specific block

• Read and write speed of a single block not increased

• Other disks still free to read/write other blocks

• Read and write speed of multiple accesses increase

– Good for large number of parallel accesses

2.3 RAID Principles – Striping

(37)

• Error Correction Codes: Increase reliability with computed redundancy

• Hamming Codes

– Can detect and repair 1 bit errors within a set of

n data bits by computing k parity bits

• n = 2^k - k – 1

• n=1, k=2; n=4, k=3; n = 11, k=4; n = 26, k=5; …

– Especially used for in-memory and tape error correction

• Not really used for hard drives anymore

– Not further elaborated in this lecture

2.3 RAID Principles - Error Correction Codes

(38)

• Interleaved Parity (Reed-Solomon Algorithm on GD(2) Galois Field)

– Can repair 1-bit errors (when the error is known) – Hard Disks can detect read errors themselves, no

need for complete Hamming codes – Basic Idea:

• From n data pieces D₁,…,D_n compute a parity data D_p by combining data using logical XOR (eXclusive OR)

– XOR is associative and commutative – Important: A XOR B XOR B = A

• i.e. D_p= D₁ XOR D₂ XOR … XOR D_n

• Assume D₂ was lost. It can be reconstructed by D₂= D_p XOR D₁ XOR D₃ XOR … XOR D_n

2.3 RAID Principles - Error Correction Codes

(39)

• Interleaved Parity. Example:

• A = 0101, B = 1100, C = 1011

• P = 0010 = A XOR B XOR C

• C is lost.

– P = A XOR B XOR C – C = P XOR A XOR B

– C = A XOR B XOR C XOR A XOR B – C = A XOR A XOR B XOR B XOR C – C = 0 XOR C

– C = 1011

2.3 RAID Principles : Interleaved Parity

0101

XOR 1100

XOR 1011

P 0010

0010

XOR 0101

XOR 1100

C 1011

(40)

• The 3 RAID principles can be combined in multiple ways

– Not every combination is useful

• This led to the definition of 7 core RAID levels

– RAID 0 – RAID 6

– The most dominant levels are RAID 0, RAID 1, RAID 1+0, RAID 5

• In following examples, assume

– A MTBF of 100,000 hours (11.42 years) per disk – A Mean Time to Repair (MTTR) of 6 hours

– Failure rate is constant and failures between disks are independent

– MTBF_raid is the mean time to data loss within the raid if each failing disk is replaced within the MTTR

– D is the number of drives in the RAID set

– C=200 GB is capacity of one disk, C_raid capacity of whole raid

2.3 RAID in practical applications

(41)

• Mean Time to Repair (MTTR)

– MTTR = TimeToNotice + RebuildTime – Assume time to notice of 0.5 hours

– Rebuild time is the time for completely writing back lost data

• Assume disk capacity of 200GB

• Write back speed of 10 MB/sec

– Consisting of reading remaining disks – Computing parity / Reconstructing data

• Rebuild time around 5.5 hours

– During rebuild, a RAID is especially vulnerable

– MTTR = 6 hours

2.3 RAID in practical applications

(42)

• File A (A1-Ax), File B (B1-Bx), File C (C1-Cx)

• Raid 0

– Block-Level-Striping only

– Increased parallel access and transfer speeds, reduced reliability

– All disks contain data (0% overhead) – Works with any number of disks

– MTBF_raid = MTBF_disk/ D – 4 disks:

• MTBF_raid= 2.86 years

• C_raid = 800 GB (0 GB wasted (0%))

– Common size: 2 disks

• MTBF_raid= 5.72 years

2.3 RAID Levels

(43)

• Raid 1

– Mirroring only

– Increased reliability, increased read transfer speed, low space efficiency

– MTBF_raid = MTBF_disk^D/ (D! * MTTR^D-1) – 4 disks:

• MTBF_raid= 2.2 trillion years

• Age of universe may be around 15 billion years…

– Common size: 2 disks

• MTBF_raid= 95,130 years

2.3 RAID Levels

(44)

• RAID 2

– Not used anymore in practice

• was used in old mainframes

– Bit-Level-Striping

– Use Hamming Codes

• Usually Hamming Code(7,4) – 4 data bits, 3 parity bits

• Reliable 1-Bit error recovery (i.e. one disk may fail)

– 3 redundant disks per 4 data disks (75% overhead)

• Ratio better for larger number of disks

– MTBF_raid = MTBF_disk²/ (D*(D-1) * MTTR)

– 7 disks (does not really make sense for 4 – not comparable to other values)

• C_raid= 800 GB (600 GB wasted (43%))

2.3 RAID Levels

(45)

• RAID 3

– Interleaved-Parity – Byte-Level-Striping – Dedicated Parity Disk

• Bottleneck! Every write operation needs to update the parity disk.

• No parallel writes

– 1 redundant disk per n data disks

• Overhead decreases with number of disks while reliability decreases

• 25% overhead for 4 data disks

– MTBF_raid = MTBF_disk²/ (D*(D-1) * MTTR) – 4 disks

2.3 RAID Levels

(46)

• RAID 4

– Block-Level Striping – As RAID 3 otherwise – 4 disks (common size)

• MTBF_raid = 15,854 years

– 5 disks (also common size)

• MTBF_raid = 9,513 years

2.3 RAID Levels

(47)

• RAID 5

– Parity is distributed among the hard disks

• May allow for parallel block writes

– As RAID 4 otherwise

– Bottleneck when writing many files smaller than a block

• Whole parity block has to be read and re-written for each minor write

– Can recover from a single disk failure

– MTBF_raid and C_raid as for RAID 3 &

4

2.3 RAID Levels

(48)

• RAID 6

– Two independent parity blocks distributed among the disks

• May be implemented by parity on orthogonal data or by using Reed-Solomon on GF(2⁸)

– As RAID 5 otherwise

– 2 redundant disk per n data disks

• Can recover from a double disk failure

• No vulnerability during single failure rebuild

• Very suitable for larger arrays

• Writer overhead due to more complicated parity computation

– MTBF_raid = MTBF_disk³/ (D*(D-2)*(D-1) * MTTR²) – 4 disks

• MTBF_raid= 132 million years

– 8 disks (common)

• MTBF_raid= 9,437 years (~RAID 5 w. D=5)

• C_raid= 1,200 GB (400 GB wasted (25%))

2.3 RAID Levels

(49)

• Additionally, there are hybrid levels combing the core levels

– RAID 0+1, RAID 1+0, RAID 5+0, RAID 5+1, RAID 6+6, …

• Raid 1+0

– Mirrored sets nested in a striped set

• RAID 0 on sets of RAID 1 sets

– Very high read and write transfer speeds, increased reliability, low space efficiency, limited maximum size

– Most performant RAID combination

– D₁= Drives per RAID 1, D₀=Number of RAID1 sets – MTBF_raid= MTBF_disk^D1/ (D₁! * MTTR^D1-1) / D₀

– 4 disks: D₁= 2, D₀= 2

• C_raid= 400 GB (400 GB wasted (50%)) – 6 disks: D₁= 2, D₀= 3

2.3 Practical use of RAIDS

(50)

• RAIDs controllers directly connect storage to the system bus

– Storage available to only one system/ server/

application

• Number of disks is limited

– Consumer grade RAID: 2-4 disks – Enterprise grade RAID: 8-24+ disks

• Solutions

– NAS (Network Attached Storage): Provide abstracted file systems via network (software solution)

– SAN (Storage Area Network): Virtualized logical

storage within a specialized network on block level (hardware solution)

2.4 Beyond RAID

(51)

• Before discussing NAS, we need file systems

• A file systems is a software for abstracting file operations on a logical storage device

– Files are a collection of binary data

• Creating, reading, writing, deleting, finding, organizing

– How does a file access translate into top-level operations on a logical

storage device?

• e.g. which blocks have to be read/written?

• Bridge between application software and (abstracted) hardware

2.4 File Systems vs. Raw Devices

Application Software

File System

Logical Storage

(52)

• Raw Devices access allows applications to bypass the OS and the file system

• Application may directly tune aspects of physical storage

• May lead to very efficient implementations

– Used for e.g. high performance

database, system virtualization, etc

2.4 File Systems vs. Raw Devices

Logical Storage

(53)

• Idea: Provide a remote file system using already available network infrastructure

– NAS: Network Attached Storage

– Use specialized network protocols (e.g. CIFS, NFS, FTP, etc)

– Easiest case: File Server (e.g. Linux+Samba)

• Advantages:

– Easy to setup, easy to use, cheap infrastructure – Allows sharing of storage among several systems – Abstracts on file system level (easy for most

applications)

• Disadvantages

– Inefficient and slow

• large protocol and processing overhead

– Abstracts on file system level (not suitable for special purposes like raw devices or storage virtualization)

2.4 NAS – Network Attached Storage

File System

Logical Storage Network

NAS Server

(54)

• SANs offer specialized high-speed networks for storage devices

– Usually uses local FibreChannel networks

– Remote location may be connected via Ethernet or IP-WAN (Internet)

– Network uses specialized storage protocols

• iFCP (SCSI on FiberChannel)

• iSCSI (SCSI on TCP/IP)

• HyperSCSI (SCSI on raw ethernet)

• SANs provide raw block level access to logical storage devices

– Logical disks of any size can be offered by the SAN – For a client system using a logical disk, it might

appears like a local disk or RAID

– Client system has full control over file systems on logical disks

2.4 SAN – Storage Area Network

File System

Logical Storage

SAN

(55)

2.4 SAN – Storage Area Network

SAN HBA

SAN/RAID HBA

Peripheral Bus (SCSI, SAS, etc.)

SAN Switch SAN Switch

SAN HBA

SAN Bus (iFCP) SAN

HBA

SAN HBA

SAN Switch

NAS Protocol (CIFS)

Ethernet Network

WAN-SAN Bus (HyperSCSI)

NAS Head

(56)

• Advantages:

– Very efficient

• Highly optimized local network infrastructure

• Optimized protocols with low overhead

– Very flexible (any number of systems may use any number of disks at any location)

– Helps for disaster protection

• SAN can transparently span to even remote locations

– May also employ NAS heads for NAS-like behavior

• Disadvantages

– Expensive

2.4 SAN – Storage Area Network

(57)

• How much storage and bandwidth is needed by YouTube and how might it organized?

• All top secret, but there are educated

guesses and some (older) leaked data…

2.5 Case Study

(58)

• A Google video search restricted to YouTube.com reveals 187,397,091 indexed videos

– 3.35 min/movie : Based on TOP-100 all time videos – 2.3 MB/min: Based on a sample (very low variation) – 8.3 MB/video

• Guessed size of all videos on YouTube is 1.56 PB

– Assume 160 GB/disk with MTBF=16.6 years

• Based on the Google reliability study

– 9,800 hard disks are needed to store all videos just once without any redundancy

• MTBF = 14 hours ...

– Using 1,960 5+1 RAID 5’s, 11,760 disks are needed

• MTBF = 6.84 years - not too great…

• Still, each video only available once

2.5 Case Study

(59)

– Using 196 (9+1)(5+1) RAID 55 arrays, 13,066

disks are needed

• RAID 5 Arrays with 6 disks each. 10 of these arrays form an overlaying RAID 5.

• MTBF = 14 million years (finally, data is “safe” at one location)

• Still, each video only available once

– No global disaster safety – No global load balancing

• How might this look?

2.5 Case Study

13066 x 1089 x 60 x

(60)

• YouTube grows fast

– Currently, around 200,000 new videos per day (1.66 TB/day)

• Larger number of disks have to be added per month

– Around 440 disks/month for new videos

– Around 80 disks/month to replace broken ones

• Growing exponentially

2.5 Case Study

(61)

• It gets even worse…

• YouTube serves 200 million videos per day (as of mid 2007)

– 30 PB of data EVERY MONTH

–

154 Gbps

(read: 154 Gigabit per second)

– Results to average 586,000 concurrent streams – Popular videos are around 250 000 views per day

• 600 concurrent streams per FILE (25 MB/sec)

– This bandwidth is insanely expensive:

600,000 USD/month

2.5 Case Study

(62)

• This massive amount of data cannot be hosted and served from a single location…

• Data needs to be distributed and globally load balanced

2.5 Case Study

(63)

• YouTube does not host and provide videos themselves

– They hire Limelight Networks for that

• Limelight Networks

– Large CDN (Content Delivery Network) Provider – Own 25 POP (Point Of Presence) connected with

own backbone

• Each POP with up to 1000’s storage servers

• Can serve up to 1 Tbps!

2.5 Case Study

(64)

• Limelight automatically distributes content among all POP

– Data is massively redundant

– More popular data replicated more, less popular replicated less

– Each file is served from the closest location with bandwidth to spare

• Global load balancing

– Data is disaster proof!

• What to learn?

• Large scale data storage and serving

– Very resource intensive – Very expensive