• Keine Ergebnisse gefunden

Relational Database Systems 2

N/A
N/A
Protected

Academic year: 2021

Aktie "Relational Database Systems 2"

Copied!
64
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Wolf-Tilo Balke

Benjamin Köhncke

Institut für Informationssysteme

Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

Relational Database Systems 2

2. Physical Data Storage

(2)

2.1 Introduction 2.2 Hard Disks 2.3 RAIDs

2.4 SANs and NAS 2.5 Case Study

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 2

2 Physical Data Storage

(3)

• DBMS needs to retrieve, update and process persistently stored data

– Storage consideration is an important factor in planning a database system (physical layer)

Remember:

The data has to be securely stored, but access to the data should be declarative!

2.1 Physical Storage Introduction

Headquarters in Redwood City, CA

(4)

• Data is stored on a storage media. Media highly differ in terms of

– Random Access Speed

– Random/ Sequential Read/Write speed –

Capacity

Cost

per Capacity

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 4 EN 13.1

2.1 Physical Storage Introduction

(5)

Capacity: Quantifies the amount of data which can be stored

– Base Units: 1 Bit, 1 Byte = 23 Bit = 8 Bit

– Capacity units according to IEC, IEEE, NIST, etc:

• Usually used for file sizes and primary storage (for higher

degree of confusion, sometimes used with SI abbreviations…)

• 1 KiB = 10241 Byte; 1 MiB = 10242 Byte ; 1 GiB = 10243 Byte; …

– Capacity units according to SI:

• Usually used for advertising secondary/tertiary storage

• 1 KB = 10001 Byte ≈ 0.976 KiB; 1 MB = 10002 Byte ≈ 0.954 MiB;

1 GB = 10003 Byte ≈ 0.931 GiB; …

– Especially used by the networking community:

• 1 Kb = 10001 Bit = 0.125 KB ≈ 0.122 KiB ; 1 Mb = 10002 Bit = 0.125 MB ≈ 0.119 MiB

2.1 Relevant Media Characteristics

(6)

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 6 http://xkcd.com/

2.1 A Kilo-Joke

(7)

Random Access Time: Average time to access a random piece of data at a known media position

– Usually measured in ms or ns

– Within some media, access time can vary depending on position (e.g. hard disks)

Transfer Rate: Average amount consecutive of data which can be transferred per time unit

– Usually measured in KB/sec, MB/sec, GB/sec,…

– Sometimes also in Kb/sec, Mb/sec, Gb/sec

2.1 Characteristic Parameters

(8)

Volatile: Memory needs constant power to keep data

Dynamic: Dynamic volatile memory needs to be

“refreshed” regularly to keep data Static: No refresh necessary

• Access Modes

Random Access: Any piece of data can be accessed in approximately the same time

Sequential Access: Data can only be accessed in sequential order

• Write Mode

Mutable Storage: Can be read and written arbitrarily – Write Once Read Many (WORM)

• Interesting for legal issues  Sarbanes-Oxley Act (2002)

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 8

2.1 Other characteristics

(9)

Online media

– „always on“

– Each single piece of data can be accessed fast – e.g. hard drives, main memory

Nearline media

– Compromise between online and offline

– Offline media can automatically put “on line”

– e.g. juke boxes, robot libraries

Offline media (disconnected media)

– Not under direct control of processing unit – Have to be connected manually

– e.g. box of backup tapes in basement

2.1 Online, Nearline, Offline

(10)

• Media characteristics result in a storage hierarchy

• DBMS optimize data distribution among the storage levels

Primary Storage: Fast, limited capacity, high price, usually volatile electronic storage

• Frequently used data / current work data

Secondary Storage: Slower, large capacity, lower price

• Main stored data

Tertiary Storage: Even slower, huge capacity, even lower price, usually offline

• Backup and long term storage of not frequently used data

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 10

2.1 The Storage Hierarchy

(11)

2.1 The Storage Hierarchy

Cost Speed

Optical Disks, Tape

Primary

Secondary

Tertiary

Cache, RAM

Flash, Magnetic Disks

~100 ns

~10 ms

> 1 s

(12)

Type Media Size Random Acc. Speed

Transfer Speed

Characteristics Price Price/GB Pri L1-Processor Cache

(Intel QX9000 )

32 KiB 0.0008 ms 6200 MB/sec Vol, Stat, RA,OL Pri DDR3-Ram

(Corsair 1600C7DHX)

2 GiB 0.004 ms 8000 MB/sec Vol, Dyn, Ra, OL

€200 € 93 Sec Harddrive SSD

(MTRON SSD MOBI64)

64 GB 0.1 ms 95 MB/sec Stat, RA, OL €1050 € 16 Sec Harddrive Magnetic

(Seagate ST3100034As)

1000 GB 12 ms 80 MB/sec Stat, RA, OL €200 € 0.20 Ter DVD+R

(Sony DRU-810A+Fuji Disks)

4.7 GB 98 ms 11 MB/sec Stat, RA, OF, WORM

€0.60/Disk € 0.12 Ter LTO Streamer

(Freecom LTO-920i)

800 GB 58 sec 120 MB/sec Stat, SA, OF €80/Tape € 0.10

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 12

2.1 Storage Media – Examples

Last updated March 2008

Pri= Primary, Sec=Secondary, Ter=Tertiary

Vol=Volatile, Stat=Static, Dyn=Dynamic, RA=Random Access, SA=Sequential Access OL=Online, OF=Offline, WORM=Write Once Read Many

(13)

• Hard drives are currently the standard for large, cheap and persistent storage

– Usually used as the main storage media for most data in a DB

• DBMS need to be optimized for efficient disk storage and access

– Data access needs to be as fast as possible

– Often used data should be accessible with highest speed, rarely needed data may take longer

– Different data items needed for certain reoccurring tasks should also be stored/accessed together

2.2 Magnetic Disk Storage – HDs

(14)

Directionally magnetization of a ferromagnetic material

• Realized on hard disk platters

– Base platter made of non-magnetic aluminum or glass substrate – Magnetic grains worked into base platter to form magnetic regions

• Each region represents 1 Bit

Read head can detect magnetization direction of each region

Write head may change direction

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 14

2.2 HD – How does it work?

(15)

• Giant MagnetoResistance Effect (GMR)

– Discovered 1988 simultaneously by Peter Grünberg and Albert Fert

• Both honored with the 2007 Nobel Prize in Physics

– Allows the construction of efficient read heads:

• The electric resistance of an alternating ferro- and non-magnetic material giantally changes within changing directed magnetic fields

2.2 HD – Notable Technology

Advances

(16)

Perpendicular Recording (used since 2005)

– Longitudal Recording limited to ~200 Gb/inch2 due to superparamagnetic effect

• Thermal energy may spontaneously change magnetic direction

– Perpendicular recording allows for up to 1000 Gb/inch2 Very simplified: Align

magnetic field orthogonal to surface instead of

parallel

• Magnetic regions can be smaller

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 16

2.2 HD – Notable Technology

Advances

(17)

• Usage of magnetic grains instead of continuous magnetic material

– Between magnetic direction transitions, Neel Spikes are formed

• Areas of unsure magnetic direction

– Neel Spikes are larger for continuous materials

– Magnetic regions can be smaller as the transition width can be reduced

2.2 HD – Notable Technology

Advances

(18)

• A hard disk is made up of multiple double-sided platters

– Platter sides are called surfaces

– Platters are fixed on main spindle and rotate at equal and constant speed (common: 5400 rpm / 7200 rpm)

– Each surface has it’s own read and write head – Heads are attached to arms

• Arms can position heads along the surface

• Heads cannot move inde- pendently

– Heads have no contact to surface and hover on top of an air bearing

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 18 EN 13.2

2.2 HD – Basic Architecture

(19)

• Each surface is divided into circular tracks

– Some disks may use spirals

• All tracks of all surfaces with the same diameter are called cylinder

– Data within the same cylinder can be accessed very efficiently

EN 13.2

2.2 HD – Basic Architecture

(20)

• Each track is subdivided into sectors of equal capacity

a) Fixed angle sector subdivision

• Same number of sectors per track, changing density, constant speed

b) Fixed data density

• Outer tracks have more sectors than inner tracks

• Transfer speed higher on outer tracks

• Adjacent sectors can be grouped into clusters

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 20 EN 13.2

2.2 HD – Basic Architecture

(21)

• Hard drives are not completely reliable!

– Drives do fail

– Means for physical failure recovery are necessary

Backups

Redundancy

• Hard drives age and wear down.

Wear significantly increases by:

Contact cycles (head parking) – Spindle start-stop

– Power-on hours

– Operation outside ideal environment

• Temperature too low/high

• Unstable voltage

2.2 HD - Reliability

(22)

Reliability measures are statistical values assuming certain usage patterns

• Desktop usage (all per year): 2 400 hours, 10 000 motor start/stops, 25°C temperature

• Server usage (all per year): 8 760 hours, 250 motor start/stops, 40°C temperature

Non-Recoverable read errors: A sector on the surface cannot be read anymore – the data is lost

• Desktop disk: 1 per 1014 read bits, Server: 1 per 1015 read bits

• Disk can detect this!

– Maximum contact cycles: Maximum number of allowed head contacts (parking)

• Usually around 50 000 cycles

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 22

2.2 HD - Reliability

(23)

– Mean Time Between Failure (MTBF): Statistically anticipated time for a large disk population failing to 50%

• Drive manufactures usually use optimistic simulations to guess the MTBF

• Desktop: 0.7 million hours (80 years), Server: 1.2 million hours (137 years) – Manufacturers values

– Annualized Failure Rate (AFR): Probability of a failure per year based on MTBF

• AFR = OperatingHoursPerYear / MTBFhours

• Desktop: 0.34%, Server: 0.73%

2.2 HD - Reliability

(24)

• Failure rate during a hard disks lifespan is not constant

• Can be better modeled by the “bathtub curve”

having 3 components

– Infant Mortality Rate – Wear Out Failures – Random Failures

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 24

2.2 HD - Reliability

(25)

• Report by Google

– 100,000 consumer grade disks (80-400GB, ATA Interface,

5400-7200 RPM)

• Results (among others)

Drives fail often!

– There is an infant mortality – High usage increases infant

mortality, but not later failure rates

– Observed AFR is around 7%

and MTBF 16.6 years!

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Failure trends in a large disk drive population

E. Pinheiro, W.-D. Weber, L. A. Barroso 5th USENIX conference on File and Storage

2.2 Real World Failure Rates

Careful: 2+ year results are biased. See reference.

(26)

• Seagate ST3100034AS (Desktop Harddrive, 2008)

– Manufacturer’s specifications

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 26

2.2 HD - Example Specs

Specification Value

Capacity 1 TB

Platters 4

Heads 8

Cylinders 16,383

Sectors per track 63

Bytes per sector 512

Spindle Speed 7200 RPM

MTBF 80 years

AFR 0.34 %

(27)

• Assume a storage need of 10 TB. Only following HDs are available

– Capacity: 100 GB capacity each

– MTBF: 100,000 hours each (ca. 11 years)

• Consider using 100 of these disks independently (w/o RAID).

– Total Storage: 10 000 GB = 10 TB – MTBF: 1,000 hours (ca. 42 days) – THIS IS BAD!

• More sophisticated ways of using multiple disks are needed

2.2 Reliability -

Considerations

(28)

• The disk controller organizes low level access to the disk

– e.g. head positioning, error checking, signal processing

– Usually integrated into the disk – Provides unified and abstracted

interface to access the disks (e.g. LBA) – Connects disk to an peripheral bus (e.g.

IDE, SCSI, FiberChannel, SAS)

• The host bus adapter (HBA) bridges between the peripheral bus and

systems internal bus (like PCIe, PCI) – Internal Bus usually integrated into

systems main board

– Often confused as being the disk controller

DAS (Directly Attached Storage)

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 28

2.2 HD – Controller

Disk Controller

Mechanics Host Bus

Adapter

Peripheral Bus

Internal Bus Inner System / Mainboard

(29)

• Sectors can be logically grouped to blocks by the operating system

– Sectors in a block do not necessarily need to be adjacent

– e.g. NTFS defaults to 4 KiB per block

• 8 sectors on a modern disk

Hardware address of a block is combination of

– Cylinder number, surface number, block number within track

– Controller maps hardware address to logical block address (LBA)

2.2 HD – Controller

(30)

• Disk controller transfers content of whole blocks to buffer

– Buffer resides in a primary storage and can be accessed efficiently

– Time needed to transfer a random block (4KiB/Block on ST3100034AS): (<10 msec)

Seek Time: Time needed to position head to correct cylinder (<8 msec)

Latency (Rotational Delay): Time until the correct block arrives below the head (<0.14 msec)

Block Transfer Time: Time to read all sectors of block (<0.01 msec)

– Bulk Transfer Rate for n adjacent blocks (<20msec for n=10)

• Seek time + Rotational Delay + n * Block Transfer Rate

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 30

2.2 HD – Controller

(31)

Locating data on a disk is a major bottleneck

– Try operating on data already in buffer

– Aim for bulk transfer, avoid random block transfer

2.2 HD – Controller

(32)

• A single HD is often not sufficient

– Limited capacity – Limited speed – Limited reliability

• Idea: Combine multiple HD into a RAID Array (Redundant Array of Independent Disks)

– RAID Array treats multiple hardware disks as a single logical disk

• More HDs for increased capacity

• Parallel access for increased speed

• Controlled redundancy for increased reliability

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 32 Silber 11.3

2.3 RAID

(33)

• The RAID controller

connects to multiple hard disks

– Disks are virtualized and appear to be just one single logical disk

– The RAID controller acts as an extended specialized HBA (Host Bus Adapter)

– Still DAS (Directly Attached Storage)

2.3 RAID Controller

RAID Controller

Peripheral Bus

Internal Bus

Represented as single logical Disk

(34)

Mirroring (or shadowing): Increases reliability by complete redundancy

• Idea: Mirror Disks are exact copies of original disk

– Not space efficient

• Read speed can be n times as fast, write speed does not increase

• Increases reliability. Assume

– Two disks with a MTBF 11 years each

• One original disk, one mirror disk

• Assume disk failures are independent of each other (unrealistic)

– Disk replacement time of 10 hours

– ► MTBF of mirror system is >57,000 years!

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 34 Silber 11.3

2.3 RAID Principles - Mirroring

(35)

Striping: Improve performance by parallelism

• Idea: Distribute data among all disks for increased performance

Bit Level Striping: Split all bits of a byte to the disks

– e.g. for 8 disks, write i-th byte to disk i – Number of disk needs to be a power of 2 – Each disk is involved in each access

• Access rate does not increase

• Read and write transfer speed linearly increases with each disk

• Simultaneous accesses not possible

– Good for speeding up few, sequential and large accesses

2.3 RAID Principles - Striping

(36)

Block Level Striping: Distribute blocks among the disks

– Only one disk is involved reading a specific block

• Read and write speed of a single block not increased

• Other disks still free to read/write other blocks

• Read and write speed of multiple accesses increase

– Good for large number of parallel accesses

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 36 Silber 11.3

2.3 RAID Principles – Striping

(37)

Error Correction Codes: Increase reliability with computed redundancy

Hamming Codes

– Can detect and repair 1 bit errors within a set of

n data bits by computing k parity bits

n = 2k - k – 1

• n=1, k=2; n=4, k=3; n = 11, k=4; n = 26, k=5; …

– Especially used for in-memory and tape error correction

• Not really used for hard drives anymore

– Not further elaborated in this lecture

2.3 RAID Principles - Error Correction Codes

(38)

Interleaved Parity (Reed-Solomon Algorithm on GD(2) Galois Field)

– Can repair 1-bit errors (when the error is known) – Hard Disks can detect read errors themselves, no

need for complete Hamming codes – Basic Idea:

• From n data pieces D1,…,Dn compute a parity data Dp by combining data using logical XOR (eXclusive OR)

XOR is associative and commutative – Important: A XOR B XOR B = A

• i.e. Dp= D1 XOR D2 XOR … XOR Dn

• Assume D2 was lost. It can be reconstructed by D2= Dp XOR D1 XOR D3 XOR … XOR Dn

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 38

2.3 RAID Principles - Error Correction Codes

(39)

• Interleaved Parity. Example:

• A = 0101, B = 1100, C = 1011

• P = 0010 = A XOR B XOR C

• C is lost.

– P = A XOR B XOR CC = P XOR A XOR B

– C = A XOR B XOR C XOR A XOR B – C = A XOR A XOR B XOR B XOR C – C = 0 XOR C

– C = 1011

2.3 RAID Principles : Interleaved Parity

0101

XOR 1100

XOR 1011

P 0010

0010

XOR 0101

XOR 1100

C 1011

(40)

• The 3 RAID principles can be combined in multiple ways

– Not every combination is useful

• This led to the definition of 7 core RAID levels

– RAID 0 – RAID 6

– The most dominant levels are RAID 0, RAID 1, RAID 1+0, RAID 5

• In following examples, assume

– A MTBF of 100,000 hours (11.42 years) per disk – A Mean Time to Repair (MTTR) of 6 hours

– Failure rate is constant and failures between disks are independent

MTBFraid is the mean time to data loss within the raid if each failing disk is replaced within the MTTR

D is the number of drives in the RAID set

C=200 GB is capacity of one disk, Craid capacity of whole raid

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 40

2.3 RAID in practical applications

(41)

• Mean Time to Repair (MTTR)

MTTR = TimeToNotice + RebuildTime – Assume time to notice of 0.5 hours

Rebuild time is the time for completely writing back lost data

• Assume disk capacity of 200GB

• Write back speed of 10 MB/sec

– Consisting of reading remaining disks – Computing parity / Reconstructing data

• Rebuild time around 5.5 hours

– During rebuild, a RAID is especially vulnerable

MTTR = 6 hours

2.3 RAID in practical applications

(42)

• File A (A1-Ax), File B (B1-Bx), File C (C1-Cx)

Raid 0

– Block-Level-Striping only

– Increased parallel access and transfer speeds, reduced reliability

– All disks contain data (0% overhead) – Works with any number of disks

– MTBFraid = MTBFdisk/ D 4 disks:

MTBFraid= 2.86 years

Craid = 800 GB (0 GB wasted (0%))

– Common size: 2 disks

MTBFraid= 5.72 years

Craid = 400 GB (0 GB wasted (0%))

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 42

2.3 RAID Levels

(43)

Raid 1

– Mirroring only

– Increased reliability, increased read transfer speed, low space efficiency

– MTBFraid = MTBFdiskD/ (D! * MTTRD-1) – 4 disks:

MTBFraid= 2.2 trillion years

Craid = 200 GB (600 GB wasted (75%))

• Age of universe may be around 15 billion years…

– Common size: 2 disks

MTBFraid= 95,130 years

Craid = 200 GB (200 GB wasted (50%))

2.3 RAID Levels

(44)

RAID 2

– Not used anymore in practice

• was used in old mainframes

– Bit-Level-Striping

– Use Hamming Codes

• Usually Hamming Code(7,4) – 4 data bits, 3 parity bits

• Reliable 1-Bit error recovery (i.e. one disk may fail)

– 3 redundant disks per 4 data disks (75% overhead)

• Ratio better for larger number of disks

– MTBFraid = MTBFdisk2/ (D*(D-1) * MTTR)

7 disks (does not really make sense for 4 – not comparable to other values)

MTBFraid= 4,530 years

Craid= 800 GB (600 GB wasted (43%))

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 44

2.3 RAID Levels

(45)

RAID 3

– Interleaved-Parity – Byte-Level-Striping – Dedicated Parity Disk

• Bottleneck! Every write operation needs to update the parity disk.

• No parallel writes

– 1 redundant disk per n data disks

• Overhead decreases with number of disks while reliability decreases

• 25% overhead for 4 data disks

– MTBFraid = MTBFdisk2/ (D*(D-1) * MTTR) – 4 disks

MTBFraid= 15,854 years

Craid= 600 GB (200 GB wasted (25%))

2.3 RAID Levels

(46)

RAID 4

– Block-Level Striping – As RAID 3 otherwise – 4 disks (common size)

MTBFraid = 15,854 years

Craid = 600 GB (200 GB wasted (25%))

– 5 disks (also common size)

MTBFraid = 9,513 years

Craid = 800 GB (200 GB wasted (20%))

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 46

2.3 RAID Levels

(47)

RAID 5

– Parity is distributed among the hard disks

• May allow for parallel block writes

– As RAID 4 otherwise

– Bottleneck when writing many files smaller than a block

• Whole parity block has to be read and re-written for each minor write

– Can recover from a single disk failure

– MTBFraid and Craid as for RAID 3 &

4

2.3 RAID Levels

(48)

RAID 6

– Two independent parity blocks distributed among the disks

• May be implemented by parity on orthogonal data or by using Reed-Solomon on GF(28)

– As RAID 5 otherwise

– 2 redundant disk per n data disks

• Can recover from a double disk failure

• No vulnerability during single failure rebuild

• Very suitable for larger arrays

• Writer overhead due to more complicated parity computation

– MTBFraid = MTBFdisk3/ (D*(D-2)*(D-1) * MTTR2) – 4 disks

MTBFraid= 132 million years

Craid= 400 GB (400 GB wasted (50%))

– 8 disks (common)

MTBFraid= 9,437 years (~RAID 5 w. D=5)

Craid= 1,200 GB (400 GB wasted (25%))

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 48

2.3 RAID Levels

(49)

Additionally, there are hybrid levels combing the core levels

– RAID 0+1, RAID 1+0, RAID 5+0, RAID 5+1, RAID 6+6, …

Raid 1+0

– Mirrored sets nested in a striped set

• RAID 0 on sets of RAID 1 sets

Very high read and write transfer speeds, increased reliability, low space efficiency, limited maximum size

– Most performant RAID combination

– D1= Drives per RAID 1, D0=Number of RAID1 sets – MTBFraid= MTBFdiskD1 / (D1! * MTTRD1-1) / D0

– 4 disks: D1 = 2, D0= 2

MTBFraid= 47,565 years

Craid= 400 GB (400 GB wasted (50%)) – 6 disks: D1 = 2, D0= 3

MTBFraid= 31,706 years

Craid= 600 GB (600 GB wasted (50%))

2.3 Practical use of RAIDS

(50)

• RAIDs controllers directly connect storage to the system bus

– Storage available to only one system/ server/

application

• Number of disks is limited

– Consumer grade RAID: 2-4 disks – Enterprise grade RAID: 8-24+ disks

• Solutions

NAS (Network Attached Storage): Provide abstracted file systems via network (software solution)

SAN (Storage Area Network): Virtualized logical

storage within a specialized network on block level (hardware solution)

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 50

2.4 Beyond RAID

(51)

• Before discussing NAS, we need file systems

• A file systems is a software for abstracting file operations on a logical storage device

– Files are a collection of binary data

• Creating, reading, writing, deleting, finding, organizing

– How does a file access translate into top-level operations on a logical

storage device?

• e.g. which blocks have to be read/written?

• Bridge between application software and (abstracted) hardware

2.4 File Systems vs. Raw Devices

Application Software

File System

Logical Storage

(52)

• Raw Devices access allows applications to bypass the OS and the file system

• Application may directly tune aspects of physical storage

• May lead to very efficient implementations

– Used for e.g. high performance

database, system virtualization, etc

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 52

2.4 File Systems vs. Raw Devices

Application Software

Logical Storage

(53)

• Idea: Provide a remote file system using already available network infrastructure

NAS: Network Attached Storage

– Use specialized network protocols (e.g. CIFS, NFS, FTP, etc)

– Easiest case: File Server (e.g. Linux+Samba)

• Advantages:

Easy to setup, easy to use, cheap infrastructure – Allows sharing of storage among several systems – Abstracts on file system level (easy for most

applications)

• Disadvantages

Inefficient and slow

• large protocol and processing overhead

– Abstracts on file system level (not suitable for special purposes like raw devices or storage virtualization)

2.4 NAS – Network Attached Storage

Application Software

File System

Logical Storage Network

NAS Server

(54)

• SANs offer specialized high-speed networks for storage devices

– Usually uses local FibreChannel networks

Remote location may be connected via Ethernet or IP-WAN (Internet)

– Network uses specialized storage protocols

• iFCP (SCSI on FiberChannel)

• iSCSI (SCSI on TCP/IP)

• HyperSCSI (SCSI on raw ethernet)

• SANs provide raw block level access to logical storage devices

– Logical disks of any size can be offered by the SAN – For a client system using a logical disk, it might

appears like a local disk or RAID

– Client system has full control over file systems on logical disks

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 54

2.4 SAN – Storage Area Network

Application Software

File System

Logical Storage

SAN

(55)

2.4 SAN – Storage Area Network

SAN HBA

SAN/RAID HBA

Peripheral Bus (SCSI, SAS, etc.)

SAN Switch SAN Switch

SAN HBA

SAN Bus (iFCP) SAN

HBA

SAN HBA

SAN Switch

NAS Protocol (CIFS)

Ethernet Network

WAN-SAN Bus (HyperSCSI)

NAS Head

(56)

• Advantages:

Very efficient

• Highly optimized local network infrastructure

• Optimized protocols with low overhead

Very flexible (any number of systems may use any number of disks at any location)

– Helps for disaster protection

• SAN can transparently span to even remote locations

– May also employ NAS heads for NAS-like behavior

• Disadvantages

Expensive

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 56

2.4 SAN – Storage Area Network

(57)

• How much storage and bandwidth is needed by YouTube and how might it organized?

• All top secret, but there are educated

guesses and some (older) leaked data…

2.5 Case Study

(58)

• A Google video search restricted to YouTube.com reveals 187,397,091 indexed videos

3.35 min/movie : Based on TOP-100 all time videos 2.3 MB/min: Based on a sample (very low variation) 8.3 MB/video

• Guessed size of all videos on YouTube is 1.56 PB

– Assume 160 GB/disk with MTBF=16.6 years

• Based on the Google reliability study

9,800 hard disks are needed to store all videos just once without any redundancy

• MTBF = 14 hours ...

– Using 1,960 5+1 RAID 5’s, 11,760 disks are needed

• MTBF = 6.84 years - not too great…

• Still, each video only available once

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 58

2.5 Case Study

(59)

– Using 196 (9+1)(5+1) RAID 55 arrays, 13,066

disks are needed

• RAID 5 Arrays with 6 disks each. 10 of these arrays form an overlaying RAID 5.

• MTBF = 14 million years (finally, data is “safe” at one location)

• Still, each video only available once

– No global disaster safety – No global load balancing

• How might this look?

2.5 Case Study

13066 x 1089 x 60 x

(60)

• YouTube grows fast

– Currently, around 200,000 new videos per day (1.66 TB/day)

• Larger number of disks have to be added per month

– Around 440 disks/month for new videos

– Around 80 disks/month to replace broken ones

• Growing exponentially

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 60

2.5 Case Study

(61)

It gets even worse…

• YouTube serves 200 million videos per day (as of mid 2007)

– 30 PB of data EVERY MONTH

154 Gbps

(read: 154 Gigabit per second)

– Results to average 586,000 concurrent streams – Popular videos are around 250 000 views per day

600 concurrent streams per FILE (25 MB/sec)

– This bandwidth is insanely expensive:

600,000 USD/month

2.5 Case Study

(62)

• This massive amount of data cannot be hosted and served from a single location…

• Data needs to be distributed and globally load balanced

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 62

2.5 Case Study

(63)

• YouTube does not host and provide videos themselves

– They hire Limelight Networks for that

Limelight Networks

– Large CDN (Content Delivery Network) Provider – Own 25 POP (Point Of Presence) connected with

own backbone

• Each POP with up to 1000’s storage servers

• Can serve up to 1 Tbps!

2.5 Case Study

(64)

• Limelight automatically distributes content among all POP

– Data is massively redundant

– More popular data replicated more, less popular replicated less

– Each file is served from the closest location with bandwidth to spare

Global load balancing

– Data is disaster proof!

• What to learn?

• Large scale data storage and serving

– Very resource intensive – Very expensive

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 64

2.5 Case Study

Referenzen

ÄHNLICHE DOKUMENTE

• Access control (authorization) ensures that all direct accesses to database objects occur exclusively according to the modes and rules given by security policies. 12.2

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 12 EN 1.6.1.. 1.1 Characteristics

Datenbanksysteme 2 – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 4 SKS 10.5!. 3.1 Introduction to

– Both of the two child nodes to the left and right of the deleted element have the minimum number of elements (L-1) and then can then be joined into a legal single node with

• Cost estimate: (Height of the tree or 1 for hash index) plus #pages that contain overflow lists. 5.5

• For a non-cluster index the expected number of blocks accessed is #blocks ≈ #result size. • If no index is given, #blocks

• DB administrators may provide optimization hints to override optimizer heuristics and results. – Uses explain statement‟s PLAN_TABLE

8.1 Basic join order optimization 8.2 Join cost and size estimations 8.3 Left-deep join trees.. 8.4 Dynamic programming 8.5