Relational Database Systems 2

(1)

Christoph Lofi Philipp Wille

Institut für Informationssysteme

Relational Database Systems 2

2. Physical Data Storage

(2)

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 2

Data Storage Manager

Query Processor

Application Interfaces

Indices Statistics

DDL Interpreter

Query Evaluation

Engine Applications

Programs Object Code

Transaction Manager

Buffer Manager File Manager

Catalog/

Dictionary

Embedded DML Precompiler

DML Compiler

DB Scheme Application

Programs Direct Query Application

Programmers

DB Administrators

2 Architecture

(3)

2.1 Introduction 2.2 Hard Disks 2.3 RAIDs

2.4 SANs and NAS 2.5 Case Study

2 Physical Data Storage

(4)

• DBMS needs to retrieve, update and process persistently stored data

–

Storage consideration is an important factor in planning a database system (physical layer)

– Remember:

The data has to be securely stored, but access to the data should be declarative!

2.1 Physical Storage Introduction

Headquarters in Redwood City, CA

(5)

• Data is stored on a storage media. Media highly differ in terms of

– Random Access

Speed

– Random/Sequential Read/Write

speed

– Capacity

– Cost

per Capacity

2.1 Physical Storage Introduction

(6)

• Capacity: Quantifies the amount of data which can be stored

– Base Units: 1 Bit, 1 Byte = 2³Bit = 8 Bit

– Capacity units according to IEC, IEEE, NIST, etc:

• Usually used for file sizes and primary storage (for higher degree of confusion, sometimes used with SI abbreviations…)

• 1 KiB = 1024¹ Byte; 1 MiB = 1024²Byte ; 1 GiB = 1024³ Byte;

…

– Capacity units according to SI:

• Usually used for advertising secondary/tertiary storage

• 1 KB = 1000¹ Byte ≈ 0.976 KiB; 1 MB = 1000²Byte ≈ 0.954 MiB;

1 GB = 1000³ Byte ≈ 0.931 GiB; …

– Especially used by the networking community:

• 1 Kb = 1000¹ Bit = 0.125 KB ≈ 0.122 KiB ; 1 Mb = 1000² Bit = 0.125 MB ≈ 0.119 MiB

2.1 Relevant Media Characteristics

(7)

2.1 A Kilo-Joke

(8)

• Random Access Time: Average time to access a random piece of data at a known media position

–

Usually measured in ms or ns

–

Within some media, access time can vary depending on position (e.g. hard disks)

• Transfer Rate: Average amount consecutive of data which can be transferred per time unit

–

Usually measured in KB/sec, MB/sec, GB/sec,…

–

Sometimes also in Kb/sec, Mb/sec, Gb/sec

2.1 Characteristic Parameters

(9)

• Volatile: Memory needs constant power to keep data

– Dynamic: Dynamic volatile memory needs to be “refreshed”

regularly to keep data

– Static: No refresh necessary

• Access Modes

– Random Access: Any piece of data can be accessed in approximately the same time

– Sequential Access: Data can only be accessed in sequential order

• Write Mode

– Mutable Storage: Can be read and written arbitrarily – Write Once Read Many (WORM)

2.1 Other characteristics

(10)

• Online media

– „always on“

– Each single piece of data can be accessed fast – e.g. hard drives, main memory

• Nearline media

– Compromise between online and offline

– Offline media can automatically put “on line”

– e.g. juke boxes, robot libraries

• Offline media (disconnected media)

– Not under direct control of processing unit – Have to be connected manually

– e.g. box of backup tapes in basement

2.1 Online, Nearline, Offline

(11)

• Media characteristics result in a storage hierarchy

• DBMS optimize data distribution among the storage levels

– Primary Storage: Fast, limited capacity, high price, usually volatile electronic storage

• Frequently used data / current work data

– Secondary Storage: Slower, large capacity, lower price

• Main stored data

– Tertiary Storage: Even slower, huge capacity, even lower price, usually offline

2.1 The Storage Hierarchy

(12)

2.1 The Storage Hierarchy

Cost Speed

Optical Disks, Tape

Primary

Secondary

Tertiary

Cache, RAM

Flash, Magnetic Disks

~100 ns

~10 ms

> 1 s

(13)

Type Media Size Random Acc. Speed

Transfer Speed

Characteristics Price Price/GB Pri L1-Processor Cache 32 KB 5 x 10^-10s 15.4 GB/sec Vol, Stat,

RA,OL Pri DDR3-Ram

(Corsair Dominator Platinum Series)

8 GB 2.6 x 10^-8s 12.3 GB/sec Vol, Dyn, Ra, OL

€ 160 € 20

Sec Harddrive SSD

(Samsung 840 PRO)

256 GB 4 x 10^-6s 513 MB/sec Stat, RA, OL € 187 € 0.73 Sec Harddrive Magnetic

(Seagate ST2000DM001)

2000 GB 5.7 x 10^-4s 153 MB/sec Stat, RA, OL € 100 € 0.05 Ter Blank recordable

DVD-R disk

4.7 GB 9.8 x 10^-2s 11 MB/sec Stat, RA, OF, WORM

€ 0.15/Disk € 0.03 Ter LTO-5 tape

(TDK - LTO Ultrium 5 Data Cartridge)

1500 GB 58 s 280 MB/sec Stat, SA, OF € 15/Tape € 0.01

2.1 Storage Media – Examples

Pri= Primary, Sec=Secondary, Ter=Tertiary

Vol=Volatile, Stat=Static, Dyn=Dynamic, RA=Random Access, SA=Sequential Access

(14)

• Hard drives are currently the standard for large, cheap and persistent storage

– Usually used as the main storage media for most data in a DB

• DBMS need to be optimized for efficient disk storage and access

– Data access needs to be as fast as possible

– Often used data should be accessible with highest speed, rarely needed data may take longer

– Different data items needed for certain reoccurring tasks should also be stored/accessed together

2.2 Magnetic Disk Storage – HDs

(15)

• Directionally magnetization of a ferromagnetic material

• Realized on hard disk platters

– Base platter made of non-magnetic aluminum or glass substrate

– Magnetic grains worked into base platter to form magnetic regions

• Each region represents 1 Bit

– Read head can detect magnetization direction of each region

– Write head may change direction

2.2 HD – How does it work?

(16)

• Giant Magnetoresistance Effect (GMR)

– Discovered 1988 simultaneously by Peter Grünberg and Albert Fert

• Both honored with the 2007 Nobel Prize in Physics

– Allows the construction of efficient read heads:

• The electric resistance of an alternating ferro- and non-magnetic material giantally changes within changing directed magnetic fields

– http://www.research.ibm.com/research/demos/gmr/cyberdemo1.htm

2.2 HD – Notable Technology Advances

(17)

• Perpendicular

Recording (used since 2005)

– Longitudal Recording limited to ~200 Gb/inch² due to superparamagnetic effect

• Thermal energy may spontaneously change magnetic direction

– Perpendicular recording allows for up to 1000 Gb/inch² – Very simplified: Align

magnetic field orthogonal to surface instead of

parallel

• Magnetic regions can be smaller

2.2 HD – Notable Technology Advances

(18)

•

Usage of magnetic grains instead of continuous magnetic material

– Between magnetic direction transitions, Neel Spikes are formed

• Areas of unsure magnetic direction

– Neel Spikes are larger for continuous materials

– Magnetic regions can be smaller as the transition width can be reduced

2.2 HD – Notable Technology Advances

(19)

• A hard disk is made up of multiple double-sided platters

– Platter sides are called surfaces

– Platters are fixed on main spindle and rotate at equal and constant speed (common: 5400 rpm / 7200 rpm)

– Each surface has it’s own read and write head – Heads are attached to arms

• Arms can position heads along the surface

• Heads cannot move independently

– Heads have no contact to surface and hover on top of an air bearing

2.2 HD – Basic Architecture

(20)

• Each surface is divided into circular tracks

–

Some disks may use spirals

• All tracks of all surfaces with the same diameter are called cylinder

–

Data within the same cylinder can be accessed very efficiently

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 20 EN 13.2

2.2 HD – Basic Architecture

(21)

• Each track is subdivided into sectors of equal capacity

a) Fixed angle sector subdivision

• Same number of sectors per track, changing density, constant speed

b) Fixed data density

• Outer tracks have more sectors than inner tracks

• Transfer speed higher on outer tracks

• Adjacent sectors can be

2.2 HD – Basic Architecture

(22)

• Hard drives are not completely reliable!

– Drives do fail

– Means for physical failure recovery are necessary

• Backups

• Redundancy

• Hard drives age and wear down.

Wear significantly increases by:

– Contact cycles (head parking) – Spindle start-stop

– Power-on hours

– Operation outside ideal environment

• Temperature too low/high

• Unstable voltage

2.2 HD – Reliability

(23)

• Reliability measures are statistical values assuming certain usage patterns

• Desktop usage (all per year): 2,400 hours, 10,000 motor start/stops, 25°C temperature

• Server usage (all per year): 8,760 hours, 250 motor start/stops, 40°C temperature

– Non-Recoverable read errors: A sector on the surface cannot be read anymore – the data is lost

• Desktop disk: 1 per 10¹⁴ read bits, Server: 1 per 10¹⁵ read bits

• Disk can detect this!

– Maximum contact cycles: Maximum number of allowed head contacts (parking)

2.2 HD – Reliability

(24)

–

Mean Time Between Failure (MTBF): Statistically anticipated time for a large disk population failing to 50%

• Drive manufactures usually use optimistic simulations to guess the MTBF

• Desktop: 0.7 million hours (80 years), Server: 1.2 million hours (137 years) – Manufacturers values

–

Annualized Failure Rate (AFR): Probability of a failure per year based on MTBF

• AFR = OperatingHoursPerYear / MTBF_hours

• Desktop: 0.34%, Server: 0.73%

2.2 HD – Reliability

(25)

•

Failure rate during a hard disks lifespan is not constant

•

Can be better modeled by the “bathtub curve” having 3 components

– Infant Mortality Rate – Wear Out Failures – Random Failures

2.2 HD – Reliability

(26)

• Report by Google

– 100,000 consumer grade disks (80-400GB, ATA Interface, 5,400- 7,200 RPM)

• Results (among others)

– Drives fail often!

– There is infant mortality

– High usage increases infant mortality, but not later failure rates

– Observed AFR is around 7% and MTBF 16.6 years!

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für

Informationssysteme 26

Failure trends in a large disk drive population E. Pinheiro, W.-D. Weber, L. A. Barroso 5th USENIX conference on File and Storage Technologies, 2007

2.2 Real World Failure Rates

Careful: 2+ year results are biased. See reference.

(27)

•

Seagate ST32000641AS 2 TB (Desktop Harddrive, 2011)

– Manufacturer’s specifications

2.2 HD – Example Specs

Specification Value

Capacity 2 TB

Platters 4

Heads 8

Cylinders 16,383

Sectors per track 63

Bytes per sector 512

Spindle Speed 7200 RPM

MTBF 85 years

AFR 0.34 %

(28)

• Assume a storage need of 100 TB. Only following HDs are available

– Capacity: 1 TB capacity each

– MTBF: 100,000 hours each (ca. 11 years)

• Consider using 100 of these disks independently (w/o RAID).

– Total Storage: 100,000 GB = 100 TB – MTBF: 1,000 hours (ca. 42 days)

– THIS IS BAD!

• More sophisticated ways of using multiple disks are needed

2.2 Reliability – Considerations

(29)

• Alternative to hard-drives: SSD

–

Use microchips which retain data in non-volatile

memory chips

and contain no moving parts

• Use the same interface as hard disk drives

–

Easy replacement in most applications possible

• Key components

–

Memory

–

Controller

2.2 Solid State Disk (SSD)

(30)

• Flash-Memory

– Most SSDs use NAND-based flash memory – Retains memory even without power

– Slower than DRAM solutions

– Single-level cell versus multi-level cell – Wears down!

• DRAM

– Use volatile Random Access Memory

– Ultrafast data access (< 10 microseconds)

– Sometimes use internal battery or external power device to ensure data persistence

– Only for Applications that require even faster access, but do not need data persistence after power loss

2.2 Memory

(31)

• The controller is an embedded processor

• Incorporates the electronics that bridge the NAND memory components to the host computer

• Some of its functions

–

error correction, wear leveling, bad block

mapping, read and write caching, encryption, garbage

collection

2.2 Controller

(32)

• Advantages

– Low access time and latency

– No moving parts  shock resistant

• MTBF about 2 million hours

– Lighter and more energy-efficient than HDDs

• Disadvantages

– Divided into blocks/pages

• If one byte changes the whole page has to be written

• The old page will be marked as stale

• Only whole blocks can be deleted

– Limited ability of being rewritten (between 3,000 and 100,000 cycles per page)

• Wear leveling algorithms assure that write operations are equally distributed to the pages

2.2 SSD – Summary

(33)

• The disk controller organizes low level access to the disk

– e.g. head positioning, error checking, signal processing

– Usually integrated into the disk

– Provides unified and abstracted interface to access the disks (e.g. LBA)

– Connects disk to a peripheral bus (e.g. IDE, SCSI, FiberChannel, SAS)

• The host bus adapter (HBA) bridges between the peripheral bus and systems internal bus (like PCIe, PCI)

– Internal Bus usually integrated into systems main board

– Often confused as being the disk controller

• DAS (Directly Attached Storage)

2.2 HD – Controller

Disk Controller

Host Bus Adapter

Peripheral Bus

Internal Bus Inner System / Mainboard

(34)

• Sectors can be logically grouped to blocks by the operating system

–

Sectors in a block do not necessarily need to be adjacent

–

e.g. NTFS defaults to 4 KiB per block

• 8 sectors on a modern disk

• Hardware address of a block is combination of

–

Cylinder number, surface number, block number within track

–

Controller maps hardware address to logical block

address

(LBA)

2.2 HD – Controller

(35)

• Disk controller transfers content of whole blocks to buffer

– Buffer resides in primary storage and can be accessed efficiently

– Time needed to transfer a random block (4KiB/Block on ST3100034AS): (<10 msec)

• Seek Time: Time needed to position head to correct cylinder (<8 msec)

• Latency (Rotational Delay): Time until the correct block arrives below the head (<0.14 msec)

• Block Transfer Time: Time to read all sectors of block (<0.01 msec)

– Bulk Transfer Rate for n adjacent blocks (<20msec for n=10)

2.2 HD – Controller

(36)

• Locating data on a disk is a major bottleneck

–

Try operating on data already in buffer

–

Aim for bulk transfer, avoid random block transfer

2.2 HD – Controller

(37)

• A single HD is often not sufficient

–

Limited capacity

–

Limited speed

–

Limited reliability

• Idea: Combine multiple HD into a RAID Array (Redundant Array of Independent Disks)

–

RAID Array treats multiple hardware disks as a single logical disk

• More HDs for increased capacity

2.3 RAID

(38)

•

The RAID controller connects to multiple hard disks

– Disks are virtualized and appear to be just one single logical disk

– The RAID controller acts as an extended specialized HBA

(Host Bus Adapter)

– Still DAS (Directly Attached Storage)

2.3 RAID Controller

RAID Controller

Peripheral Bus

Internal Bus

Represented as single logical Disk

(39)

• Mirroring (or shadowing): Increases reliability by complete redundancy

• Idea: Mirror Disks are exact copies of original disk

– Not space efficient

• Read speed can be n times as fast, write speed does not increase

• Increases reliability. Assume

– Two disks with a MTBF 11 years each

• One original disk, one mirror disk

• Assume disk failures are independent of each other (unrealistic)

– Disk replacement time of 10 hours

2.3 RAID Principles – Mirroring

(40)

• Striping: Improve performance by parallelism

• Idea: Distribute data among all disks for increased performance

• BitLevel Striping: Split all bits of a byte to the disks

– e.g. for 8 disks, write i-th bit to disk i

– Number of disk needs to be a power of 2 – Each disk is involved in each access

• Access rate does not increase

• Read and write transfer speed linearly increases

• Simultaneous accesses not possible

– Good for speeding up few, sequential and large accesses

Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 40 Silber 11.3

2.3 RAID Principles – Striping

(41)

• Block Level Striping: Distribute blocks among the disks

–

Only one disk is involved reading a specific block

• Read and write speed of a single block not increased

• Other disks still free to read/write other blocks

• Read and write speed of multiple accesses increase

–

Good for large number of parallel accesses

2.3 RAID Principles – Striping

(42)

• Error Correction Codes: Increase reliability with computed redundancy

• Hamming Codes (~1940)

–

Can detect and repair 1 bit errors within

a set of n data bits by computing k

parity bits

• n = 2^k - k – 1

• n=1, k=2; n=4, k=3; n = 11, k=4; n = 26, k=5; …

–

Especially used for in-memory and tape error correction

• Media cannot detect errors autonomously

• Not really used for hard drives anymore

2.3 RAID Principles – Error Correction Codes

(43)

• Interleaved Parity (Reed-Solomon Algorithm on GD(2) Galois Field)

– Can repair 1-bit errors (when the error is known)

– Hard Disks can detect read errors themselves, no need for complete Hamming codes

– Basic Idea:

• From n data pieces D₁,…,D_n compute a parity data D_p by combining data using logical XOR (eXclusive OR)

– XOR is associative and commutative – Important: A XOR B XOR B = A

• i.e. D_p= D₁ XOR D₂ XOR … XOR D_n

2.3 RAID Principles – Error Correction Codes

(44)

• Interleaved Parity. Example:

• A = 0101, B = 1100, C = 1011

• P = 0010 = A XOR B XOR C

• C is lost.

– P = A XOR B XOR C – C = P XOR A XOR B

– C = A XOR B XOR C XOR A XOR B – C = A XOR A XOR B XOR B XOR C – C = 0 XOR C

– C = 1011

2.3 RAID Principles – Interleaved Parity

0101

XOR 1100

XOR 1011

P 0010

0010

XOR 0101

XOR 1100

C 1011

(45)

• The 3 RAID principles can be combined in multiple ways

– Not every combination is useful

• This led to the definition of 7 core RAID levels

– RAID 0 – RAID 6

– The most dominant levels are RAID 0, RAID 1, RAID 1+0, RAID 5

• In following examples, assume

– A MTBF of 100,000 hours (11.42 years) per disk – A Mean Time to Repair (MTTR) of 6 hours

– Failure rate is constant and failures between disks are independent – MTBF_raid is the mean time to data loss within the raid if each failing

disk is replaced within the MTTR

– D is the number of drives in the RAID set

– C=200 GB is capacity of one disk, C capacity of whole raid

2.3 RAID in practical applications

(46)

• Mean Time to Repair (MTTR)

– MTTR = TimeToNotice + RebuildTime – Assume time to notice of 0.5 hours

– Rebuild time is the time for completely writing back lost data

• Assume disk capacity of 200GB

• Write back speed of 10 MB/sec

– Consisting of reading remaining disks – Computing parity / Reconstructing data

• Rebuild time around 5.5 hours

– During rebuild, a RAID is especially vulnerable

– MTTR = 6 hours

2.3 RAID in practical applications

(47)

• File A (A1-Ax), File B (B1-Bx), File C (C1-Cx)

• Raid 0

– Block-Level-Striping only

– Increased parallel access and transfer speeds, reduced reliability

– All disks contain data (0% overhead) – Works with any number of disks – MTBF_raid= MTBF_disk / D

– 4 disks:

• MTBF_raid= 2.86 years

• C_raid = 800 GB (0 GB wasted (0%))

– Common size: 2 disks

2.3 RAID Levels

(48)

• Raid 1

– Mirroring only

– Increased reliability, increased read transfer speed, low space efficiency

– MTBF_raid= MTBF_disk^D/ (D! * MTTR^D-1) – 4 disks:

• MTBF_raid= 2.2 trillion years

• Age of universe may be around 15 billion years…

– Common size: 2 disks

• MTBF_raid= 95,130 years

2.3 RAID Levels

(49)

• RAID 2

– Not used anymore in practice

• was used in old mainframes

– Bit-Level-Striping – Use Hamming Codes

• Usually Hamming Code(7,4) – 4 data bits, 3 parity bits

• Reliable 1-Bit error recovery (i.e. one disk may fail)

– 3 redundant disks per 4 data disks (75% overhead)

• Ratio better for larger number of disks

– MTBF_raid = MTBF_disk²/ (D*(D-1) * MTTR)

– 7 disks (does not really make sense for 4 – not comparable to other values)

• MTBF = 4,530 years

2.3 RAID Levels

(50)

• RAID 3

– Interleaved-Parity – Byte-Level-Striping – Dedicated Parity Disk

• Bottleneck! Every write operation needs to update the parity disk.

• No parallel writes

– 1 redundant disk per n data disks

• Overhead decreases with number of disks while reliability decreases

• 25% overhead for 4 data disks

– MTBF_raid = MTBF_disk²/ (D*(D-1) * MTTR) – 4 disks

2.3 RAID Levels

(51)

• RAID 4

– Block-Level Striping –

As RAID 3 otherwise

– 4 disks (common size)

• MTBF_raid = 15,854 years

– 5 disks (also common size)

• MTBF_raid = 9,513 years

2.3 RAID Levels

(52)

• RAID 5

– Parity is distributed among the hard disks

• May allow for parallel block writes

– As RAID 4 otherwise

– Bottleneck when writing many files smaller than a block

• Whole parity block has to be read and re-written for each minor write

– Can recover from a single disk failure – MTBF_raid and C_raid as for RAID 3 & 4

2.3 RAID Levels

(53)

• RAID 6

– Two independent parity blocks distributed among the disks

• May be implemented by parity on orthogonal data or by using Reed-Solomon on GF(2⁸)

– As RAID 5 otherwise

– 2 redundant disks per n data disks

• Can recover from a double disk failure

• No vulnerability during single failure rebuild

• Very suitable for larger arrays

• Writer overhead due to more complicated parity computation

– MTBF_raid= MTBF_disk³/ (D*(D-2)*(D-1) * MTTR²) – 4 disks

• MTBF_raid= 132 million years

• C_raid= 400 GB (400 GB wasted (50%))

– 8 disks (common)

2.3 RAID Levels

(54)

• Additionally, there are hybrid levels combing the core levels

– RAID 0+1, RAID 1+0, RAID 5+0, RAID 5+1, RAID 6+6, …

• Raid 1+0

– Mirrored sets nested in a striped set

• RAID 0 on sets of RAID 1 sets

– Very high read and write transfer speeds, increased reliability, low space efficiency, limited maximum size

– Most performant RAID combination

– D₁= Drives per RAID 1, D₀=Number of RAID1 sets – MTBF_raid= MTBF_disk^D1/ (D₁! * MTTR^D1-1) / D₀

– 4 disks: D₁= 2, D₀= 2

• C_raid= 400 GB (400 GB wasted (50%)) – 6 disks: D₁= 2, D₀= 3

• C_raid= 600 GB (600 GB wasted (50%))

2.3 Practical use of RAIDS

(55)

• RAIDs controllers directly connect storage to the system bus

– Storage available to only one system/ server/ application

• Number of disks is limited

– Consumer grade RAID: 2-4 disks – Enterprise grade RAID: 8-24+ disks

• Solutions

– NAS (Network Attached Storage): Provide abstracted file systems via network (software solution)

– SAN (Storage Area Network): Virtualized logical storage within a specialized network on block level (hardware

2.4 Beyond RAID

(56)

• Before discussing NAS, we need file systems

• A file systems is a software for abstracting file operations on a logical storage device

– Files are a collection of binary data

• Creating, reading, writing, deleting, finding, organizing

– How does a file access translate into top-level operations on a logical

storage device?

• e.g. which blocks have to be read/written?

• Bridge between application software and (abstracted) hardware

2.4 File Systems vs. Raw Devices

Application Software

File System

Logical Storage

(57)

• Raw Devices access allows

applications to bypass the OS and the file system

• Application may directly tune aspects of physical storage

–

There is still the hard drive

controller…so, its not really direct

• May lead to very efficient implementations

2.4 File Systems vs. Raw Devices

Logical

(58)

• Idea: Provide a remote file system using already available network infrastructure

– NAS: Network Attached Storage

– Use specialized network protocols (e.g. CIFS, NFS, FTP, etc)

– Easiest case: File Server (e.g. Linux+Samba)

• Advantages:

– Easy to setup, easy to use, cheap infrastructure – Allows sharing of storage among several systems – Abstracts on file system level (easy for most

applications)

• Disadvantages

– Inefficient and slow

• large protocol and processing overhead

– Abstracts on file system level (not suitable for special purposes like raw devices or storage virtualization)

2.4 NAS – Network Attached Storage

File System

Logical Storage Network

NAS Server

(59)

• SANs offer specialized high-speed networks for storage devices

– Usually uses local FibreChannel networks

– Remote location may be connected via Ethernet or IP-WAN (Internet)

– Network uses specialized storage protocols

• iFCP (SCSI on FiberChannel)

• iSCSI (SCSI on TCP/IP)

• HyperSCSI (SCSI on raw ethernet)

• SANs provide raw block level access to logical storage devices

– Logical disks of any size can be offered by the SAN – For a client system using a logical disk, it might

appears like a local disk or RAID

– Client system has full control over file systems on

2.4 SAN – Storage Area Network

File System

SAN

(60)

2.4 SAN – Storage Area Network

SAN HBA

SAN/RAID HBA

Peripheral Bus (SCSI, SAS, etc.)

SAN Switch SAN Switch

SAN HBA

SAN Bus (iFCP) SAN

HBA

SAN HBA

SAN Switch

NAS Protocol (CIFS)

Ethernet Network

WAN-SAN Bus (HyperSCSI)

NAS Head

(61)

•

Advantages:

– Very efficient

• Highly optimized local network infrastructure

• Optimized protocols with low overhead

– Very flexible (any number of systems may use any number of disks at any location)

– Helps for disaster protection

• SAN can transparently span to even remote locations

– May also employ NAS heads for NAS-like behavior

•

Disadvantages

2.4 SAN – Storage Area Network

(62)

• There are different types of storage

–

Usually, there is a storage hierarchy

• Faster, smaller, more expensive storage

• Slower, bigger, less expensive storage

• Hard drives are currently the most popular media

–

Mechanical device

• High sequential transfer rates,

• Bad random access times, low random transfer rates

• Prone to failure

–

DBMS must be optimized for the used storage devices!

2 Physical Storage

(63)

• Access Pathes

–

Physical Data Access

–

Index Structures

–