Christoph Lofi Philipp Wille
Institut für Informationssysteme
Relational Database Systems 2
2. Physical Data Storage
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 2
Data Storage Manager
Query Processor
Application Interfaces
Indices Statistics
DDL Interpreter
Query Evaluation
Engine Applications
Programs Object Code
Transaction Manager
Buffer Manager File Manager
Catalog/
Dictionary
Embedded DML Precompiler
DML Compiler
DB Scheme Application
Programs Direct Query Application
Programmers
DB Administrators
2 Architecture
2.1 Introduction 2.2 Hard Disks 2.3 RAIDs
2.4 SANs and NAS 2.5 Case Study
2 Physical Data Storage
• DBMS needs to retrieve, update and process persistently stored data
–
Storage consideration is an important factor in planning a database system (physical layer)
– Remember:
The data has to be securely stored, but access to the data should be declarative!
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 4
2.1 Physical Storage Introduction
Headquarters in Redwood City, CA
• Data is stored on a storage media. Media highly differ in terms of
– Random Access
Speed
– Random/Sequential Read/Write
speed
– Capacity– Cost
per Capacity
2.1 Physical Storage Introduction
• Capacity: Quantifies the amount of data which can be stored
– Base Units: 1 Bit, 1 Byte = 23 Bit = 8 Bit
– Capacity units according to IEC, IEEE, NIST, etc:
• Usually used for file sizes and primary storage (for higher degree of confusion, sometimes used with SI abbreviations…)
• 1 KiB = 10241 Byte; 1 MiB = 10242 Byte ; 1 GiB = 10243 Byte;
…
– Capacity units according to SI:
• Usually used for advertising secondary/tertiary storage
• 1 KB = 10001 Byte ≈ 0.976 KiB; 1 MB = 10002 Byte ≈ 0.954 MiB;
1 GB = 10003 Byte ≈ 0.931 GiB; …
– Especially used by the networking community:
• 1 Kb = 10001 Bit = 0.125 KB ≈ 0.122 KiB ; 1 Mb = 10002 Bit = 0.125 MB ≈ 0.119 MiB
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 6
2.1 Relevant Media Characteristics
2.1 A Kilo-Joke
• Random Access Time: Average time to access a random piece of data at a known media position
–
Usually measured in ms or ns
–
Within some media, access time can vary depending on position (e.g. hard disks)
• Transfer Rate: Average amount consecutive of data which can be transferred per time unit
–
Usually measured in KB/sec, MB/sec, GB/sec,…
–
Sometimes also in Kb/sec, Mb/sec, Gb/sec
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 8
2.1 Characteristic Parameters
• Volatile: Memory needs constant power to keep data
– Dynamic: Dynamic volatile memory needs to be “refreshed”
regularly to keep data
– Static: No refresh necessary
• Access Modes
– Random Access: Any piece of data can be accessed in approximately the same time
– Sequential Access: Data can only be accessed in sequential order
• Write Mode
– Mutable Storage: Can be read and written arbitrarily – Write Once Read Many (WORM)
2.1 Other characteristics
• Online media
– „always on“
– Each single piece of data can be accessed fast – e.g. hard drives, main memory
• Nearline media
– Compromise between online and offline
– Offline media can automatically put “on line”
– e.g. juke boxes, robot libraries
• Offline media (disconnected media)
– Not under direct control of processing unit – Have to be connected manually
– e.g. box of backup tapes in basement
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 10
2.1 Online, Nearline, Offline
• Media characteristics result in a storage hierarchy
• DBMS optimize data distribution among the storage levels
– Primary Storage: Fast, limited capacity, high price, usually volatile electronic storage
• Frequently used data / current work data
– Secondary Storage: Slower, large capacity, lower price
• Main stored data
– Tertiary Storage: Even slower, huge capacity, even lower price, usually offline
2.1 The Storage Hierarchy
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 12
2.1 The Storage Hierarchy
Cost Speed
Optical Disks, Tape
Primary
Secondary
Tertiary
Cache, RAM
Flash, Magnetic Disks
~100 ns
~10 ms
> 1 s
Type Media Size Random Acc. Speed
Transfer Speed
Characteristics Price Price/GB Pri L1-Processor Cache 32 KB 5 x 10-10s 15.4 GB/sec Vol, Stat,
RA,OL Pri DDR3-Ram
(Corsair Dominator Platinum Series)
8 GB 2.6 x 10-8s 12.3 GB/sec Vol, Dyn, Ra, OL
€ 160 € 20
Sec Harddrive SSD
(Samsung 840 PRO)
256 GB 4 x 10-6 s 513 MB/sec Stat, RA, OL € 187 € 0.73 Sec Harddrive Magnetic
(Seagate ST2000DM001)
2000 GB 5.7 x 10-4 s 153 MB/sec Stat, RA, OL € 100 € 0.05 Ter Blank recordable
DVD-R disk
4.7 GB 9.8 x 10-2s 11 MB/sec Stat, RA, OF, WORM
€ 0.15/Disk € 0.03 Ter LTO-5 tape
(TDK - LTO Ultrium 5 Data Cartridge)
1500 GB 58 s 280 MB/sec Stat, SA, OF € 15/Tape € 0.01
2.1 Storage Media – Examples
Pri= Primary, Sec=Secondary, Ter=Tertiary
Vol=Volatile, Stat=Static, Dyn=Dynamic, RA=Random Access, SA=Sequential Access
• Hard drives are currently the standard for large, cheap and persistent storage
– Usually used as the main storage media for most data in a DB
• DBMS need to be optimized for efficient disk storage and access
– Data access needs to be as fast as possible
– Often used data should be accessible with highest speed, rarely needed data may take longer
– Different data items needed for certain reoccurring tasks should also be stored/accessed together
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 14
2.2 Magnetic Disk Storage – HDs
• Directionally magnetization of a ferromagnetic material
• Realized on hard disk platters
– Base platter made of non-magnetic aluminum or glass substrate
– Magnetic grains worked into base platter to form magnetic regions
• Each region represents 1 Bit
– Read head can detect magnetization direction of each region
– Write head may change direction
2.2 HD – How does it work?
• Giant Magnetoresistance Effect (GMR)
– Discovered 1988 simultaneously by Peter Grünberg and Albert Fert
• Both honored with the 2007 Nobel Prize in Physics
– Allows the construction of efficient read heads:
• The electric resistance of an alternating ferro- and non-magnetic material giantally changes within changing directed magnetic fields
– http://www.research.ibm.com/research/demos/gmr/cyberdemo1.htm
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 16
2.2 HD – Notable Technology Advances
• Perpendicular
Recording (used since 2005)
– Longitudal Recording limited to ~200 Gb/inch2 due to superparamagnetic effect
• Thermal energy may spontaneously change magnetic direction
– Perpendicular recording allows for up to 1000 Gb/inch2 – Very simplified: Align
magnetic field orthogonal to surface instead of
parallel
• Magnetic regions can be smaller
2.2 HD – Notable Technology Advances
•
Usage of magnetic grains instead of continuous magnetic material
– Between magnetic direction transitions, Neel Spikes are formed
• Areas of unsure magnetic direction
– Neel Spikes are larger for continuous materials
– Magnetic regions can be smaller as the transition width can be reduced
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 18
2.2 HD – Notable Technology Advances
• A hard disk is made up of multiple double-sided platters
– Platter sides are called surfaces
– Platters are fixed on main spindle and rotate at equal and constant speed (common: 5400 rpm / 7200 rpm)
– Each surface has it’s own read and write head – Heads are attached to arms
• Arms can position heads along the surface
• Heads cannot move inde- pendently
– Heads have no contact to surface and hover on top of an air bearing
2.2 HD – Basic Architecture
• Each surface is divided into circular tracks
–
Some disks may use spirals
• All tracks of all surfaces with the same diameter are called cylinder
–
Data within the same cylinder can be accessed very efficiently
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 20 EN 13.2
2.2 HD – Basic Architecture
• Each track is subdivided into sectors of equal capacity
a) Fixed angle sector subdivision
• Same number of sectors per track, changing density, constant speed
b) Fixed data density
• Outer tracks have more sectors than inner tracks
• Transfer speed higher on outer tracks
• Adjacent sectors can be
2.2 HD – Basic Architecture
• Hard drives are not completely reliable!
– Drives do fail
– Means for physical failure recovery are necessary
• Backups
• Redundancy
• Hard drives age and wear down.
Wear significantly increases by:
– Contact cycles (head parking) – Spindle start-stop
– Power-on hours
– Operation outside ideal environment
• Temperature too low/high
• Unstable voltage
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 22
2.2 HD – Reliability
• Reliability measures are statistical values assuming certain usage patterns
• Desktop usage (all per year): 2,400 hours, 10,000 motor start/stops, 25°C temperature
• Server usage (all per year): 8,760 hours, 250 motor start/stops, 40°C temperature
– Non-Recoverable read errors: A sector on the surface cannot be read anymore – the data is lost
• Desktop disk: 1 per 1014 read bits, Server: 1 per 1015 read bits
• Disk can detect this!
– Maximum contact cycles: Maximum number of allowed head contacts (parking)
2.2 HD – Reliability
–
Mean Time Between Failure (MTBF): Statistically anticipated time for a large disk population failing to 50%
• Drive manufactures usually use optimistic simulations to guess the MTBF
• Desktop: 0.7 million hours (80 years), Server: 1.2 million hours (137 years) – Manufacturers values
–
Annualized Failure Rate (AFR): Probability of a failure per year based on MTBF
• AFR = OperatingHoursPerYear / MTBFhours
• Desktop: 0.34%, Server: 0.73%
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 24
2.2 HD – Reliability
•
Failure rate during a hard disks lifespan is not constant
•
Can be better modeled by the “bathtub curve” having 3 components
– Infant Mortality Rate – Wear Out Failures – Random Failures
2.2 HD – Reliability
• Report by Google
– 100,000 consumer grade disks (80-400GB, ATA Interface, 5,400- 7,200 RPM)
• Results (among others)
– Drives fail often!
– There is infant mortality
– High usage increases infant mortality, but not later failure rates
– Observed AFR is around 7% and MTBF 16.6 years!
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für
Informationssysteme 26
Failure trends in a large disk drive population E. Pinheiro, W.-D. Weber, L. A. Barroso 5th USENIX conference on File and Storage Technologies, 2007
2.2 Real World Failure Rates
Careful: 2+ year results are biased. See reference.
•
Seagate ST32000641AS 2 TB (Desktop Harddrive, 2011)
– Manufacturer’s specifications
2.2 HD – Example Specs
Specification Value
Capacity 2 TB
Platters 4
Heads 8
Cylinders 16,383
Sectors per track 63
Bytes per sector 512
Spindle Speed 7200 RPM
MTBF 85 years
AFR 0.34 %
• Assume a storage need of 100 TB. Only following HDs are available
– Capacity: 1 TB capacity each
– MTBF: 100,000 hours each (ca. 11 years)
• Consider using 100 of these disks independently (w/o RAID).
– Total Storage: 100,000 GB = 100 TB – MTBF: 1,000 hours (ca. 42 days)
– THIS IS BAD!
• More sophisticated ways of using multiple disks are needed
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 28
2.2 Reliability – Considerations
• Alternative to hard-drives: SSD
–
Use microchips which retain data in non-volatile
memory chipsand contain no moving parts
• Use the same interface as hard disk drives
–
Easy replacement in most applications possible
• Key components
–
Memory
–Controller
2.2 Solid State Disk (SSD)
• Flash-Memory
– Most SSDs use NAND-based flash memory – Retains memory even without power
– Slower than DRAM solutions
– Single-level cell versus multi-level cell – Wears down!
• DRAM
– Use volatile Random Access Memory
– Ultrafast data access (< 10 microseconds)
– Sometimes use internal battery or external power device to ensure data persistence
– Only for Applications that require even faster access, but do not need data persistence after power loss
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 30
2.2 Memory
• The controller is an embedded processor
• Incorporates the electronics that bridge the NAND memory components to the host computer
• Some of its functions
–
error correction, wear leveling, bad block
mapping, read and write caching, encryption, garbage
collection
2.2 Controller
• Advantages
– Low access time and latency
– No moving parts shock resistant
• MTBF about 2 million hours
– Lighter and more energy-efficient than HDDs
• Disadvantages
– Divided into blocks/pages
• If one byte changes the whole page has to be written
• The old page will be marked as stale
• Only whole blocks can be deleted
– Limited ability of being rewritten (between 3,000 and 100,000 cycles per page)
• Wear leveling algorithms assure that write operations are equally distributed to the pages
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 32
2.2 SSD – Summary
• The disk controller organizes low level access to the disk
– e.g. head positioning, error checking, signal processing
– Usually integrated into the disk
– Provides unified and abstracted interface to access the disks (e.g. LBA)
– Connects disk to a peripheral bus (e.g. IDE, SCSI, FiberChannel, SAS)
• The host bus adapter (HBA) bridges between the peripheral bus and systems internal bus (like PCIe, PCI)
– Internal Bus usually integrated into systems main board
– Often confused as being the disk controller
• DAS (Directly Attached Storage)
2.2 HD – Controller
Disk Controller
Host Bus Adapter
Peripheral Bus
Internal Bus Inner System / Mainboard
• Sectors can be logically grouped to blocks by the operating system
–
Sectors in a block do not necessarily need to be adjacent
–
e.g. NTFS defaults to 4 KiB per block
• 8 sectors on a modern disk
• Hardware address of a block is combination of
–
Cylinder number, surface number, block number within track
–
Controller maps hardware address to logical block
address(LBA)
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 34
2.2 HD – Controller
• Disk controller transfers content of whole blocks to buffer
– Buffer resides in primary storage and can be accessed efficiently
– Time needed to transfer a random block (4KiB/Block on ST3100034AS): (<10 msec)
• Seek Time: Time needed to position head to correct cylinder (<8 msec)
• Latency (Rotational Delay): Time until the correct block arrives below the head (<0.14 msec)
• Block Transfer Time: Time to read all sectors of block (<0.01 msec)
– Bulk Transfer Rate for n adjacent blocks (<20msec for n=10)
2.2 HD – Controller
• Locating data on a disk is a major bottleneck
–
Try operating on data already in buffer
–
Aim for bulk transfer, avoid random block transfer
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 36
2.2 HD – Controller
• A single HD is often not sufficient
–
Limited capacity
–Limited speed
–
Limited reliability
• Idea: Combine multiple HD into a RAID Array (Redundant Array of Independent Disks)
–
RAID Array treats multiple hardware disks as a single logical disk
• More HDs for increased capacity
2.3 RAID
•
The RAID controller connects to multiple hard disks
– Disks are virtualized and appear to be just one single logical disk
– The RAID controller acts as an extended specialized HBA
(Host Bus Adapter)
– Still DAS (Directly Attached Storage)
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 38
2.3 RAID Controller
RAID Controller
Peripheral Bus
Internal Bus
Represented as single logical Disk
• Mirroring (or shadowing): Increases reliability by complete redundancy
• Idea: Mirror Disks are exact copies of original disk
– Not space efficient
• Read speed can be n times as fast, write speed does not increase
• Increases reliability. Assume
– Two disks with a MTBF 11 years each
• One original disk, one mirror disk
• Assume disk failures are independent of each other (unrealistic)
– Disk replacement time of 10 hours
2.3 RAID Principles – Mirroring
• Striping: Improve performance by parallelism
• Idea: Distribute data among all disks for increased performance
• BitLevel Striping: Split all bits of a byte to the disks
– e.g. for 8 disks, write i-th bit to disk i
– Number of disk needs to be a power of 2 – Each disk is involved in each access
• Access rate does not increase
• Read and write transfer speed linearly increases
• Simultaneous accesses not possible
– Good for speeding up few, sequential and large accesses
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 40 Silber 11.3
2.3 RAID Principles – Striping
• Block Level Striping: Distribute blocks among the disks
–
Only one disk is involved reading a specific block
• Read and write speed of a single block not increased
• Other disks still free to read/write other blocks
• Read and write speed of multiple accesses increase
–
Good for large number of parallel accesses
2.3 RAID Principles – Striping
• Error Correction Codes: Increase reliability with computed redundancy
• Hamming Codes (~1940)
–
Can detect and repair 1 bit errors within
a set of n data bits by computing k
parity bits• n = 2k - k – 1
• n=1, k=2; n=4, k=3; n = 11, k=4; n = 26, k=5; …
–
Especially used for in-memory and tape error correction
• Media cannot detect errors autonomously
• Not really used for hard drives anymore
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 42
2.3 RAID Principles – Error Correction Codes
• Interleaved Parity (Reed-Solomon Algorithm on GD(2) Galois Field)
– Can repair 1-bit errors (when the error is known)
– Hard Disks can detect read errors themselves, no need for complete Hamming codes
– Basic Idea:
• From n data pieces D1,…,Dn compute a parity data Dp by combining data using logical XOR (eXclusive OR)
– XOR is associative and commutative – Important: A XOR B XOR B = A
• i.e. Dp= D1 XOR D2 XOR … XOR Dn
2.3 RAID Principles – Error Correction Codes
• Interleaved Parity. Example:
• A = 0101, B = 1100, C = 1011
• P = 0010 = A XOR B XOR C
• C is lost.
– P = A XOR B XOR C – C = P XOR A XOR B
– C = A XOR B XOR C XOR A XOR B – C = A XOR A XOR B XOR B XOR C – C = 0 XOR C
– C = 1011
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 44
2.3 RAID Principles – Interleaved Parity
0101
XOR 1100
XOR 1011
P 0010
0010
XOR 0101
XOR 1100
C 1011
• The 3 RAID principles can be combined in multiple ways
– Not every combination is useful
• This led to the definition of 7 core RAID levels
– RAID 0 – RAID 6
– The most dominant levels are RAID 0, RAID 1, RAID 1+0, RAID 5
• In following examples, assume
– A MTBF of 100,000 hours (11.42 years) per disk – A Mean Time to Repair (MTTR) of 6 hours
– Failure rate is constant and failures between disks are independent – MTBFraid is the mean time to data loss within the raid if each failing
disk is replaced within the MTTR
– D is the number of drives in the RAID set
– C=200 GB is capacity of one disk, C capacity of whole raid
2.3 RAID in practical applications
• Mean Time to Repair (MTTR)
– MTTR = TimeToNotice + RebuildTime – Assume time to notice of 0.5 hours
– Rebuild time is the time for completely writing back lost data
• Assume disk capacity of 200GB
• Write back speed of 10 MB/sec
– Consisting of reading remaining disks – Computing parity / Reconstructing data
• Rebuild time around 5.5 hours
– During rebuild, a RAID is especially vulnerable
– MTTR = 6 hours
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 46
2.3 RAID in practical applications
• File A (A1-Ax), File B (B1-Bx), File C (C1-Cx)
• Raid 0
– Block-Level-Striping only
– Increased parallel access and transfer speeds, reduced reliability
– All disks contain data (0% overhead) – Works with any number of disks – MTBFraid= MTBFdisk / D
– 4 disks:
• MTBFraid= 2.86 years
• Craid = 800 GB (0 GB wasted (0%))
– Common size: 2 disks
2.3 RAID Levels
• Raid 1
– Mirroring only
– Increased reliability, increased read transfer speed, low space efficiency
– MTBFraid= MTBFdiskD/ (D! * MTTRD-1) – 4 disks:
• MTBFraid= 2.2 trillion years
• Craid = 200 GB (600 GB wasted (75%))
• Age of universe may be around 15 billion years…
– Common size: 2 disks
• MTBFraid= 95,130 years
• Craid = 200 GB (200 GB wasted (50%))
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 48
2.3 RAID Levels
• RAID 2
– Not used anymore in practice
• was used in old mainframes
– Bit-Level-Striping – Use Hamming Codes
• Usually Hamming Code(7,4) – 4 data bits, 3 parity bits
• Reliable 1-Bit error recovery (i.e. one disk may fail)
– 3 redundant disks per 4 data disks (75% overhead)
• Ratio better for larger number of disks
– MTBFraid = MTBFdisk2/ (D*(D-1) * MTTR)
– 7 disks (does not really make sense for 4 – not comparable to other values)
• MTBF = 4,530 years
2.3 RAID Levels
• RAID 3
– Interleaved-Parity – Byte-Level-Striping – Dedicated Parity Disk
• Bottleneck! Every write operation needs to update the parity disk.
• No parallel writes
– 1 redundant disk per n data disks
• Overhead decreases with number of disks while reliability decreases
• 25% overhead for 4 data disks
– MTBFraid = MTBFdisk2/ (D*(D-1) * MTTR) – 4 disks
• MTBFraid= 15,854 years
• Craid = 600 GB (200 GB wasted (25%))
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 50
2.3 RAID Levels
• RAID 4
– Block-Level Striping –
As RAID 3 otherwise
– 4 disks (common size)• MTBFraid = 15,854 years
• Craid = 600 GB (200 GB wasted (25%))
– 5 disks (also common size)
• MTBFraid = 9,513 years
• Craid = 800 GB (200 GB wasted (20%))
2.3 RAID Levels
• RAID 5
– Parity is distributed among the hard disks
• May allow for parallel block writes
– As RAID 4 otherwise
– Bottleneck when writing many files smaller than a block
• Whole parity block has to be read and re-written for each minor write
– Can recover from a single disk failure – MTBFraid and Craid as for RAID 3 & 4
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 52
2.3 RAID Levels
• RAID 6
– Two independent parity blocks distributed among the disks
• May be implemented by parity on orthogonal data or by using Reed-Solomon on GF(28)
– As RAID 5 otherwise
– 2 redundant disks per n data disks
• Can recover from a double disk failure
• No vulnerability during single failure rebuild
• Very suitable for larger arrays
• Writer overhead due to more complicated parity computation
– MTBFraid= MTBFdisk3/ (D*(D-2)*(D-1) * MTTR2) – 4 disks
• MTBFraid= 132 million years
• Craid= 400 GB (400 GB wasted (50%))
– 8 disks (common)
2.3 RAID Levels
• Additionally, there are hybrid levels combing the core levels
– RAID 0+1, RAID 1+0, RAID 5+0, RAID 5+1, RAID 6+6, …
• Raid 1+0
– Mirrored sets nested in a striped set
• RAID 0 on sets of RAID 1 sets
– Very high read and write transfer speeds, increased reliability, low space efficiency, limited maximum size
– Most performant RAID combination
– D1= Drives per RAID 1, D0=Number of RAID1 sets – MTBFraid= MTBFdiskD1 / (D1! * MTTRD1-1) / D0
– 4 disks: D1 = 2, D0= 2
• MTBFraid= 47,565 years
• Craid= 400 GB (400 GB wasted (50%)) – 6 disks: D1 = 2, D0= 3
• MTBFraid= 31,706 years
• Craid= 600 GB (600 GB wasted (50%))
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 54
2.3 Practical use of RAIDS
• RAIDs controllers directly connect storage to the system bus
– Storage available to only one system/ server/ application
• Number of disks is limited
– Consumer grade RAID: 2-4 disks – Enterprise grade RAID: 8-24+ disks
• Solutions
– NAS (Network Attached Storage): Provide abstracted file systems via network (software solution)
– SAN (Storage Area Network): Virtualized logical storage within a specialized network on block level (hardware
2.4 Beyond RAID
• Before discussing NAS, we need file systems
• A file systems is a software for abstracting file operations on a logical storage device
– Files are a collection of binary data
• Creating, reading, writing, deleting, finding, organizing
– How does a file access translate into top-level operations on a logical
storage device?
• e.g. which blocks have to be read/written?
• Bridge between application software and (abstracted) hardware
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 56
2.4 File Systems vs. Raw Devices
Application Software
File System
Logical Storage
• Raw Devices access allows
applications to bypass the OS and the file system
• Application may directly tune aspects of physical storage
–
There is still the hard drive
controller…so, its not really direct
• May lead to very efficient implementations
2.4 File Systems vs. Raw Devices
Application Software
Logical
• Idea: Provide a remote file system using already available network infrastructure
– NAS: Network Attached Storage
– Use specialized network protocols (e.g. CIFS, NFS, FTP, etc)
– Easiest case: File Server (e.g. Linux+Samba)
• Advantages:
– Easy to setup, easy to use, cheap infrastructure – Allows sharing of storage among several systems – Abstracts on file system level (easy for most
applications)
• Disadvantages
– Inefficient and slow
• large protocol and processing overhead
– Abstracts on file system level (not suitable for special purposes like raw devices or storage virtualization)
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 58
2.4 NAS – Network Attached Storage
Application Software
File System
Logical Storage Network
NAS Server
• SANs offer specialized high-speed networks for storage devices
– Usually uses local FibreChannel networks
– Remote location may be connected via Ethernet or IP-WAN (Internet)
– Network uses specialized storage protocols
• iFCP (SCSI on FiberChannel)
• iSCSI (SCSI on TCP/IP)
• HyperSCSI (SCSI on raw ethernet)
• SANs provide raw block level access to logical storage devices
– Logical disks of any size can be offered by the SAN – For a client system using a logical disk, it might
appears like a local disk or RAID
– Client system has full control over file systems on
2.4 SAN – Storage Area Network
Application Software
File System
SAN
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 60
2.4 SAN – Storage Area Network
SAN HBA
SAN/RAID HBA
Peripheral Bus (SCSI, SAS, etc.)
SAN Switch SAN Switch
SAN HBA
SAN Bus (iFCP) SAN
HBA
SAN HBA
SAN Switch
NAS Protocol (CIFS)
Ethernet Network
WAN-SAN Bus (HyperSCSI)
NAS Head
•
Advantages:
– Very efficient
• Highly optimized local network infrastructure
• Optimized protocols with low overhead
– Very flexible (any number of systems may use any number of disks at any location)
– Helps for disaster protection
• SAN can transparently span to even remote locations
– May also employ NAS heads for NAS-like behavior
•
Disadvantages
2.4 SAN – Storage Area Network
• There are different types of storage
–
Usually, there is a storage hierarchy
• Faster, smaller, more expensive storage
• Slower, bigger, less expensive storage
• Hard drives are currently the most popular media
–
Mechanical device
• High sequential transfer rates,
• Bad random access times, low random transfer rates
• Prone to failure
–
DBMS must be optimized for the used storage devices!
Relational Database Systems 2 – Wolf-Tilo Balke – Institut für Informationssysteme 62
2 Physical Storage
• Access Pathes
–
Physical Data Access
–Index Structures
–