Digital Technical Journal
I
SPIRALOG LOG-STRUCTURED FILE SYSTEM
OPENVMS FOR 64-BIT ADDRESSABLE VIRTUAL MEMORY
HIGH-PERFORMANCE MESSAGE PASSING FOR CLUSTERS
SPEECH RECOGNITION SOFTWARE
Vol u me 8 N um ber 2
1996
Editorial
Jane C. Blake, Managing Editor Kathleen M. Stetson, Editor Helen L. Patterson, Editor Circulation
Catherine M. Phillips, Administrator Dorothea B. Cassady, Secretary Production
Terri Autieri, Production Editor Anne S. Katzdf, Typographer Peter R.. Woodbury, Illustrator Advisory Board
Samuel H. fuller, Chairman Richard W. Beane
Donald Z. Harbert William R. Hawe Richard J. Hollingsworth William A. Laing Richard F. Lary Alan G. Nemeth Pauline A. Nist Robert M. Supnik
Cover Design
Digital's new Spiralog rile system, a featured topic in the issue, supports rid! 64-bit system capability and fast backup and is integrated with the Open VMS 64-bit version 7.0 oper
ating system. The cover graphic captures the inspired character of the Spiralog design effort and illustrates a concept taken from University of California research in which the whole disk is treated as a single, sequen
tial log and all file system modifications arc appended to the tail of the log.
The cover was designed by Lucinda O'Neill of Digital's Design Group using images fi·om Photo Disc, Inc., copyright 1996.
The Digital Teclmica/juumal is a refereed journal published quarterly by Digital Equipment Corporation, 30 Porter Road L)02/DIO, Littleton, MA 0!460.
Subscriptions can be ordered by sending a check in U.S. ti.mds (made payable to Digital Equipment Corporation) to the published-by address. General subscrip
tion rates arc $40.00 (non-U.S. $60) for rour issues and $75.00 (non-U.S. $!15) for eight issues. University and college pro
fessors and Ph.D. students in the electrical engineering and computer science fields receive complimentary subscriptions upon request. Digital's customers may qualify for gift subscriptions and arc encouraged to contact their account representatives.
Single copies and back issues are available ri:>r $16.00 (non-U.S. $18) each and can be ordered by sending the requested issue's volume and number and a check to the published-by address. Sec the Further Readings section in the back of this issue ror a complete listing. Recent issues arc also available on the Internet at
http:/ /www.digital.com/info/dtj.
Digital employees may order subscrip
tions through Readers Choice at UR.L http:/ jwebrc.das.dec.com or by enter- ing VTX PROfiLE at the system prompt.
Inquiries, address changes, and compli
mentary subscription orders can be sent to the Digital Teclmical.fourna/ at the published-by address or the electronic mail address, dtj@digital.com. Inquiries can also be made by calling thejoumal office at 508-486-2538.
Comments on the content of any paper arc welcomed and may be sent to the managing editor at the published-by or electronic mail address.
Copyright© 1996 Digital Equipment Corporation. Copying without fee is per
mitted provided that such copies are made for usc in educational institutions by faculty members and are not distributed for com
mercial advantage. Abstracting with credit of Digital Equipment Corporation's auth
orship is permitted.
The information in the journal is subject to change without notice and should not be construed as a commitment by Digital Equipment Corporation or by the compa
nies herein represented. Digital Equipment Corporation assumes no responsibility ror any errors that may appear in the Journal.
!SSN 0898-90 I X
Documentation Number EC-N6992-18 Book production was done by Quantic Communications, Inc.
The following arc trademarks of Digital Equipment Corporation: AlphaScrver, DEC, DECtalk, Digital, the DIGITAL logo, HSC, Open VMS, PATH WORKS, POLYCENTER, RZ, TruCiuster, VAX, and VAXcluster.
BBN Hark is a trademark of Bolt Beranek and Newman Inc.
Encore is a registered trademark and MEMORY CHANNEL is a trademark of Encore Computer Corporation.
FAScrver is a trademark of Network Appliance Corporation.
Listen for Windows is a trademark of Verbcx Voice Systems, Inc.
Microsoft and Win32 arc registered trade
marks and Windows and Windows NT arc trademarks of Microsoft Corporation.
MIPSpro is trademark of MIPS Technol
ogies, 1 nc., a wholly owned subsidiary of Silicon Graphics, Inc.
Netscape Navigator is a trademark of Netscape Communications Corporation.
PAL is a rcgistcn:d trademark of Advanced Micro Devices, Inc.
UNIX is a registered trademark in the United States and in other countries, licensed exclusively through X/Open Company Ltd.
VoiceAssist is a trademark of Creative Labs, Inc.
X Window System is a trademark of the Massachusetts Institute ofTechnology.
Contents
Foreword
SPIRALOG LOG-STRUCTURED FILE SYSTEM Overview of the Spira log File System
Design of the Server for the Spira log File System
Designing a Fast, On-line Backup System for a Log-structured File System
Integrating the Spira log File System into the OpenVMS Operating System
Rich Marcello
J ames E. Johnson and William A. Laing Christopher Whitaker, ). Stu<lrt Bayl ey, �HJd Rod D. W. Widdo\\'son
Russell J. Green, Alasdair· C. B aird, and J. Christopher Davies
M ark A. Howell and Ju lian M. Palmer
Open VMS FOR 64-BIT ADDRESSABLE VIRTUAL MEMORY Extending Open VMS for 64-bit Addressable
Virtual Memory
The OpenVMS Mixed Pointer Size Environment
Michael S. H arvey and Leonard S. Szubo\\'icz
Thomas R . Benson, Kare n L. 'ocl, and Rich ard E. Peterson
Adding 64-bit Pointer Support to a 32-bit Run-time Library Duane A. Smith
HIGH-PERFORMANCE MESSAGE PASSING FOR CLUSTERS
3
5 15 32 46
5 7
7 2
S3
Building a High-performance Message-passing System for MEMORY CHANNEL Clusters
James V. Lawton, John). Brosnan, Morgan l'. Doyle, 96
SPEECH RECOGNITION SOFTWARE
The Design of User Interfaces for Digital Speech Recognition Software
Seosamh D . 6 Riordain, and Timothy G . Reddin
Bernard A. Rozmovits 117
Digiral Technical journal Vol. 8 No.2 1996
2
Editor's
Introduction
This past spring when we sun·c\Td
.founw/subscribcrs, readers rook the rime ro commem on the parricula1·
,·aluc ofrhc issues fCawring Digital's 64-bir Alpha technology. The engi
neering described in those two issues continues, with ever higher levels of pcrr(>rmance in Alpha microproces
sors, servers, clusters, and systems software. This issue presents reccm developments: a log-structured file system, called Spiralog; the Open VMS opcr�1ting system extended ro take full a<hant�Jgc of64-bit addressing; high
pcrf(mnancc computing software f{>r Alpha clusters; and speech recognition sofn,arc for Alpha workstations.
Spir�1log is a whollv new clusterwidc file svsrcm integrated with the new 64-bir Open VMS version 7 .0 opcrat·
ing system and is designed for high d�1ra avaibbiliry and high pert(xmancc. The tirst of four papers about Spira log is writtcn by Jim Johnson and Bill L1ing, wlw imroduce log-srrucrured file ( LfS) concepts, the university research behind the design, and design innO\'�ltions.
The <1dvantages of LFS technologv over com·emional "update-in-phcc"
rcchnolog\' JI'C explained lw Chris Whir�1kcr, Stuart Baylcv, and Rod vViddowson. In their paper about the rile server design, they comp;m: rhc Spir;1log implementation of the LFS technology with others and describe the novel combination of the tcchnoi
O!:,''Y with a R-trce mapping mechanism to provide the S\'Stcm with needed d�1t;1 recovery guarantees.
A third paper about Spiralog,
\\Tittcn by Russ Green, Alasdair K1ird, �md Chris Davies, addresses a criric1l customer rcquirement
hst, application-consistent, on-line Digir:>l Technical journal
b�Kkup. Exploiting the features of log-structured storage, designers
\\'ere �1blc ro combine rhe rlexibilin·
of tile-based backup and the high pcdrml1�11KC ot' pll\'sical h· oriented lxKkup. Consistent copies of the file svstcm arc created \\'bile applications modi�' data.
The Spiralog integration into the Opcn VMS file system required th�1t existing <lpplications be able to run unchanged. M<!rk Howell and Julian P:�lmcr describe tbe integration of the
\\'rirc-b�1ck c:�ching used in Spira log into the \\Tire-through em·ironmcnt used in the existing Fi\cs-11 tile S�'Stcm.
The imporL1nce of comp:�tibilin' h>r existing 32-bit applications in
�1 64-bir environment is stressed :�g�1in in rhc set of three papers about the i<ltesr step in rhe evolution of'rhe Open VMS oper;1ting system. Digital first ported the 32-bir Open VMS opera ring system to the Alpha <1rchi
tecture in 1992. The extension of rhe svsrcm ro exploit 64-bit \'irru:·tl addressing is presented bv Mike f-L!n·el' . �md Lennv Szubowicz. . Their discussion includes the team's
solution to signi6cant scaling issues rhat in\'(>h'cd �1 ne\\' approach to
p;Jgc-rablc residenc\'.
The Open VMS ream anticipated rh�1r applications would mix 32- �1nd 64-lJit �1ddrcsscs, or pointers, in rhc new environment. Tom Benson, Karen Noel, and Rich Peterson explain why this mixing ofj ointer sizes is expected and the DEC C compiler solution thev developed ro
support the practice. In a related dis
cussion, Du�me Smith's paper revic\\·s ne\\' techniques the team used to
�malyzc and modifv the C run-time librarv intcrtaces that accommochtc Vol. 8 No.2 19':!6
:�pplicatiom using 32-bit, 64-bit, or both �1dd rcss sizes.
Designed f(>r scicntitic users, the p.!r;!llcl-programming tool ncx t described does not run on the Open VMS Alpha wstcm but instead on UNIX clusters connected with l'vlEMORY CHANNEL technologv.
jim Lawton, John Brosnan, Morgan Doyle, Scosarnh () Riordain, and Tim Reddin rc\'iew the challenges in designing the TruCiustcr ,vn-:,vlOJ�Y CH ANNE1. SofTware product, which is a ll!Cssagc-passing s�·stcm intended t(>r builders of parallel software libr�1rics and implcmenrers of fl�H�!Ilcl compilers. The product reduces communicuions latenc\' to Jess than
10 f.lS in shared mcmor\' S\'stcms.
Fin�1lly, Bernie RozmO\·its prcscnrs the design of user interfaces f(Jr the Digital Speech Recognition Software ( J)SRS) product. Although DSRS is r�ugctcd t(Jr Digital's Alpha work
stations running UNIX, the impk·
mcnt<1tion issues examined and the tum\ eft{>rrs ro ensure the prod
uct's C�lsc-of-usc c1n be gcner<lllv
�1pplicd to speech recognition prod
uct dC\'Ciopmcnr.
Coming up arc p��pers on a ,·arictl' of topics, including the internet protocol, collaborati\·e SOIT\\':'ll'l: t(Jr the internet, and high-perk>nn�mcc servers. These topics rctlcct areas of imcrcst.frJ/11'/W/ readers rated ncar the rop in last spring's survcv. Our sincere th�1nks go ro C\'t:r\'onc \\'ho
responded ro that sun·ev.
J anc C. Blake Mmwging Fditor
Foreword
Rich Marcello
Vice President. Open VMS ,\)stems Sqfiu,are Group
The papers you will read in this issue ofthe.fournal describe how we in the Open VMS engineering community set out to bring the Open VMS oper
ating system and our loyal customer base into the rwcnty-first century.
The papers present both the develop
ment issues and the technical chal
lenges faced by the engineers who delivered the Open VMS operating system version 7.0 and the Spiralog file system, a new log-structured file system tor Open VMS.
We are extremely proud of the results of these efforts. In December
1995 at U.S. Fall DECUS (Digital Equipment Computer Users Society), Digital announced Open VMS version 7.0 and the Spiralog tile system as part of a first wave of product deliveries for the Open VMS Windows NT Nllnity Program. Open VMS version 7.0 pro
vides the "unlimited high end" on which our customers can build their distributed computing environments and move toward the next millennium.
The release of Open VMS version 7.0 in January oftJ1is year represents the most significant engi11eering enhancement to the Open VMS oper
ating system since Digital released the VAXcluster system in 1983.
Open VMS version 7.0 extends the 32-bit architecture of Open VMS to a 64-bit <�rchitecture, allowing Open VMS Alpha users to fully exploit the 64-bit virtual address capacity of the Alpha architecture. As you will read in some of the papers in this issue, however, our design goal for Open VMS version 7.0 went beyond just delivering 64-bit virtual address capability to Open VMS users. It was
essential to us that Open VMS users be able to upgrade to version 7.0 with full compatibility tor their exist
ing 32-bit applications.
In addition to achieving the sig
nificant goals of 64-bit addressing and compatibility for 32-bit applica
tions, version 7.0 includes very large memory (VLM ), very large database (VLDR ), fast 1/0, fast path, and symmetric multiprocessing (SMP) enhancements. These new features recently combined with the power of the Alpha architecture to earn Open VMS a world record tor perform
ance. ln May of this year, Open VMS version 7.0 on an AlphaServer 8400 system configured with eight pro
cessors and 8 gigabytes of memory, running Oracle's Rdb7 database
<llld using the ACMS transaction processing monitor, set a new world record tor TPC-C pertormance on a single SMP system. Audited per
tormance was 14,227 tpmC at $269 per tpmC. Just this past August, the combination of Open VMS version 7.0, Oracle's Rdb7 database, the ACMS monitor, and the AlphaServer 4100 system achieved world -record departmental server performance.
The new world record was set on an AlphaServer 4100 5/400 system configured with four processors and 4 gigabytes of memory. ln audited benchmarks, the pertonnance results were 7,985 tpmC at $173 per tpmC.
Such outstanding results are achiev
able in a ti.ill 64-bit environment
hardware architecture, operating systems, and applications such as Oracle's Rdb database. No other vendor today can deliver this power.
Digital Technical journal Vol. 8 No.2 1996 3
4
In E1ct, Digit�l has two 64-bit oper
�uing s1·srems 1\'ith this po11·er: the OpenV/vlS �1nci the Digital U:-.JIX
oper;ni ng SI'Stems.
As noted �1bm·e, Digit:ll inrrocluccd the OpenV1v!S operating system with SUflport f(lr full 64-bit l'irtual address
ing at the s:1mc time it introduced the Spir;1log tile system, in Dt:cembcr 1995. The Spira log design is lmt:d on rhe Sprite log-structllred tile svs
rem trom the University ofCalit(>rni;1, Berkeley. Wirh irs usc of this log
structured ;1pproaeh, Spira log offers
Jll;1jor nell performance reatures, including t�1st, applicnion-consistcnr, on-line b;lckup. further, ir is fulk comparibk 11·irh customers' oisting
hies-II rile SI'Stems, and applications rh;H t·un 011 hles-11 will run on Spir;Jiog 11·ith no modification. To ddiver �111 of the re;nures we kit ll't:rc esscmd to mt:t:t the needs of our loval utstomer base, the Spiralog tc1m 0;1mined ;1t1d resoked a number of rechniul issues. The papers in this issue desuibe some ofrhe clullcnges rhe1· hcnl, including the decision to design ;1 hlcs-11 tile SI'Stcm emubrion.
The dcli1·crv ofrhe OpenVMS 1 crsion 7.0 operating SI'Stem and rhe Spir:1lug tile system are part of DigiL1I's continued commitment ro
rhe Open VMS customer b:�se. These products reptTsent the work of dedi
cated, t:demed engineering teams that h;we deployed state-ot�the-arr reehnology in products rhat ll'ill help our customers remain competiti1·e
r(Jt-\'cars to come.
In the Open VMS group as else
ll'hcre in Digital, 11·e are committt:d to excellence in the de,·clopmcnt and
Di1;ital Technical )ourn�l
delivnv o�· business computing solu
tions. 'vVt: ll'ill continue to maint;lin
;Jnd cnluncc a product porr�(,lio th;1t meers our customers' need for true 24-hour bl' 365-dal' acct:ss ro theit-. .
d�1t;1, ti.tll imegration wirh Microsof-t 'vVindows :-.rr enl'ironmenrs, and the full complement of network solutions
;1nd application software tor today and well into tht: next millennium.
Vol. 8 No. 2 J 996
Overview of the Spira log File System
The OpenVMS Alpha environment requ i res a file system that supports its full 64-bit capabili
ties. The Spiralog file system was developed to increase the capabilities of Digital's Files-11 file system for Open VMS. It incorporates ideas from a log-structured file system and an ordered write
back model. The Spiralog file system provides improvements in data availability, scaling of the amount of storage easily managed, support for very large volume sizes, su pport for applications that are either write-operation or file-system
operation intensive, and support for heteroge
neous file system client types. The Spi ra log technology, which matches or exceeds the relia
bility and device independence of the Files-11 system, was then integ rated into the Open VMS operating system.
I
James E. JohnsonWilliam A. Laing
Digital's Spiralog product is a log-structured , duster
wide file system with i ntegrated , on-line backup and restore capability and support tor multiple tile sys
tem personalities. It incorporates a number of recent ideas ti·om the research commu nity, i ncluding the log-structured tile system ( LFS) from the Sprite tile system and the ordered write back ti-om the Echo tile system.U
The Spiralog file system is fully integrated into the Open VMS operating system , providing compatibility with the current Open VMS file system, Files-ll. It supports a coherent, clusterwide write-behind cache and provides higb-pertonnance, on-line backup and per-file and per-volume restore functions.
In this paper, we first discuss the evolution of tile systems and the requirements tor many of the basic designs in the Spiralog tile system . Next we descri be the overall architecture of the Spiralog file system, identit),ing its major components and outlining their designs. Then we discuss the project's results: what worked well and what did not work so well. Finally, \Ve present some conclusions and ideas tor future work.
Some of the major components, i . e . , the backup and restore facility, the LFS server, and OpcnVMS integration, are described in greater detail in compan
ion papers in this issue .3-5 The Evolution of File Systems
File systems have existed throughout much of the his
tory of computing. The need tor libraries or services that help to manage the collection of data on long
term storage devices was recognized many years ago.
The early support libraries have evolved into the tile systems of today. During their evolution, they have responded to the industry's improved hardware capa
bilities and to users' increased expectations. Hardware has contin ued to decrease i n price and improve in its price/performance ratio. Consequently, ever larger amounts of data are stored and manipulated by users i n ever more sophisticated ways. As more and more data are stored on-line, the need to access that data 24 hours a day, 365 days a year has also escalated.
Digital Technical Jounul Vol. 8 No. 2 1996 5
6
Signiticant improvements to file systems have been made in the t(>llowing areas:
• Direcrorv structures to ease locating data
• Device i ndependence of data access through the rile svstcm
• Accessi bil ity of the data to users on other svstems
• Avail;1bility of the thta, despite either planned or unplanned service outages
• Reliability of the stored data and the pertc>rm:mn.:
of the datJ JCCCSS
Requirements of the Open VMS File System
Since l977, the OpenVMS operating system has ofti..:rcd a stable, robust tile svstem known as Files-!!.
This tile system is considered to be very successfu l in the areas of reliab i l ity and device independence.
Recent customer teed lxtck, however, indicated th;H the areas of data avaibbility, scaling of the ::tmounr of sror:�gc c:tsilv managed, support tor \'Cry large vol ume sizes, and support tc>r heterogeneous file system client types were in need of improvement.
The Spiralog project was initiated in response to customers' needs. We designed the Spiralog rile system ro match or somewhat exceed the Files-1 1 system i n its rel i ability and device independence. T h e focus of the Spir::tlog project was on those areas that were d u e tor improvement, notably:
• DatJ availability, especially d uring pJanned opera tions, such ::ts backup.
If the stor:tgc device needs to be taken offline to pcrtcm11 J backup, even at a very high backup rate of 20 megabytes per second ( Ml3/s), ;llmost
14 hours are needed to back up l terabyte. This length of service outage is clearly unacceptable.
More typical backup rates of 1 to 2 M B/s can rake several davs, which, of course, is not acceptable.
• Grc:nlv increased scal ing in total amount of on-line storage, without greatly increasing the cost to man
age rhar storage.
For example, 1 terabyte of d isk storage cutTcmly costs approxi mately $250,000, which i s well within the budget of many large computing centers.
However, the cost in staff and rime to manage such amounts of storage can be many times th::tt of the storage.'' The cost of storage continues ro t�11I, whil e the cost of managing it continues to rise.
• Efkcrivc scaling as more processing and storage resources become available.
For example, Open VMS Cluster systems aJlow pro
cess i ng power and storage capacity to be added incrementally. Ir i s crucial that the software support-
Dig.it.tl Tcdtnic;tl jounul Vol. 8 No. 2 1996
ing the rile system scale as the processing power, bandwidth to storage, and storage capacity increase.
• Im proved performance tor applic:nions that arc either \\'rite-operation or tile-system-operation i mcnsivc.
As tile svsrem caches m main mcmor\' ha\'C i ncreased in capacitv, data reads and file svsrcm read opcr.nions have become satisf-ied more and more tl-om the cache. At the same time, nL1nv applica
tions write large amounts of data or create and manipulate large numbers of tiles. The usc of redundant arrays of inexpensive disks (RAID) stor
age has increased the available bandwidth r(>r d:na writes and rile system writes. Most tile system oper
ations, on the other hand, are small writes and �1rc spread across the disk at random, often negating the bcndirs of RAID storage.
• lmpnl\'cd ;lbility to transparently access the stored data across several dissimilar client types.
Computing environments h ave become tncrcas
ingly heterogeneous. Different client S\'Stcms, such as the W i ndo\\'S or the UNIX opccning S\'Stcm, store their nics on and share the ir ti les with scn·cr S\'Stcms such as the OpenVJ\1\S sen·cr. It Ius become ncccssJry to support the svn tax ;md scmJn
rics ot-several ditkrcnr tile system personal ities on
:1 common rile server.
These needs were central ro many design decisions we m::tdc t(lr the Spiralog tile system.
The mem bers of the Spiralog project eval uJtcd much of the ongoing work in file systems, dat:tbascs, :tnd storage architectures. RA.ID storage makes h igh bandwidth av:tiLlble to disk storage, but it requires large writes to be etiective. Dar:tbascs ha\'C exploited logs ;md the grouping of writes toget her to minimize the number ot' disk f/Os and disk seeks required.
Databases and transaction systems have a lso exploited the tt:cbniquc of copving the tail of the log to dkct backups or data replication. The Sprite project at Berkeley had brought together a log-structured ri le system and RA.I D storage to good eftccr.1
By drawing rl·om the above ideas, parricul::trlv the insight of how a log structure could support on-line, h igh-pcrr(mnance backup, we began our dcvclopmcnr cft(>rt. We designed and built a distributed tile system rhar made extensive use of t he processor and me mory ncar the appl ication and used log-structured storage in the server.
Spiralog File System Design
The main cxccurion stack of the Spira log til e svstcm consists of three d istinct lavers. Figure I sho\\'S the o\·er:�ll structure. At the rop, nearest the user, is the rile
F64 FSLIB VPI SERVICES
Figure 1
FILE SYSTEM CLIENT
Spira log Structure Overview
BACKUP USER I NTERFACE
system client layer. It consists of a nu m ber of file system personal ities and the underlying personality
independent services, which we call the V P I .
Two tile system personalities dominate t h e Spiralog design. The F64 personality is an emulation of the Files-l l fi le system . The fi le system library ( FSLIB ) personality i s an implementation of Microsoft's New Technology Advanced Server ( NTAS ) file services for
use by the PATHWORKS for Open VMS file server.
The next l ayer, present on all systems, is the clerk layer. It supports a distributed cache and ordered write back to the LFS server, giving single-system semantics in a cluster configuration .
The LFS server, the third layer, is present on a l l des
ignated server systems. This component is responsible for maintaining the on-disk log structure; it includes the cleaner, and it is accessed by multiple clerks. D isks can be connected to more than one LFS server, but they are served only by one LFS server at a time. Trans
parent fail over, fi·om the point of view of the tlle sys
tem client layer, is achieved by cooperation between the clerks and the surviving LFS servers.
The backup engine is present on a system with an active LFS server. It uses the LFS server to access the on-d isk data, and it interfaces to the clerk to ensure that the backup or restore operations are consistent with the clerk's cache.
Figure 2 shows a typical Spiralog cluster configura
tion. I n this cl uster, the clerks on nodes A and B are accessing the Spira log volu mes. Normally, they use the LFS server on node C to access their data. I f node C should fail, the LFS server on node D would immedi
ately provide access to the vol u mes. The clerks on nodes A and B would usc the LFS server on node D, retrying a ll their outstanding operations. Neither user application would detect any f�1ilure. Once node C had recovered , it would become the standby LFS server.
NODE A NODE B
USER APPLICATION USER APPLICATION SPI RALOG CLERK SPI RALOG CLERK
ETHERNET
NODE C NODE D
SPI RALOG VOLUMES
Figure 2
Spiralog Cluster Configuration
File System Client Design
The file system client is responsible for the traditional fi le system fimctions. This l ayer provides fi les, directo
ries, access arbitration, and file naming rules. It also provides the services that the user cal ls to access the file system .
VPI Services Layer The VPI layer provides an u nderly
ing primitive file system i nterface, based on the U NIX V FS switch . The VPI layer has two overall goals:
1 . To support multiple file system personalities
2. To effectively scale to very large volumes of data and very large n u m bers oftilcs
To meet the first goal, the VPI layer provides
• File names of 2 56 U nicode characters, with no reserved characters
• No restriction on directory depth
• Up to 2 5 5 sparse data streams per tile, each with 64-bit addressing
• Attributes with 2 5 5 Unicode character names, con
taining values of up to l ,024 bytes
• Files and directories that are freely shared among fi le system personality mod ules
To meet the second goal, the V Pl layer provides
• File identifiers stored as 64-bit integers
• Directories through a B - tree, rather than a simple linear structure, for log(n) fi le name lookup time The VPI layer is only a base for file system personali
ties. Therefore it requires that such personalities are trusted components of the operating system.
Moreover, it requ ires them to implement tile access security (although there is a convention tor storing access control l ist information ) and to perform all nec
essary cleanup when a process or image terminates.
Digital Technical Journal Vol . 8 No. 2 1996 7
8
F64 File System Personality As �m.:viously stated, the Spira log product includes t\VO rile system personalities,
F64 :111d fS U B . The f64 pcrson�1liry provides a sen·ice that emul ates the Files-! ! ri le S\'Stc m . ' I ts fu nctions, services, available ril e attrib utes, and execu tion bch�wiors :lrc simi lar to those in the Fi lcs-l l ril e S\'S
tcm . i'vli nor d i fkrcnccs �li'C isobtcd i nt o areas that receive l ittle usc ti·om most applicuions .
F m instance, the Spiralog ti le svstcm su pports rhe
\':lrious Files-! ! q ueued l/0 ( $Q I O ) par
J
mcters forreru ming ri le attri bute inf(mn�uion, because they are used i mplicit�\' or explicitly by most usn :1pplic:1tions . On the other hand, the h lcs-1 1 method of reading the ril e hea<.kr i n r(mmtion d i rectly through �1 tile cal led J N DEXf.SYS is nor com monlv used by applica
tions �md is nor suppom . .:d .
The F64 rile svstcm pcrson�1liry demonstrates that rhc V PI layer con tains sufficient tlC\ibility ro su pport a complex ti le system inrnbcc. I n a n u m ber of cases, however, several V P I calls arc needed ro implement
a single, complex Fi les-[ ] operation. For instance, ro do a hie open operation, the F64 personality pcrr(xms
the tasks listed below. The items th:tt end with ( V PI ) arc tasks that usc V P I service u lls to complete .
• Access rhc tile's p�1rcnt d irectory (V PI )
•
R
ead the directory's rile �mributcs ( V P I )• Veri!-\• authorization r o rc1d t h e directory
• Loop , searching r(>r the tile name, by -Read i ng some dirccrorv entries ( V PI )
-Search ing the di rccrorv buftl.:r r()r the h ie name - Exiting the loop, i f rhc 1mtch is tcllll1d
• Access the target tile ( V P l )
•
R
ead rhc rile's attributes ( V I' ! )• Audit the h i e open �Htempt
FSLIB File System Personal ity The �S U B tile sysrem personal itv is J spcci�11izcd ri le system to support t he PATHWO RKS rc>r Open VMS r
i
le server. I rs nvo maj
orgoals arc to support the tile 11amcs, attri butes, and bcha,·iors ti.n111d in
M
icrosoft's NTAS ri le access protocoJs, and ro provide low run-time cost rc>r processing NTAS tile svstcm req uests.
The PATHWO RKS server implements �1 ri le service t(Jr personal computer ( !'C) clients bycrcd on rop of the Filcs-1 1 tile system services. When NTAS service behaviors or ::�trri hu tcs do not match those of Files-l l , the PATHWO RKS server h:ts to e m u l ate the m . This can lead to checking security access permiss
i
ons nvice, mapping ri le names, and e m ulating rile Jttriburcs.Many of these pro blems can be avoided if the V PI interbcc is used directly. for instance, because the
�SLI B pcrson:1l i
r
y docs nor laver on top of a Fi les-1 1 personalin·, sccurin· access checks do nor need to be pcrtcmm:d t\\'icc. fu rthermore, in a srraightrc>rward design, there is 110 need to map across d i ftl.:renr tile Dig:ire1l Tcchnic.d )oum,d Vol . � No. 2 l <J96naming or attri bute rules. For rc1sons II'C describe later, in the V I' !
R
esults section, we c hose nor ro p u rsue this design to its conclusion.
Clerk Design
The c lerks arc respons i b l e tc>r m�maging the caches, determ ining the order o f writes out of rhc cache to the LFS scn·cr, and mainL1i ning c:tche cohet-cnc\· ,,·ithin a cluster. The caches arc write behind i n �1 111�1n ncr rhat preserves the order of dependent operations.
The cl erk-scn·cr protocol controls the rr:-�nskr of d:tta to and from stable storage . Data G il l he scm �1s
a mu lti block :nomic \\Titc, �md oper�uions th�u ch�1ngc mu ltiple data items such as a r
i
le re name em be made atomica l ly. I f a server tai ls du ring ;� request, the clerk treats the req u est as i f it were lost and retries rite request.The clerk-server protocol is idempotent. I d e m
potent operations cJn b e :tpplicd rcpcarcdlv with n o effects other t h a n t h e d esired o n e . Thus, a fter a n y n u m ber of server fai l u res or scn·cr tailovcrs, ir is alwavs sate to reissu e an operation. Clerk-to-server write operations always !ewe the ri le system stare consistent.
The c lerk-clerk protocol protects the user data :1nd ti.Jc svstcm mctadatJ cached . lw the . cluks. C:�chc
coherency i n r(Jrmarion, rather th�111 <.bra, is passed d i rect!\' between clerks.
The tile svsrcm caches a1-c kept in the clerks . iVI u l tiple clerks can have copies ot'sta bi l i t.cd data , i.e. , <.bra rhat has been ll'ritten to the scn'Cr 11·ith the IITitc acknowledged . Onlv one c l er k can h�li'C unsL1bilizcd, volatile data.
D
ata is cschangcd between clerks b1·stabilizing it. vVhcn a clerk needs to \\Titc a block oF data to the server ti·om i ts cache, it uses a rok.cn i mcr
f:1ce t
h
at is layered o n the clerk-clerk proto co l . The ll'rites ti·om t h e c:�
c hc to t h e SCI"\'CI" arc deterred as long as possi blc ,,.i thin the constra i n ts of the c�1c he protocol and the dcpcndcncv gu:1r�m tccs.Dirn· data remains in the cache �1s long as 30 sec onds. During th:n rime, m·cr11Titcs �1rc combined within rhe constra
i
nts of' the dcpcndcncv guara mecs.Furthermore, operations that arc known to other one another, such :�s ti-ccing a tile idcntiricr and �1 llocating a tile identitier, <l iT fu l l v combined within the c1chc. Eventually, some trigger causes the d i r tv d�na ro be written to the server. At this point, scvcLll writes arc grouped together. Write opcLltions ro :1d
j
acc nt, oroverlapping, fil e locations arc combined to r(mn a smaller nu m ber of larger writes. The resu lting write operations are then grouped inr o messages ro the LFS server.
The clerks pertorm write beh i nd r(>r rcn1r reasons:
• To spread the I /0 load over time
• To remove occluded data, which c.m result ti·om
repeated O\'crwritcs of :1 d
:�
t�l b l ock, �i-om bei ng transterrcd to the scn·er• To avoid writing data thJt is q uickly deleted such as temporary tiles
• To com bine multiple smaU writes into larger transfers The c lerks order dependent writes from tile cache to the server; conseq uently, other clerks never see
"impossible" states, and rel ated writes never overtake each other. For i nstance, the deletion of a tile cannot happen beti:>rc a ren:�me that was previously issued to the same ti le. Related d:�ta writes arc caused by a partial overwrite, or an expl icit linking of operations passed into the clerk by the V PI layer, or an i m plicit linking due to the clerk-clerk coherency protocol.
The ordering between writes is kept as a directed graph. As the clerks trave rse these graphs, they issue the writes in order or collapse the graph when writes can be sately combined or elim inated .
LFS Server Design
The Spiralog ti le system uses a log-structured, on-disk format tor storing data within a volume, yet presents a trad ition a l , update-in-place ti le system to its users.
l
USER 1/0sFILE VI RTUAL BLOCKS
I I I I I I I I I I
Figure 3
Spira log Add ress Mapping
Recen tly, log-structured tile systems, such as Sprite, have been an area of active 1-escJrch . '
Within the LFS server, su pport i s provided ti:x the .log-structured, on-disk format and Jor mapping that tormat to an update-in-place model . Specitical ly, this component is responsible tor
• Mapping the incoming read and write operations from their simple Jdd ress space to positions in an
open-e nded Jog
• Mapping the open-ended log onto a tin ite amount of disk spJce
• Reclaiming disk space by cleaning (gJrbage collect
ing) the obsolete (overwritte n ) sections of the log Figure 3 shows tllC various mapping layers in the Spiralog til e system, including those hJndled by the LFS server.
fncoming read and write operations arc based on a single, large address space. Initial ly, the LFS server trans
torms the address ranges in the incoming operations into equivalent address ranges in an open-ended log.
This log supports a very large, write-once address space.
DISK
Digit.ll "kdmical Journal
FILE SYSTEM ADDRESS SPACE
j
VPI CLERKFILE ADDRESS SPACE
j
LFS B-TREELOG ADDRESS SPACE
LFS LOG DRIVER LAYER
PHYSICAL ADDRESS SPACE
Vol. S No. 2 1 996 9
1 0
A read operation looks u p its location i n the open
ended log and proceeds. On the other hand, a write operation m akes obsolete its c urrent address range and appends its new val ue to the tai l ofthe log.
In turn, locations in the open-ended log arc then mapped i nto locations on the (finite -sized ) disk. This additional mapping allo\\'S disk blocks to be reused once their original contents h ave become obsolete .
Physically, t h e log is divided into Jog segments, each of w hich is 256 ki lobytes ( KB) in lengrl1. The log seg
ment is used as the transfer u nit for the backu p engine.
I t is also used by the cleaner for reclaiming obsolett log space.
More int<xmation about the LFS server can be t(Ju nd in this issue:'
On-line Backup Design
The design goals tor the backup engine arost ti·om higher storage management costs and greater data avail
ability netds. I nves6garions with a number of customers revealed their requirements for a backup engint:
• Consistent save optrations without stopping a11V applications or locking out data mod ifications
• Verv bst save operations
• Both ful l and incremental save operations
• Restorts of a ful l volu me and of individual tiles Our response to these needs i n fl uenced many dtci
sions concerning the Spiralog tile system design . The need for a high-performance, on-line backup led to a search ten an on-dis k structure that could su pport it. Again, we c hose the l og-structured design as the most suitable one.
A log-structured organization allows the backup bcil i ty to easily demarcate snapshots of tht tile system at any poi nt in time, simply by marking a point in the log. Such a mark represents a version of the file system and prevellts d isk blocks that compose that version ti·om being cltancd . I n turn, this allows the backup to run against a low ln·el of the ti le system, that of tht logical log, and thcrdorc to operate close to the spiral tr�mskr rate of the u nderlying d isk.
The d i fference between a parti a l , or i ncremtntal, and a ti.d l save operation is only the starting point in the log. An incre mental save need not copy data back to the beginning of the log. Therefore, both incre
mental and hil l save operations transfer data at very high speed .
By i m plcmciHing these features i n t h e Spira log tile syste m , we fu lfilled our customers' req u i rements f(>r h igh-pcdcm11ancc, on-line backup save opcLltions.
We ::dso met their needs tor per-tile and per-volume restores and an ongoing need tor simpl icity and red uc
tion in operating costs.
Digital Technical Joumal Vol . 8 No. 2 1 9')6
To provide per-tile restore capabilities, tht backup utilitY and the LrS server ensure that the :1ppropriarc tile header information is stortd d uring the sa\ c opcr
:ltion . The Sa\·cd ri le svstem clara, i ncluding ri l e head ers, log mapping i n formation, and user data, arc stored in a tile k.no\\'n as a SCI/Y!SC'I. Each sa\·escr, regardless of the n u m ber of tapes it requires, repre
sents a singl e sa,·e operation.
'T'o reduce the complcxitv o f fi l e restore opcr:nions, the Spiralog fi l e system provides an on� linc SJ\'CSCt merge featu re . This allows tht systtm manager to merge severa l savesets, either tl.d l or i ncremental, to
t(mn a new, single savcset. VVith this katurc, system managers can h ave a workable backup save plan that never cal ls fc>r :�n on-line full backup, rhus h1 rthcr red ucing the l oad on their prod uction systems. Also, this featurt can be used to ensure that tilt restore oper
ations can be accomplished with a small , bounded set of S:l\'CSCtS.
The Spiralog backup t:K i litv is dtscribcd in dcuil in this issue . ;
Project Results
'l'he Spir:tlog tile svstem contains a n u mber of in nm·::t
tions in the areas of o n - line backup, log-structu red storage, clusttrwick ordered \\'rite-beh ind caching, and m ultiple-tile-system client support.
The usc of log structuring as an on-disk t(Jrmat is very cftCctivc in supporting h igh-pcrf(mnancc, on-line backup. The Spiralog tile system retains the previously documented benefits of LfS, such as bst write pcrfc>r
mancc that scales with the d is k size and through put that increases as large read caches arc used to oftscr
disk rc:1d s . '
I t s h o u l d also b e noted that t h e Fi lcs-1 1 ti l e svstcm stts a high standard t(x data rel iabilitY and robusmcss.
The Spira log technologv met this chal le nge \'en· wel l : a s a res u l t o f the idempotent protocol, the cluster Eli lover design, and the recover capabilitv of the log,
\\'C cncountertd fC\\' d ata rcliabilitv p roblems d u ring development.
In any large, complex project, manv tech n ical deci
sions arc necessary to convert research technology into :1 prod uct. I n this section, we d iscuss why certain decisions were made d uring the devtlopmtnt of the Spira log subsystems.
VPI Results
The V PI tile system was genera l ly successful 1 11 pro
vid ing the u nderlying support necessary r<>r d i rkrcnr tile system personalities. We found that it was possi ble t o construct a set ot' prim itive operations that cou l d be used to build complex, user-lC\·cl , tile svstcm operations.