Digital Technical Journal
Digital Equipment Corporation
Volume 3 Number 1 Winter 1991
Cover Design
Transaction processing is the common theme for papers in this issue. The automatic teller machine on our cover represents one of the many businesses that rely on TP systems. If we could look behind the familiar machine, we would see the products and technologies - here symbolized by linked databases - that suppo1·t reliable and speedy processing of transactions worldwide.
The cover was designed by Dave Bryant of Digital's Media Communications Group.
Editorial
Jane C. Blake, Editor
Kathleen M. Stetson, Associate Editor
Ci.rculation
Catherine M. Phillips, Administrator Suzanne). Babineau, Secretary
Production
Helen L. Patterson, Production Editor Nancy jones, Typographer
Peter Woodbury, Illustrator
Advisory Board
Samuel H. Fuller, Chairman Richard W Beane
Robert M. Glorioso Richard). Hollingsworth john W McCredie Alan G. Nemeth Mahendra R. Patel
F. Grant Sa viers Robert K. Spitz Victor A. Vyssotsky Gayn B. Winters
The Digital Tecbnicaljoumal is published quarterly by Digital Equipment Corporation, 146 Main Street MLO l-3/B68, Maynard, Massachusetts 0175 4-2571. Subscriptions to the journal are $40.00 for four issues and must be prepaid in .S. funds. niversity and college professors and Ph.D. students in the electrical engineering and computer science fields receive complimentary subscriptions upon request. Orders , inquiries, and address changes should be sent to The Digital Tecbn.ical}oumal at the published-by address.
Inquiries can also be sent electronically to DTJ®CRJ..DEC.COM.
Single copies and back issues are available for $16.00 each from Digital Press of Digital Equipment Corporation, 12 Crosby Drive, Bedford, M A 01730 -1493.
Digital employees may send subscription orders on the ENET to RDVAX::JOURNAI. or by interoffice mail to mailstop MLO I-3/B68.
Orders should include badge number, cost center, site location code and address. All employees must advise of changes of address.
Comn1ents on the content of any paper are welcomed and may be sent to the editor at the published-by or network address.
Copyright <D 1991 Digital Equipment Corporation. Copying without fee is permitted provided that such copies are made for use in educational institutions by faculty members and are not distributed for commercial advantage. Abstracting with credit of Digital Equipment Corporation's authorship is permitted. All rights reserved.
The information in this journal is subject to change without notice and should not be construed as a commitment by Digital Equipment Corporation. Digital Equipment Corporation assumes no responsibility for any errors that may appear in this journal.
ISSN 0898-901 X
Documentation Number EY-F588E-DP
The following are trademarks of Digital Equipment Corporation:
DEC, DECforms, DECintact, DECnet, DECserver, DECtp, Digital, the Digital logo, LAT, Rdb/VMS, TA, VAX ACMS, VAX CDD, VAX COBOL, VAX DBMS, VAX Performance Advisor, VAX RALLY, VAX Rdb/VMS, VAX RMS, VAX SPM, VAX SQL, VAX 6000, VAX 9000, VAXcluster, VA.Xft, VAXserver, VMS.
IBM is a registered trademark of International Business Machines Corporation.
TPC Benchmark is a trademark of the Transaction Processing Performance Council.
Book production was done by Digital's Educational Services Media Communications Group in Bedford, MA.
I Contents
8 Foreword Carlos G. Borgiall i
1 0 DECdta-Digital's Distributed Transaction Processing Architecture
Transaction Processing, Databases, and Fault-tolerant Systems
Phil ip A. Bernstein, William T. Emberton, and Vi jay Trehan 18 Digital's Transaction Processing Monitors
Thomas G. Speer and Mark W Storm
33 Transaction Management Support in the VMS Operating System Kernel
Wi ll iam A. Laing, James E. Johnson, and Robert V Landau 45 Peiformance Evaluation of
Transaction Processing Systems
Walter H. Kohler, Yun-Ping Hsu, Thomas K. Rogers, and Wael H. Bahaa-EI-Di n
58 Tools and Techniques for Preliminary Sizing of Transaction Processing Applications
William Z. Zahavi, Frances A. Habib, and Kenneth). Omahen 65 Database Availability for Transaction Processing
Ananth Raghavan and T. K. Rengarajan
70 Designing an Optimized Transaction Commit Protocol Peter M. Spiro, Ashok M . Joshi, and T. K. Rengarajan
79 Verification of the First Fault-tolerant VAX System Wi lliam F. Bruckert, Carlos Alonso, and James M . Melvin
I Editors Introduction
Jane C. Blake Editor
Digital's t ransaction processi ng system s are i nte
grated hardware and software products that operate in a distributed environment to support commer
cial applications, such as bank cash wit hd rawals, credit card t ransactions, and global t rad i ng. For these app lications, data i ntegrity and cont i nuous access to shared resources are necessary system characteristics; anything less would jeopardize the revenues of busine ss operat ions that depend on these applications. Papers in this issue of the Journal look at some of D igi tal 's techologies and products that provide these system characterist ics in three areas: distributed transaction processing, database access, and system fault tolerance.
Opening the issue is a discussion of the architec
ture, DECdta, which ensures rel iable interoperation in a d i st ri buted environment. Phil Bernstei n, B i l l Emberton, and V i jay Trehan define some transaction processing termi nology and anal yze a TP applica
tion to i l l u strate the need for separate architectural components. They then present overviews of each of the components and interfaces of the distributed transaction p rocessing architecture, giving partic
ular attention to transaction management.
Two products, the ACMS and DECi ntact monitors, implement several of the functions defi ned by the D ECdta architecture and are the twi n topics of a paper by Tom Speer and Mark Storm. Although based on di fferent implementation strategies, both ACMS and DECintact provide TP-specific services for developi ng, e xecuting, and managing TP appli
cat ions. Tom and Mark discuss the two strategies and then highl ight the functional sim i larities and differences of each monitor product.
The ACMS and DECi ntact monitors are layered on the VMS ope rat i ng system, which provides base services for distributed transaction management.
Described by Bill Lai ng, Jim Joh nson, and Bob Landau, these VMS services, called DECdtm, are an
2
addition to the operating system kernel and address the problem of i ntegrat ing data from multiple sys
tem s and databases. The authors describe t he t hree DECdtm components, an opt imized implementa
tion of the two-phase commit protocol, and some VA.Xclu ster-specific optim izations.
The next two papers turn to the issues of measur
i ng TP system pe rformance and of sizi ng a system to ensure a TP appl icat ion will run efficient ly. Wal t Kohler, Yun-Ping Hsu, Tom Rogers, and Wael Bahaa
E I-Din discuss how Digital measures and models TP system performance. They present an overview of the industry-standard TPC Benchmark A and Digital's implementation, and then describe an alternative to benchmark measurement- a mult i level analyti
cal model ofTP system performance that simplifies the system's complex behavior to a manageable set of parameters. The discussion of performance con
tinues but takes a di fferent perspective in t he paper on sizing TP systems. B i l l Zahav i , Fran H abib, and Ken Omahen have wri tten about a methodology for estimat i ng the appropriate system size for a TP application. The tools, techniques and algorithms they describe are used when an appl icat ion is sti l l in i t s early stages of development.
High performance must extend to the database system . ln their paper on database avai labi l i ty, Ananth Raghavan and T. K. Rengarajan exam i ne strategies and novel techniques that minim ize the affects of downtime situations. The two databases referenced in their discussion are the VAX Rdb/YMS and VAX D BMS systems. Both system s u se a database kernel called KODA, which provides t ransaction capabil i t ies and com m i t processing. Peter Spiro, AshokJoshi, and T.K. Rengarajan explain the impor
tance of commit processi ng relati ve to throughput and describe new designs for improving the perfor
mance of group com mit processing. These designs were tested, and the results of these tests and t he authors' observations are presented .
Equal ly as important in TP systems as database avai labil ity is system availabi lity. The topic of the final paper in this issue is a system designed to be cont i nou sly available, the VAX.ft 3000 fault-tolerant system. Authors Bill Bruckert, Carlos Alonso, and Jim Melvin give an overview of the system and then focus on the four-phase verification strategy devised to ensure t ransparent system recovery from errors.
I thank Carlos Borgial li for his help in prepari ng this issue and for writing the issue's Foreword.
Biographies I
Carlos Alonso A principal software engineer, Carlos Alonso is a team leader for the project to port the System-V operat ing system to the VAXft 3000.
Previou sly, he was the project leader for various VAXft 3000 system validation development efforts. As a member of the research group, Carlos developed the test bed for evaluati ng concurrency control algorithms using the VMS Distributed Lock Manager, and he designed the prototype alternate lock rebuild algorithm for cl uster transit ions. He holds a B.S. E.E. (1979) from Tulane University and an M.S . C.S. (1980) from Boston University.
Wael Hilal Bahaa-El-Din Wael Bahaa-EI-Din joined Digi tal in 1987 as a senior consultant to t he Systems Performance Group, Database System s. He has led a number of studies to evaluate performance database and transaction process
i ng systems under response time constraints. After receiving his Ph. D. (1984) in computer and informat ion science from Ohio State University, Wael spent three years as an assistant professor at the University of Houston. He is a member of ACMS and IEE E , and he has wri tten numerous art icles for profes
sional journals and conferences.
Philip A. Bernstein As a senior consultant engineer, Philip Bern stei n is both an architectural consultant i n the Transaction Processi ng Systems Group and a researcher at the Cambridge Research Laboratory. Prior to joining Digital in 1987, he was a professor at Wang Institute of Graduate Studies and at Harvard Un iver
sity, a vice president at Sequoia System s, and a researcher at the Computer Corporation of America. He bas published over 60 papers and coauthored two books. Phi l received a B.S. (1971) in engineering from Cornel l University and a Ph. D. ( 197'5) in computer science from the University of Toronto.
William F. Bruckert William Bruckert is a consu lti ng engineer who joined D igital in 1969 after receiving a B.S.E.E. degree from the University of Massachusetts. He received an M.S.E. E./C. E. degree from the same university in 1981 . Begin n i ng as a worldwide product support engineer, Bill later worked on a number of DECsystem-10/20 designs. He developed the cache, memory, and 1/0 subsystem for the VA.,'( 8600 processor and was the system architect of the VAX 86'50 processor. H is most recent role was as the architect of the VAXft 3000 system . Bi.ll currently holds seven patents.
3
4
William T. Emberton As a principal software engineer, William Emberton is currently involved in the development of Queue Management Architecture. He is also involved in X/Open and POS!X TP Standards work ancl is a member of the team that is developing the overall DECtp product architecture. Previ
ously, he worked on the initial versions of the DEC:dta architecture. Before com
ing to Digital in 1987, Bill held positions as Director of Software Development at National Semiconductor and Manager of Systems Development for Inter
national Retail Systems at NCR. He was educated at London University.
Frances A. Habib
Fran Habib is a principal software engineer involved with the development of transaction processing workload characterization and siz
ing tools. Previously, Fran worked at Data General and c;TE Laboratories as a management science consultant. She holds an
M.S.in operations research from MIT and a B.S. in engineering and applied science from Harvard. Fran is a full member of
ORSAancl belongs to ACM, IEEE, and the
AC:YI S!CMETRJC:Sspecial interest group on modeling and performance evaluation of computer systems.
Yun-Ping Hsu
Yun-Ping is currently a principal software engineer in the Transaction Processing Systems Performance and Characterization Group. He joined Digital in October
1987,after receiving his master's degree in electrical and computer engineering from the University of Massachusetts at Amherst. In his position, Yun-Ping is responsible for performance modeling and bench
mark measurement of both ACMS- and DEC:intact-based TP systems. He also participated in the
TPCBenchmark A standardization activity during
!989He is a member of ACM and IEEE.
james E. johnson
A consulting software engineer, Jim Johnson has worked for the VMS Engineering Group since joining Digital in
1984.He is current!)' a project leader for VMS Engineering in Europe. Prior to this work, Jim led the RMS project, and after relocating to the UK three years ago, he was responsible for much of the design and implementation of the DEC:dtm services. At the same time, Jim was an active participant in the transaction management architecture review group. He has applied for a patent pertaining to the two-phase commit protocol optimization currently used in DECdtm services.
Ashok M. Joshi Ashok Joshi is a principal software engineer interested in database systems, transaction processing , and object-based programming. He is presently working on the KODA subsystem, which provides record storage for Rdb/VMS and DBMS software. For the Rdb/VMS project, he developed hash indexing and record placement features, and he worked on optimizing the lock protocols. Ashok came to Digital after receiving a bachelor's degree in electrical engineering from the Indian Institute of Technology, Bombay, and a master's degree in computer science from the University of Wisconsin, Madison.
TP benchmark standards activities. Before joining D igital in 1988, Walt was a vis
i t ing scientist and technical consultant to D igital and a professor of electrical and computer engineering at the Univers i ty of Massachusetts at Amherst. He holds B.S., M.S., and P h . D . degrees in electrical engi neering, all from Princeton University. Walt recently received the IEEE/CS Meritorious Service Award, and he has published over 25 technical articles.
William A. Laing W i l l iam La i ng is a senior consu l ta nt engi neer based in Newbury, England . He is the technical leade r for p roduction systems support for the VMS operat i ng system . D u ring five years spent in the U.S., Bi l l was responsible for the design and i n it ial development of symmetrical mult i
processi ng support i n the VMS system . He joined D igital i n 1981, after doing research on operating systems at Edinburgh University for nine years. Bill holds a B.Sc. (1972) in mathematics and computer science and an M.Phil. (1976) i n computer science, both from Edinburgh Univers ity.
Robert V. Landau Principal software engineer Robert Landau is a member of the VMS Engi neering Group, based in Newbury, England. He is currently the project leader of a VMS advanced development team investigat ing a high-perfor
mance, transaction-based, flat file system. Before joining D igi tal i n 1987, Bob worked for a variety of software houses speciali zing in database-related prod
ucts. He stud ied botany at London Univers ity and, subsequently, obta ined a teaching qualification from Hereford College.
James M. Melvin As a principal design engineer, Jim was responsible for the specification of hardware error-handling mechanisms i n the VAXft system and is presently an engineering project leader for future VA.,'(ft systems. He also speci
fied and led the implementatio n of t he hardware system simulation platform and t he hardware des ign verification test plan. Jim joi ned D igital in 1984 and holds a B.S.E.E. (1984) and an M.S. (1989) in engineering management from Worcester Polytechnic Insti tute. He holds t hree patents on the VAXft 3000 sys
tem, al l related to error handling in a fault-tolerant system.
Kenneth]. Omahen A principal engineer, Kenneth Omahen is developing object-oriented queuing network solvers. He designed a variety of perfor
mance tools and performed design support stud ies which i nfluenced a number of D igital products. Prior to joining D igital , Ken worked at Bel l Telephone Laboratories, lectured at the University of Newcast le-Upon-Tyne, and was a faculty member at Purdue Un iversity. He received a B.S. degree i n science engi
neering from Northwestern University and M . S . and P h . D . degrees in informa
tion sciences from the University of Chicago.
5
Biographies
6
Ananth Raghavan Since join i ng D igital i n 1988, Ananth Raghavan has been a software engi neer who has led projects for t he KODA/Rdb Group. Previous to this position, he was a teaching ass istant in t he computer science department of the University of Wisconsin. Anant h holds a B.S. ( 1985) degree in mechani
cal engineering from the I nd ian I nstitu te of Technology, Madras, and an M.S.
( 1987) degree in computer science from t he Un iversity of Wisconsin, Mad ison . H e h a s two patent applicat ions p end i ng for h i s w o r k on undo a n d undo/redo database algori thms.
T. K. Rengarajan T. K. Rengarajan has been a member of the Database Systems Group since 1987 and works on the KODA software kernel for database management systems. He is involved in the support for WORM devices and global buffer management in the VA..'\cluster environment. His work in the areas of boundary element methods and database management systems is reported in several published papers and patent applications. Ranga holds an M.S. degree i n computer-a ided design from the Uni versity o f Kentucky and a n M.S. in com
puter science from the Un iversi ty of Wisconsin.
Thomas K. Rogers Thomas Rogers is a project leader for the Transaction Processing Systems Performance ami Characte rization Group. He is respon
sible for tes t i ng the V.A.,'C 9000 Model 210 system us ing the TPC Benchmark A standard . Prior to j o i n i ng D igital i n January 1988, Tom worked for Sperry Corporation as a techn ical specia l ist for t he Nort heast region. H e received a bachelor of science degree in mathematical sciences i n 1979 from Johns Hopkins University.
Thomas G. Speer As a principal software engineer i n t he DECtp/East Engineering Group, T homas Speer is currently lead i ng the D EC intact V2.0 pro
ject. In this posit ion, his m ajor responsi b i lity is defi n i ng the requirements for DECintact support of DECdtm services, client/server database access, and sup
port for the DECform s p roduct. Since joining Digital in 1981 , Tom has worked on several development projects, including FORTRAN-10/20 and RMS-20. He holds degrees from Harvard University, Ru tgers University, and Simmons College. He is a member of Phi Beta Kappa.
Peter M. Spiro Peter Spiro, a pri n cipal software engineer, is currently i nvolved in optim izing database technology for RISC machi nes. He has worked on database fac i l i t ies such as access m e t hods, journal i ng and recovery, t rans
action protocols, and buffe r management. Peter joined D igital i n 1985 , after rece iving M.S. degrees in forest science and computer science from the University of Wiscons in. He has a patent pend ing for a method of database jour
nal i ng a nd recovery, and he authored a paper for an earl ier issue of t he Digital Technical journal. In add i tion, Peter enjoys the game of Ping-Pong.
TP products for more t han ten years. Currently, he is act ing technica l d irector for t he East Coast Transaction Processing Engi neering Group, as wel l as manag
ing a small advanced development group. After join i ng D igital i n 1976, Mark worked on COBOL compi lers for the PDP-11 systems and developed the first native COBOL compiler for t he VAX computer. He holds a B.S. (with honors) i n computer science from t h e Un iversity o f Southern M ississippi .
Vijay Trehan Since joi n i ng Digi tal i n 1978, Vijay Trehan has contributed t o several archi tecture projects. H e i s t h e techn i cal d irector responsi ble for DECtp architecture, design, and standards work. Prior to t his assignment, Vijay was t he archi tect for t he DECdtm p rotocol, architect for the D DIS data inter
change format, and i n i t iator of work on t he D DIF document i n terchange format and compound document strategy. He holds a B.S. ( 1972) i n mechan ical engi
neering from t he I nd ian I nstitute of Technology and an M.S. ( 1974) in operations research from Syracuse Un iversity.
William Z. Zahavi As an engineering manager, B i l l is responsible for the des ign and development of predict ive sizi ng tools for t ransaction p rocessi ng app.lications. Before join i ng D igital i n 1987, he was a techn ical consu ltant for Sperry Corporation, specializing i n systems performance analysis and capacity planni ng. Bil l rece ived an M . B.A. from Nort heastern Un iversity and a B.S. i n mathematics from t he Univers ity o f Virgi n ia . H e i s an active member o f the Computer Measurement Group, and frequently presents at CMG conferences.
7
I Foreword
Carlos G. Borgialli
Senior Manager, DECtp Software Engineering
Transaction p rocessing is one of the largest, most rapidly growing segments of the computer i nd us
try. D igital's st rategy is to be a leader in transaction processing, and toward that end we are making technological advances and delivering products to meet the evolving needs of businesses that rel y on transaction processing systems.
Because of the speed and rel iabi l i ty with which transaction processing systems capture and d is
play up-to-date information, they enable businesses to make well-informed, t imely decisions. Industries for which t ransaction p rocessing systems are a sig
nificant asset i nclude banki ng, labo ratory au toma
tion , manufacturing, government, and i nsurance.
For these i ndustries and others, t ransaction p ro
cessing is an i nformation l ifeli ne that supports the achievement of da i l y business objectives and i n many instances provides a competitive advantage.
Many older transaction processing systems on which busi nesses rely are centralized and tied to a particular vendor. A great deal of money and time has been invested i n these systems to keep pace with busi ness expansion. As expansion continues beyond geographic boundaries, however, the cen
tralized, s i ngle-vendor t ransaction p rocessing sys
tems are less and less l i kely to offer the flex ibility needed for round- the-clock, rel iable, business operations conducted worldwide. Transaction pro
cessi ng technology therefore must evolve to respond to the new business environment and at the same t ime protect the i nvestment made i n existing systems.
Our research efforts and i nnovative p roducts provide the transaction p rocessi ng systems that businesses need today. The demand for d istribu ted
8
rather than central ized systems has focused atten
tion on system m anagement. Que u i ng services, highly av a i lable systems, heterogeneous environ
ments, securi ty services, and compute r-a ided soft
ware engineering (CASE) are a few examples of areas in which research and advanced develop
ment efforts have had and will con t i nue to have a major i mpact o n the capabilities of transaction processi ng systems.
Transaction p rocess i ng solut ions requ i re the appli cation of a w ide range of technology and the integration of m u l t iple software and hardware products: from desktop to ma inframe: from presen
tation services and user i nterfaces to TP moni tors, database systems, and compu ter-a ided software eng ineeri ng tools; from optim ization of system performance to optimization of availabi lity. Making all of this tcch.nology work well together is a great challenge, but a challenge D igital is u niquely posi
t ioned to meet.
D igital ensures broad appl ication of its t rans
action p rocess i ng technology by defi n i ng an architecture, the Digital Distribu ted Transaction Architecture (DECdta). DE Cdta, about which you will read i n this issue, defines the major components of a D igital TP systt:m and the way those components can form an integrated transaction p rocessi ng sys-
tem. The DECdta architecture describes how data and processi ng are easily d istributed among m ulti
p le VAX p rocessors, as wel l as how the components can i nteroperate in a heterogeneous environment.
The D ECdta architecture is based on the client/
server computing model, which allows D igital to apply its traditional strengths in networking and expandabi I ity to t ransaction p rocessi ng system so lutions. In the DECdta client/server computing model, the client port ion i nteracts with the user to create processi ng requests, and the server portion performs t he data manipulation and computation to execute the processing request. T his computi ng model facil itates the d ivision of a TP system into small components in three ways. It al lows for dis
tribut ion of functions among VA_,\: p rocessors; i t part itions the work performed b y one or more of the components to al low for parallel processi ng;
or i t repl icates functions to achieve h igher ava i l
ability goals. T hese opt ions permit the customer to p urchase the configurat ion that meets present needs, confident that the system will al low smooth expansion in the future.
Further, the D ECdta architecture sets a direction for i ts evolution through different p roducts i n a
coord inated manner. It provides for the cooper
ation and interoperation of components imple
mented on different platforms, and it supports the expansion of customer applicat ions to meet growth requirements. The DECdta arch i tecture is des igned to work with other Digital arch itectures such as the D igital Network Architecture (DNA), t he network application services (NAS), and the Digi tal database archi tecture (DDA). Moreover, the DECdta architec
ture supports ind ustry st andards that enable the portability of appl ications and their interopera
t ion in a heterogeneous enviro nment, such as the standard appl ication programming interfaces being developed by t he X/Open Trans action Proce ssing Working Group and t he IEEE POSJX. Standard wire protocols that provide for systems interoperation in a mult ivendor, heterogeneous environment are be i ng developed by the International Standards Organization as part of the Open System Inter
connection activities.
Among the products D igi tal has developed speci
f ical l y for TP systems are the TP monitors. These monitors provide the system integrat ion "glue," if you will. Rather than act as their own systems inte
grators, customers who use D igital's TP monitors are able to spend more t ime on solving bus iness problems and less t ime on solving software in te
gration problems, such as how to make forms and database products work together smoothly.
Digital's TP moni tors run on all types of hard
ware configurations, including local area networks (LANs), wide area networks (WAJ'\Is), and VAXcluster systems. The DECdta client/server computing model provides t he necessary flex ibility to change hard
ware configurations, thus allowing reco nfigura
t ion without the need for any source code changes.
The two TP moni tors, DECin tact and VAX AG•IS, i ntegrate vital D igital technologies such as t h e D igital Distributed Transaction Manager (DECcltm) and products such as D igital's forms systems (DECforms) and our Rdb/VMS or V�'\ DBMS data
base products. DECdt m uses the two-phase com
mit protocol to solve the complex problem of coord i nating updates to multiple data resources or databases.
Major developments in Digita l's database prod
ucts have enhanced the strengths of its overal l product offerings. The two mainstrea m database products noted above, Rdb/VMS and VA,"( DBMS, layer on top of a database kernel called KODA, thus providing data access i ndependent of any data mod el. The services made available by KODA,
besides its high performance, allow D igi tal's data
base products to eff icient ly support TP applica
tions as well as to provide rich functional ity for general-purpose database appl ications.
For those TP systems that require u ser i nter
faces, DECforms provides a device-independent, easy-to-use human interface and perm its t he sup
port of mult iple devices and users within a single appl icat ion.
TP systems that requ ire high ava ilabil i ty or con
t inuous operations are supported by the V�'X fam
ily of hardware and software. The introd uct ion of the fault-tolerant VAXft 3000 system, added to t he successf u l V�'Xcluster system, allows for a high level of s ystem av a ilabil i t y. Performance needs also are be ing met by a combination of hardware resources. includ ing the VAX 9000 system.
This combinat ion of architecture, software, and hardware technology, and support for emerging industry standards places D igital in an excellent pos i t ion to become the industry leader for d is
tributed, portable transaction processing systems.
The papers in this issue of the Journal provide a view of t he key elements of D igital's d istributed transaction process ing technologies.
Many individuals, teams, organizations, and busi
ness partners are respons ible for bringing Digi tal's TP v ision to fru it ion. Their dedicat ion, hard work, and creativity will cont inue to drive t he develop
ment of new technologies t hat enhance our family of products and services.
9
I
Philip A. Bernstein William T. Emberton Vijay Trehan
DECdta -Digitals Distributed Transaction Processing
Architecture
Digital's Distributed Transaction Processing Architecture (DECdta) describes tfJe modules and interfaces that are common to Digital's transaction processing (DECtp) products. The architecture allows easy distribution of DECtjJ products.
fn particular. it supports client/server style applications. Distributed transaction management is the main function that ties DECdta modules together it ensures that application programs, database systems, and other resource managers inter
operate reliably in a distributed �ystem.
Transaction processing (TP) is the activity of execut
ing requests to access shared resources, typical ly databases. A computer system that is configured to execute TP applications is cal led a TP system.
A t ransaction is an execut ion of a set of opera
t ions on shared resources that has the fo llowing properties:
• Atom ici ty. Either aJ J of the transaction ·s ope ra
t ions execute, or the transact ion has no effect at all.
• Serializabi li ty. The set of all operat ions that exe
cute on behalf of the t ransaction appears to execute serially with respect to the set of opera
tions executed by every other transaction.
• Durabi lity. The effects of the transaction 's oper
ations are resistant to fa i lu res.
A t ransaction term inates by executing the com
mit or abort operat ion. Commit tells the system to install the effect of the transact ion's operations permanently. Abort tells the system to undo t he effects of the transact ion's operations.
For enhanced reliabi l i ty and ava i labil ity, a TP application uses t ransactions to execute requests.
That is, the application receives a request message (from a d isp lay, compu ter, or other device), exe
cutes one o r more t ransactions to process the request, and possibly sends a reply to the origina
tor of the request or to some other parry specified by the originator.
TP appl icat ions are essential to the operation of many indust ries, such as finance, reta i l , health care, transportation, govern ment, commun ications,
10
and manufacturing. Given the broad range of appli
cat ions of TP, D igital offers a wide variety of prod
ucts with which to build Tl' systems.
DECtp is an u mbrel la term that refers to Digi tal's TP p roducts. The goal of DECtp is to offe r an inte
grated set of ha rdware and software p roducts t hat supports the development, execu t ion, and management of TP appl ications for enterprises of all sizes.
DECtp systems include software components t hat are specialized for TP, notably TP monitors such as t he ACMS and DECintacr TP monito rs, and transaction managers such as the DEC:dtm t rans
action manager. ' ' DECtp systems also req uire the integration of general-purpose hardware products (processors, storage, communications, and termi
nals) and software products (operat ing systems, database systems, and com munication gateways).
These products a re typically integrated as s hown in Figure l.
TP APPLICATION
TP MONITOR DATABASE SYSTEMS FORMS MANAGER
OPERATING SYSTEM COMMUNICATION SYSTEM
Figure 1 Layering of Products to Support a TP Application
Vol. .l No. I Willll!r J')')J Digital Tec!Jnical jounwl
Appl ications on DECtp systems can be des igned using a client/server parad igm . This parad igm is especially useful for separat i ng the work of prepar
ing a request from that of running t ransactions.
Request p reparation can be done by a front-end system, that is, one that is close to the user, i n which processor cycles arc i nexpens ive and inter
active feedback is easy to obtain. Transaction execution can be done by a larger back-end sys
tem, that is, one that m anages large databases and may be far from the user. Back-end systems may themselves be d istribu ted . Each back-end system manages a p orrion of the enterprise database and executes appl icat ions, usually ones that make heavy use of the database on that back end. D ECtp products are modu larized to al low easy d istribu tion across front ends and back ends, which enables them to support client/server style applications. DECtp systems thereby simplify pro
gramming and reco nfiguration in a d istribu ted system.
Digi t a l 's Distributed Transaction Processi ng Architecture (DECdta) defines the modularization and d istribu t ion structure that is common to DI'Ctp products. D ist ributed transaction management is the m a i n fu nction that tics this structu re together.
This paper describes the D ECdta structure and explains how DECdta components are integrated by distributed transaction management.
Current versions of DECtp p roducts imp lement most, but not all, modu les and inte rfaces in the DECdta architectur e . Gaps between the architec
ture and products will be fi l led over time. D ECtp products that current ly imp lement DECd ta compo
nents are referenced throughou t the paper.
TP Application Structure
By analyzing TP appl icat ions, we can see where the need a rises for separate D ECdta co mponents. A typical TP app l ication is structured as fol lows:
Step 1 : The client application i nteracts with a user (a person or machine) to gather input, e.g., using a forms manage r.
Step 2 : The client maps the user's input into a request, that is, a message that asks the system to pe rform some wo rk. The c l ient sends the request to a serve r appl ication to process the request.
A request may he d irect or queued. Jf d irect, the client expects a server to process the request right away. If queued , the cl ient deposits the request in a queue from which a server can dequeue the request later.
Digitu/ Teclmicul jouniUI Vol. ,) Nu I Winter t'J'JI
Step 3: A server processes the request by executing one or more transactions. Each trans
action may
a. Access multiple resources
b. Cal. I programs, some of which may be remote c. Generate requests to execute other t ransactions d. Interact with a user
e. Return a reply when the transaction fi nishes Step 4: If the transaction produces a reply, then the client i nteracts with the user to d isplay that reply, e.g., using a forms manager.
Each of the above steps involves the interact ion of two or more programs. In many cases, it is desir
able that these programs be d istribu ted . To d is
t ribute them conveniently, i t is important that the programs be in separa te components. For exam
ple, consider the fol lowing:
• The p resentation service that operates the dis
play and the appl ication that controls which form to d isplay may be d istributed.
One may want to off-load presentation services and related functions to front ends, whi le allow
ing programs on back ends to cont rol which forms are d isplayed to users. This capabi l i ty is useful in Steps 1 , 3d, and 4 above to gather input and d isplay output. To ensure that the presenta·
tion service and application can be d istribu ted, the p resentat ion service should correspond to a separate DECdta component.
• The cl ient appl ication that sends a request and the server application that processes the request may be d istribu ted. The applicat ions m ay com
m u n icate through a nerwork or a queue.
In Step 2, front-end applications may want to send requests direct ly to back-end applicat ions or to place requests in queues that are managed on back ends. Simi larly, in Step 3c, a t rans·
action, T, may enqueue a request to run another t ransaction, where the queue resides on a d if
ferent system than T. To max imize the flexibi l
ity of d istribu t i ng request management , request management should correspond to a separate DECdta component.
• Two t ransaction m anagers that want to run a com m i t protocol may be d istribu ted .
For a transaction to be distributed across different systems, as in Step 3b, the transaction management
1 1
Transaction Processing, Databases, and Fault-tolerant Systems
se rvices must be dist ri buted.
'1()en sure that each t ran saction is at omic, the t ransac tion manage rs on these sy ste ms must c on t rol t ran sac tion c o m m it
men t using a com mon c o m mit prot oc ol. To c o m
plic ate matte rs, the re is more t han on e w ide ly used prot oc ol for t ran sac ti on c o m mit men t. To the exten t possi b le, a sy st e m sh o u ld all ow inte ro pe ra
t ion of th ese protoc ols.
To en sure th at t ran sact ion manag e rs c an be dis
t ributed, the t ran sact ion m an ag e r sho uld be a c o mponent of DEC:dt a.
Tc>en sure th at they c an inte ro pe rate, the ir t ran saction p rot oc ol sh o u ld also be in DECdt a. To en sure th at (liffe rent c o m mit p rot oc ol s
embe supported , the part of tran saction man age ment th at define s the prot oc o l for inte r
act ion with re mote t ran sac tion man age rs sh ould be se parated f ro m the part th at coordinates t ran s
act ion exec ution ac ross loc a l re sources. In the DECdt a architecture, the forme r is c alled a c o m mu
nic at ion man age r, and the latte r is c al led a t ran s
act ion manage r.
Inte rope rat ion of t ran s action m an age rs and re source man age rs, such as (latabasc syste ms, also affect s the m od ul arization of DEC:dt a c omponent s.
A t ran saction may inv olve clifferent ty pe s of re source s, as in Ste p :)a. For example , it may update d at a th at is man aged by different database sy ste ms.
To c ont rol t ran saction c o m mit m en t, th e t ransac
tion man age r must inte rac t w i th d iffe rent re source man age rs, p ossi bly su pplied by diffe rent vend ors.
This re qui re s th at re so urce man ag e rs be separate c omponents of DE C:dt a.
The DECdta Architecture
H aving seen whe re t he need fo r DECdt a c ompo
nent s ari se s, we are n ow re ady t o de sc ri be th e DE Cdt a architec ture as a w hole, inc luding the func
t ion s of and interf aces t o e ach comp onent.
Most DECdt a inte rface s are rmblic . S ome of the public inte rf ace s are c ont rolled by offic ial stan
dard s bodie s and ind ust ry c onsortia; i .e., they are
"open " inte rf ac es . Oth ers are c ont rolled sole ly by D igit al. DECdt a inte rf ace s and protoc ols w il l be published and align ed with ind ust ry st andards, as appropriate.
DECdt a c omponent s are abst ract entitie s. They do n ot nece ssari ly map one-t o-one to hardware component s, software c omponent s (e .g ., p ro
g rams or prod uct s), o r exec ution envi ron ment s (e .g ., a single-th re aded p roce ss, a multith re aded process, or an ope rating sy ste m se rvice). Rathe r, a DE Cdt a c omponent m ay be i mple mented as m u lti
ple software c omponents, for ex ample, as seve ral
1 2
proce sse s . Alte rnatively. sev era l DECdt a c o mpo
nen ts may be imple men ted as a s ing le software c omponent. For ex ample, an ope rating system o r
TP
m onit or ty pic a l ly offe rs th e fac il ities of more th an one DECdt a c ompon en t.
The f ollowing are th e c ompon en ts of DEC:d ta:
•
An applic a tion p rog ram is any prog ram that use s se rv ice s of D ECdta com pon ent s
•
A re sou rce man ager man ag es resourc es th at sup
port t ran sact ion se mantic s.
•
A t ran saction m an age r c oordin ates tran sac ti on te rmin at ion (i.e , c o m mi t and abort).
•
A c om munic ati on man age r supports a t rans
ac tion c o m m unic at ion protoc ol between
Tl'syste ms.
•
A p re sent ation man ag e r support s d ev ic e-inde
pendent inte ract ion s with a presen tation d evic e.
•
A re q ue st m an ag er fac i li t ates th e subm ission of re que sts to exec ute t ran sactions.
DECdt a c ompon ent s are l ay e red on serv ice s that are p rovided by the underlying operating sy ste m and dist ributed syste m platform, and arc n ot spec i
fic t o
Tl',as sh mvn in Figure
2.Application Program
We usc the term app l ic ation prog ra m to mean a prog ram th at use s th e services provid ed by oth e r DECd ta c ompon ent s . An app lic ation p rog ram c o u ld be a c ust omcr-wri tt cn prog ram, a laye red
prod uct . or a DfUita c omponent .
In the D ECdt a arch i tecture, we disting uish tw o special types of app l ic ation prog ra m : request ini
tiat ors and t ran sact ion se rve rs. A re quest in it iator is a DECd ta c o mpon ent that prepares ami submi t s a req ue st for the exec ut ion o f a t ran sact ion.
Tbc reate a re q ue st, t he re que st initiator usua
IIy inte r
act s with a pre sent ati on m an age r that provide s an inte rface t o a device, such as a te rmin al, a w ork
station, a dig it al priv ate branch exchange, or an aut o m ated telle r machine .
A t ran s acti on se rve r c an d emarc at e a t ran s acti on, inte ract with one or more resourc e man
age rs t o acce ss rec ove rable re sourc e s on behalf of the t ran saction, inv oke ot her t ran sac tion serve rs, and re spond t o c alls f rom request initi at ors.
For a s im p le re q ue st , a t ransac ti on serv e r receives the re que st , proce sse s it, and opti on ally ret urn s a re ply t o the re q ue st initiat o r. A c onve r
sation al re que st is like a simple re que st, exc ept th at while p roce ssing the re q ue st, t he transac t ion
\�11. j .Vu. J Winter 1991 Digital Tecbuica/ jourua/
A P P L ICATION PROGRAMS
TP S E R V I C E S
R EQUEST I N ITIATOR
R E QU EST MANAGER
P R E S E NTATION MANAGER
R E Q U EST MANAGER
OPERATING SYSTEM A N D D I S T R I BUTED SYSTEM S E R V I C E S
DIST R I B U T E D NAME S E R V I C E
DISTR I BU T E D T I M E S E R V I C E
T H R E A D MANAG E M E N T S E R V I C E
TRANSACTION S E R V E R
RESOU RCE MANAG E R
OTH E R
COM M U N I CATION MANAGE R S TRANSACTION
MANAGER
U I D S E R V I C E A U T H ENTICATION S E R V I C E
Figure 2 DECdta Components and Interfaces
server exchanges one or more messages with the user, usuall y through the request initiator.
In principle, a request ini tiator coulll also execute transactions (not shown in Figure 2). That is, the dis
tinction between request i n i t iators and transaction servers is for clarity onl y, and does not restrict an appli cation from perform ing request initiation func
t ions i n a transaction. Architectural ly, this amounts to saying that request initiation fu nctions can exe
cute in a transaction server.
Resource 1l1anager
A resource manager performs operations on shared resources. We are especia l l y i nterested i n recover
able resource managers, those that obey transaction semantics. In particular, a recoverable resource manager undoes a transaction's updates to the resources if the transaction aborts. Other recover
able resource manager activities i n support of trans
actions are described in the next section. In the rest of this paper, we use " resource manager" to mean
" recoverable resource manager."
In a TP system, the most common k i nd of resource manager is a database system. Some pre
sentation managers and communication managers may also be resource managers. A resource man-
Digita/ 1ec1Jitical jourt�al 1-'11/ . .> Nu. I \Vinter I'J'JI
ager may be wri tten by a customer, a third party, or D igital.
Each resource manage r type offers a resource
manager-specific interface that is used by applica
tion p rograms to access and modify recoverable resources managed by the resource manager. A des
cription of these resource manager i nterfaces is outside the scope of DECdta. However, many of these resource manager interfaces have archi tec
tures defined by industry standards, such as SQL (e .g., t he VAX Rdb/Vtv!S product), CODASYL data man
ipulation language (e.g., the VAX DB,'v!S product), and COBOL fi le operations (e.g. , RNIS i n the VMS system).
One type of resource manager that plays a spe
cial role in TP systems is a queue resource manager.
It manages recoverable queues, which are often used to store requests. ' I t allows appl ication pro
grams to p lace elements i nto queues and retrieve them, so that appl ication programs can com muni
cate even though they execute i ndependently and asynchronou s l y. For example, an appl ication pro
gram that sends elements can communicate with one that receives elements even if the two applica
t ion p rograms are not operationai simultaneously.
This communication arrangement improves ava i l
abil i ty and faci litates hatch input of elements.
1 3
Transact ion Processing, Databases, and Fault-tolerant Systems
A queue resource manager i n terface supports such operations as open-queue, close-queue, enqueue, dequeue, and read-elemen t . The ACMS and DEC in tact TP moni tors both have queue resource managers as components.
Transaction Manager
A t ransaction manager supports the transact ion abstraction. It is responsible for ensur i ng the atom
icity of each transaction by tel l i ng each reso urce manager in a transaction when to com m i t . It uses a two-phase comm i t p rotocol to ensure that ei ther all resource managers accessed by a t ransaction comm i t the transaction or they all abort the t rans
action. ' To support transaction atomici ty, a t rans
action manager provides the fo l lowing functions:
• Transaction demarcation operations allow appli
cation p rograms or resource managers to start and commi t or abort a transaction. (Resource managers sometimes start a transaction to exe
cute a resource operat ion if the caller is not executing a transac t ion. The SQL standard requires this.)
• Transaction exec u t ion operations al low resource managers and com munication man
agers to declare themselves part of an existing transaction.
• Two-phase com m i t operations al low resource managers and communication managers to change a transaction's state (to "prepared," "com
mitted," or "aborted ").
The serial izabi l i ty of t ransactions is primari l y the responsibil ity of the resource managers.
Usual ly, a resource m anager ensures serial izabi l i ty by set t i ng locks on resources accessed by each transaction, and by releasing t he locks after t he transact ion manager tel l s the resource manager to commit. (The latter activi ty makes serial izabi l
i ty partly the respo ns ibility of the t ransaction manager.) If t ransactions become dead locked, a resource manager may detect the dead lock and abort one of the dead locked transact ions.
The durability of transactions is a responsibi l ity of transaction managers and resource managers.
The t ransaction manager is responsible for the durabi l i t y of the com m i t or abort decis ion. A resource manager is responsible for the durabi l i ty of operations of com m i t ted transactions. Usually, i t ensures durabi l it y by storing a description of each t ransact ion 's resource operations and state changes in a stable (e.g., d isk- resident) log. It can
14
later use t he log to reconstruct transactions' states while recovering from a fa i lure.
A deta i led description of the DECdta transaction manager component appears in the Transact ion Manager Architecture section.
Communication Manager
A com munication manager provides services for communication between named objects i n a TP system, such as application programs and trans
action managers. Some commun ication managers part icipate in coord i n a t i ng the term i nation of a transaction by p ropaga t i ng the transaction man
ager's two-phase comm i t operations as messages to remote communication managers. Other com
munication managers propagate application data and transact ion context, such as a t ransaction iden
tifier, from one node to another. Some do both.
A TP system can support multiple commun ica
tion managers. These communication managers can interact with other nodes us i ng d ifferent com
m i t protocols or message-passi ng p rotocols, and may be part of d ifferen t name spaces, securi ty doma i ns, system management doma i ns, etc.
Examples are an IBM SNA LU6.2 commun ication manager or an ISO-TP communication manager.
By support i ng m u l t iple com munication man
agers, the DECdta architecture enhances the i nter
operability ofTP systems. D i fferent TP systems can i nteroperate by execu t i ng a t ransact ion using d if
ferent com m i t protocols.
A com munication manager offers an i n terface for application p rograms to comm u n icate w i t h other application programs. Different communica
tion managers may offer d ifferent communication paradigms, such as remote procedure call or peer
to-peer message pass i ng.
A com munication m anager also has an i nterface to i ts local t ransaction manager. It u ses this i n ter
face to tel l the transaction manager when a trans
action has spread to a new node and to obt a i n i nformation about transaction commitment, which it exchanges w i th comm u n i cation managers o n remote nodes.
Presentation Manager
A p resentation manager provides an appl icat ion p rogram with a record-oriented i n terface to a pre
sentation device. Its services are used by applica
tion p rograms, usual ly request i n i t iators. By using presentation manager servi ces, i nstead of d i rectly access i ng a p resentation device, appl ication pro
grams become device i ndependent.
Vol. 3 No. 1 Winter 1991 Digital Teclmicaljournal
A forms manage r is one type of presentation manager. Just as a database system supports opera
t ions to define, open, close, and access databases, a forms m anager supports operations to defi ne, enable, d isable, and access forms. A form i ncludes the defi n i t ion of the fields (wi t h different attributes) that make up the form. I t also i ncludes services to map the fields into device-i ndependent application records, to pe rform data validation, and to perform data conve rsion to map fields onto device-specific frames.
One presentation manager is D igital's DEC:forms forms management p roduct. The DECforms prod
uct is the first i mplementat ion of the A NSI/ISO Forms Interface Management Systems standard (COOASYL FIMS) .'
Request Manager
A request manage r provides services to authenti
cate the source of requests (a user ami/or a presen
tation device), to subm i t requests, and to receive repl ies from the execu tion of requests. It supports such operat ions as send- request and receive- reply.
Send- request must p rovide the ident i t y of the source device, the identity of the user who entered the request, the ident ity of the appl ication pro
gra m to be i nvo ked, and the i nput data to the program.
A request manager can ei ther pass the request di rect ly to an application program , or it can store requests in a queue. In t he latte r case, anot her request manage r can subsequently schedule the request by dequeuing the request ami i nvoking an a pplication p rogram. The ACMS System Interface is an example of an ex isting request manager inter
face for d irect requests. The ACMS Queued Trans
action Ini tiator is an example of a request m anager that schedules queued requests.'
Transaction Manager Architecture
OECdta components are t ied together by the t rans
action abstraction. Transactions al low application programs, resou rce m anagers, request managers (ind irectly through queue resource managers), and commun ication managers to inte mperate reliably.
Si nce transactions p lay an especially important ro le i n the O ECdta archi tecture, we describe the transaction management funct ions in more det a i l.
The OECdta archi tecture i ncl udes i nte rfaces between transaction managers and applicat ion p rograms, resource managers, and communication manage rs, as shown in Figure 3. I t also i ncl udes a
Digital Tedmical Jour11al 1'<>1. .i 1\i>. I Winler I') VI
APPLICATION PROGRAM
OTH ER
COMMUNICATION MANAGERS
Figure 3 Transaction Manager A rchitecture
transaction manager protocol, whose messages are propagated by communication managers. This pro
tocol is used by D igital's D EC :dtm d istributed t rans
action manager.'
From a t ransaction manager's viewpoint, a trans
action consists of transact ion demarcation opera
t ions, transact ion execution operat ions, two-phase com m it operat ions, and recovery operations.
• The t ransaction demarcation ope rat ions are issued by an application program to a transac
tion manager and incl ude ope rat ions to start and e i ther end or abort a t ransaction.
• Transaction execur ion operations are issued by resource managers ami commun ication man
agers to a transaction manager. They i nclude operat ions
For a resource manager or com m unication manager to join an existing transaction - For a commun icat ion manager to tel l a t rans
action manager to start a new branch of a t ransaction that al ready exists at another node
• Two-phase com m i t operat ions are issued by a transaction manager to resource managers, commun ication managers, and through com
munication managers to other t ransaction man
agers, and vice-versa. They i nclude operat ions - For a transaction manager to ask a resource manager or commun ication manager to p re
pare , comm i t, or abort a transaction
For a resource manager or commun ica
t ion manager to tel l a transaction manage r whether i t has p repared, com m i t ted , o r aborted a transaction
1 5
Transaction Processing, Databases, and Fault-tolerant Systems
- For a com mu n ication manager to ask a t rans
action manager to p repare, co m m i t, or abort a t ransaction
- For a transact ion manager to te l l a com mu
n ication manager whether it has prepared, com m i tted, or aborted a transaction
• Recovery operat ions are issued by a resource manager to its t ransaction manager to deter
m i ne the state of a t ransaction (i . e . , com m i tted or aborted).
In response to a start operat ion i nvoked by an application program, the transaction manager d is
penses a unique transaction ident ifier for the trans
action. The transaction manager that processes the start ope ration is that t ransact ion's home t rans
action m anager.
When an application program invokes an opera
tion supported by a resource m anager, the resource manager must find out the t ransaction identifier of the appl ication p rogram's t ransaction.
This can happen in d iffe rent ways. For example, the appl ication p rogram m ay tag the operation with the t ransaction ident ifier, or the resource m anager may look up the transact ion identifier in the app l i
cation program's context. When a resource man
ager receives i ts first operation on behalf of a transaction, T, i t must join T, meani ng that it must tell a transact ion manager that i t is a subordinate for T. AJ ternatively, the DECdta architecture sup
ports a model in which a resource manager may ask to be j o ined automatically to all transactions man
aged by its transaction manager, rather than asking to join each transaction separately.
A t ransact ion , T, spreads from one node, Node 1, to another node, Node 2 , by send i ng a message (through a commun ication manager) from an appl i
cation p rogram that is executing T at Node 1 to an application p rogram at Node 2 . When T sends a message fro m Node 1 to Node 2 fo r the first time, the communication managers at Node 1 and Node 2 m ust perfor m branch registration. This fu nction may be performed automatica l l y by the commu nication managers. Or, it may be done man
ually by the application program , which tell s t he comm unication managers at Node 1 and Node 2 that the transaction has spread to Node 2. In ei ther case, the result is as fol lows: the com m unication manager at Node 1 becomes the subord inate of the t ransaction manager at Node 1 for T and the supe
rior of the com m u n ication manager at Node 2 for T; and the com munication manager at Node 2 becomes the superior of the transaction manager
1 6
at Node 2 fo r T. This arrangement allows the com
mit protocol between transact ion managers to be propagated p roperly by com munication m anagers.
After the transaction is done with i ts applicat ion work, the appl ication p rogram that started transac
t ion T may i nvoke an "end" operation at the home transaction manager to commit T. This causes the home transact ion manager to ask its su bord i nate resource managers and co m munication m anagers to try to co m m i t T. The t ransaction ma nager does this by using a two-phase commit p rotocol. The p rotocol ensures that ei ther all subord inate resource managers com m i t the transaction or they all abort the t ransaction.
In phase 1 , the home transaction manager asks its subordi nates for T to prepare T. A subord inate p repares T by doing what is necessary to guarantee that it can either com m i t T or abort T if asked to do so by its superior; this guarantee is valid even if i t fa ils i mmed iately after becom i ng p repared . To p repare T,
• Each subordin ate for T recmsively propagates the p repare request to i ts subordinates for T
• Each resource manager subordi nate writes a l l of T's updates to stable storage
• Each resource manager and transaction manager subord i nate writes a prepare-record to sta ble storage
A subord i nate fo r T repl ies with a "yes " vote if and when i t bas completed its stable writes and a l l o f i t s subordinates for T have voted " yes" ; other
wise, it votes " no.'' lf any subord inate for T does not acknowledge the request to prepare within the t imeout period, then the home transaction man
ager aborts T; the effect is the same as issuing an abort operation.
In phase 2 , when the home transaction manager has received "yes" votes from all of its subordinates for T, i t decides to comm i t T. It writes a com m i t record for T t o stable sto rage a n d tells i t s subordi
nates for T to com m i t T. Each subord i nate for T writes a com m i t record for T to stable storage and recursively p ropagates the com m i t request to i ts subord i n2.tes for T. A subord i nate for T rep I ies with an acknowledgment if and when i t has com m itted the transaction (in the case of a reso urce m anager subord inate) and has received acknowledgments from all subord inates for T. When the home trans
action manager receives acknowledgments fro m a l l o f i t s subordi nates fo r T, the transaction com m i t
ment is complete.
v'!JI. j No. J Winter I':J'JJ Digital Technical journal
To re cove r from a f a ilu re, all res ource manage rs that part icipated in a trans action mu st exa m i ne the i r logs on s table s torage to de te rm i ne w hat to do. If the log contains a commit or abort record for T, t he n T comple ted. No act ion is requ i red. If the log conta i ns no p rep are , com m it, or abort record for T, the n T w as act ive. T mus t be aborted. If t he l og con t a i ns a p repare record for T, bur no com
m i t or abort re cord for T, T w as betwee n phases I and 2. The res ou rce manage r mus t ask i ts superior transaction manag e r w hether to commi t or abort t he trans act ion.
An i nhcrenr pro blem in aU two-phase comm i t proto cols is that a resource manager is blocked between phases I and 2. that is, after vot i ng "yes"
and before receivi ng the com m i t or abort decision.
It cannot com m i t or abort the transaction u nt i l t he trans action m anage r tel ls i t w h ich to do. I f i ts trans
action m anage r fa i ls, t he res ou rce manag e r may be block ed i ndef i n i tel y, u n til e i t he r the t ransaction manage r re cove rs or an ex te rnal age nt, such as a system ma nage r, s teps i n to tel l t he re sou rce man
age r w he t he r to co m m i t or abo rt.
A trans action T may s pontane ousl y a bort due to syste m e rrors at any rime du ring i rs execu t i o n. O r, an appl ication p rog ram (p rior to completing its work ) o r a res ource manage r (p rior t o vo t i ng "yes") may tell i ts trans ac t io n manager to abort T. In e i the r case , the t ransaction manager t hen tel l s a l l of i ts su bord i nates for T to undo t he e ffe ct s of T's res ource manage r operations. S u bord i nate re source manage rs abort T, and su bord i nate com
mun ication managers recursivel y prop ag ate the abort re ques t to the i r su bord i nates fo r T.
The two-phase commit p ro to col is opt i m ized for t hose cases i n w h ich t he nu mber of messag es exchanged can be red uced below that of the g e n
eral case (e.g. , if the re is onl y o ne su bord i nate res ou rce manage r. if a resource manag e r d id not mod i fy res ou rces, or if the presu med-abort proto
col was us ed to s ave acknowledgments)."
Summary
We have presented an ove rview of the DECdta archi tecture. As p a rt of this overview, we i n t ro duced the components and expla ined t he fu nction of each i ntcrface. We als o d es cribed tile D ECd ta trans act ion manag eme n t an:hi recrure in some dera i l. Ove r t i me, many i nte rf aces of the DECd ta model w ill be m ade pu bl ic via prod uct offerings or archi tecture pu b! ications .
Digital Teclmical jounwl l'ol . . > .\'u. I Winter I')<) I
Acknowledgments
T his architecture g rew f rom dis cu ssions w i t h many col leagues. We thank them a l l for their help, espe
cially D ieter G awl ick, B ill La i ng , Dave Lomet, Bru ce M an n , Barry Ru binson, Diogenes Torres, and the TP archi tecture g roup , i nclud i ng Edw ard B ragi nsky, Tony Del laFera, George Gaj nak, Per G y l lstrom, and Yoav Raz.
References
1 . T. Speer and M . Storm, " D igital's Transaction Process ing Monitors," Digital Technical journal, vol . 3. no. I (Win ter 1991 , this issu e): 18-32.
2. W L1 ing, J. johnson, and R. Landau, "Transaction M anag ement Support in the VMS Ope rati ng System Ke rnel," Digital Technical journal, vol . 3, no. 1 (Winter 1991 , this issue): :B-44.
3. P B ernste i n , V Hadzilacos, and N. G o od man,
Concurrency Control and Recouery in Database Systems ( Re ad ing, MA: Add is on-Wes le y, 1987).
4 . P Bernste i n , M. H su , and B. Mann, " I mplement i ng Recovera ble Re q ues ts Us i ng Q ueues,"
Proceedings 1 990 ACM StG/viOD Conference on Management of Data (May 1990).
5. FIMS journal of Developrnent (Norfo l k, VA:
CODASYL FIMS Committee, Ju l y 1990).
6. C. Mohan, B. Linds ay, and R. O bermarck,
"Trans action M anage ment i n t he R* D istribu ted D atabase M anag ement System," ACM Trans
actions on Database .�vstems, vo l. 1 1 , no. 4 (Dece mber 1986)
1.7