l 'ncorrectable memory errors experienced by the Cl'l are reported as machine checks. These m achine checks are synchronous with the PC mak
ing the reference. Uncorrectable memory errors occur when data is lost by the memory cont roller and cannot be re-created by its ECC circ u itry; fortu
nately, these errors seldom occur. Uncorrectable memory errors represent a serious problem to the exec u t ion thread that experiences them. The hard
ware cannot assist in the recovery of this type of error; recovery is total ly a software function.
If the page that experien ces an u ncorrectable error is a process private page that has not been modified, and the code thread currently execut i ng
l bl. ·i So . .3 St1111111er I'J')l Digital Tec/)llical]ourual
i s at pageable p riority, the error is not considered fatal. The error-hand I i ng routines arrange for the page to be re-created in a different physical page i n memory by i nva l idating the necessary memory management structures. As a result, a translation
nor-valid exception occurs when the instructi o n that experienced t h e exception is retried. The page fau lt mechanisms of the V,\1 5 system do the actual re-creation. The original page with the error is pur on a I ist of bad pages internal to the VtvJ S system . If and verify had to be built into error handling to pro
duce a predictable, robust, a nd qua l i ty product.
Although the VA)\ 6000 family and CI'Us in general
!lave a nu mber of featu res that a llow errors to be generated , they rend nor to be general-purpose. In most cases. they are designed for use by special d iagnostic software that does not operate i n the context of an operating system , e.g. , the VMS oper
ating system . We chose to i mplement a scheme whereby errors wou ld be simu lated in software on the target hardware . Th is approach gave us several clear advantages. The most important was that the approach cou ld be extended as the power and com
plexity of CPU models i ncreased ancl that complete control was with the designers. No special hard
ware equipment or C:Pll feature would be required.
The only precondition was that certain software i mplementation guideli nes had to be fol lowed to make use of the simulator.
Mach ine check test (MTEST) consists of two parts, a u t i l ity ancl an error-hand l i ng i mplementa
tion methodology. The methodo logy consists of using main memory storage as the primary agent that is acted upon by error handl ing. This method a lso fir in to our model of retain ing data in memory.
The other requirement was the strategic placement of the D E UU G_TRANSF E R macro. DEBUG_ TRANSFER
expands to produce a code segment that deter
mines i f the current error being serviced is an error simulation or not. If it is, data that resides in mem
ory that is being interrogated is mod ified , i n con
cert with i\HEST. to reflect the error condition being simulated . DEBUG_ TRA N S F E R code segments
Digital Tedmical journal Vol. 4 No. 3 Summer 19')2
Vt'lX 6000 Error Hcmdling: A Pragmatic Approach
represent synchron ization po ints between an error-hand l ing execution thread and the i\1TEST si mutator.
The MTEST simu lator is a privi leged i mage and consists of a user interface, a number of nonpage
able i nternal bu ffers, and simu lator routi nes. The user interface a llows the internal buffers to be selected and loaded with data patterns of the user's choice. The user interface also a llmvs the user to would determi ne that this was an error being simu
lated and return control to NITEST. MTEST wou ld then decide if the synchronization point was one for w hich the user has data. The clara would be transferred from the bu ffe r named i n the
DEB UG_TRA N S F E R code segment to the add ress also declared in the segment. By jud iciously placing tl1e
DEB UG_ TRA N S F E R synchronization poi nts and care
fu l ly selecting an appropriate data pattern, we were able to simu late a n y a nd all error conditions for the appropriate CPU.
In this way. we were able to verify many complex algo rithms and code paths that wou ld have been di fficult to exercise. We were also able to verify error hand I ing and error logging from the point of error to the error log file. MTEST can be e ither inter
active or procedure-drive n . This aspect al lowed us to maintain a l ibrary of procedures that cou ld be used at any time to verify that operational charac
teristics for i nd ividual errors had not changed when code paths that affected many error types were mod ified.
MTEST was the pri mary tool we used for testing.
During the test ancl verification phase, prototype hardware that bad rea l error con d itions became avai lable, and we used these prototypes.
Conclusions
The VAX 6000 fam il y now has a robust and complete set of error-hand l i ng routines that accomplished our project goals. In fact, many routi nes were never before part of the VMS system. These routines include the abil ity to report complete error context to the system console and the abi l ity to group fail
ures occurring across the system to a single error l.og entry. An important S M I' feature is the abil ity to recogni ze and retire fai ling processors from the active set of a VMS session and allow the session to
1 03
NV AX-microprocessor VAJ( Systems
continue. These ro u t i nes and others su pport the entire range of VAX 6000 CPU models. The object
orient ed app roach to error co ndi tions not on the CPU modu le has made support and i ntrod uction of newer rou t i nes easier. The abi l i ty to test a t wi l l any or all error-hand I ing ro u t ines has been a tremen
dous advantage.
Acknowledgments
Our success resu lted from a number of factors, i ncluding the advantages of design i ng the abi l i ty to test into the pr<Kiuct. 'fhere is no substitution fo r actually exec uting a code thread to determ ine the effect iveness of its design goal. The various engi
neering grou ps involved in designi ng the many
1 04
6000 CPl!s showed great discipl ine i n producing engineering specifications that met the needs of both hardware and software engineering gro u ps.
The many hours spent painstakingly describing intricate details of erro r conditions ami the p roduc
tion of p:.�rse trees al.l owecl the structured approach we set our to achieve. S p ecial thanks to Mi ke Uhler fo r his parse trees ami to Nick Carr, who suggested this paper be written.
Reference
1 . G. Uhler e t a l . , ''The 1\IVA X and 1'\JVAX+ High
performance Wv'\ iVlicrop rocessors:· Digital Technical Journal. vol. 4. no. 3 (Su m m er 1992. this issu e): 1 1 - 2 }
vbl ..; No .. i Su1111ner I'J'.!l Digital Tee/mica/ journal
ISSN 0898-901 X
Printed i n U.S.A. EY-J884E-DP/92 l l 02 19.0 Copyright © Digital Equipment Corporation. All Rights Reserved .