• Keine Ergebnisse gefunden

Uncorrectable Memory Errors

l 'ncorrectable memory errors experienced by the Cl'l are reported as machine checks. These m achine checks are synchronous with the PC mak­

ing the reference. Uncorrectable memory errors occur when data is lost by the memory cont roller and cannot be re-created by its ECC circ u itry; fortu­

nately, these errors seldom occur. Uncorrectable memory errors represent a serious problem to the exec u t ion thread that experiences them. The hard­

ware cannot assist in the recovery of this type of error; recovery is total ly a software function.

If the page that experien ces an u ncorrectable error is a process private page that has not been modified, and the code thread currently execut i ng

l bl. ·i So . .3 St1111111er I'J')l Digital Tec/)llical]ourual

i s at pageable p riority, the error is not considered fatal. The error-hand I i ng routines arrange for the page to be re-created in a different physical page i n memory by i nva l idating the necessary memory management structures. As a result, a translation­

nor-valid exception occurs when the instructi o n that experienced t h e exception is retried. The page fau lt mechanisms of the V,\1 5 system do the actual re-creation. The original page with the error is pur on a I ist of bad pages internal to the VtvJ S system . If and verify had to be built into error handling to pro­

duce a predictable, robust, a nd qua l i ty product.

Although the VA)\ 6000 family and CI'Us in general

!lave a nu mber of featu res that a llow errors to be generated , they rend nor to be general-purpose. In most cases. they are designed for use by special d iagnostic software that does not operate i n the context of an operating system , e.g. , the VMS oper­

ating system . We chose to i mplement a scheme whereby errors wou ld be simu lated in software on the target hardware . Th is approach gave us several clear advantages. The most important was that the approach cou ld be extended as the power and com­

plexity of CPU models i ncreased ancl that complete control was with the designers. No special hard­

ware equipment or C:Pll feature would be required.

The only precondition was that certain software i mplementation guideli nes had to be fol lowed to make use of the simulator.

Mach ine check test (MTEST) consists of two parts, a u t i l ity ancl an error-hand l i ng i mplementa­

tion methodology. The methodo logy consists of using main memory storage as the primary agent that is acted upon by error handl ing. This method a lso fir in to our model of retain ing data in memory.

The other requirement was the strategic placement of the D E UU G_TRANSF E R macro. DEBUG_ TRANSFER

expands to produce a code segment that deter­

mines i f the current error being serviced is an error simulation or not. If it is, data that resides in mem­

ory that is being interrogated is mod ified , i n con­

cert with i\HEST. to reflect the error condition being simulated . DEBUG_ TRA N S F E R code segments

Digital Tedmical journal Vol. 4 No. 3 Summer 19')2

Vt'lX 6000 Error Hcmdling: A Pragmatic Approach

represent synchron ization po ints between an error-hand l ing execution thread and the i\1TEST si mutator.

The MTEST simu lator is a privi leged i mage and consists of a user interface, a number of nonpage­

able i nternal bu ffers, and simu lator routi nes. The user interface a llows the internal buffers to be selected and loaded with data patterns of the user's choice. The user interface also a llmvs the user to would determi ne that this was an error being simu­

lated and return control to NITEST. MTEST wou ld then decide if the synchronization point was one for w hich the user has data. The clara would be transferred from the bu ffe r named i n the

DEB UG_TRA N S F E R code segment to the add ress also declared in the segment. By jud iciously placing tl1e

DEB UG_ TRA N S F E R synchronization poi nts and care­

fu l ly selecting an appropriate data pattern, we were able to simu late a n y a nd all error conditions for the appropriate CPU.

In this way. we were able to verify many complex algo rithms and code paths that wou ld have been di fficult to exercise. We were also able to verify error hand I ing and error logging from the point of error to the error log file. MTEST can be e ither inter­

active or procedure-drive n . This aspect al lowed us to maintain a l ibrary of procedures that cou ld be used at any time to verify that operational charac­

teristics for i nd ividual errors had not changed when code paths that affected many error types were mod ified.

MTEST was the pri mary tool we used for testing.

During the test ancl verification phase, prototype hardware that bad rea l error con d itions became avai lable, and we used these prototypes.

Conclusions

The VAX 6000 fam il y now has a robust and complete set of error-hand l i ng routines that accomplished our project goals. In fact, many routi nes were never before part of the VMS system. These routines include the abil ity to report complete error context to the system console and the abi l ity to group fail­

ures occurring across the system to a single error l.og entry. An important S M I' feature is the abil ity to recogni ze and retire fai ling processors from the active set of a VMS session and allow the session to

1 03

NV AX-microprocessor VAJ( Systems

continue. These ro u t i nes and others su pport the entire range of VAX 6000 CPU models. The object­

orient ed app roach to error co ndi tions not on the CPU modu le has made support and i ntrod uction of newer rou t i nes easier. The abi l i ty to test a t wi l l any or all error-hand I ing ro u t ines has been a tremen­

dous advantage.

Acknowledgments

Our success resu lted from a number of factors, i ncluding the advantages of design i ng the abi l i ty to test into the pr<Kiuct. 'fhere is no substitution fo r actually exec uting a code thread to determ ine the effect iveness of its design goal. The various engi­

neering grou ps involved in designi ng the many

1 04

6000 CPl!s showed great discipl ine i n producing engineering specifications that met the needs of both hardware and software engineering gro u ps.

The many hours spent painstakingly describing intricate details of erro r conditions ami the p roduc­

tion of p:.�rse trees al.l owecl the structured approach we set our to achieve. S p ecial thanks to Mi ke Uhler fo r his parse trees ami to Nick Carr, who suggested this paper be written.

Reference

1 . G. Uhler e t a l . , ''The 1\IVA X and 1'\JVAX+ High­

performance Wv'\ iVlicrop rocessors:· Digital Technical Journal. vol. 4. no. 3 (Su m m er 1992. this issu e): 1 1 - 2 }

vbl ..; No .. i Su1111ner I'J'.!l Digital Tee/mica/ journal

ISSN 0898-901 X

Printed i n U.S.A. EY-J884E-DP/92 l l 02 19.0 Copyright © Digital Equipment Corporation. All Rights Reserved .