DISK ERRORS REQUIRING RECOVERY - DCU-4 DISK ERROR RECOVERY

Step lOP Description

20. MIOP ACOM releases Buffer Memory space for the DAL and any Local Memory space

3.3 DCU-4 DISK ERROR RECOVERY

3.3.1 DISK ERRORS REQUIRING RECOVERY

Disk storage units signal an error condition by setting the done bit and

Angular position counter failure Lost function

I

3.3.1.1 Data error

Data errors are detected on read and write functions when the hardware senses that the correct data has not been transferred as requested. The Kernel disk interrupt answering routine senses that both Done and Busy flags are set and reads the disk status. If any of the bits between 3 and 6 are set in the status parcel, an error occurred in transferring data from the disk. If any of the bits between 9 and 12 are set, an error occurred when trying to transfer data to the disk.

3.3.1.1.1 Recovery for data errors on read operations - When a data error is encountered, the Kernel tries to recover the data with a series of operations. The recovery sequence occurs in the following order:

1. Error recovery repeats the read operation a fixed number of times to determine if the error is transient.

2. If the function repetition fails, recovery is attempted through cylinder margin selection, read early/late selection, or

combinations of the two. The READSEQ table in ERRECK controls the sequence of events and contains margin and read early/late parameters.

3. Disk error correction is attempted for data errors if cylinder margin and read early/late selection retries are unsuccessful.

Error recovery reads the data and the associated error correction code without cylinder offset or read early/late selection. The overlay FIRECODE is called to generate correction vectors and correct the data, if possible. The error correction algorithm corrects data in a single burst of 11 bits or less for each of the four read heads.

The disk error correction feature can be disabled if desired (see the I@IOSECC parameter description in the COS Operational

Procedures Reference Manual, SM-0043 or the UNICOS System Administrator's Guide for CRAY Y-MP, CRAY X-MP, and CRAY-1 Computer Systems, publication SG-2018).

4. If none of the preceding procedures is successful, error recovery sends the sector of data containing the error to the mainframe along with a status indicating the unsuccessful data request.

The remainder of the current disk request is thrown away, and the Kernel continues processing any subsequent requests.

3.3.1.1.2 Recovery for data errors on write operations - If the disk hardware detects an error while attempting to write data to disk, the error recovery routine repeats the function a set number of times to determine if the error is transient. If the requests· are not successful, the lOS returns a status to the mainframe indicating unsuccessful

completion of the operation.

3.3.1.2 Lost data errors

When the status parcel has a 1 set in bit 2, the hardware has detected that Local Memory was unable to keep up with the disk transfer on a read operation. In this case, the data transfer was not completed, and error recovery attempts to complete the function by repeating i t a set number of times.

ERRECK then clears fault flags, does a seek to cylinder 0, and attempts to repeat the disk function. This operation is repeated a set number of times. If the data is not successfully transferred by these repeated operations, the lOS returns a status to the mainframe indicating unsuccessful completion of the operation.

3.3.1.3 Seek errors

Seek errors are detected by the hardware and are indicated when bit 8 is set in the status parcel. The recovery procedure is to return to

cylinder 0, then attempt to do the seek again. This sequence is repeated a set number of times. If the seek cannot be completed successfully, an error status is returned to the mainframe.

3.3.1.4 10 errors

Following a normal disk seek operation, the hardware returns the cylinder number from the disk 10 field in the Status Response register. If this cylinder number does not agree with the cylinder that software is trying to select, error recovery is invoked. The error recovery procedure is to return to cylinder 0, then attempt to do the seek again. This sequence is repeated a set number of times. Before the final retry, the head group is switched in an effort to determine if the correct cylinder is being selected. If all retries fail, an error status is returned to the mainframe.

3.3.1.5 Interlock status

When error recovery finds no bits set in the status parcel after

detecting an error condition, i t knows that the disk referenced is not in a condition to perform the 1/0. To determine the cause of the condition, the error recovery overlay reads the interlock status into the status response register and then into the A register with an IOB:11

instruction. Error recovery checks to see whether the lOS has reserved bit set. If reserved, the status word is checked to see if a real interlock condition is set. If not set, the recovery routine considers the interlock falsb and tries to recover as though i t were a

miscellaneous type. Otherwise, error recovery displays a message indicating the type of error so the operator can correct physical

interlocks. An interlock status (irrecoverable error) is returned to the mainframe.

Conditions considered interlocks, along with their bit positions in the status response register, are indicated in table 3-2. In all cases, a 1 in the bit position indicates that the corresponding condition is true.

Table 3-2. Interlock Error Conditions

Bit Error

8 Positive voltage supply for the DSU is below normal

9 Negative voltage supply for the DSU is below normal

11 DSU start switch is off

12 DSU brush cycle is in process 13 Disk heads are not loaded on

the disk surface

14 Disk surface is not up to speed 15 Disk drive cabinet is over the

normal temperature range

3.3.1.6 Miscellaneous disk errors

Certain disk errors do not fit neatly into any of the previous

classifications. When these errors occur, they are treated as transient conditions that may disappear on retry, and the last function executed on the channel is reexecuted up to a set maximum number of times. If the error continues to occur, the condition is processed as though i t were an interlock condition, causing a message to be sent to the operator and a status response to the mainframe.

Miscellaneous errors, along with their bit positions in the status response register, are given in table 3-3. A 1 in the bit position indicates that the condition is true.

Table 3-3. Miscellaneous Error Conditions

Bit Error

0 Angular position counter failure 7 Address error

13 Multiple head select 14 Read and write conflict 15 Readlwrite off cylinder

Im Dokument Internal Reference Manual (Seite 106-110)