Single Error Detection (SED) - Fault tolerance infrastructure and its reuse for offline testing

a) Unprotected R_i

n n

c) Detection &

Correction

Bit-Flipping Ri

⊕

C_i C_i' Si

⊕⊕

⊕

⊕⊕

⊕

II III

…

Decoder

n n

b) Detection &

Recomputation

R_i

⊕

C_i

C_i' Si

⊕⊕

⊕

⊕⊕

⊕ …

n n

l II

Figure 5.1.: Presented Concurrent Online Configurations.

Throughout this chapter, soft errors are confined to the occurrence ofSingle Bit Upsets (SBU)(single error assumption). The detection and correction in presence of multiple errors will be discussed in Chapter 6.

The remainder of this chapter starts with the Single Error Detection in Section 5.2. It briefly depicts the application of the modulo-2 address characteristic to individual registers (Block I), as well as the protection of the stored error condition in order to eliminate false detections (Block II). Subsequently, Section 5.3 details the extension to Online Correction with a focus on the Bit-Flipping Latch (Block III). Prior to the summary, both parts are experimentally evaluated in Section 5.4.

5.2. Single Error Detection (SED)

Upon the detection of errors during operation, a correction by recomputation requires many clock cycles during which a module is not available. Thus, the avoidance of non-essential corrections has the potential to reduce the performance impact associated with the time consuming restart of a computation. In the course of this section the base for such fine-grained correction decisions is laid by

▷ an improved localization granularity to spot errors within registers and

▷ the avoidance of false detections whenever the data path is correct.

5.2.1. Derivation of a Register Specific Error Condition

The modulo-2 address characteristic discussed in Section 4.3 has been shown to be an effective solution for the localization of SEU across multiple registers of a circuit.

The need for an online detection and localization of SEUs within the registers can be fulfilled by deriving a characteristic for each individual register.

In contrast to the non-concurrent localization, theregister characteristicC of a single register Rwith n bits is derived as C = M · R. The characteristic is computed by attaching the characteristic tree directly to the register as shown in Figure 5.2 and stored in the l-bit reference characteristic registerC. The syndrome Sis derived by comparison ofCandC^′and compacted into the one bit syndrome fail signalfail Sto indicate any discrepancy between the register content Rand its characteristicC. The implementation for each register profits from the low number of connections and gates enabled by the optimal characteristic tree organization introduced in Section 4.3.2. As a register is locally confined to a relatively small area in the final layout, only few routing resources are occupied and a small area overhead of the protected register is expected. The characteristic tree only requires exclusive OR standard cells and the associated area is identical to the non-concurrent localization (Eq. (4.12)). As all remaining infrastructure parts solely depend on the logarithmic characteristic, the area associated with the derivation of the error condition (EC) within a single register is low (Eq. (5.1)).

Area_EC = n| {z }· A_Latch

+| (2^l⁺¹ − 2l{z −2) · A_XOR2}

Characteristic Tree (Eq. (4.12))

+l| {z }· A_Latch

+l · A_XOR2

| {z }

Syndrome S

+| {z }(l −1) · A_OR2

fail S

(5.1) The application of the modulo-2 address characteristic to individual registers provides the localization granularity necessary to identify single failing bits and will later allow to reconstruct the correct bit value at the erroneous location.

…

⊕

C_i' C_i S_i

⊕ ⊕ ⊕ ⊕ ⊕ ⊕

l l

PC_i

⊕ ⊕ ⊕ ⊕

1 1 1

fail S

fail P

≥1

n n

Figure 5.2.: Block I and Block II: Deriving and Protecting the Error Condition.

5.2. Single Error Detection (SED)

5.2.2. Protected Storage of the Error Condition

The modulo-2 address characteristic is able to detect and localize single errors in the protected register. The syndrome fail signalfail Swill indicate any Single Event Upset in R. Unfortunately, a Single Upset in the stored reference characteristicC also leads to a raised syndrome fail signal (Figure 5.2). Hence, a false detection is triggered although the register content and thereby the data on the data path is unaltered and valid. To decide if a SEU affected the original register and thereby the data path or if it manifested in the stored reference characteristic, an additional protection of the error condition is necessary to distinguish both cases under a single error assumption.

This can be achieved by either protecting the register Ror the characteristicCwith an additional parity bit. The parity of the characteristicC is chosen due to a smaller area overhead resulting from the logarithmic relationship between the register size n and the characteristic sizel (Eq. 4.6). The parity is then computed from the reference characteristic registerC by a small XOR tree and stored in one additional latchPCas shown in the gray part of Figure 5.2. In the following, the difference of the reference parity Pand the actual parity P^′is called the parity fail signalfail P. Any Single Bit Upset in the register R, its characteristicC or the parity Pthereof, will be visible in at least one of the two fail signals.

With two fail signals at hand, that indicate a discrepancy between either RandC (fail S), orCand P(fail P), the location can be derived for an error affecting the

▷ Original Register (R): The syndrome fail signal is ’1’ (asC , C^′), the parity fail signal is ’0’ (P = P^′). As the data on the data path is affected, a correction is triggered.

▷ Reference Characteristic (C): Both, the syndrome and the parity fail signal are

’1’ (C ,C^′ andP , P^′). The error changedC, not the data path, no correction is needed.

▷ Characteristic Parity (P): The syndrome fail signal is ’0’ while the parity fail signal is ’1’ (C =C^′and P, P^′). The reference parityPwas altered. The data path is correct, no correction is needed.

Hence, the correction signal is defined ascorrect = fail S ∧ ¬fail P. The negation of the parity fail signal is efficiently implemented by merging with the XOR2 gate and

using a XNOR2 gate instead. The additional area to protect the error condition is small and grows logarithmically with the register size (Eq. (5.2)).

Area_SED = Area_EC

|{z}

Eq. (5.1)

+(l −1) · A_XOR2

| {z }

Parity Tree of C

+ A_Latch

|{z}

+A_XNOR2

| {z }

¬fail P

+A_AND2

|{z}correct

(5.2) Detection Capability: All single faults can be detected, correctly localized and cor-rected. Double faults can be detected, but not corcor-rected. The detection of other multiple faults cannot be guaranteed. In general, the Hamming distance of the used code can be increased to allow the detection, localization and correction of multiple faults.² Consequently, Single Error Detection (SED) is achieved, which completely avoids false detections. In case of a detection, detailed localization information within a register is supplied. Thus, fine-grained decisions are permitted for correction by recomputation.

Im Dokument Fault tolerance infrastructure and its reuse for offline testing : synergies of a unified architecture to cope with soft errors and hard faults (Seite 113-116)