Soft Errors - Fault tolerance infrastructure and its reuse for offline testing : synergies of a

appear and disappear as a function of time at a fixed location. They relate to marginal or unstable hardware, are activated by environmental conditions, and may evolve into permanent faults. Transient faultsaffect a circuit at random timepoints and random locations. They are caused by environmental conditions such as dynamic parameter variations that lead to violations of timing safety margins [Con03; Bor05] or the charge induced by ionizing radiation [Bau05]. Thus, they are often more precisely denoted astransient errorsto accentuate the absence of a physical defect orsoft errors for particle-induced transients.

2.2. Soft Errors

Definition 2.2.3 (Single Bit Upset (SBU)) A single event that induces a single bit in an IC to fail at one time.

If the energy of a single particle is high enough, it can affect multiple bits. With sequential elements organized into registers or memory words, it is important to distinguish the amount of upsets affecting a single word.

Definition 2.2.4 (Multiple Cell Upset (MCU) [JED06]) A single event that induces several bits in an IC to fail at one time.

Definition 2.2.5 (Multiple Bit Upset (MBU) [JED06]) A multiple cell upset in which two or more error bits occur in the same word.

2.2.2. Soft Error Quantification

The frequency at which soft errors occur is denoted by theSoft Error Rate (SER). It is measured in units ofFailures in Time (FIT), where one FIT is defined as the number of failures per 10⁹device-hours. For memories or sequential elements, the soft error rate is often expressed in FIT/device or FIT/Mbit.

The soft error rate of the system components is additive, hence SER_circuit =

i=0

(SER^nominalcomponent i × AVF_i × TVF_i) . (2.1) Not all radiation events will finally result in a failure of a digital circuit or system due tologic deratingandtime derating [NY03]. TheArchitectural Vulnerability Factor (AVF)comprises electrical and logical masking effects and denotes the probability, that a single event transient affecting a node or device will be observed by the system or user [MWE+03]. TheTime Vulnerability Factor (TVF)contemplates latch window masking effects of sequential elements and denotes the fraction of time a node or device is susceptible to upsets [ST04].

Reliability denotes the probability that a device will perform its intended function during a specified time under stated conditions [ALRL04]. This probability of survival beyond a specified time (timepoint 0 to timepoint t) is commonly designated by the

term survival function while in the technical domain the term reliability function R(t) is widely used. It can be calculated as

R(t) = e^−λ^t (2.2)

whereλ denotes a constant failure rate measured in FIT, such as the soft error rate SER. The termMean Time To Failure (MTTF) is often used to denote reliability and is inversely related to the soft error rate.

MTTF in years= 10⁹

λ × 24 hours × 365 days (2.3)

2.2.3. Fault Tolerance

Fault tolerance, the ability of a circuit to continue its intended operation in presence of faults, can be achieved by means of redundancy, the provisioning of functional ca-pabilities that would be unnecessary in an error-free environment [Lap85]. Common to most solutions, three goals are essential to cope with soft errors:

▷ Error Detection: The presence of a soft error is recognized.

▷ Error Localization: The location of the soft error is revealed.

▷ Error Correction: The original, error-free data is reconstructed.

The addition of redundancy to a circuit in order to increase its robustness in the presence of soft errors can be performed at different domains and abstraction lev-els.Temporal redundancy is achieved by repeating an operation multiple times and comparing the results or by sampling the result of a single execution multiple times.

Structural redundancy includes diversification and n-modular redundancy, where basic building blocks are replicated structurally and compared. The detection of errors is possible with two replicas induplication with comparison, while majority voting with at least three replicas allows for error correction.Information redundancy in form of error detecting and correcting codes will be discussed in more detail in the following.

2.2. Soft Errors

2.2.3.1. Error Detecting and Correcting Codes

In coding theory,error detection and correction (EDAC)are techniques that enable the reliable delivery of digital data over unreliable communication channels. Therefore, redundancy is added to a message in order to check the consistency of the delivered message and to recover corrupted data.

In a systematic code, the original data is transmitted in conjunction with a fixed number of additionalcheck bitsthat are derived deterministically from the data bits.

In order to detect an error, the same algorithm is applied to the received data bits and the result is compared with the received check bits. A common way of detecting errors is the use of aparity bit that denotes whether the number of 1 bits in the data word is even or odd.

Definition 2.2.6 (Even Parity) Let d⃗be a vector withnbinary values[d_n, . . . ,d1], then the even parity bit p(d)⃗ is defined as

p(d)⃗ = Lⁿ

a=1

d_a .

The introduction of a parity bit increases the Hamming distance between any two valid code words to 2, thereby allowing the detection of single bit errors (and any odd number of errors) while a correction is not possible.

In general, the number of bit errors that can be detected or corrected is determined by the minimum Hamming distanceof a code, defined as the minimum Hamming distance between any two valid (fault-free) code words.

Definition 2.2.7 (Minimum Hamming Distance) Let c⃗1,c⃗2 ∈ C ⊂ Bⁿ⁺^k be two code words and let∆_H(c⃗1,c⃗2) denote the Hamming distance betweenc⃗1 andc⃗2. Then, the minimum Hamming distance∆_H(C) of the codeC is defined as

∆_H(C) := min

c⃗1, ⃗c2∈C;c⃗1,c⃗2

∆_H(c⃗1,c⃗2) .

In order to detect errors inaor less bits, the minimum Hamming distance is required to be at least a+ 1. To correct all errors in bor fewer bits, a minimum Hamming distance of at least 2b+ 1 is required. If errors should be detected in aor less bits and

corrected in bor less bits (wherea ≤ b), the minimum Hamming distance needs to be at leasta+b+ 1 [Muk08].

Error detecting and correctingHamming codesemploy multiple check bits to ensure an adequate distance between code words [Ham50]. The minimum number of check bits required for soleSingle Error Correction (SEC)is given by the Hamming relationship [Ham50]:

2^c ≥ d+c+ 1 (2.4)

where d is the number of data bits and c is the number of check bits. To correct a single bit error, the 2^c combinations of thec check bits must be able to localize the error in d+c code bits. In addition, they need to represent the fault free case where no error occurred. Hamming codes are commonly expressed as Hamming (n,d), withn = d+ccode bits andddata bits. They areperfect codes, as the Hamming relationship is satisfied with a minimum number of check bits.Extended Hamming codesthat allowSingle Error Correction and Double Error Detection (SECDED)of errors employ one additional check bit to increase the minimum Hamming distance.

Example Let Hamming(7,4) be the Hamming code defined by thegenerator matrix G and thecheck matrix H with mutually distinct columns.

G =









1 1 0 1

1 0 1 1

1 0 0 0

0 1 1 1

0 1 0 0

0 0 1 0

0 0 0 1









H = 





1 0 1 0 1 0 1

0 1 1 0 0 1 1

0 0 0 1 1 1 1





(2.5)

To determine the to be transmitted codewordX⃗ = (c1,c2,d1,c3,d2,d3,d4), the data vector D⃗ = (d1,d2,d3,d4) with value [1010]^T is pre-multiplied byG.

X⃗ =G· D⃗ =[1011010]^T

To check a received codeword R⃗ it is pre-multiplied by H in order to obtain the syndrome S⃗. For an error free codewordR⃗ (:=X⃗), the syndrome⃗Sis the null vector.

For a codeword R⃗^′with a bit error at bit 5 (R⃗^′ = [1011 1 10]^T), the syndromeS⃗^′is not the null vector and its value indicates which bit has been flipped.

⃗S = H ·R⃗ =[000]^T S⃗^′= H · R⃗^′ =[101]^T

Im Dokument Fault tolerance infrastructure and its reuse for offline testing : synergies of a unified architecture to cope with soft errors and hard faults (Seite 54-59)