Redundancy - Radiation Mitigation for the FPGA

5.2. Radiation Mitigation for the FPGA

5.2.2. Redundancy

For some components, like the local timestamp counter, it was clear from the beginning to use TMR as radiation mitigation technique. For the implementation of finite state machines (FSMs), two competing techniques are available (see also paragraphFSM En-coding, page 52) and the best technique had to be identified first.

The results of this selection process are presented first, followed by implementation details of the selected TMR.

Hamming Coded versus TMR’ed State Machines

To compare the efficiency of hamming-coded FSMs with TMR’ed FSMs, an SEU-injection test was performed. The logic used in the 2012 in-beam test was chosen as basis. Two ra-diation mitigated firmwares were built, one that implements hamming-coded FSMs and another one that implements TMR’ed FSMs. In addition a firmware without redundancy was built for comparison.

The test has a very low probability to show an error for two reasons:

• To keep the test simple, the functionality of only one FSM in the logic was tested.

The FSM that was selected for the test is considered to be highly critical, however, the probability for an injected SEU to actually affect this particular FSM is still rather small.

• To avoid multi-bit upset effects, only one SEU was injected per iteration. A random bit in the partial bitfile was flipped and this modified bitstream was then loaded on the FPGA.

Due to that, the test had to run for a significant time, but this not a major obstacle, as the test could be performed over weekends and during other times when the hardware was not required for something else.

Figure 5.4 shows the results of the test. A positive SEU mitigation effect of hamming coded FSMs cannot be seen, whereas TMR’ed FSMs show a significant mitigation effect.

It might be criticized that the Hamming coded FSMs protect the state vector which is part of the “dynamic” memory and this is not affected by SEU injection tests at all.

Therefore, one can argue that it is no surprise that no effect in favor of Hamming coded FSMs can be observed here. However, it has been shown that the reason for an SEU induced failure is more likely an error in the static memory than in the dynamic memory, simply because much more static memory contributes to the functionality of the design [Whi14, page 25]. Even though the FSM’s state vector is not directly affected by SEU injection tests, the FSM’s functionality is, and this is the relevant criterion to evaluate.

Actually, this is probably even the reason why Hamming coded FSMs do not improve SEU resilience despite adding redundancy to the logic. Hamming codes only protect the dynamic state vector at cost of blowing up the static part. TMR increases the overall resource consumption but it mitigates both, static and dynamic part of the logic.

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04

Hamming TMR'ed No Red.

in %

Figure 5.4.: Comparing the SEU susceptibility of Hamming Coded FSMs, TMR’ed FSMs, and FSMs without redundancy. Plotted is the percentage of all injected SEUs that caused erroneous design behavior. The error bars refer to the square root of the number of errors. TMR’ed FSMs show a clear improvement while Hamming Coded FSMs do not.

Hamming coded FSMs might perform well on Flash-based FPGAs and Antifuse FP-GAs where static logic is not susceptible to radiation but they do not perform well on SRAM-based FPGAs.

For that reason, all critical FSMs that had to be protected in the scope of this thesis were implemented with TMR.

Implementation of TMR

All critical parts of the design are implemented with TMR. The parts that are considered to be critical are:

• the control logic (e.g. the state machines implementing the GET/PUT protocol)

• the logic that handles time and synchronization (e.g. the local timestamp counters) The basic idea of TMR as well as important design considerations have already been presented in section 3.2.2.

Figure 5.5 shows a VHDL example that implements a TMR’ed flip-flop, with syn-chronous set, synsyn-chronous reset, clock enable and also the voters for the outputs. The combinational logic that is the source for the inputs of such a flip-flop needs to be tripled as well.

It is best practice to connect the three output signals of one TMR entity to the corre-sponding three input signals of the next entity. In reality, this is not always possible, for

Figure 5.5.: VHDL implementation of a TMR’ed flip-flop with synchronous set, syn-chronous reset, and clock enable (ce) inputs and data outputs (Q_out). The data validity on the outputs is hardened by majority voters. Not shown here is the triplication of the combinational logic that sources the tripled inputs (D) of the TMR’ed flip-flop.

example when a communication between two different clock domains has to be imple-mented. A situation that occurs frequently withSelective TMRis that the next entity is not hardened by TMR and only accepts one input signal. In such situations, only one voter was used. The voter as single point of failure is tolerated.

Design Restriction As previously withscrubbing, an additional design restriction has to be taken into consideration in order to allow for a TMR’ed firmware implementation:

• The design toolxsthas to be called with the options – “-equivalent_register_removal no”

to not inadvertently remove the TMR feature in the optimization process.

Detection of Data Corruption with CRC Since the logic that handles data readout and data transport (the green resources in figure 4.4) is not protected by TMR, data corrup-tion is possible and cannot be neglected. To detect such corrupcorrup-tion, a CRC checksum is calculated at the very first input stage, in parallel to the deserialization of the signal re-ceived from the GET4. CRC calculation can be implemented very efficiently in hardware, figure 5.6 gives the complete source for calculating the 8 bit wide CRC checksum that is used to detect data corruption. The checksum implementation is based on the generator polynomialx⁸+x²+x+1 (ITU-T conform CRC-8 [Int13]).

Figure 5.6: VHDL implementation of the CRC calculation that is used to de-tect whether data was corrupted by SEU effects. It implements the ITU-T conform CRC-8, with the generator polynomial x⁸+x²+x+_{1. The 38} lines of code shown here represent a full VHDL entity that is synthesizable.

The CRC calculation uses 8 flip-flops and 2 LUTs per input channel. Al-though one CRC entity has to be in-stantiated for every input channel, it is not much overhead to the overall de-sign.

Im Dokument Radiation mitigation for SRAM-Based FPGAs in the CBM experiment (Seite 81-84)