• Keine Ergebnisse gefunden

Choice of Radiation Mitigation Techniques

be used to detect possible corruption. This is easy to implement and requires only few resources. Only errors in the configuration logic and in the time-representing logic need special protection and must be implemented as TMR’ed logic. The logic components can be grouped into two categories:

• data path related→only error detection required→use CRC

• control related→error correction required→use TMR

Figure 4.3 shows the resource consumption of the data path logic and control logic in the GET4 read-out design from a conceptional point of view. A lot of input channels are connected, therefore, the data-related part of the read-out logic represents the major part of the overall logic. Control logic constitutes only a small part of the design.

Data In

Buffer

Combine

Control

Data In Data In Data In

Figure 4.3.: Analysis of the GET4 read-out design. The combination of multiple input channels consumes the major part of the resources, the contribution of control logic is much less.

In the particular implementation for the GET4 read-out firmware control logic con-tributes about 10% of the full design, based on the MAP report from the synthesis of the not redundant design. Figure 4.4 shows an illustration of the given design with high-lighted controls and data resources.

It is evident that avoiding costly TMR implementations in the data path saves the major part of the overhead of a full TMR implementation. Automated tools do not allow for fine-grain TMR replication and hence they are not used. For the critical logic parts, TMR is implemented manually using VHDL (see section 5.2.2). Results are presented in section 6.2.2.

Hamming Coded FSMs For hardening of finite state machines (FSMs), it was consid-ered to use hamming coded state vectors instead of using a TMR approach. As already mentioned in paragraphFSM Encoding(page 52), contradictory statements about the per-formance of hamming coded FSMs compared to TMR’ed FSMs can be found. To clarify

Figure 4.4: Illustration of the resource con-sumption for the GET4 read-out firmware without redundancy, created with the Xil-inx design tool PlanAhead. Resources that belong to control and timing logic are high-lighted in red, resources that are part of data path logic are highlighted in green. It can be seen, that critical resources (red) contribute much less to the overall resource consump-tion than uncritical (green) resources. Ac-cording to the numbers in the map report, the percentage of control and timing related resources is only 10% of the total number of utilized resources in this design. Only the red resources need TMR protection.

the situation for the design in question, a test has been performed that evaluates the per-formance of the two options. The test is described in section 5.2.2 and suggests that, also in the case of FSM design, a TMR approach should be chosen.

4.2.3. Fault Tolerant Protocol

A decision for a soft radiation mitigation strategy that allows occasional failures can only succeed when higher layers of the system design can tolerate an error in the lower levels.

This means that the impact of the radiation mitigation strategy is not restricted to compo-nents in the radiation zone, but has also some implications for design decisions of parts that are located outside the radiation zone.

An example is the protocol design for data transport. Given that the protocol design supports retransmission and that the retransmission is not carefully designed, a data word that gets corrupted in a buffer in the radiation zone can result in an infinite cycle of retransmissions. By retransmitting this one corrupted word over and over again, the data channel is completely blocked.

When communicating with unreliable components, such details have to be considered at higher design levels. A clever use of timeouts and watchdog functionality is required alongside an intelligent treatment of responses that are not compliant to the defined pro-tocol.

Problems related to these higher levels only emerge when the higher levels are put into operation. This means, that the whole system needs to be put together to see whether all

possible design flaws are eliminated.

4.2.4. Fault Injection Tests

As already described in section 3.2.4, fault injection tests (also known as SEU injection tests) can be used, albeit with some limitations, to emulate SEUs in a firmware design.

Fault injection tests use the same hardware and interfaces as scrubbing does. AXILINX Platform Cable USBprogrammer and theXilinx iMPACTsoftware can be used to write the configuration bitstream from the PC to the FPGA via a JTAG interface. Another possibil-ity is to use the fasterSelectMAPinterface. TheSysCore Version 2connects theSelectMAP interface to the Actel configuration controller that can read a configuration bitstream from an on-board flash memory.

The SelectMAP interface has the advantage of being much faster than JTAG. However, the JTAG solution is more versatile, the bitstream can be modified by software running on the PC. Flexibility is important for SEU injection tests, therefore, the solution that was chosen for the tests described later (in sections 5.2.2 and 5.3.2) utilizes the JTAG interface.

Weak Points The SEU injection method is a great helper for the evaluation of SEU ef-fects as it emulates a real high-radiation situation very closely. However, some differences remain.

The weak points for the particular SEU injection implementation for this thesis are:

• Only the static configuration memory is targeted for SEU injection. SEUs are not in-jected to the “dynamic” memory like flip-flops and BRAMs although these memory cells are also SRAM-based and will suffer from SEUs in the real experiment.

• The injection of an SEU via JTAG is rather slow, one injection run takes about seven seconds. Recording a statistically significant set of samples requires a long time.

Therefore, the tests had to run over night. This could have been improved by re-placing JTAG based injection with a hardware based injection using the on-board configuration controller. However, it would also add new logic to the system. New logic is always a likely candidate for new errors, and hence complicates debugging of the overall system.

• For some tests with tight time schedule it was required to compensate the slow recording of statistics. Then (e.g. in the test explained in section 5.3.2), not only one SEU but 20 SEUs have been injected per iteration. This allows for multi-bit upsets, and some radiation mitigation techniques might not perform optimal on multi bit upsets (see paragraphTMR Needs Repair, page 49).