5.2 HiLDE: HiL Design Environment
5.2.3 Communication and Performance
In order to realise FPGA-in-the-Loop simulations the RAPTOR system has to be connected to a standard PC, as depicted in figure 5.9. The main board of a PC has typically a processor and a set of buses and bridges (i.e., a chipset) to interconnect peripheral components, such as memory, video cards, and external devices. The RAPTOR system uses the PCI-Bus to connect to the PC. In order to exchange data between RAPTOR and a host processor, PIO and DMA transmission methods can be used, both methods are described in the next sections.
For the experiments presented in this section, a Pentium 4 processor from Intel, with 3,0 GHz clock frequency, 1 GByte PC400 Double Data Rate (DDR) RAM are
used. The Mainboard has a 865G-Chipset, whose connection to the RAPTOR system is depicted in figure 5.9. Although the results of the experiments are specific to this setup, they can be generalised to newer computer systems.
System Controller 865G (North-Bridge)
CPU (Pentium 4)
Peripheral Bus 865G (South-Bridge) Front Side Bus (6,4 GB/s)
PCI Bus(133 MB/s)
HDD USB BIOS
PCI BUS Bridge (PLX PCI9054)
Local Bus Arbiter
RAPTOR2000
Audio RAID LAN
Hub Interface (266 MB/s)
Channel A (6,4 GB/s)
Channel B (6,4 GB/s) AGP Bus(2GB/s)
266 MB/s
DDRAM
DDRAM AGP Graphic
Card Gigabit Ethernet
Host PC
FPGA FPGA
FPGA FPGA
FPGA FPGA
PCI-Slot PCI-Slot
PCI-Slot
Figure 5.9: Coupling of host computer and RAPTOR. In this example a Pentium 4 with a 865G-Chipset is presented
In the following section, the different kinds of FPGA-iL simulations are presented, relating them to the choice of a transmission method.
Open-Loop vs. Close-Loop Simulations
In an open-loop simulation the DUT does not have an implicit or explicit feedback loop to the simulated environment. A typical example of an open-loop simulation is a digital filter. In contrast to this, in a close-loop simulation the DUT has a close interaction with the simulated environment. Control systems require typically
close-loop simulations, because their outputs are computed based on the state of the controlled system.
The kind of simulation has a great influence on the kind of communication (e.g., PIO or DMA) that is best suited to the FPGA-iL simulation. In a close-loop simulation, data has to be exchanged between DUT and simulation software at every integration step. Therefore, the kind of memory used to store input and outputs of the DUT, and the kind of communication has to be selected accordingly. In an open-loop simulation, the amount of data that can be sent to the DUT depends mainly on the speed of the simulation. Therefore, data can be sent to the DUT in a way that the communication overhead is reduced, e.g., burst of data can be sent at once.
PIO Communication
In PIO transmission mode, the processor loads data to be transferred to one of its registers, before the data is actually sent through the Front-Side-Bus, the PCI-Bus and finally the Local-Bus to a DUT running on the FPGA (cf. figure 5.9). correspondingly, the data generated by the DUT (i.e., control signals) are sent from the registers of the RAPTOR to registers of the processor by a read command of the host processor. This transmission mode blocks the processor during the data transfer.
DMA Communication
Direct memory access (DMA) is a transmission mode where a peripheral device transfers information directly to or from memory, without the processor being required to perform the transaction. This has the advantage that the processor can execute other tasks while the transfer is taking place.
The PCI-Bridge of the RAPTOR system is able to operate as a DMA controller with two independent channels. This Bridge is able to execute DMA-transfers to the PCI-Bus as well as to the Local-Bus. The initialisation of a DMA transfer plus the arbitration of the PCI- and Local-Bus makes DMA worth using instead of PIO only if the amount of data to be transferred is above a certain threshold-value, which is explored in the following section.
Simulation Performance
To estimate the maximum performance of the presented framework, several pre- and post-processing steps need to be considered, which have to be conducted in every simulation cycle. A maximum for the simulation frequencyFsimis given by
Fsim= 1
Tsw2hw+Tsend+Trun+Treceive+Thw2sw (5.1)
whereTsw2hw andThw2sw are the conversion-times from a simulator-internal to a hardware-specific number representation and vice versa, Tsend and Treceive are the transfer-times from the main memory of the host to the prototyping system and back, and Trun is the latency of the design itself. All values except Trun depend on the interface between the simulation environment and the hardware design, whileTrun depends on the speed of the hardware design only. The delay of the simulator, which may be running a test-bench, or a data logger, or similar, can not be estimated here, because it depends on the complexity of the simulation. As the interface latency is highly dependent on the underlying host architecture, the following measurements are presented as an example for transfer and conversion times.
0 10 20 30 40 50 60 70 80 90 100
0 50 100 150
# 32Bit−Data Words
Round−Trip−Frequency [KHz]
DMA SGL−Transfer DMA Blocktransfer PIO
Figure 5.10: Maximum simulation frequency for a given number of input/output pairs In figure 5.10 the simulation frequencies for different transfer modes against the number of I/O-pairs (i.e., combination of one input and one output) are shown. I/O pairs are used, because the transfer times are different between the write and read transfer, and assuming the same number for inputs and outputs is a good approximation to real scenarios. It can easily be seen that the transfer mode should be selected according to the number of I/Os, since PIO is faster for up to 18 I/O-pairs. As from 22 I/O-pairs, DMA block transfers are faster.
Communication Optimization
In the simulation flow as described above, all I/O data have to be transferred at every clock cycle, resulting in redundant I/O operations when data has not changed. To decrease this overhead, two further concepts were integrated in HiLDE:Event based communicationandTransactors:
• Event based communication: to reduce the number of redundant I/O oper-ations, only data that actually changes has to be transferred. While this is straightforward to be implemented in software (Simulink provides appropriate functions), the hardware wrapper has to be extended. The register of every output port is extended with a mechanism to detect changes at the output. For nooutput ports an additional register withnobits stores the results of these de-tectors, and thus indicates which values must be read by the host computer. The number of additional read operations to retrieve this information is dependent on the bit-width of the bus to the host computer, resulting in an overall number of read accesses ˜nr:
˜
nr=∆(out) +l no wordwidth
m
(5.2)
where∆(out) is the number of output ports with a new value. Given thatnr denotes the number of read operations in the standard HiLDE wrapper, the benefitnrn˜r is dependent on the relation of I/Os with regularly changing values to the overall number of I/Os in the DUT. In general DUTs with irregularly changing I/Os will benefit from this technique.
• Transactors: whenever the sequence of events (value changes) is predefined, such as in communication protocols, the number of I/O operations can be reduced even further by implementing adaptors for the simulation and for the FPGA. The amount of savings here is dependent on the complexity of the protocol: instead of transferring all control-signals or control-signal changes, the adaptors detect protocol activity and transfer only the necessary data, such as address and data, the actual protocol handling is processed in the adaptors in the simulation environment and in the FPGA. While the functionality of the HiL simulation is not affected by this method, the amount of I/O operations for a protocol as described in [Kal02] can be reduced by over 90%.