• Keine Ergebnisse gefunden

5.2 HiLDE: HiL Design Environment

5.2.3 Communication and Performance

In order to realise FPGA-in-the-Loop simulations the RAPTOR system has to be connected to a standard PC, as depicted in figure 5.9. The main board of a PC has typically a processor and a set of buses and bridges (i.e., a chipset) to interconnect peripheral components, such as memory, video cards, and external devices. The RAPTOR system uses the PCI-Bus to connect to the PC. In order to exchange data between RAPTOR and a host processor, PIO and DMA transmission methods can be used, both methods are described in the next sections.

For the experiments presented in this section, a Pentium 4 processor from Intel, with 3,0 GHz clock frequency, 1 GByte PC400 Double Data Rate (DDR) RAM are

used. The Mainboard has a 865G-Chipset, whose connection to the RAPTOR system is depicted in figure 5.9. Although the results of the experiments are specific to this setup, they can be generalised to newer computer systems.

System Controller 865G (North-Bridge)

CPU (Pentium 4)

Peripheral Bus 865G (South-Bridge) Front Side Bus (6,4 GB/s)

PCI Bus(133 MB/s)

HDD USB BIOS

PCI BUS Bridge (PLX PCI9054)

Local Bus Arbiter

RAPTOR2000

Audio RAID LAN

Hub Interface (266 MB/s)

Channel A (6,4 GB/s)

Channel B (6,4 GB/s) AGP Bus(2GB/s)

266 MB/s

DDRAM

DDRAM AGP Graphic

Card Gigabit Ethernet

Host PC

FPGA FPGA

FPGA FPGA

FPGA FPGA

PCI-Slot PCI-Slot

PCI-Slot

Figure 5.9: Coupling of host computer and RAPTOR. In this example a Pentium 4 with a 865G-Chipset is presented

In the following section, the different kinds of FPGA-iL simulations are presented, relating them to the choice of a transmission method.

Open-Loop vs. Close-Loop Simulations

In an open-loop simulation the DUT does not have an implicit or explicit feedback loop to the simulated environment. A typical example of an open-loop simulation is a digital filter. In contrast to this, in a close-loop simulation the DUT has a close interaction with the simulated environment. Control systems require typically

close-loop simulations, because their outputs are computed based on the state of the controlled system.

The kind of simulation has a great influence on the kind of communication (e.g., PIO or DMA) that is best suited to the FPGA-iL simulation. In a close-loop simulation, data has to be exchanged between DUT and simulation software at every integration step. Therefore, the kind of memory used to store input and outputs of the DUT, and the kind of communication has to be selected accordingly. In an open-loop simulation, the amount of data that can be sent to the DUT depends mainly on the speed of the simulation. Therefore, data can be sent to the DUT in a way that the communication overhead is reduced, e.g., burst of data can be sent at once.

PIO Communication

In PIO transmission mode, the processor loads data to be transferred to one of its registers, before the data is actually sent through the Front-Side-Bus, the PCI-Bus and finally the Local-Bus to a DUT running on the FPGA (cf. figure 5.9). correspondingly, the data generated by the DUT (i.e., control signals) are sent from the registers of the RAPTOR to registers of the processor by a read command of the host processor. This transmission mode blocks the processor during the data transfer.

DMA Communication

Direct memory access (DMA) is a transmission mode where a peripheral device transfers information directly to or from memory, without the processor being required to perform the transaction. This has the advantage that the processor can execute other tasks while the transfer is taking place.

The PCI-Bridge of the RAPTOR system is able to operate as a DMA controller with two independent channels. This Bridge is able to execute DMA-transfers to the PCI-Bus as well as to the Local-Bus. The initialisation of a DMA transfer plus the arbitration of the PCI- and Local-Bus makes DMA worth using instead of PIO only if the amount of data to be transferred is above a certain threshold-value, which is explored in the following section.

Simulation Performance

To estimate the maximum performance of the presented framework, several pre- and post-processing steps need to be considered, which have to be conducted in every simulation cycle. A maximum for the simulation frequencyFsimis given by

Fsim= 1

Tsw2hw+Tsend+Trun+Treceive+Thw2sw (5.1)

whereTsw2hw andThw2sw are the conversion-times from a simulator-internal to a hardware-specific number representation and vice versa, Tsend and Treceive are the transfer-times from the main memory of the host to the prototyping system and back, and Trun is the latency of the design itself. All values except Trun depend on the interface between the simulation environment and the hardware design, whileTrun depends on the speed of the hardware design only. The delay of the simulator, which may be running a test-bench, or a data logger, or similar, can not be estimated here, because it depends on the complexity of the simulation. As the interface latency is highly dependent on the underlying host architecture, the following measurements are presented as an example for transfer and conversion times.

0 10 20 30 40 50 60 70 80 90 100

0 50 100 150

# 32Bit−Data Words

Round−Trip−Frequency [KHz]

DMA SGL−Transfer DMA Blocktransfer PIO

Figure 5.10: Maximum simulation frequency for a given number of input/output pairs In figure 5.10 the simulation frequencies for different transfer modes against the number of I/O-pairs (i.e., combination of one input and one output) are shown. I/O pairs are used, because the transfer times are different between the write and read transfer, and assuming the same number for inputs and outputs is a good approximation to real scenarios. It can easily be seen that the transfer mode should be selected according to the number of I/Os, since PIO is faster for up to 18 I/O-pairs. As from 22 I/O-pairs, DMA block transfers are faster.

Communication Optimization

In the simulation flow as described above, all I/O data have to be transferred at every clock cycle, resulting in redundant I/O operations when data has not changed. To decrease this overhead, two further concepts were integrated in HiLDE:Event based communicationandTransactors:

• Event based communication: to reduce the number of redundant I/O oper-ations, only data that actually changes has to be transferred. While this is straightforward to be implemented in software (Simulink provides appropriate functions), the hardware wrapper has to be extended. The register of every output port is extended with a mechanism to detect changes at the output. For nooutput ports an additional register withnobits stores the results of these de-tectors, and thus indicates which values must be read by the host computer. The number of additional read operations to retrieve this information is dependent on the bit-width of the bus to the host computer, resulting in an overall number of read accesses ˜nr:

˜

nr=∆(out) +l no wordwidth

m

(5.2)

where∆(out) is the number of output ports with a new value. Given thatnr denotes the number of read operations in the standard HiLDE wrapper, the benefitnrr is dependent on the relation of I/Os with regularly changing values to the overall number of I/Os in the DUT. In general DUTs with irregularly changing I/Os will benefit from this technique.

• Transactors: whenever the sequence of events (value changes) is predefined, such as in communication protocols, the number of I/O operations can be reduced even further by implementing adaptors for the simulation and for the FPGA. The amount of savings here is dependent on the complexity of the protocol: instead of transferring all control-signals or control-signal changes, the adaptors detect protocol activity and transfer only the necessary data, such as address and data, the actual protocol handling is processed in the adaptors in the simulation environment and in the FPGA. While the functionality of the HiL simulation is not affected by this method, the amount of I/O operations for a protocol as described in [Kal02] can be reduced by over 90%.