GCA multi-softcore architecture for agent systems simulation

(1)

GCA Multi-Softcore Architecture for Agent Systems Simulation

Christian Sch¨ack, Wolfgang Heenes, Rolf Hoffmann {schaeck,heenes,hoffmann}@ra.informatik.tu-darmstadt.de

Abstract:

The GCA (Global Cellular Automata) model consists of a collection of cells which change their states synchronously depending on the states of their neighbors like in the classical CA (Cellular Automata) model. In differentiation to the CA model the neighbors are not fixed and local, they are variable and global. The GCA model is applicable to a wide range of parallel algorithms. The application in this paper is multi-agent behavior simulation. By using the GCA model many multi-agent behaviors can easily be described and efficiently simulated. A general purpose multiprocessor architecture to accelerate the simulation of multi-agent behaviors based on the massively parallel GCA model is presented. In contrast to a special purpose implementation of a GCA al- gorithm the multiprocessor system allows the implementation in a flexible way through programming, thus simulating different behaviors on the same architecture. The implemented architecture mainly consists of a set of processors (Nios II) and a network.

The Nios II features a general-purpose RISC CPU architecture designed to address a wide range of applications. Three different networks have been implemented and evaluated with regard to the multi-agent simulation application. A system with up to 16 processors was implemented as a prototype on an FPGA.

1 Introduction

Simulation of multi-agent systems are used in order to

(A) observe interesting behaviors in systems modeled with agents (problems in physics, chemistry, biology, nano technology, etc.). [Kl¨u01, DJT01]

(B) to solve computational tasks modeled with agents. [FS05]

(C) evaluate the performance (fitness) of a multi-agent system during heuristic optimization procedures. Thereby the behavior of agents is optimized in order to solve a given global task. [KEFH09]

The simulation of multi-agent systems with many agents and complex behaviors of the agents is a time-consuming task. We will present different multi-softcore FPGA architectures to accelerate the simulation on a two dimensional grid. The aim is to find a flexible, customizable and programmable architecture to accelerate the simulation of different agent behaviors. There is a trade-off between a dedicated hardware implementation and a programmable architecture. To provide a programmable architecture without the need of

(2)

resynthesizing, a softcore processor such as the NIOS II is a good choice. The NIOS II processor comes in three different speed grades allowing to adjust the overall architecture in order to gain the best performance out of it. The processor can be easily programmed in C by using the Eclipse based NIOS II IDE. Loading the developed architecture with a new agent behavior is easy and transparent, meaning that the programmer does not need to be fully aware of the underlying hardware architecture. This also allows to integrate future innovations without changing the complete software code. Multiple NIOS II softcores are used to accelerate the simulation by distributing the whole task among the available processors.

GCA Model

For a general multi-agent simulation the GCA model (Global Cellular Automata)[HVW00, HVWH01] is used. The GCA model is a generalization of the CA model. The GCA model consists of a set of cells. Each cell can hold multiple data and link fields. For each cell a local cell rule is applied calculating the next data and link values. The different states of the cell values are called generations. In the GCA model each cell has access to any other cell using dynamic links. Figure 1 shows the principle for two link fields and one data field. The amount of data and link fields can be adjusted as necessary.

Figure 1: The operation principle of the GCA

Multi-Softcore Architectures and Interconnection Networks

The performance of multiprocessor SoC architectures is depending on the communication network and the processor architecture. Several contributions described and compared different bus architectures [RSM01]. Multiprocessor architectures with the Nios II softcore were presented in [KSH07].

2 Architecture Overview

The architecture supporting the GCA model consists ofpNios II processors each sup- plied with a program memory and a data memory. The program memory holds the programmable cell rule. The data memory holds the cell values. Each data memory holds a part of all cells. Each Nios II processor applies the cell rule to the cell data in its associated data memory. To be able to read neighbor data addressed by the links all processors are

(3)

connected to a network. Write accesses to neighbor cells are not allowed according to the GCA model. The data memories are implemented as dual port memories. The first port is used for read and write accesses by the associated processor, the second port is used for read accesses by other processors via the network. The data memories are capable of holding two generations at a time. The current generation is used for all read accesses while the next generation is used for all write accesses. Thereby data consistency is given at any time.

Figure 2: General system architecture with datapath

A Nios II processor and its associated data and program memory is called Processing Unit(PU). The design of the network is essential for the overall performance. The used networks will be explained in chapter 2.3. The performance of the architecture strongly dependents on the network implementation. The current architecture supports one link field L and one data field D.

2.1 The Nios II Softcore Processor

The Nios II processor [Alt09] is a general-purpose RISC processor core, providing:

• Full 32-bit instruction set, data path, and address space

• 32 general-purpose registers

• Dedicated instructions for computing 64-bit and 128-bit products of multiplication

• Floating-point instructions for single-precision floating-point operations

• Custom instructions

For the implementation of the multiprocessor architecture, the custom instructions [Alt09]

turned out to be very useful (Section 2.2).

(4)

2.2 Custom Instruction

We extended the Nios II instruction set with a parametrized custom instruction. The custom instruction can be specified by one of the following functions:

(1) AUTO GET L: automatic decision between internal/external read of the linkLbased on processor ID comparison

(2) AUTO GET D: automatic decision between internal/external read of the data D based on processor ID comparison

(3) INTERNAL GET L: internal read of the linkLfrom memory (4) INTERNAL GET D: internal read of the dataDfrom memory

(5) SET LS: write linkL to the internal memory and shift right by one for easier accesses arranged by the power of two

(6) SET L: write linkLto the internal memory

(7) EXTERNAL GET L: read linkLby using the network (8) SET D: write dataDto the internal memory

(9) EXTERNAL GET D: read dataDby using the network (10) NEXTGEN: processor synchronization command (Section 2.4)

The automatic read functions (1-2) decide between an internal or external read access by comparing the address with the processor ID. The actual read process is then mapped onto the right read function (2,3,7,9) named LOCAL GET{X}or GET{X}where X∈ {L,D}.

This process does not need an extra clock cycle and should therefore always be favored over the separated access functions.

2.3 Network

The interconnection between the processors can be implemented using different network topologies. We have previously used an omega network. [SHH09a, SHH09b] The omega network is good to use for regular accesses (power of two) e.g. for the bitonic sort algo- rithm. As the arbitration is done within the network all read accesses to the network have to be synchronized. For multi-agent simulation the use of such a network is not applicable because the access patterns are arbitrary. Therefore we have investigated other networks to gain the best performance for such an application.

(5)

The requirements for the network are:

• fast access to neighbor cells, reduce wait cycles

• reduce the amount of collisions/congestion

• fair and easy arbitration

• keep the system frequency high Registered Ring

The registered ring network contains a register stage for each processor. Each processor can only read from and write to its own register stage. At each clock cycle the data is forwarded from stageN to stageN+ 1 modp. The ring bus consists of a request flag, an acknowledge flag, an address and a data field (Figure 3). The address is further divided into a processor ID and the data memory address.

Figure 3: Register fields on the registered ring network

If the register stage of a processor is free (request and acknowledge flag is zero) the processor can perform a write access to the network. In that case the processor sets the request flag to one, writes the address of the requested cell onto the address part of the bus and its own ID onto the data part of the bus. The ID of the sending processor is needed for the reply packet. The requesting processor is stalled until the request has been successfully processed.

Eachprocessing unit listens for incoming requests on the ring. If the request flag is set and the address is within the processors address space, the request is removed from the ring by setting the request flag to zero. If a processing unit is executing a read request all following requests are forwarded on the ring. As soon as the processing unit has finished the execution a new read access can be executed. Note that there is a difference between the processor itself and the processing unit the processor belongs to (Figure 2). A processing unit can execute incoming read request from the ring in parallel to and without interrupting its processor.

After the request has been processed and the requested data is available to be send back, the processing unit checks if the register stage is free. The data is then written onto the data part of the bus, the ID of the requesting processor is written onto the address part of the bus and the acknowledgment flag is set to one.

The processor that initiated the request listens for acknowledgment packets with its own ID. If such a packet is received it is removed from the ring. The acknowledgment flag is set to zero and the data is copied from the ring and advanced to the processor.

The protocol handling is done transparent and does not require action from the Nios II processor nor does a request disturb another processor with its execution.

(6)

Figure 4: Registered ring network for eight processors (showing two PUs)

Deadlocks are not possible as there are as many register stages as processors. Each processor can only put one request on the ring at a time. In the worst case scenario all register stages contain requests from different processors. These will be removed from the ring as soon as they arrive at the destination processing unit. An acknowledgment can only be written onto the ring if the request has been removed in advance. The access time for each request lastspclock cycles. Accesses to the internal memory via the ring network last2·p clock cycles and should therefore be avoided which can be easily accomplished by using the automatic read functions (Section 2.2)

Bus

The bus network connects all processing units with a shared bus. As there are no tristate buffers in the FPGA the common bus is implemented using multiplexers. The challenging part is the implementation of an efficient arbiter. Two approaches have been implemented, a round robin arbiter and a dynamic prioritized arbiter.

Round Robin Arbiter (BRRA).The round robin arbiter performs a check on the request signal of each processor in a cyclic way. If there is a request on the request signalN the address of this request is forwarded onto the bus. After the data has been read from the bus the request signalN+ 1 modpis checked. This approach avoids starvation of other processors trying to read data via the bus.

Dynamic Prioritized Arbiter (BDPA).The dynamic prioritized arbiter checks the request signals by searching the processor with the highest ID that wants to read via the network.

The IDs of the processors are compared with a comparator tree (Figure 5). A processor with an active request signal advances its ID otherwise zero through the comparator tree.

The comparator tree finds the highest processor ID with an active request and forwards this request with its corresponding address signals onto the bus.

(7)

Figure 5: Comparator tree for eight processors

2.4 Generation Synchronization

All processors run independently from each other which leads to unsynchronized exe- cutions of the processors. To ensure data integrity a special synchronization point was introduced. The synchronization point, implemented with a custom instruction function as described in section 2.2 (function 10), ensures that at a certain time all processors have executed the current generation and are ready to start execution of the next generation. A barrier-synchronization technique [Ung97, S. 165] is implemented byANDgating.

3 Agent Behavior Rule (Example)

As an example the moving behavior of agents in a 2D grid is modelled. The agent’s behavior, generally called thecell rule, is implemented as C code (Listing 1) running on the Nios II processors. The cell rule which is applied to all cells differs in three cases. The three cases are: empty cell (EC), agent cell (AC), block cell (BC).

EC: An empty cell checks the four neighbors located north (E1), south (E5), east (E2) and west (E4) of it. If there is exactly one neighbor cell with an agent that wants to move to the empty cell the agent is copied to it. The neighbor cell can not be deleted by the empty cell as there is no write access to neighbor cells. The left agent in figure 6 shows this situation.

AC: An agent cell checks four neighbor cells (A1, A2, A3, A5) located in the moving direction of the agent. For the front cell (A3) an easy check is executed as this cell only needs to be empty. The remaining three cells (A1, A2, A3) are checked for an agent with a moving direction to the front cell (A3). If that is the case none of the agents can move to this cell to avoid a collision. If only the agent from the current

(8)

agent cell (A4) wants to move to the front cell (A3) it is deleted from the agent cell (A4). The right agent in figure 6 shows this situation.

BC: A block cell/obstacle (B1) does not need to check any neighbors, is not altered and copied to the next generation.

Figure 6: Agent world

Figure 6 also shows the distribution of the cells among the processors. The two dimensional grid or agent world is distributed among the processors in a line wise order. The distribution of the cells depends on the size of each processors data memory. The hori- zontal arrow shows an internal read access of processor one. The vertical arrow shows an external read access of processor 2 to the data memory of processor 1.

Each agent performs one of the following two actions:

• move forward and simultaneously turn

• do not move but turn

The implemented direction of rotation for the two cases is shown in figure 7. The dotted arrows show the transition between two given agent alignments.

(9)

Figure 7: Agent turn behavior

4 FPGA Prototype Implementation

The prototyping platform is a Cyclone FPGA with the Quartus II 8.1 synthesis software from Altera. The Cyclone II FPGA contains 68,416 logic elements (LE) and 1,152,000 RAM bits [Alt06]. The implementation language is Verilog HDL. The Nios II processor is built with the SOPC-Builder. We used the Nios II/f processor. This core has the highest execution performance and uses a 6 stage pipeline. The DMIPS rate is 218 at the maximum clock frequency of 185 MHz. The synthesis settings are optimized for speed. Table 1 shows the numbers of logic elements (LE), logic elements for network, the memory usage, the maximum clock frequency and the estimated max. DMIPS [Wei02] rate for all processors [Alt09, Chapter 5]. With the DMIPS rate the speed-up can be calculated. The max. DMIPS rate does not take network data exchanges into account.

p net total LEs register mem register max. max. DMIPS

type LEs for for kbits bits clock DMIPS speed

net net (MHz) up

1 - 2,853 - - 191 1,583 140.00 162 1.00

4 Ring 9,711 300 248 282 5,012 112.50 522 3.21

8 Ring 18,573 588 496 405 9,509 90.00 835 5.14

16 Ring 37,040 1,200 992 651 18,514 75.00 1,392 8.57

4 BRRA 9,474 94 3 282 4,767 110,00 510 3.14

8 BRRA 18,178 232 4 405 9,017 90.00 835 5.14

16 BRRA 36,135 490 5 651 17,527 70.00 1,299 8.00

4 BPDA 9,389 94 3 282 4,767 112.50 522 3.21

8 BPDA 18,200 234 4 405 9,017 93.75 870 5.36

16 BPDA 35,949 501 5 651 17,527 75.00 1,392 8.57

Table 1: Resources and clock frequency for the Cyclone II FPGA

The evaluation of the three prototype platforms shows that the bus network with dynamic arbitration is the best network for this application. The clock frequency is kept high, while the resource usage is low.

(10)

4.1 Agent Behavior Implementation

The implementation of the behavior is as described in section 3. First the data of each cell is loaded. The operation applied onto that cell is dependent on the cell data (agent, block or free). The new state of that cell for the next generation is calculated and then stored.

All processors are synchronized at the end of each generation.

The evaluateField function (Listing 1, line 0-18) calculates and returns the amount of agents that want to move to the cellfieldADR. The second parameter *srcpoints to the neighbor cell where an agent is located, that wants to move to thefieldADRcell. *srcis only valid, if there is only one agent that wants to move to the cellfieldADR. The function is used to check the front cell of a moving agent and to check the neighborhood of an empty cell. The parameters have to be set accordingly. At startup all processors are synchronized (line 23-24). Then all cells which are associated with the current processor are altered for 100 generations (line 26-27). The cell content of the current cell is loaded (line 28). The three cases described in section 3 can be seen in line 29, 48, 68. The neighborhood of every empty cell is checked (line 31) and if there is only one agent (line 33) that wants to move onto this empty cell it is copied (line 35-42). The agents movement (turning while moving forward) is done in line 35-39. For an agent cell (line 48-66) the front cell has first to be calculated out of the moving direction (line 50-55). With the front cells address theevaluatedFieldfunction can be called (line 57). The agent is deleted (line 59) from the current cell if it is the only agent that wants to move to the front cell. Otherwise a collision occurred. The agent does not move and turns clockwise (line 62-64). Processor synchronization is done at the end of every generation (line 71).

(11)

Listing 1: Example c program

0 i n t e v a l u a t e F i e l d (i n t fieldADR , i n t ∗s r c ){ / / r e t u r n amount o f n e i g h b o r a g e n t s m o v i n g t o c e l l f i e l d A D R 1 i n t c o u n t , N , S , E ,W; c o u n t = 0 ; / / c h a n g e p o i n t e r∗s r c t o d e s t i n a t i o n c e l l i f move i s p o s s i b l e 2 i f( ALT CI CUSTOM NETWORK EXT INST ( AUTO GET D , fieldADR , 0 ) ! = CELL FREE ){r e t u r n 9999;}

3 N= fieldADR−MAXY; / / c a l c u l a t e a d d r e s s o f t h e n e i g h b o r c e l l l o c a t e n o r t h o f f i e l d A D R 4 i f( ( ALT CI CUSTOM NETWORK EXT INST ( AUTO GET D , N, 0 ) = = CELL AGENT )

5 && ( ALT CI CUSTOM NETWORK EXT INST ( AUTO GET L , N, 0 ) = =SOUTH ) ){c o u n t + + ;∗s r c =N;}

6 S= f i e l d A D R +MAXY; / / c a l c u l a t e a d d r e s s o f t h e n e i g h b o r c e l l l o c a t e s o u t h o f f i e l d A D R 7 i f( ( ALT CI CUSTOM NETWORK EXT INST ( AUTO GET D , S , 0 ) = = CELL AGENT )

8 && ( ALT CI CUSTOM NETWORK EXT INST ( AUTO GET L , S , 0 ) = =NORTH ) ){c o u n t + + ;∗s r c =S;}

9 E= f i e l d A D R + 1 ; / / c a l c u l a t e a d d r e s s o f t h e n e i g h b o r c e l l l o c a t e e a s t o f f i e l d A D R 10 i f( ( ALT CI CUSTOM NETWORK EXT INST ( AUTO GET D , E , 0 ) = = CELL AGENT )

11 && ( ALT CI CUSTOM NETWORK EXT INST ( AUTO GET L , E , 0 ) = =WEST ) ){c o u n t + + ;∗s r c =E;}

12 W= fieldADR−1; / / c a l c u l a t e a d d r e s s o f t h e n e i g h b o r c e l l l o c a t e w e s t o f f i e l d A D R 13 i f( ( ALT CI CUSTOM NETWORK EXT INST ( AUTO GET D ,W, 0 ) = = CELL AGENT )

14 && ( ALT CI CUSTOM NETWORK EXT INST ( AUTO GET L ,W, 0 ) = = EAST ) ) 15 {i f( c o u n t>=1)r e t u r n 9 9 9 9 ; c o u n t + + ;∗s r c =W;}

16

17 r e t u r n c o u n t ;

18 }

19

20 i n t main ( ){

21 i n t g , i i , newADR , d i r e c t i o n , s o u r c e , c , C e l l C o n t e n t D ; 22

23 ALT CI CUSTOM NETWORK EXT INST (NEXTGEN, 0 , 0 ) ; / / s y n c h r o n i z a t i o n a t s t a r t u p

24 ALT CI CUSTOM NETWORK EXT INST (NEXTGEN, 0 , 0 ) ; 25

26 f o r( g = 0 ; g<GENS ; g ++){ / / r u n f o r GENS g e n e r a t i o n s

27 f o r( i i =SC ; i i<LOCL BLKS+SC ; i i ++){ / / r u n f o r a l l c e l l s a s s i g n e d t o t h i s p r o c e s s o r

28 C e l l C o n t e n t D =ALT CI CUSTOM NETWORK EXT INST ( INTERNAL GET D , i i , 0 ) ;

29 i f( C e l l C o n t e n t D ==CELL FREE ) / / i f c u r r e n t c e l l i s f r e e

30 {

31 c = e v a l u a t e F i e l d ( i i , &s o u r c e ) ; / / t h e n c h e c k n e i g h b o r c e l l s f o r a g e n t s 32 d i r e c t i o n =ALT CI CUSTOM NETWORK EXT INST ( AUTO GET L , s o u r c e , 0 ) ;

33 i f( c ==1) / / i f e x a c t l y o n e a g e n t i n t h e n e i g h b o r h o o d

34 {

35 s w i t c h( d i r e c t i o n ){ / / c a l c u l a t e new d i r e c t i o n

36 c a s eNORTH: d i r e c t i o n =EAST ;break;

37 c a s eSOUTH : d i r e c t i o n =WEST ;break;

38 c a s e EAST : d i r e c t i o n =NORTH;break;

39 c a s eWEST : d i r e c t i o n =SOUTH ;break;}

40

41 ALT CI CUSTOM NETWORK EXT INST ( SET D , i i , CELL AGENT ) ; / / s a v e a g e n t and d i r e c t i o n on c u r r e n t l y f r e e c e l l 42 ALT CI CUSTOM NETWORK EXT INST ( SET L , i i , d i r e c t i o n ) ;

43 }

44 e l s e

45 ALT CI CUSTOM NETWORK EXT INST ( SET D , i i , CELL FREE ) ; / / i f c o l l i s i o n k e e p c e l l e m p t y

46 }

47 e l s e{

48 i f( C e l l C o n t e n t D ==CELL AGENT ) / / i f c u r r e n t c e l l i s an a g e n t

49 {

50 d i r e c t i o n =ALT CI CUSTOM NETWORK EXT INST ( INTERNAL GET L , i i , 0 ) ; / / c a l c u l a t e m o v i n g d i r e c t i o n

51 s w i t c h( d i r e c t i o n ){

52 c a s eNORTH: newADR= i i−MAXY;break;

53 c a s eSOUTH : newADR= i i +MAXY;break;

54 c a s e EAST : newADR= i i + 1 ;break;

55 c a s eWEST : newADR= i i−1;break;}

56

57 c = e v a l u a t e F i e l d ( newADR , &s o u r c e ) ; / / c h e c k f r o n t c e l l ( f r e e and no c o l l i s i o n )

58 i f( c ==1) / / i f o n l y t h i s a g e n t i s m o v i n g t o t h e f r o n t c e l l

59 ALT CI CUSTOM NETWORK EXT INST ( SET D , i i , CELL FREE ) ; / / t h e n d e l e t e a g e n t f r o m c u r r e n t c e l l

60 e l s e

61 {

62 ALT CI CUSTOM NETWORK EXT INST ( SET D , i i , CELL AGENT ) ; / / i f c o l l i s i o n k e e p a g e n t and t u r n 63 i f( d i r e c t i o n ==WEST) d i r e c t i o n =NORTH; e l s e d i r e c t i o n + = 2 ;

64 ALT CI CUSTOM NETWORK EXT INST ( SET L , i i , d i r e c t i o n ) ;

65 }

66 }

67 e l s e

68 ALT CI CUSTOM NETWORK EXT INST ( SET D , i i , CELL BLOCK ) ; / / i f c u r r e n t c e l l i s a b l o c k k e e p i t

69 }

70 }

71 ALT CI CUSTOM NETWORK EXT INST (NEXTGEN, 0 , 0 ) ; / / g e n e r a t i o n s y n c h r o n i z a t i o n

72 }

73 ALT CI CUSTOM NETWORK EXT INST ( DONE EXE , 0 , 0 ) ; / / e x e c u t i o n f i n i s h e d

74 r e t u r n 0 ;

75 }

(12)

To evaluate the performance of the architecture with the three different networks a world of 45·45 = 2025cells with 185 randomly placed agents was executed for 100 generations.

The implemented architecture supplies a maximum of2¹¹ = 2048 cells due to limited resources of the FPGA. Empty cells are initialized as block cells. The maximum performance was reached with the bus network using the dynamic arbitration scheme. Table 2 shows the execution time and the real speed-up that was reached for the three different networks.

p network cycles cycles execution real

type (autoread) time speed-up

1 - 18,874,706 - 134.82 ms 1.0

4 Ring 8,227,386 5,040,248 44.80 ms 3.01

8 Ring 5,541,620 2,677,569 29.75 ms 4.53

16 Ring 4,195,332 1,544,556 20.59 ms 6.55

4 BRRA 5,767,479 5,004,627 45.50 ms 2.96

8 BRRA 3,401,685 2,616,640 29.07 ms 4.64

16 BRRA 2,680,747 1,442,007 20.60 ms 6.54

4 BDPA 5,802,696 4,997,752 44.42 ms 3.03

8 BDPA 4,315,046 2,586,247 27.59 ms 4.89

16 BDPA 3,961,527 1,364,675 18.20 ms 7.41 Table 2: Cycles, execution time and real speed-up for different networks

(13)

4.2 Network Evaluation

To evaluate the three implemented networks the network access pattern has been recorded out of the FPGA. The prototype implementation was enriched with a second clock domain.

The first clock domain was driving the previously presented system while the second clock domain was driving an ethernet core. The clock driving the system was dependent on an enable signal generated within the ethernet clock domain. This process is needed as the ethernet adapter is not capable of sending the huge amount of generated data. An extract of the captured data is presented in figure 8. Starting with the first access to the network at clock cycle 47 and ending at clock cycles 180, 234 and 191. It shows the request signals of four processors and the corresponding acknowledgment signals. Empty data lines have been discarded. The clock cycle position is printed alongside each log.

Figure 8: Network access pattern. Bus with dynamic arbitration, bus with round robin arbitration and registered ring

As can be seen in figure 8 the dynamic arbitration for the bus network provides a constant access time for each network access (assuming one access at a time). There are multiple accesses at the beginning which are distributed along the time axis. This pattern is kept constant pretty well which can be seen on the full captured data log. The round robin arbitration for the bus network has varying request times. This can be seen by comparing the lines at clock cycle position 90 and 173. At clock cycle position 90 the access is executed as fast as an internal memory access using only two clock cycles. But the access at clock cycle position 173 takes a total of five clock cycles to execute the memory access

(14)

adding three wait cycles for the arbitration. This shows clearly the round robin execution behavior. As can be seen from this short data log extract the dynamic arbitration is faster for single accesses to the bus network. Comparing the first lines also shows the different priorities of the arbitration, although the round robin arbitration could show a different pattern depending on the time of the network access and the status of the arbiter. Taking a look at the registered ring network shows that it is faster than the round robin but slower than the dynamic arbitration on the bus network. This does not apply to the overall performance. In contrast to the bus network the ring network executes multiple requests at a time. The acknowledgment data at clock cycle position 52 shows this behavior. The first accesses need six clock cycles which is the best performance that can be reached. Four clock cycles are needed for the request and the acknowledgment to move around the ring and two clock cycles are needed for the actual request. The following requests take more clock cycles indicating additional wait cycles due to an occupied ring.

5 Conclusion

The implemented multi-core architecture is very flexible because different agent behaviors can be programmed and run on the same architecture. The network has a large impact on the performance and the bus system with dynamic arbitration showed the best performance. This performance can further be improved by new custom instructions separating the network read request from the network reply. This gives the processor the ability to continue with the program execution instead of being stalled. Although the present architecture is very flexible only one link and one data field of the GCA model are supported.

Future implementations will support multiple link and data fields providing the ability to store multiple data for each cell. This can be used to run intelligent agents more efficiently by storing additional values in separate data fields.

References

[Alt06] Altera, Datasheet Cyclone II.

http://www.altera.com/literature/hb/cyc2/cyc2_cii5v1.pdf, 2006.

[Alt09] Altera, NIOS II Website.

http://www.altera.com/products/ip/processors/nios2/

ni2-index.html, 2009.

[DJT01] J. Dijkstra, J. Jessurun, and Harry J. P. Timmermans. A Multi-Agent Cellular Au- tomata Model of Pedestrian Movement. In M. Schreckenberg and S. D. Sharma, editors,Pedestrian and Evacuation Dynamics, pages 173–181. Springer, 2001.

[FS05] Dietmar Fey and Daniel Schmidt. Marching-pixels: a new organic computing paradigm for smart sensor processor arrays. In Nader Bagherzadeh, Mateo Valero, and Alex Ram´ırez, editors,Conf. Computing Frontiers, pages 1–9. ACM, 2005.

(15)

[HVW00] Rolf Hoffmann, Klaus-Peter V¨olkmann, and Stefan Waldschmidt. Global cellular automata GCA: an universal extension of the CA model, 2000.

[HVWH01] Rolf Hoffmann, Klaus-Peter V¨olkmann, Stefan Waldschmidt, and Wolfgang Heenes.

GCA: Global Cellular Automata, A Flexible Parallel Model, 2001.

[KEFH09] Marcus Komann, Patrick Ediger, Dietmar Fey, and Rolf Hoffmann. On the Effective- ness of Evolution Compared to Time-Consuming Full Search of Optimal 6-State Au- tomata. In Leonardo Vanneschi, Steven Gustafson, Alberto Moraglio, Ivanoe De Falco, and Marc Ebner, editors,Genetic Programming, pages 280–291. Springer, 2009.

[Kl¨u01] Franziska Kl¨ugl. Multiagentensimulation: Konzepte, Werkzeuge, Anwendungen.

Addison-Wesley Verlag, 2001.

[KSH07] Ari Kulmala, Erno Salminen, and Timo D. Hämäläinen. Evaluating Large System- on-Chip on Multi-FPGA Platform. In S. Vassiliadis et al., editor,International Work- shop on Systems, Architectures, Modeling and Simulation (SAMOS), pages 179–189.

Springer, 2007.

[RSM01] Kyeong Keol Ryu, Eung Shin, and Vincent J. Mooney. A Comparison of Five Different Multiprocessor SoC Bus Architectures. InDSD ’01: Proceedings of the Euromicro Symposium on Digital Systems Design, pages 202–209, Washington, DC, USA, 2001.

IEEE Computer Society.

[SHH09a] Christian Sch¨ack, Wolfgang Heenes, and Rolf Hoffmann. A Multiprocessor Archi- tecture with an Omega Network for the Massively Parallel Model GCA accepted on SAMOS-Workshop, 2009.

[SHH09b] Christian Sch¨ack, Wolfgang Heenes, and Rolf Hoffmann. Network Optimization of a Multiprocessor Architecture for the Massively Parallel Model GCA. In22. PARS Workshop, Gesellschaft f¨ur Informatik (GI), 2009. ISSN 0177-0454.

[Ung97] Theo Ungerer. Parallelrechner und parallele Programmierung. Spektrum Akademis- cher Verlag, Heidelberg, Berlin, 1997.

[Wei02] Alan R. Weiss. Dhrystone Benchmark - History, Analysis, ”Scores” and Recommendations. http://www.ebenchmarks.com/download/

ECLDhrystoneWhitePaper.pdf, 2002.