Optimizing the Implemented Timed-automata Templates 105

5.4 Methods for Improving Scalability

5.4.1 Optimizing the Implemented Timed-automata Templates 105

imple-mentation capturing the MoP of our considered MPSoC (which were presented and evaluated in [Fakih et al., 2013a]). But if we take a sharp look at these templates following optimizations/abstractions could be made. These opti-mizations were applied, when possible, in the evaluation chapter (see Chap. 7) and lead to major improvements in terms of state space.

Abstracting Shared/Private FIFO Buffers We have described the TA template of the shared FIFO buffer in Sect.5.2.6. But according to our system model def-initions, no parallel accesses can be issued on the shared FIFO buffers, since the accesses are sequentialized in the interconnect, where the one with the highest priority wins the earliest access. Taking this into consideration, we are able to abstract (not modeling explicitly) the shared FIFO buffers and model them as local queues (with their specific methods) within the interconnect, adding their timing delays to that of the interconnect without distorting the timing semantics.

Similarly to the above abstraction, the private FIFO buffers could also be abstracted. In order to abstract private FIFO buffers, the idea here is to include the time needed to access these FIFO buffers (since these are mapped to local private memories) in the communication driver’s BCET/WCET access calcula-tion and model them as local queues (with their specific methods) within the communication driver TA template of the corresponding tile.

3Equation5.2assumes that UPPAAL doesn’t perform further optimization on the initialized multi-dimensional channel arrays.

After applying above abstractions, state-space savings in terms of synchro-nization channels (since these would be implemented as local methods either in the interconnect or in the communication driver templates) and Eq. 5.2 be-comes:

Ch_total =Ch_runActor+Chf inishActor Ok+Chf inishActor Block+Chf inishComm Ok

+Chf inishComm Block+Ch_read+Ch_write+ChreadInterconnect+ChwriteInterconnect

+Chf inishInterconnect Ok+Chf inishInterconnect Block+Ch_event

= (Ptarget+P_intiator)((A×T) + (T×I)) + 2(T×I) + 5A+E

(5.3) which leads in the example described above (3 actors, 4 ports, and 2 tiles) to a large reduction of total number of channels of 55 (85 in Eq. 5.2) and 87 (141 Eq. 5.2) channels when increasing the number of ports to 8. In addition, the number of timed automata in Eq. 5.1 becomes independent of the number of SDFGs’ FIFO buffers:

TA_total = (2×T) +A+I+E+ 1 (5.4)

and we are able to spare one TA and one clock for every FIFO buffer, which significantly improves the scalability of our method.

Merging Scheduler TA Template and Actors’ TA Templates For the case in which we have a static-order SDFG scheduler, additional optimizations can be made. Here, the order of the actors defines the execution priority and no extra scheduling mechanism is needed (for implementation details please refer to [Schaumont, 2013]). Taking this into consideration, we are now able to spare the scheduler timed automaton with its primitives for every tile. Additionally, the following optimization can also be made. Letsobe an ordered list of actors (see Def. 4.2.10), possibly including actors belonging to many SDFGs, which are executed in a fixed-static order on a specific tile. Now instead of having the cost of one timed automaton for modeling every actor, we utilize only one automaton called SOonTile (depicted in Fig. 5.11) to capture the behavior of all actors (belonging to the sorted list so) mapped to a tile, and this is done for every tile in the SUA. Obviously, the above optimization leads to significant savings in terms of instantiated number of TAs, clocks and synchronization channels, where Eq.5.4becomes:

TA_total =T+I+E+ 1 (5.5)

which leads in the example described above (3 actors, 4 ports, and 2 tiles) to a large reduction of total number of TA of 4 (9 in Eq.5.4), allowing to analyze greater number of actors without drastically increasing the state space. Notice that the number of channels (Eq. 5.3) cannot be reduced since the scheduler

!isTransporter()

!isProducer() t >= Performance[currActor][0]

!isFinished()

isProducer()

t >= Performance[currActor][0]

!isFinished()

event[sdfg]?

t>=PollingWait[id][0]

isFinished()

t>=PollingWait[id][0]

runActor[currActor]!

FinishWriteAllPorts

t <= Performance[currActor][1]

PollingWaitWrite

PollingWaitRead isTransporter()

!startingActor&&currActor>=0 currActor<0

t<=PollingWait[id][1]

startingActor&&currActor>=0 isFinished()

ReadAllPorts

ﬁnishActor[currActor]!

ﬁnishComm_Ok[currActor]?

ﬁnishActor[currActor]!

read[currActor][activeTargPort][tile]!

write[currActor][activeInitPort][tile]!

ﬁnishComm_Ok[currActor]?

ﬁnishComm_Block[id]?

t=0

nextTargPort()

t=0 t=0 t=0

t=0 currActor=getReadyActor()

t=0

nextInitPort() t=0

t=0

WaitReadCommDelay

Compute WriteAllPorts

WaitWriteCommDelay

FinishCons ComputeCons FinishReadAllPorts

ActorType WaitEvent

FinishProd

Start GetActor CheckActor

Figure 5.11: Optimization of TA templates in case of SO SDFG scheduler

channelsCh_runActor andChf inishActor are still needed by the observers templates (see Sect. 5.3). Fig.5.11shows the optimized TA template ofSOonTile. The TA starts by choosing the first actor in the ordered list (in state GetActor) and then it checks if that actor is sensitive to an event trigger (in stateCheckActor depending on the flag startingActor), where it either proceeds or it waits for an event (in stateWaitEvent). Now depending on the type of the actor (if a Producer or not in state ActorType), the TA either begins with consuming tokens on its ports (in case of Consumer or Transporter actor see above part of Fig.5.11) or delays∆Amodeling the computation time of the actor. In the latter case and after producing tokens on all its ports, it finishes (similar toTransporter TA implementation in Fig.5.5). In the first case, after finishing consuming on all ports (in stateFinishCons), the actor either finishes (if it is aConsumer actor) or it continues (if it is aTransporteractor) delaying∆Aand after that producing tokens on its ports to reach at last the finish state (in the stateFinishProd).

5.4.2 Applying Clustering Method

In Sect.2.2.1.4, we have described a clustering method for SDFGs known from literature [Bhattacharyya et al., 1997]. This method can obviously improve the scalability of our RT analysis method when analyzing SDFGs with large num-ber of actors. With the help of clustering, the numnum-ber of actors in an SDFG can be reduced leading to the reduction in the number of TAs which should be explored by the model-checker. In the following, we will describe how this method can be applied taking into consideration our system model properties.

SDFG Scheduler

SDFG2 RR

SDFG1

Inter-com-munication Part to be Clustered

1 1

SDFG Scheduler

SDFG2 RR

SDFG1 Ʊ

1 1

C H

7 16

Actor on other tile

A B

Figure 5.12: Example of clustering timing violation by RR SDFG scheduler

The clustering method as already stated in Sect.2.2.1.4, assumes connected and consistent SDFGs. Furthermore, the part of the SDFG to be clustered should be acyclic. In order to apply this clustering method on the SDFG(s) in our system model, the following additional conditions should be satisfied:

D1 A static-order SDFG scheduler is assumed.

D2 The mapping of the SDFG(s) actors to the MPSoC must be known since only actors mapped to the same tile and which do not engage in an inter-processor communication can be clustered.

It is obvious that the clustering method can be applied only in the case where a static-order SDFG scheduler is used (D1). For the case, e.g. the round-robin SDFG scheduler is used, the timing semantics could be violated when applying the clustering method in its general form as shown in the example in Fig.5.12. Here, we assume that both homogeneous SDFGs (SDFG1andSDFG2) are mapped to the same tile (except the gray shadowed actors are mapped to another tiles) and are scheduled according to RR scheduling (see Sect. 4.2.4.2) with SDFG1 being executed before SDFG2. In addition, we assume that the BCET and the WCET are equivalent for every actor. This execution time is annotated to every single actor as seen in Fig. 5.12. The resulting SDFGs after applying the clustering method are shown in Fig.5.12(to the right). Notice that

in the unclustered (to the right of Fig. 5.12) SDFG1 (by RR SDFG scheduler) actor C comes to execution after 21 time units (under the assumption that no blocking on the FIFO buffers occurs) while in the clustered version it will come to execution after 23 time units. This fact obviously proves that the timing semantics can be violated when applying the clustering method, in its general form, to SDFGs scheduled according to Round-Robin.

Mapping information (see D2) are needed, since only actors mapped to same tile and which do not engage in an inter-processor communication can be clustered. The reason behind this is that actors of an SDFG which are engaged in an inter-processor communication, when clustered show different timing semantics (because of possible changes in the rates of the ports in the resulting hierarchical actor). This in turn, could lead to a distorted access pattern on the shared interconnect which could lead to false real-time results.

If above conditions hold, clustering can now be applied. After clustering, one issue remains, which is how to calculate the WCET/BCET of the resulting hierarchical actorΩ. Ifndenotes the number of actors in Z,γ(a) is the repeti-tion vector value of actor a(for notations’ details refer to Sect.2.2.1.4) then the newwcetof the hierarchical actorΩcan be calculated as follows:

γ(Ω)×_{wcet(Ω) =}

∑

n i=1

(γ(a_i)×wcet(a_i))

⇔_{wcet(Ω) =}

∑n i=1

(γ(a_i)×wcet(a_i)) γ(Ω)

similarly thebcetof the hierarchical actorΩcan be calculated as follows:

bcet(Ω) =

∑n i=1

(γ(a_i)×bcet(a_i)) γ(Ω)

It is important to note that beyond the estimated WCET/BCET of single actors when being executed on a target processor, the clustering technique is fully independent from the target architecture of the MPSoC.

5.4.3 Temporal and Spatial Segregation for a Composable and

Im Dokument State-Based Real-Time Analysis of Synchronous Data-flow (SDF) Applications on MPSoCs with Shared Communication Resources (Seite 113-117)