E XPLOITING A PPLICATION -I NHERENT R EDUNDANCY

5.3.1 Functional Redundancy Through Anytime Algorithms

Functional redundancy within the instances of the main parts is the first kind of applica-tion-inherent redundancy we consider. As it is known at design time that the main parts might be aborted at runtime, it is wise to design them in such a way that partly executed instances are not completely useless. The exception part provides a means to make inter-mediate results available if the main part was designed in such a way that preliminary re-sults are available before the task completes. Anytime algorithms are the means to achieve this goal.

The basic idea of anytime algorithms is to compute some first results as soon as possible and then iteratively improve these results until the best possible result of the algorithm is achieved (Dean and Boddy 1988,Boddy and Dean 1989). This way of realizing an algo-rithm has the advantage that the algoalgo-rithm may be stopped at any time and always has some preliminary result to deliver. There are the following measures for the quality of pre-liminary results (cf. (Russel and Zilberstein 1991)):

• Certainty. This means that the result represents a kind of “first guess” of the algo-rithm. The result has a certain probability of being correct, though it may not yet have the highest probability the algorithm can achieve. Other possible solutions still have to be evaluated, and it may turn out that they have a higher likelihood of being correct. For example, the arc filter in the distributed sensor fusion succes-sively evaluates a sequence of estimated arc positions. If it is aborted prematurely, the estimate with the highest evaluation may not yet have been found.

• Accuracy. This means that the algorithm is still reducing the error of the result, which refers to the distance (in a general sense) between the result and the real-world entity it represents. For example, the algorithm may still be able to reduce the spatial distance between the estimated position of some object and its real posi-tion.

• Completeness. This means that the algorithm delivers results that only represent a part of the real world entity they relate to. For example, if the expected result is a perception of the environment represented by a set of edges, an incomplete result may contain only some of the edges.

• Specificity. This means that the results, while representing the whole real world en-tity they relate to, are not yet as detailed as possible.

Despite being preliminary in the above senses, the results delivered by an aborted anytime algorithm may still prove useful or even be sufficient. If the latter is the case, all computa-tions the algorithm performs after achieving the sufficient result can be considered as func-tional redundancy. Anytime algorithms allow exploiting this redundancy. All-or-nothing algorithms either provide no result at all or the best possible result (including all the re-dundancy); a sufficient preliminary result is not available. Plotting the value of an algo-rithm against its execution time, an all-or-nothing algoalgo-rithm exhibits a single step when it is completed, whereas an anytime algorithm exhibits a continuously increasing curve or a number of small steps. With some consideration at design time, many algorithms can be designed in such a way. For example, all filters in the distributed sensor fusion adopt this

paradigm. Furthermore, (Bade 2003,Herms 2004)show how it can be applied in stereovi-sion and complex planning applications.

When scheduling anytime algorithms with TAFT, the main part consists of the anytime algorithm, whereas the exception part delivers the results computed so far when the main part is aborted. In this approach, the anytime algorithm is automatically stopped and its results are automatically delivered when it cannot be completed by its deadline. Thus, the scheduler decides how long the anytime algorithm is run and ensures that it delivers its results in time.

Using this approach, it is easy for application designers to employ anytime algorithms.

They only need to design the algorithm and provide the deadline and expected-case execu-tion time to the scheduler, which makes the decisions at runtime. The ECET allows appli-cation designers to specify a minimum time for the algorithm to be executed. This ensures that it is not terminated arbitrarily early, in which case it might deliver no or only very-low-quality results. By contrast, a high probability that sufficient results are delivered can be achieved. In fact, if a WCET required to compute at least a sufficient result was known (as is assumed in (Lin et al. 1987,Liu et al. 1994)), the ECET of the main part could be set to this WCET and the main part would always yield sufficient results.

Realized as anytime algorithms, main parts that have to be aborted may still deliver suffi-cient results. If so, the task pair exhibits no failure, since it provides a service according to its specification. The resource fault of the main part, therefore, has been tolerated. Thus, the approach exploits the functional redundancy within the main parts to tolerate resource faults. Considering the task pair as a component of the application, one can say that the approach increases the reliability of this component, for it increases the ratio between the number of task pair instances providing sufficient results and the number of all released instances. For example, measurements conducted in the prototype of the distributed sensor fusion show how using anytime algorithms allows increasing the reliability of the arc filter (cf. Section 6.2).

The increased reliability notwithstanding, there may be task pair instances not providing sufficient results to meet their specification. This is unavoidable unless WCETs for a logi-cal mandatory part are assumed. The following clause will show how such component failures can be tolerated.

5.3.2 Spatial and Timing Redundancy

When an instance of a main part is aborted without providing sufficient results this repre-sents a failure of the corresponding component of the application. This constitutes a fault from the perspective of the overall application. In this sub-section, we consider how such faults can be tolerated.

Figure 5-8. Structural redundancy in the distributed sensor fusion

Spatial redundancy is the first kind of redundancy being exploited to tolerate such faults.

In a group of embedded systems observing some real-world entity in their environment, if a component on one of the systems fails to deliver an observation, the others may still pro-vide the missing information in time. The distributed sensor fusion, for example, combines results from several laser scanners that observe the environment from different perspec-tives. The fusion automatically tolerates component failures that result in data from one of the sensors being missing as long as two or more sensors are observing the same part of the environment. Moreover, incomplete results from a faulty module may still represent a valuable contribution to the results of the fusion. Figure 5-8 shows an example: Figure

5-8 (a) and (b) depict two incomplete observations from two laser scanners. The scan de-picted in Figure 5-8 (a) contains edges from only one of the rectangular objects in the scene, while the scan depicted in Figure 5-8 (b) only contains edges of two other rectangu-lar objects. The fusion of both of the scans (Figure 5-8 (c)), however, comprises edges of all three rectangular objects so that they can all be detected in the fused scan.

Timing redundancy is the second kind of redundancy being exploited. In many control applications, control loops are executed well above their stability criteria. This means that the corresponding tasks are performed with a period significantly smaller than necessary to accomplish a stable control of the system. Such applications exhibit redundancy in the number of task instances that must provide sufficient results. This is because two consid-erations guide the selection of the period: It must be sufficiently small so that the controller (i) can react to changes in the environment before the controlled system is damaged (a safety constraint) and (ii) exhibits a smooth reaction to the changes in the environment (a quality goal). While being less critical regarding system safety, (ii) implies the more strin-gent timing requirements. For example, there is a minimum frequency at which motion planning, and hence the sensor fusion, must be executed to avoid collisions. It depends on the sensor range, the speed of the robot, and the speed of the surrounding objects. Usually, motion planning and sensor fusion are performed at a much higher frequency to achieve a smooth driving.

Thus, when scheduling controllers, frequently only m out of n scheduled task instances are really hard, while the remaining n – m instances can be considered as timing redundancy, which can be exploited in overload situations. Systems with such constraints, called

( )

^mⁿ

constraints, are also known as weakly-hard real-time systems (Hamdaoui and Ramanathan 1995,Koren and Shasha 1995,Bernat and Burns 2001,Bernat and Cayssials 2001,Wang et al. 2002).

Exploiting the kinds of application-inherent redundancy discussed so far, transient over-loads can be tolerated without explicit fault treatment on part of the application. In the fol-lowing subsection we consider what can be done if an overload situation is more persis-tent.

5.3.3 Signaling Persistent Overload

If the execution times of a main part persistently exceed its ECET such that it observers faults over some extended period of time, relying on the kinds of redundancy described above does no longer suffice and we are faced with a persistent overload situation. There are two ways to address the problem:

• One can adapt the specified resource demand of the main part ― that is, its ECET

― to its actual resource demand so that the scheduler allocates more resources for executing this task.

• One can adapt the resource demands of the application in order to reduce the sys-tem load.

(Gergeleit 2001) shows how the first approach can be realized. He combines the TAFT scheduler with an online-monitoring component, which provides execution time statistics to the scheduler. Using these statistics, the scheduler can adapt the resource allocation to changing execution times. This approach allows adapting the guarantee the scheduler pro-vides to the demands of the tasks as long as the task set with the increased resource de-mands passes the acceptance test. If the latter is no longer the case, the second approach can be used. This approach has to prerequisites: First, the application must be able to adapt its resource demand by gracefully degrading its service; second, the middleware must sig-nal the overload situation to the application. The exception handling of TAFT provides the means to detect and signal persistent overload situations. To exploit it, the ratio between the number of aborted instances of the main part and the number of all instances of the main part is computed in the exception part of a task pair. Actually, according to the kind of constraint presented in the preceding sub-section, this value is computed over an inter-val of n released instances, where n is specified by the application. The number k of aborted instances within this interval is counted. If k exceeds an application-specific threshold a persistent overload is detected. In this case, the exception part signals the over-load and thus triggers the adaptation of the application. Thus, the exception handling TAFT guarantees per task pair instance can be used to trigger a next level of exception handling if an

( )

^mⁿ constraint is violated or about to be violated.

The distributed sensor fusion scenario allows for adaptation under overload. As explained in Section 3.2, performing the fusion on a higher level of abstraction allows adapting the resource demands of the distributed sensor fusion at the expense of its accuracy. In a per-sistent overload situation, the sensor fusion is switched to a higher level of abstraction.

Thus, the system load is reduced and the ratio of aborted task instances is reduced also as a result.

Im Dokument A middleware for cooperating mobile embedded systems (Seite 121-126)