Testbed and Setup - Algorithms for Circuit Sizing in VLSI Design

All algorithms are implemented in C++ and the industrial timing engine IBM EinsTimer is used for all delay computations. Tests on microprocessor instances and the ISPD 2013 benchmarks were performed on Intel XEON machines with clock frequencies of 3.1 GHz and 2.9 GHz, respectively.

Table 9.1 shows our industrial testbed consisting of 8 microprocessor units.

Design Technology # Circuits Cycle time (ps) Edges Pins

Unit1 22nm 50520 208 218507 170630

Unit2 22nm 54709 174 277141 259207

Unit3 14nm 66843 240 168093 148929

Unit4 14nm 74136 174 263197 228221

Unit5 14nm 169327 174 637107 501126

Unit6 14nm 219749 264 917523 726164

Unit7 22nm 474312 208 2745919 2166512

Unit8 22nm 542544 208 2222887 1790123

Table 9.1: Microprocessor units used in our experiments. Column two and three indicate the technology of the units and the number of circuits. The last three columns show the cycle time, and the number of edges and pins in the timing graph, respectively.

In the previous chapters we considered algorithms for gate sizing only, as latches (registers) are usually fixed along with the clock net routing in earlier design stages.

On the ISPD 2013 benchmarks, latch sizes are fixed, but on the microprocessor designs we include non-fixed latches in our optimization. We will treat latch sizing again in Chapter 10.

The oracle algorithm is called once in each iteration of the LR and RS algorithms in parallel with 8 threads. Thereby the netlist is partitioned into logical regions with an algorithm provided by IBM, and threads iteratively process the regions.

Thereby circuits that lie in the boundary of two regions, which means that they are connected with circuits in another region by an edge in the timing graph, are not optimized. For this reason the netlist is partitioned a second time after the first

9.4 Testbed and Setup processing of all regions. This is done in such a way that each former boundary circuit now lies within a single region. Afterwards, each circuit has been processed at least once. Within the regions, circuits are traversed in reverse topological order.

We performed 25 iterations of each algorithm as the RS algorithm runs a specified number of iterations in theory. Placement locations of the circuits on the ISPD 2013 benchmarks benchmarks are unknown, and designs cannot be legalized. We also did not run a legalization algorithm on the microprocessor designs in order to evaluate the raw behavior of the LR and RS algorithm without considering effects that are caused by legalization.

9.4.1 Starting Solutions

Algorithms were run incrementally on optimized instances instead of starting with the smallest size and highestV_tlevel solution as it is usually done on the ISPD 2013 benchmarks. The microprocessor designs are placed and already timing optimized with tools for repeater insertion, layer assignment, sizing and Vtoptimization etc.

For the ISPD 2013 benchmarks, we generated a starting solution similar to Li et al. [Li+12a] and Flach et al. [Fla+14]: Starting with the smallest available sizes and highestV_tlevels, all gates were traversed in topological order and the smallest solution that incurred no load violations was chosen for each gate.

9.4.2 Evaluation Metrics

Existing convergence guarantees of the LR algorithm refer to the solution computed in the last iteration of the algorithm, while for the RS algorithm estimations refer to the weighted average of the solutions computed in the course of the algorithm.

This is not necessarily a feasible discrete solution. Given that, it is not clear which solution to analyze from our experiments because rounding can be arbitrarily bad (cf. Section 8.7). We can choose for example the solution that minimizes the maxi-mum resource usage. Note that under our simplifying assumptions (cf. Section 9.3) the worst path resource usage is induced by a path that attains the worst slack WS of the design. However, the sum of negative endpoint slacks SNS and the sum of negative slacks of all subpaths SLS in the timing graph can be poor in this solution (see page 27 and 28 for a definition of these metrics). In our evaluation we consider the solutions returned after the last iteration of the algorithms. We will see that this usually is a good choice.

We shortly describe the metrics we employ to evaluate the solutions returned by our algorithms. These metrics are also depicted in our result tables (Table 9.3 -9.6).

We measured the worst slack of the design WS, the sum of negative endpoint slacks SNS, and the sum of negative slacks of all subpaths SLS , which was also proposed as evaluation metric in Reimann et al. [RSR15]. We refer to the these metrics as

timing metrics. Timing metrics

In literature, SLS is often not considered. We use it as an additional measure for the following reason: In an industrial design flow, gate sizing is often applied at a stage where the design cannot become timing clean by sizing and V_t optimization only, and further optimizations like repeater insertion are necessary. Using edge weights in the oracle algorithm, regardless if they are Lagrange multipliers or resource weights, sometimes leads to delay improvements on timing critical paths, which do not appear in the SNS that is most commonly used as a measure for the timing quality of a design. Consider for example a gate with 2 inputsaandb, and suppose the edge from input a to the gate output pin c lies on a very critical path that cannot be significantly improved by gate sizing. If the edge from inputbtocis less timing critical than the edge fromatoc, a delay improvement of this edge will not change the worst slack at c. Thus improving the delay of the edge from b tocwill only be visible in the SLS.

Additionally, a WS and SN S improvement is not preferable if it comes at the cost of a largeSLS degradation that needs to be recovered again by later algorithms to achieve timing closure.

“Pstatic” denotes the static power consumption of all gates. Minimizing the static power consumption is the objective for the ISPD 2013 benchmarks. On the mi-croprocessor instances, the relation between the total power consumption of the solutions returned by our algorithms was usually of the same magnitude as the re-lation between the static power consumptions except for one design. Therefore we omit the total power consumption in our result tables and point out the discrepancy in Section 9.5.

The column headed by “Pusage” shows the usage of the static power resource, i.e. the ratio of static power consumption and its budget. Due to the exponential number of paths in the timing graph, we do not evaluate the usage of all path resources.

Instead we depict the worst usage, which is induced by a path that attains the worst slack under our simplifying assumptions, in column ”W S_usage“. Both metrics are also evaluated for the LR algorithm to provide a better comparison.

As performance guarantees for the RS algorithm refer to the convex combination of resource usages (and the convex combination of the solutions) computed in the course of the algorithm, the two columns headed with ”∅Usages“ show the convex combination of the power and worst path resource usage over all iterations of this algorithm, respectively.

Column ”RT“ denotes the running times of the algorithms in minutes. Most of it is spent for delay computations in the solution evaluation of the oracle, and only to a minor extent for updating the resource weights. It serves as an indicator for the number of solutions that have been evaluated in the course of the algorithm, and usually the number of gates that have been changed correlates with the running time.

However, updating the resource weights takes approximately three times as long as updating the Lagrange multipliers. This can be contributed to the fact that here the timing graph needs to be traversed twice in order to compute the new edge weights. Additionally, the worst path resource usage needs to be determined first

9.4 Testbed and Setup to compute

y^(t)

_∞ for y^(t) ∈R^|^R^| in each iterationt of the RS algorithm, which is also needed in the weight computation (cf. equation (9.1)).

As the sum of load and slew violations were of the same magnitude for the LR and RS algorithm, and never degraded significantly except for two microprocessor designs, we omit these numbers in our tables and indicate the exceptions in Section 9.5.

9.4.3 Optimization Modes

Mode Weight update rule Local oracle objective Power weight

Init - -

-LR Lagrangian relaxation Sum of weighted delays No RS Resource sharing Sum of weighted delays Yes LRM Multiplicative Sum of weighted delays No LRH Lagrangian relaxation Sum of weighted delays No

and local SNS

RSH Resource sharing Sum of weighted delays Yes and local SNS

APPR Resource sharing Weighted local SNS Yes

Table 9.2: Optimization modes.

Table 9.2 shows our optimization modes that will be explained in the following.

The modes are classified by the update rule for the edge weights and multipliers, the objective that is optimized locally in the oracle algorithm (the refine costs), and whether a power weight is considered.

Mode “Init” refers to the initial values before optimization. Modes ”LR“ and ”RS“

indicate the base LR and RS algorithm whose implementations we have described in Section 9.2 and Section 9.3, respectively.

Recall that our analysis of the heuristic modifications of the LR algorithm in prac-tice led us to the multiplicative weights algorithm (cf. Section 7.2.4) and further to the new model as a min-max resource sharing problem. Naturally, the question arises how the LR algorithm with heuristic modifications performs compared to the base LR algorithm, and in particular to the RS algorithm. We divide these modifications into two classes: Firstly, changes proposed for the projected gradi-ent method, and secondly changes to the oracle algorithm described in Section 9.1. As various heuristics exist and it is impossible to implement and test all the combinations, we implemented only a selected choice that we deem reasonable.

Among these are a multiplicative multiplier update proposed in Flach et al. [Fla+14]

that was employed in the winning algorithm of the ISPD 2013 Discrete Gate Sizing Contest. This mode is indicated by ”LRM“ in our tables. Here the Lagrange multi-pliers are updated multiplicatively based on their local timing criticality (cf. Section

7.2.4). Although the weight update rule in the RS algorithm is also multiplicative, it is based on the resource usages and differs significantly from this heuristic rule.

Similar update rules were proposed for example by Tennakoon and Sechen [TS02]

and Livramento et al. [Liv+14].

It also is an interesting question how other heuristic oracles influence the perfor-mance of the algorithms. As mentioned before, our oracle algorithm is a heuristic and no approximation guarantees are known. Based on our observations in practice, we believe that better results can be obtained when the relevant timing metrics like WS, SNS and SLS are considered more directly in the oracle algorithm instead of letting the weights guide the optimization.

Modes ”LRH“ and ”RSH“ refer to the LR and RS algorithm with such a heuristic oracle algorithm, which we will shortly describe. Let g ∈ G. We refer to the local SNS ofg as the sum of negative slacks at the sibling and predecessor pins of

Local SNS

the input pins of g and the successor pins of the data output pin of g (cf. Section 2.5.2). Based on our practical observation that the LR and RS algorithm tend to use larger sizes and lowerV_tlevels unnecessarily, we included both the negative slacks at predecessor and sibling pins in this metric to put more emphasis on signals entering g. We deploy as refine costs the sum of weighted delays in the neighborhood graph as before, but reject solutions that degrade the local SNS of the initial solution by more than a factor of Φ > 0 similar to Flach et al. [Fla+14]. For the ISPD 2013 benchmarks, the value of Φ is updated after each iteration based on the criticality of the design: If the worst slack improves, Φ decreases. For the microprocessor designs the worst slack does not necessarily improve by sizing andV_toptimization.

On designs with a worst slack that is strongly negative, Φ would then remain large and hardly any solution would be rejected due to a local SNS degradation. For this reason we set Φ to a fixed value of 1.05. This proved to be efficient in practice, but can possibly be improved by fine-tuning. We also prohibit degradations of the 1%

most critical slacks of the design. This oracle algorithm is a variant of Algorithm 5.4.

Additionally, we evaluate an optimization mode called ”APPR“ with another heuristic oracle algorithm that optimizes the weighted local SNS for each gate in the oracle algorithm similar to Daboul [Dab15]. Here the slack atv∈V is weighted withωsv·ω_vt, and these pin weights can be computed simultaneously with the edge weights by equation (9.2). In each iteration, the refine costs of g ∈ G are the sum of the weighted negative slacks at the pins that are considered in the local SNS of g. The differences to minimizing the weighted sum of delays are as follows: Firstly, the delays of a few edges are not ”captured“, which is illustrated in Figure 9.1.

The picture shows the neighborhood graph of gate g in the center. Edges whose delay contributes to the local SNS are indicated in green and blue, otherwise edges are colored red. The reason why the edge delays of some edges are not considered is that at predecessor pins and output pins of g only the criticalities and thus the delays of the most timing critical edges are propagated further. We conclude that the weighted sum of negative slacks at sibling and successor pins equals the sum of weighted delays when the currently sized gate and the predecessor gates are

Im Dokument Algorithms for Circuit Sizing in VLSI Design (Seite 152-157)