When the NVAX microprocessor is run at maximum speed , it draws a direct current of about 5 amperes.
D ue to CMOS switching tra nsie nts, the a lternating current peaks are considerably higher. Distributing power ( 1;111) a nd ground
(I�)
across t he chip while keeping power grid voltage drops ( IR) und er :)00 millivolts ( 10 percent of minimum y;,") was a major chal lenge. 'To address this constraint a nd meet interconnect rel iability goals, we used tl�e lowresistance M :) layer extensively to d istribute 1;"'
a nd I<,· As shown i n Figure 4, we desi gned the right -hand side of the chip to be covered with a n interdigitated array of a lternating �"' a nd �;,
1 MAIN CLOCK ROUTI NG (M3) I
VIC
SUPER BIT� LINES (M3)
I
�I I I I
r-tj I
VICI 0t====l===ti==':I===IIC:::::::::::::I ��-I Y '
Ill W+==
========t=tVI
�POWER OR CLOCK LINES (M3)
: -,. [lillill]
t-F-BOX
J
M3 POWER-...:.
STRAPS ..:::: - � --
-I -I
F-BOX
1 71 I
M3 POWER R I N G
I-BOX
1
� POWER O Rz
---�� j-
CLOCK M28 ��==============�I
STRAPSd
-�=;:��:::j t1-- L � tr _ .LJ/ . Y ?---
CONTROL STORE ROM E-BOX, ' SUPER BIT
.J.
r- LINES (M3)lj
-�����������blJ� �
M-BOX --
POWER OR CLOCK LINES (M3)C·BOX
t-H=\t::::====:::::F=tjt-
t-1 I \ l l J
\\....
M3 POW E R S P I N E Figure 4 Jl!letal 3 Rou tingDigital Technical journal \ "'· 1 ,\"()_ .. i Slllilllter /')')..!
NVA.,'X-microprocessor VA.,'X systems
l ines, each 17 micrometers wide. Vertical metal two (M2) J ines are used to strap the power l i nes a nd for m a �;lr, grid a nd a V,.,. grid . The �trt and 1�"
d istribution of the left- hand side of the chip was d ifferent from that on the right because of the spe
cial layout requ irements of the cache arrays and the F-box.
Individual cel l layout did not contain M3. The power, ground, and clock connections for a cell were routed by short vertical M2 l ines insicle each cel l. These M2 l i nes were connected to the M3 grids automatical ly by a CAD tool.
On-chip Clock Distribution
In order for us to meet the performance goals, i t was critical to keep clock skews sma l l a n d edge rates sharp across the chip. As shown in Figure 5, special attention was given to the c lock distribu
tion scheme. Differential outputs from an offchip oscillator were su ppl ied to a receiver located at the top of the chip. The output of the receiver was routed to the global clock generator (CI.KCEN), which was placed at the center of the chip to reduce clock skew. 'fhe outputs of the gloha l c lock generator were bu ffered hy four inverters to
OSCILLATOR_LOW "'- ,.r OSCI LLATOR_H IGH
DIFFERENTIAL AMPLI F I E R
�� /'
�UPPER PAD CLOCKS (M3)- �-/- �---,
, v \7 t
/ /UPPER
VIC PAD
CLOCK VIC CLOCK
B U F F E R
:
I-BOX CLOCK BUFFER
f'.-I-BOX
- �
--� - -� ' 1
E-BOX
�;. CLOCKS M2 STRAPS
I �
I
GLOBALI
OSCILLATOR
�
E-BOX CLOCK E-BOX DATA :I
;---!_I
•>---CLOCKS E-BOX(M3) (M3)
CLKG E N PATH
--I
t
F-BOX F-BOX CLOCK BUFFER
GLOBAL
I I I �
CLOCKS
____.___,_;---- I
---(M3)
I I
P-CACHE P-CACH E CLOCK
BUFFER
v �
M-BOX
- CLOCK M-BOX
- BUFFER
-C-BOX C-BOX CLOCK
BUFFER
LOWER PAD CLOCK
+
1....---�
LOWER PAD CLOCKS (M3) Figure 5 Clock (ieneration and Distribution0
II
:)0 Vol. --i .\'o. 3 Sullllllel' l'J'J2 D igital Tec/Juical journal
The NVAX CPU Chip: Design Challenges, Methods, and CA D Tools
increase their driving capability. The clocks were then distributed, using the low-resistance third metal layer (17 mi l l iohms per square), from the top to the bottom of the central clock rout i ng channel that spans the chip.
Clocks were suppl ied to the different functional boxes by loca l ly tapping off the central clock rout
ing and buffe ring each signal with four inverters to further increase the signal's driving capability. This buffering helps to mini mi ze the capacitive loads seen by the clock phases in the central routing channel in which the RC delays are held to 30 pico
seconds (ps). To reduce distribution skew between the global clock J ines, loading on each global line was balanced by adding dummy loads to the more ligh tly loaded l i nes. The buffered clock phases were distribu ted to the east and west of the central clock routing channel, again using M3 to reduce RC delay. The east-west clock rou ting was strapped with M2 as shown in Figure 5. These straps were not allowed to cross box boundaries. Box-level clock skew was reduced by using a common section bu ffer design and layout, and by carefully tuning the buffer drive capabi lity to the clock load in each section.
Finally, before the clocks were used by the logic, the clock signals were loca l ly buffered . These final stages of loca l buffe ring served two purposes: they reduced the gate loading on the east-west clock routing, and they helped to sharpen the clock edges seen by the logic.
The global clock rou ting network was spaced so that the RC delays of local clock branches wou ld never exceed a negl igible 125 ps. We calculated the RC delays of local clock branches using the WAWOTH l ayout interconnect analyzer (described in the section New Proprietary CAD Tools) and, where necessary, rerouted branches to meet the 125-ps design goal. A sample RC plot, generated by WAWOTH for a section of local clock routing, is given in Figure 6. The clock skews and edge rates across this 1 .62-cen timeter chip are less than 0.5 ns and 0.65 ns, respectively.
Microcode Control Store
The design of the 12KB ROiVI control store was sim
pl ified by dividing it into four subarrays. Each subar
ray has its own M l bit l i nes. The M l bit l i nes from the su barrays are cascaded onto low-capacitance M3 super bit l i nes that extend over a l l four subar
rays. Since the capacitance of the M3 super bit l ines is low, the access time is very fast, obvi ating the
Digital Technical journal Vol. 4 No. 3 Sum.mer 1992
need for sense ampl ifiers and vol tage reference gen
erators. This substantially reduced the time required to design and verify the control store ROJ'•'l.
Primary Cache
A similar tech n ique was used i n the 8KB P-cacbe to ease the timing requ irements. The three h igh-order P-cache address bits must be translated and conse
quently become valid later than the untranslated lower-order bits. By dividing the P-cache i nto eight subarrays, each with its own sense amp l ifiers, the cache subarray access can be starred before the three translated add ress bits are va l id. \Vhen the last three address bits become valid, the outputs of the subarrays are multiplexed onto the M3 super bit J i nes, resu lting in a faster cache access time.
Layout Verification Tools
Verifying the NVAX chip l ayout presented several CAD software challenges. Prior to the NVA)( design, the existing layout verification tools were able to extract fu l l-chip net l ists from layout for a l l la rge designs in a single batch process. However, the existing layou t netlist extractor cou ld not hand le designs such as NVAX with over one mi l l ion transis
tors. Also, a more accurate capacitance extraction algorithm was required to calculate side- to-side and fringing capacitance, which came to show signi fi
cant effects in the small physical dimensions in CMOS-4. Furthermore, accurate interconnect resis
tance extraction was needed for NVAX . A combina
tion of new CA D tools (see Figure 7) and design methods was employed to meet the NVAX layout verification requirements.
Partitioning Using "Clean Belts"
To address the problem of extracting parasitic capacitance data from such a large design, the NVA)(
chip layout was constructed so that each chip parti
tion could be i ndependent l y extracted without introducing inaccuracies in the resu lts. The chip was partitioned into nonoverlapping regions, each of which had a rectil inear annu l us or " dean be l t "
around i t s periphe ry. A clean belt i s a rectangu lar region that contains only metal I ines and satisfies a nu mber of layout design rules beyond those set by the technology. The clean belt l ayout ru les pre
vented design rule violations within the clean belt and between adjacent clean belts. The rules also ensured that extracting parasi tic capacitance from a region enclosed by a clean belt cou ld be done
3 1
NVAX-microprocessor VA.,"X systems
1 0 1 .6 1 02 . 0
1 03 2 105.8
1 04 . 0 1 03.2 1 0 1 . 9 1 0 1 .6
1 04.0 1 0 1 .8
1 0 1 . 5 1 0 1 1
1 00 7
1 00.7 1 00.6 1 00.9 1 0 1 . 0 1 02 . 2
1 05.7
1 04 . 7
Note: nmes are given in picoseconds.
1 07.0
I
1 07.0 1 06.2
1 06.2
... -- ... , ... � . ...
,...._ . . .Figure 6 W;i W011-f RC Delay A no�)'sis Resullsji.Jr o Clock Node
accurately regard less of the materials that border the region. Partitioning the chip in this manner made it easier to locate global wiring errors.