tectural and tech nological constraints that defi ne the M E MOR Y C HANNEL 2 design space. To increase the
8 KB Shared bus
77 M B/s
Com parison of first- and Second- generation MEMORY CHANNEL Architectures
D N I D
I
TPCMD
I
SIDADDRESS
-ADDRESS
DATA
(4 TO 256 BYTES)
ERROR
-DETECTION
( a ) Data Packet
Fig u re 4
HEADER
PAY LOAD
E R RO R DETECTION CODE
!YlEMORY CHANNEL 2 Packer Format
Digit;ll Tc(hnical )omnal Vol. 9 No. I 1 997
PST AT
I
TPCFG
l
D N I DH U B
-STATUS GLOBAL
-STATUS E R ROR
-DETECTION
( b ) Control Packet
M E MORY CHANNEL 2 1 6 bits
F u l l d u p l ex LVDS Yes 66 M H z 1 33 + 1 33 M B/s 1 00 M B/s 256 bytes Yes
32-bit Reed-Solomon Receive
4 KB a n d 8 KB Cross bar
800 to 1 , 600 M B/s
HEADER
CONTROL I N FOR MATION
ERROR DETECTION CODE
On M EM O RY C H A N N E L l cl usters, the network add ress is mapped to a local page of physical memory using remapping resources contained in tbe system's PCI-to- host memory bri dge. Al l Alp haScrvcr systems implement rh.:sc 1-cmapping resources. Other sys
tems, particularly those with 3 2 - bi t addresses, do nor i mplement this PCI - to-host memory re mapping resource. On M EM O RY CHANNEL 2 , software has the option to enable remapping in the receiver side of the MEMORY
CHANNEL
2 adapter on a pernet\vork- page basis. When configured tor remapping, a section of the
PCT
is used to store the upper address software-assisted s lurcd memory. The prim itive a llows a node ro complete a re:1d request to another node without software intervention. It is impleme nted by a new remote read-on-write attribute in the receive page con trol ta ble. The req uesting node generates a write with the appropriate remote address (a rc::�d - n.:q ucst write ). When the packet arrives at the receiver, irs address maps in the
PCT
to a page marked as remote read . After remapping ( i f enabled) , the address is converted to :1 PCT read command . The read data is retu rn ed as a M E M O RY CHANN E L write to the s:1me address as the original read - request write.Since read access to a page of memory in a remote node is provided by a unique network address, privi
leges to write or read d uster memory remain com
pletely independent.
A glob<1l clock mechanism has been introd uced ro provide support [()r clusterwidc synch ronization . Gl obal clocks, which arc highly accurate, arc extre mely usefu l in many distri buted applications, such as p<u·al lel databases or distri buted debuggi ng. The M EMORY CHANN EL 2 h u b implements this global clock by pcriod ic1lly sending synchron ization packets to a l l nodes in the cluster. The reception o f s u c h a pulse can be nLlde to trigger an interrupt or, on future MEMO RY CHAN NEL-to-CPU direct-interrace sys
tems, may be used to update a local cou nter. The interrupt service software updates the ofrser bet\vcen t he local time :�nd the global ti me. This synchmniza
tion mechanism a llows a u nique clusterwidc time to be maintained with an accuracy equal to twice the rJnge ( max - m i n ) of rbe 1'vi EMORY CHAN N EL net
work latency, plus the interrupt service routi ne time.
Conditional write transactions have been i ntro
duced in MEi\1\.0RY CHANNE L 2 to improve the speed of a recoverable messaging svstem. On MEMORY
CHANNEL l , the simplest implementation of general
purpose recoverable IllCSS<lging requires J round -trip acknowledge delay to validate the message tra nsfer, which adds to the communication latency. The
!\'\ EMORY
CHAN N E L 2's newly introd uced conditional write transaction prm·idcs a more efficient i mpleme ntation that requires a single acknowledge packer, thus practically red ucing the associated latency by more than a factor
of
t\vo.Memory Channe/ 2 Hardware
As suggested in the previous architectural description, M E M O RY CH A N N E L 2 ha rdware components arc functionall y partitioned into t\I'O subsystems: the PCl intert:Ke and the link interf:1cc. First in, first out
( fifO)
queues arc placed bet�,·een the t\vo subsystems. The PCI interrace communicates with the host system, feeds rhe link intcrflCe wirh data packers ro be sent, and torwards received packets on to the PCI bus. The l i n k interface manages the l i n k protocol and data tlow: I t f(xmats data p<Kkets, generates control packers, and hand les error code generation and detection. It also multiplexes the data path fiom the PC! format ( 32 bits at 33 megahertz [ M H z ] ) to the link protocol
( 1 6
bitsat 66 M H z ) . In addition, rhc l ink interface implements the conversion to and ti·om LVDS signaling.
The transmit
(TX)
and receive( �'\)
data paths, both heavi ly p ipelined, arc kept com pletely separate from each other, Jnd there is no resource contlictmal MEMORY CHANNEL 2 transaction, the transmit pipel ine processes a t ransmit request from the PC!
bus. The transmit PCT is addressed with a su bset of rhc PCI add ress bits and is used to determine rhe i ntended destination of the packet and irs attribmcs.
The transmit pipel ine te�:ds the l i nk imcrbce with d <lta packets and appropriate commands through the trans
mit F I FO queue. The l i n k intert:Kc formats the pack
ets and sends them on the link cable. Ar the receiver, the link interface disassembles rhe packet in an inter
mediate format and stores ir i nto the receive FTJ-:0 queue. The PCI intcrbcc performs a lookup in the
Digital Tcc hni c,\l journ,ll Vol. 9 :--:v. l I ')97 35
Figure 5
Block Di,1;;r,1m of a MEl\ lORY CHA:-J :--i EL 2 Alhptcr
rcccil'l: r PCT to ensure that the page has hccn enabled
r(>r
reception and to determine the loc1l destination add ress .I n
the simplest i m plememarion, J1<1ckcr� ;nc subject to tin> store -and-t(.>t-wJrd dclavs-onc on the rr;:msm ir path <l lld one on the reccin: path. lkc;Juse oF the;Homicin oF p:1.ckcrs, the trJnsmit p<Hh must ,,·ai r
ten
the 1:-�st dac1 11-ord to b
e
couccrll· t;J kcn in ti-om the PU hus before tor11·ard ing theJXld::ct
w the l ink inrert:Ke . The rccci1·e path experiences a d
c
L11· hcc1llsc the error detection protocol req u i res the checking of the IJst c1·cle bctorc the p<lcket c1n be dccbrcd error- Free.A set of control/status 1VI EMO RY
C H A N N E l .
2 registers, add rcss<lble t hrough the l1Cl , is used to set ,.Jri
ous modes oF operation :�nd to read local status of the link Jllli globJI cluster status.
The MEMORY CHAN N E L 2 Hub The hub is the cen
tr;ll resou rce that i nterconnects <11 1 nodes to form a cluster. Figure 6 is <1 block d iagr;Jlll of <111
8
- h1·-8
M FMORY C !-T.A;'\ :-\ F L 2 h u b . The h u b i m J1Iemcnrs
;J non hlocking
8 - l)\'-8
crossb<lr <llld i mcrhccs w eight! 6- b i r-11 ide fu l l -duplex l inks b1· mcu1s
of;J i i n k
i merbcc si 1 1 1ilar to that used in the <llhprcr. The <lCtual crossbar Ius eight input ports and eight output ports, ,1 1! I (> bits 11·idc . Each output port h<ls an
8-ro- l
mu ltiplexer, which is able to choose ti·om one oF eight input ports. E:tch mu ltiplexer is contro l led
bv
a loc1l :�rbitcr, ll'hich is fed decoded destination requests ti·om the eight i npur f10rts. The port arbi n-.nion is b;Jscd on a h\cd-prioritl·, request-sampl i ng :�l gorirhm. All req uests that •• n·i1 c 1\·ithi n J sampling imcn·;JJ a1·c consid ered of cqu<l l ;1gc <lllli Jre sen·iced bd(>rc anv nell' requests.This <ligori thm, 11·h il e nor ent(m.:i ng
.1 bsol ute
<llTi,·al tilllC mdering ;1n10ng p<lckers scllt ti·om dith: rcnr\'oi . <J �'o. l ] <)<)7
LINK CABLE
nodes, :�ssurcs no SLlJ'I'<ltion and a bir agc-driVl:n prior
ity Jcross s:�mpling imcn·als.
vVben a broad c.1st rcqut:sr arrives at the
h u b, tht:
orhcrll'ise
i ndependent arbiters syn
ch
ronize the m selves to rra nskr the bro:Jdcast packet.The
arbiters 11·air t(>r the com p letion of the packet current l y being transferred , d is<lble poi n r-to -poi n t arbitration , signal that tht:l' :Jre rCllh' r( >r bro:Jd cast, and thenll'ait r(>r
ali other �1orrs w :trri1-c at the same svnchronizarion point. O nce all outputpons
arc readvtor
broadcast, port 0 pmcecds to readfrom
the appropriate i n p u t port, a n d a l l othn ports ( i nc luding port 0) select the s:�mc input source . The maxim u m synchron ization 11·air time, assuming no output queue blocking, is equJI to the time i r t<lkcs to transfer the largest size packers ( 2 5 6 lwres ) , about 4 f.-LS, and is i ndependent of the number of ports. As i n JIW cross bar architecture ll'ith J single poim of u>hcrenc1·, such broadcast operation is mme cosrlv rhan a point - to-point rranskr. Our experie nce hJs been th<lt some critic:ll but rebtivcll·lo11
-rin.jucnn
opcrJtions ( primariil' fast locks ) exploit the broadcast circ u i t .MEMORY CHANNEL 2 Design Process and Physical Implementation
Figure 7 i l l ustnrcs the mai n J'v! E M O RY CH A N N E L physical components. As shown in Figure 7a, t\I'O-node dusters can be constructed by directly connecti ng two M E MO RY C H A N N E L PCJ adapters and a cable. This contlgu rarion i� c:1l led the 1·i rrual h ub contlguration.
Figure
7b
sholl
's clusters i nterconnected b1· means of ahub.
The ME1\!l O RY
CHAN!'-ll--:1.
adap ter is implemented :ts a si ngle PC! c;mi . The h u b consists of amother-POR T 0 _____. IN TERFACE IN/O U T LINK 0
I
IN 0
POR T 1 IN/O U T LINK
I
_____. I N TERFACE 1
I
IN 1PORT 7 - IN/OU T LINK IN TERFACE 7
I
IN 7Figure 6
Rlock Diagr;1m ofan l\-lw-1\ M EMORY CHANNEL 2 Hub
bo:1rd that holds rhc S\\'itch and a set of l i nccards,
one
per port, th:H p n)\·idcs
the
i ntcrhcc to the l i n k cable . The :1d .1ptcr and h u b impleme ntations usc a combi nation of progr:unmablc logic dC\·iccs :1 11d off-the
she l f components. This design 11 as preferred to an applic:nion-spcci ric i ntcgr:ncd circ u i t ( AS I C ) i mpk
mcnution because of the short time-to-market
( a ) VirtuJI huh mode: direcr node- ro-noJe
intcrconnenion of tH"o PC: I J(bprcr cards
Fig ure 7
M I-: M O RY C : H A N N E J . I b rd\\':1 1'l' C :olll f'Oill'nts
1 6
1 6
1 6
-
--
--
--
-I
ARBITER 0I
I
::=:�1
- -
-
1 6- -
-= F- 1
-- -- -- 16
- -
-I
A<C,L I
=====
F---
- - - - -_ _ _ _ _._...i I
1 6-1
I
ARBITER7 1
rcq uircmcms. I n :1dd i tion, some of thc nc,,· function
alit\' \\'il l CH>ivc :1s softw:1rc is modified to t�1ke :1d1·:1 n tagc o f the ne\\' k:nurcs . ·rhc M E,\1\ 0 il.Y C : H A N � I-: 1 . 2 design \\':15 de1·eloped cmirch- i n Vni log :l t the regis
ter rranskr IC\·c l
( ll.T I. ) .
I t \\':1S si m u l :ncd usi11g theViell'logic
VC:S c1·cm-dri1cn si m u l :uor :1 11d Sl'll thcsi zcd 11·i r h the S1·nops1·s too l . The res u l ti n g nctl ist
(b) Using the MEMOit\' C H A � N E I . h u h to create cl usters of u p to 1 6 nodes
Disit.d Tcch11ic.li Jomn.1l \d . <.J :--;"· I I ')<.J7
ll'as ted t h rough the appropri�lte 1 e n d or tools
t(n
efrecti,·e process-to-process band11idrh achie1·ed using a pair ofAJph:1Sen er 4 1 00 S\'Stems. With 2 5 6-bl'te p:lckets, M E M O RY C H A N N E L 2 achin·es 1 2 7 iVlB/s or about 96 percem of the rail' ll'ire bandll'idth.
For P C ! IITites of less than or equal to 2 56 bvtes, the ME M O RY C H A N N EL 2 i n terface simply converts the P C I write to a sim i l ar-size
iVIEMORY
CHAN N E L packet. The current d esign does not aggregate mu ltiple PC ! write transactions imo a single M E M ORY message L1te11c1' r(Jr d i fkrem tl'fXS of comm u nications
, - - - -NODE
-
- - --
--,
simplest implementationof
,·ariable-length messaging.The latencies ofstanlhrd communication intertdccs arc packets . On the one hand, smaller packets arc less etli
cient
than largerones in
term of overhea
d. On t he other hand, smalkr packets incur as
horter store-and forward dela
y per p;�cket, which Gin then be overgeneration MEMORY CHAl'\JNEL network, MEMORY CHANNEL 2 . The rationale behind the major design decisions arc discussed in light of the c :;pcri enc
e
gained from M E1\IIORY CHAN NEl.
l .
A description of the M EMO RY C HANNEL 2hat-dware
componentsJed
to the pres
entation of measured pcrfcmnancc results.Digital Tcchnic.1l }ou rn.�l Vol . 9 No. l 1997 39