In computing, a cluster is loosely defi ned as a parallel system comprising a collection of stand-alone comput
ers (each called a node) connected by a network. Each node runs its own copy of the operating system, and cluster software coorctinating the entire parallel system attempts to provide users with a unitled system view.
Since each node in the cluster is an off-the-shelf computer system , clusters offer several advantages over traditional massively parallel processors ( MPPs) and large-scale symmetric multiprocessors ( SMPs ) . Specifically, clusters provide1
• Much better price/performance ratios, opening a wide range of computing possibilities for users who could not otherwise afford a single large system .
• M uch better availability. With appropriate software support, cl usters can survive node fai lures, whereas SMP and MPP systems generally do not.
• Impressive scaling (hundreds of processors) , when the inctividual nodes are medium-scale SMP systems.
• Easy and economical upgrading and technology migration . Users can simply attach the latest
generation node to the existing cluster network.
Despite their advantages and their i mpressive peak computational power, clusters have been u nable to displace traditional parallel systems in the marketplace because their effective performance on many real
world parallel applications has often been disappoint
ing. Clusters' lack of computational efticiency can be attri buted to their traditionally poor communication, which is a result of the usc of standard networking technology as a cluster interconnect. The develop
ment of the MEMORY CHANNEL network as a cluster interconnect was motivated by the realization that the gap in effective performance between clusters and SMPs can be bridged by designing a communication network to deliver low latency and high bandwidth all the way to the user applications.
Over the years, many researchers have recogn.ized that the performance of the m ajority of real-world par
allel applications is affected by the latency and band
width available for communication.2-5 In particular, it has been showd·6.7 that the efficiency of paral lel scientific applications is strongly influenced by the
Digital Technical Journal Vol . 9 No. I 1 997 27
28
system's architectural balance as qu antified by irs commu nication-to-computation ratio, which is some
times called the q-rario 2 The q - rario is dctined �1s
the ratio between the time it takes to send an 8 -Lwte floating- point result from one process to another (commu nication ) and the time it takes to perform a floating-point operation (computJtion ) . I n a system wirh a q-ratio equal to
1 ,
ir takes rhc same rime for �1 node to compute a result as i t docs tor the node to com mun icate rhe result to another node in the system.Thus, rhe higher rhe q - rario, rhe more d i fficu l t i t is ro program a paral lel system ro achieve a given level of performance. Q-rarios close to unity have been obtained only in experimental machines, such as iWarp' and the M-Machine," by employing direct register-based commu nication .
Table l shows actua l q-ratios ti:>r several commercial systems. 111·1 1 These q -ratios vary fi-om about
100
ti:>r aDJGJTAL AJphaServcr
4 1 00 SM P
system using shared memory to3 0 , 000
f(>r a cluster of theseS M P
svstemsintercon nected over a ti ber distri but..:d data inrcr bcc (F D D I ) network using tbe transmission control protocol/internet protocol
(TCP/I P).
An MI> J>system, such as the I R M SP2, using the Message Passing Interface
( M l' l )
has a q-ratio of5 , 7 1 4 .
The M EMORY CHAN N E L network developed bv Digit�ll Equipment Corporation reduces the q-ratio of an AlphaServer-based cluster by a factor of38
to 82 to bewithin the range of 367 ro
1
,0
67
. Q-ratios in this range permit clusters to efficiently tackle a large ci:lSS of paral lel technical and commer..:ial problems.The benefits of low-l atency, high-bandwidth networks arc well understood. '2·'; As shown by mJny studies,14·" high communication latency over tradi
tional ncrworks is the result of the operating svstcm overhead i nvoh'Cd in transmitting and recci\'ing mes
sages. The
M E M ORY CHAN N E L
network eliminJtcs this latency by supporting direct process-to-process com munication that bypasses the operating syste m .Ta ble 1
The M EM O RY CHANNEL net:\vork supports this type of communication by i m p l ementing a natural exten
sion of the virtual mcmor�' spJcc, which provid..:s direct, but protected , auxss ro rhe memor�· resid ing in other nodes.
Ihs..:d on this approach, D I G ITA L d eveloped its first-generation M EM O RY CHA N N EL net:\vork ( M EMORY CH.AJ'\J N EL l ),1'' which has been shipping in production since April 1 996. The net:\\'ork docs not require an�· fl.mction::dit:\' beyond th..: pnipheral com ponent interconnect
( PCI )
bus and theretore can be used on any system with a PCI l/0 slot. D IGITAL currently supports produ ction M E MORY CHANNE L cl usters as large as 8 nodes by 1 2 pro..:cssors per node (a total of 96 processors) . One of these clusters was presented at Supercomputing '95 and rJn cl usterwide applicuions using High Pertormancc Fortran(
H P f ) ,·1 Parallel Virtual Mac h i ne( PVM ) , 17
andM PI "
inDiGITAL's
Parallel Soltware En\'iron mcnt( PS E ) .
This96-proccssor svstem ll3s a q - rario of
SOO
to l,000,
depend i ng on the commu nication inrcrtace.
A
4-node M E M O RY C HA N N E L c luster running DIGITA L Tru Ci uster sofnvare1" and rhe Oracle Parallel Serv..:r has held the cluster performance world record on the Tl'C-C benchmark2''-the indusrrv st:lllcbrd in on-li ne transaction processing-since April 1 996.We next prese nt :1 11 overview of the generic 1\tl EMORY C HAN NEL nct\vork to j ustit)' the design goals of the second-generation MElvl O RY CHAl'\J N E L nwmrk ( M E M O RY CHANNEL 2 ) Following this o,·en·iew, \\'e describe in derail the architecture of the t\\'O components that make up the M EM O RY CHAN NEL 2 network: the hub and the adapter. Last, we present hardware-measured pcrti:>rmance data.
ME MORY CHANNEL Overview
The M EM O RY C H A N N E L network is a dedicated cl uster i nterconnection net\vork, based on Encore's
Comparison of Communicati o n and Computation Perfo rma nce (q-ratio) for Va rious Pa ra l lel Syste ms Comm u n ication
Performance Latency
System (Microseconds)
Alpha Server 4 1 00 Model 300 config urations
SMP using shared memory messaging 0 . 6
SMP using MPI 3.4
FDDI cl uster using TCP/I P 1 80.0
M EMORY CHAN N E L cluster using
native messaging 2 . 2
M E M O RY CHAN N E L cluster u s i n g M P I 6.4
IBM SP2 using MPI 40.0
Digiral T.:c hn icJI Journal Vol . 9 No. I ! <)97
Co mputation
Communication-Performance Based on to-com putation UNPACK 1 00 X 1 00 Ratio
(Microseco nds/FLOP) (q-ratio)
0.006 1 00
0.006 567
0.006 30, 000
0.006 367
0.006 1 , 067
0_006 5,7 1 4
M E M O RY C H A N N E L technology, that supports virtual shared memory space by means of internodal memory address space mapping, similar to that used in the S H RI M P system 2 ' The 1V! E M O RY CHAN N E L substrate is a Hat, fully interconnected network that provides push-on ly message-based comm unica
tion. "' 2' Unlike traditional networks, the M E MORY C H A N N E L network provides low- latency communi
cation by supporti ng direct user access to the network.
As in Scalable Coherent Interrace ( SCI
)
23 and Myrinet24 networks, connections between nodes are established by mapping part of the nodes' virtual address space to the M EMORY CHAN N E L interf1ce.A M E M O RY CHAN N E L connection can be opened as either an outgoing connection (in which case an address-to-destination node mapping must be pro
vided ) or an i ncoming connection. Before a pair of nodes can communicate bv means of the M EM O RY C H A N N E L network, they must consent to share part of their address space-one side as outgoing and the other as incoming. The MEM ORY CHANNEL net
work has no storage of i ts own . The granularity of the mapping is the same as the operating system page size.
MEMORY CHANNEL Address Space Mapping
M apping is accompl ished through manipulation of page tables. Each node that maps a page as incoming allocates a si ngle page of physical memory and makes it available to be shared by the duster. The page is always resident and is shared by all processes in the node that map the page. Tbe first map of the page causes the memory allocation, and subsequent
GLOBAL
reads/maps point to tl1e same page . No memory is allocated for pages mapped as outgoing. The mapper simply assigns the page table entry to a portion of the MEMORY CHANNEL hardware transmit window and defines the destination node for that transmit sub
space. Thus, the amount of p hysical memory con
sumed for the clusterwide network is the product of the operating system page size and the total number of pages mapped as i ncoming on each node.
After mapping, MEMORY CHANNEL accesses are accomplished by simple load a nd store instructions, as for any other portion of virtual memory, without any operating system or run-time l i brary calls. A store i nstruction to a M EM O RY CHANN E L outgoing address results in data being transterred across the M EM O RY CHANN E L network to the memory allo
cated on the destination node. A load i nstruction from a M EM O RY CHAN N EL incoming channel address space results in a read from the local physical memory i nitialized as a M EM O RY CHAN N E L i ncoming chan
nel . The overhead (in CPU cycles) i n establishing a MEMORY CHANNEL connection is m uch higher than that of using the connection. Because of the memory
mapped nature of the interface, the transmit or receive overhead is similar to an access to local main memory.
This mechanism is the fundamental reason tor the low MEMORY CHANNEL latency. Figure
I
illustrates a n example o f MEMORY C HANNEL add ress mapping.The figure shows two sets of i ndependent connec
tions. Node
I
h as established an outgoing channel to node3
and node 4 and also an i ncoming channel to i tself. Node 4 has an outgoing channel to node 2 .MEMORY CHANNEL A DDRESS SPACE NODE 1
NODE 2
Figure 1
NODE 1 TO NODES 3 A N D 4
NODE 4 TO NODE 2
JV!EMORY C HANNEL Mapping of a Portion of the Clusterwide Add ress Space
Digital Technical Journal
NODE 3
NODE 4
Vol . 9 No. I 1 997 29
30
All
connections are unidirectional, either outgoing or i ncoming. To map a channel as both outgoing and i ncoming to the same shared address space, node l maps the channel two ti mes into a single process' virtual address space. The mapping example in Figure l req uires a total of four pages of physical memory, one for each of the four arrows poi nted toward the nodes' virtual address spaces.
MEMORY CHANNEL mappings reside in two page control tables ( PCTs) located on the MEMORY CHANNEL interface, one on the sender side and one on the receiver side. As shown in Figure 2, each page entry in the PCT h as a set of attri butes that speci�' the MEMORY CHANNEL behavior tor that page .
The page attributes on the sender side are
• Transmit enabled, which must be set to allow trans
mission from store instructions to a specific page
• Local copy on transmit, which directs an ordered copy of the transmitted packet to the local memory
• Acknowledge request, which is used to request
The page attributes on the receiver side are
• Receive enabled , which must be set to allow recep
tion of messages addressed to a specific virtual page
• Interrupt on receive, which generates an interrupt on reception of a packet
• Receive enabled under error, which is asserted for error recovery communication pages
• Remote read , which identifies all packets that arrive at a page as requests for a remote read operation
• Conditional write, which identifies all packets that arrive at a page as conditional write packets
S E N D E R
The MEMORY C HANNEL communication paradigm is based on three fundamental ordering rules:
1 . Single-sender Ru le:
All
destination nodes willreceive packets in the order in which they were gen
erated by the sender.
2 . Multisender Rul e : Packets fi:om mu ltiple sender
nodes wi U be received in the same order at all desti nation nodes.
3. Orderi ng-under-errors Rul e : Rules 1 and 2 must apply even when an error occurs in the network.
Let Pj," ·x be the jth point-to-point packet from
there is a ti.nite set of val id reception orders at destina
tion nodes X a nd Y, depending on the actual arrival messages destined to both receivers must be received in the same order. For example, if X receives B2", B l "',
MEMORY C HAN NEL Page Control Attributes
Digital Technical )oumal Vol . 9 No. I 1997
CONDITIONAL WRITE
same order. One arrival order congruent with both of
These rules are independent of a parrjcular intercon
nection topology or implementation and must be obeyed in all generations of the M EMORY CHANNEL network.
On the lvlEMORY CHANNEL network, error han
d l ing is a shared responsibility of the hardware and the cl uster management software. The hardware provides real -time precise error hand ling and strict packet ordering by discarding all packets in a particular path that fo l low an erroneous one. The software is respon
sible f(>r recovering the network fi·om the faulty state back to i ts normal state and for retransmitting the lost packets.
Additional MEMORY CHANNEL Network Features Three additional features of the MEMORY CHANNEL network make it ideal f(>r cl uster interconnection:
l . A hardware- based barrier acknowledge that sweeps
tbe nenvork and all its buffers
2.
A
fast, hardware-supported lock primitive 3. Node failure detection and isolationBecause of the three ordering rules, the MEMORY CHANNEL nenvork acknowledge packets are imple
mented with little variation over ordinary packets. To request acknowledgment of packet reception, a node sends an ord inary packet marked with the request
acknowledge attri bute.
The
packet is used to sweep clean the network queues in the sender destination path and to ensure that all previously transmitted packets have reached the destination . ln response to the reception of a M EMORY CHANNEL acknowledge request, the destination node transmits a M EMORY CHANN EL acknowledgment back to the originator.
The arrival of the acknowledgment at
the
originating node signals that all preceding packets on that path have been successfu l ly received .MEMORY CHANNEL
locks
areimplemented
using a lock-acquire software data structure mapped as both incoming and outgoing by all nodes in the cluster.is the same for all nodes. The node can t hen determine if it was the only bidder for the l ock, in which case the node has won the lock. If the node sees multiple bidders for the same lock, it resorts to an operating system-specific back-offand -retry algorithm. Thanks to the M EMORY CHANNEL !:,'llaranteed packet order
ing, even under error the above mechanism ensures that at most one node in the cluster perceives that it was the first to write the lock data structure. To guarantee that data structures are never locked indefi
nitely by a node that is removed from a c luster, the cluster manager software also monitors lock acquisi tion and release.
The M EMORY CHANNEL network supports a strong-consistency shared -memory model due to its strict packet ordering. In addition, the
1/0
operations used to access the M EMO RY CHANN EL arc ful l y integrated within the node's cache coherency scheme.Besides greatly simpl it)1ing the programming model, such consistency allows tor an implementation of spinlocks that does not saturate the memory system . For instance, whi l e a receiver is polling tor a tlag that signals the arrival of data ti·om the MEMO RY CHANNEL net\vork, the node processor accesses only the locally cached copy of the flag, which will be upd ated whenever the corresponding main memory location is written by a M EMORY CHANNEL packet.
Unlike other networks, the MEMORY CHANNEL hardware maintains information on which nodes are currently part of the cluster. Through a collection of timeouts, the MEMORY CHANN EL hardware con tinuously monitors all nodes in the cl uster tor illegal behavior. When a tail ure is detected , the node is iso
lated from the c luster and recovery software is invoked .
A
MEMORY CHANNEL cluster is equ ipped with software capable of reconfiguration when a node is added or removed ti-om the cl uster. The node is simply brought on-line or off-line, the event is broadcast to al l other nodes, and operation continues. To MEMO RY CHANNEL nenvork, the sofuvare switches over to the stand by nenvork, in a man ner transparent to the appl ication .'"
The First-generation MEMORY CHANNEL Network
The first generation of the J\II EMORY CHANNEL ncnvork consists of a node interrace card and a con
centrator or hub. The interface card, ca lled an adapter, plugs into the
1/0
PCl. To send a packer, the CPUDigi r<1l TcdlJJi(<ll journal Vol . 9 No. I 1 997 3 1
32