VML_RESULT [63:0J TO VREG

FINAL PRODUCT 3�1

(TO DIVU)

Figure 9 Vector Multiply Unit

similar to the process used for double-precision multiplication. However, in single-precision multi

plication, only one multiplier chip is needed ro pro

duce the result and the pack chips do not need to sum the partial product. Integer multipli ca tion is slightly different from floating point multiplication because it does not need to be accumulated or rounded. Thus, the correct product is produced by one multiplier. The result bypasses the accumu

lation and rounding logic and proceeds directly into the packing logic to be sent to the vector regis

ter unjt.

The exponent handling for both multiplication and division is performed by the same logic on the packing chips. Depending on the instruction being executed, the exponent is either added (multipli

cation) or subtracted (division). The result of this operation is then piped to the next stage and the position of the h idden bit is determined. If the frac

tional portion of the data must be shifted to ensure the hidden bit is in the correct position, the expo

nent is then incremented or decremented

accord-76

ingly. The normalize count (i.e. , shift count) is used to select the correct final exponent. Overflow and underflow exception checking can only be detected and reported after the final exponent is selected. If an exception is detected, then a reserved operand is written to the appropriate vector register element.

The first stage of the exponent logic also checks for divide by zero and reserved operand exceptions.

Division Vector division is a variable-cycle func

tion. The number of cycles depends on the format of the operands. The custom divider is capable of producing six quotient bits per cycle. Therefore, F _floating point division is performed in ⁷cycles, G_floating point in 1 2 cycles, and D_floating point in 13 cycles. Because of the variable number of cycles in a divide instruction, no other instruc

tion can execute in the V-box while a divide is in process. Also, because of the iterative nature of divi

sion (i.e. , one division must be completed before another can be started), the instruction cannot be pipelined.

Vol. 2 ^No.⁴ Fa/1 /'J'J{) Digital Tecbnicaljounwl

As a vector div ide instruction executes, two 64-bit elements are received from the vector regis

ter unit each cycle and are latched i n the di vide unpack chip. The elements are unpacked, and the fractional portion of the elements is sent to the etJS

tom divider in 32-bit slices. The exponent portion is sent to the shared exponent logic on the packing chips, as described in the Multiplication sect ion.

During this cycle, time-critical values, such as com

plemented element values and first-cycle quotient bits, are calculated and forwarded to t he custom divider.

W hen t he divider receives the data, it uses a n iterative algorithm t o produce six quotient bits per cycle. The quotient bits produced are then sent to the packing chips, which may have to increment the quotient, depending on the value of subsequent quotient bits. The div ider instructs the quotient accumulation logic whether or not incrementing is necessary. The partial quotient, once decided, is held in a bank of l atches until a l l the quotient bits are received . When the entire quotient is available, the result is rounded, normal ized , and packed by using the same logic path as multiplication. A mul

tiplexer switches this packing logic between the multiplication and division logic.

Performance Characteristics

As of this writing, testing of the vccror performance of the ^VAX9000 system has only just begun. How

ever, some preliminmy resu lts are p resented in Table 3. We expect that these results will improve as testing continues and more code i s optimized to take advantage of the chaining and overlapping provided by the V-box.

Chaining and Overlapping

Because of the design of the vector register unit, the V-box can concurrently execute a vector

add-Table 3 VAX 9000 Model 21 0 P rel imi nary

Vector Processing on the VAX 9000 System

class instruction , vector multiply instruction, and vector memory instruction. Unlike the VAX 6000 Model 400 system, vector register conflicts between these instructions have little effect on overlapping. ; With the ^VAX9000 system, a conflict only delays t he execution of the subsequent vector instruction by one or two cycles at most.

However, the overlapping behavior of the V-box is sensitive to the issue order of vector instructions.

If two vector instructions executed by the same V-box unit are issued one after the other, the second instruction is delayed until the V-box unit has fin

ished executing the first. In addition, vector i nstruc

tions issued after a vector memory instruction or divide instruction, do not begin execution unti.l the previous instruction completes. A general ru le in scheduling code for the ^VAX9000 V-box, is to gen

erate, whenever possible, instruction triples, where the first two instructions are a vector add-class and vector multiply instruction and the last instruction is a vectOr memory or vector divide instruction . Failing that, at least one vector add-class or vector multiply instruction should be issued before a vec

tor memory or vector divide instruction.

The following code examples demonstrate the usage of the VAX vector instruction set and the over

lapping behavior of the VA X 9000 V-box. (Note: It should be assumed in the examples that all arrays are 8-byte double precision .)

In the following ^DAXPYinner loop example, the first two ^VLDQinstructions do nor overlap. How

ever, the VSM ULD, VVA DDD , and ^VSTQinstructions

The first two V LDQ instructions do not overlap in the following MERGE example,

Do i ⁼ 1 , 64

vectorizes as:

However, the VVSUBD instruction does overlap with the VSTQ instruction. Both the VSLSSD (VSCMP) and VVMERGE instructions are executed by the vector add unit. Therefore, these two instruc

tions do not overlap. However, the VVMERGE instruction does overlap with the VSTQ instruction.

In an I F-THEN- ELSE example, such as the

Nothing overlaps the first V LDQ instruction, but the VSLSSD instruction does overlap the second VLDQ instruction. Nothing can overlap with the VVDIVD instruction. Thus, the VSTQ instructio n does not begin execution until the VVOIVD instruc

tion completes. The remaining VSTQ instruction waits for the first VSTQ instruction to complete.

In the following scatter-gather example, none of the instructions is overlapped. VSEQLD and the IOTA instructions do not overlap.

This lack of overlap occurs because the IOTA instruction is actually done with microcode on the E-box, and the IOTA instruction cannot begin exe

cution until the VSEQLD instruction has computed all the new vector mask register bits. The vector register access instructions (MFVCR and MTVLR) take only a few cycles and do not significantly affect the overlapping of other vector instructions.

Summary

By taking advantage of key features of the VAX vector architecture, such as instruction overlap

ping, imprecise exceptions, and asynchronous interaction with the scalar processor, the vector processor of the VAX 9000 system provides super

computing performance for computationally inten

sive applications. Through the use of barber poling, the vector processor can overlap two vector arith

metic instructions with one memory instruction to deliver a peak double-precision performance of 125 M F LOPS.

Acknowledgments

The authors wish to acknowledge the technical contributions of the following individuals to the VAX vector architecture and the VAX 9000 V-box

References

1 . Russell, "The ^{CRAY - 1} Computer System ,"

ACM Proceedings, vol . 21, no. 1 (January ^1978):

63-72.

2. _VAX Vector Processing Handbook (Maynard : D igital Equipment Corporation, Order No.

EC-H04 19-46/89, 1989).

3. R. Brunner, VAX Architecture Reference Manual (Bedford: Digital Press, Order No. EY -F576 E- DP,

1990).

4 . D. Fenwick et a l . , "A ^VlSIImplementation of the ^VAX Vector Architecture," Proceedings of COMPCON '90 (IEEE, Spring 1990).

Digital Tecbntcaljournal Vol. 2 ^No.4 Fall 1990

Vector Processing on the VAX 9000 System

5. _CRAY-2Compute-r System Functional Descrip

tion (Cray Research, Inc , ^{1985 ).}

6. _W.Buchholz, "The IBM System/370 Vector Archi

tecture, " IBM Syste-ms journal, vol. ^25,no. ¹

(1986): 51 -62 .

7. D. Marshall and ]. McElroy, " VAX 9000 Pack

aging- The Multichip Unit," Proceedings of COMPCON '90 (!E E E , Spring ^1990).

8. M. Adiletta et al . , "Semiconductor Technology in a High-performance ^VAX System ," Digital Technical journal, vol . ^{2 ,}no. ⁴(Fall ^1990,this issue): ^43-60.

james B. McElroy Frank]. Swiatowiec

HDSC and Multichip Unit

Im Dokument Digital Technical Journal Digital Equipment Corporation (Seite 78-82)