• Keine Ergebnisse gefunden

FINAL PRODUCT 3�1

(TO DIVU)

VML_RESULT [63:0J TO VREG

Figure 9 Vector Multiply Unit

similar to the process used for double-precision multiplication. However, in single-precision multi­

plication, only one multiplier chip is needed ro pro­

duce the result and the pack chips do not need to sum the partial product. Integer multipli ca tion is slightly different from floating point multiplication because it does not need to be accumulated or rounded. Thus, the correct product is produced by one multiplier. The result bypasses the accumu­

lation and rounding logic and proceeds directly into the packing logic to be sent to the vector regis­

ter unjt.

The exponent handling for both multiplication and division is performed by the same logic on the packing chips. Depending on the instruction being executed, the exponent is either added (multipli­

cation) or subtracted (division). The result of this operation is then piped to the next stage and the position of the h idden bit is determined. If the frac­

tional portion of the data must be shifted to ensure the hidden bit is in the correct position, the expo­

nent is then incremented or decremented

accord-76

ingly. The normalize count (i.e. , shift count) is used to select the correct final exponent. Overflow and underflow exception checking can only be detected and reported after the final exponent is selected. If an exception is detected, then a reserved operand is written to the appropriate vector register element.

The first stage of the exponent logic also checks for divide by zero and reserved operand exceptions.

Division Vector division is a variable-cycle func­

tion. The number of cycles depends on the format of the operands. The custom divider is capable of producing six quotient bits per cycle. Therefore, F _floating point division is performed in 7 cycles, G_floating point in 1 2 cycles, and D_floating point in 13 cycles. Because of the variable number of cycles in a divide instruction, no other instruc­

tion can execute in the V-box while a divide is in process. Also, because of the iterative nature of divi­

sion (i.e. , one division must be completed before another can be started), the instruction cannot be pipelined.

Vol. 2 No. 4 Fa/1 /'J'J{) Digital Tecbnicaljounwl

As a vector div ide instruction executes, two 64-bit elements are received from the vector regis­

ter unit each cycle and are latched i n the di vide unpack chip. The elements are unpacked, and the fractional portion of the elements is sent to the etJS­

tom divider in 32-bit slices. The exponent portion is sent to the shared exponent logic on the packing chips, as described in the Multiplication sect ion.

During this cycle, time-critical values, such as com­

plemented element values and first-cycle quotient bits, are calculated and forwarded to t he custom divider.

W hen t he divider receives the data, it uses a n iterative algorithm t o produce six quotient bits per cycle. The quotient bits produced are then sent to the packing chips, which may have to increment the quotient, depending on the value of subsequent quotient bits. The div ider instructs the quotient accumulation logic whether or not incrementing is necessary. The partial quotient, once decided, is held in a bank of l atches until a l l the quotient bits are received . When the entire quotient is available, the result is rounded, normal ized , and packed by using the same logic path as multiplication. A mul­

tiplexer switches this packing logic between the multiplication and division logic.

Performance Characteristics

As of this writing, testing of the vccror performance of the VAX 9000 system has only just begun. How­

ever, some preliminmy resu lts are p resented in Table 3. We expect that these results will improve as testing continues and more code i s optimized to take advantage of the chaining and overlapping provided by the V-box.

Chaining and Overlapping

Because of the design of the vector register unit, the V-box can concurrently execute a vector

add-Table 3 VAX 9000 Model 21 0 P rel imi nary

Vector Processing on the VAX 9000 System

class instruction , vector multiply instruction, and vector memory instruction. Unlike the VAX 6000 Model 400 system, vector register conflicts between these instructions have little effect on overlapping. ; With the VAX 9000 system, a conflict only delays t he execution of the subsequent vector instruction by one or two cycles at most.

However, the overlapping behavior of the V-box is sensitive to the issue order of vector instructions.

If two vector instructions executed by the same V-box unit are issued one after the other, the second instruction is delayed until the V-box unit has fin­

ished executing the first. In addition, vector i nstruc­

tions issued after a vector memory instruction or divide instruction, do not begin execution unti.l the previous instruction completes. A general ru le in scheduling code for the VAX 9000 V-box, is to gen­

erate, whenever possible, instruction triples, where the first two instructions are a vector add-class and vector multiply instruction and the last instruction is a vectOr memory or vector divide instruction . Failing that, at least one vector add-class or vector multiply instruction should be issued before a vec­

tor memory or vector divide instruction.

The following code examples demonstrate the usage of the VAX vector instruction set and the over­

lapping behavior of the VA X 9000 V-box. (Note: It should be assumed in the examples that all arrays are 8-byte double precision .)

In the following DAXPY inner loop example, the first two VLDQ instructions do nor overlap. How­

ever, the VSM ULD, VVA DDD , and VSTQ instructions

The first two V LDQ instructions do not overlap in the following MERGE example,

Do i = 1 , 64

vectorizes as:

However, the VVSUBD instruction does overlap with the VSTQ instruction. Both the VSLSSD (VSCMP) and VVMERGE instructions are executed by the vector add unit. Therefore, these two instruc­

tions do not overlap. However, the VVMERGE instruction does overlap with the VSTQ instruction.

In an I F-THEN- ELSE example, such as the

Nothing overlaps the first V LDQ instruction, but the VSLSSD instruction does overlap the second VLDQ instruction. Nothing can overlap with the VVDIVD instruction. Thus, the VSTQ instructio n does not begin execution until the VVOIVD instruc­

tion completes. The remaining VSTQ instruction waits for the first VSTQ instruction to complete.

In the following scatter-gather example, none of the instructions is overlapped. VSEQLD and the IOTA instructions do not overlap.

This lack of overlap occurs because the IOTA instruction is actually done with microcode on the E-box, and the IOTA instruction cannot begin exe­

cution until the VSEQLD instruction has computed all the new vector mask register bits. The vector register access instructions (MFVCR and MTVLR) take only a few cycles and do not significantly affect the overlapping of other vector instructions.

Summary

By taking advantage of key features of the VAX vector architecture, such as instruction overlap­

ping, imprecise exceptions, and asynchronous interaction with the scalar processor, the vector processor of the VAX 9000 system provides super­

computing performance for computationally inten­

sive applications. Through the use of barber poling, the vector processor can overlap two vector arith­

metic instructions with one memory instruction to deliver a peak double-precision performance of 125 M F LOPS.

Acknowledgments

The authors wish to acknowledge the technical contributions of the following individuals to the VAX vector architecture and the VAX 9000 V-box

References

1 . Russell, "The CRAY - 1 Computer System ,"

ACM Proceedings, vol . 21, no. 1 (January 1978):

63-72.

2. VAX Vector Processing Handbook (Maynard : D igital Equipment Corporation, Order No.

EC-H04 19-46/89, 1989).

3. R. Brunner, VAX Architecture Reference Manual (Bedford: Digital Press, Order No. EY -F576 E- DP,

1990).

4 . D. Fenwick et a l . , "A VlSI Implementation of the VAX Vector Architecture," Proceedings of COMPCON '90 (IEEE, Spring 1990).

Digital Tecbntcaljournal Vol. 2 No. 4 Fall 1990

Vector Processing on the VAX 9000 System

5. CRAY-2 Compute-r System Functional Descrip­

tion (Cray Research, Inc , 1985 ).

6. W. Buchholz, "The IBM System/370 Vector Archi­

tecture, " IBM Syste-ms journal, vol. 25, no. 1

(1986): 51 -62 .

7. D. Marshall and ]. McElroy, " VAX 9000 Pack­

aging- The Multichip Unit," Proceedings of COMPCON '90 (!E E E , Spring 1990).

8. M. Adiletta et al . , "Semiconductor Technology in a High-performance VAX System ," Digital Technical journal, vol . 2 , no. 4 (Fall 1990, this issue): 43-60.

79

james B. McElroy Frank]. Swiatowiec

HDSC and Multichip Unit