ROUTING TO MARiS

WITH APPLICATION TO PICTURE PROCESSING*

00= d=

IN THE INSTRUCTIONS ADDRESS INCREMENT

INPUT FROM

t ^ZO

Z REGISTERS -(VECTOR INDIRECT n.l:

MODE) ~

MEMORY BOX

o

MEMORY BOX

n LINES - I TO { EACH Z REGISTER :

LOGIC TO COMPUTE n ADDRESSES

• •

ROUTING TO MARiS

D···

Figure 3. Vamp memory organization.

MEMORY BOX 2k_1

peatedly issue micro operations for a single shift, stopping when all AU's signal completion. Another reasonable alternative is to use a larger radix. Qua-ternary, octal or hexidecimal systems are obvious candidates.

Assuming the engineering problems are solved one is still left with the situation where a scalar (one pair of operands) operation requires as much time as a vector operation ( n pairs of operands).

!!(n,n

! (n,l)

!:.I _ .

~(n,k)

~(n,2k)

~( 1,2k)

MILL

---Thus when a problem is heavily dominated by sca-lar operations there is little gaJn in speed.

Rather than attempt to work with multiple AU's, a register array and a much smaller number of very high-speed Execution Units may be used. The basic procedure is to load the register array and stream operands and results to and from high-speed Execu-tion Units. Figure 4 shows the resultant organiza-tion of the Mill.

TO/FROM MEMORY

VERY HIGH SPEED EXECUTION UNITS

(~nIl6)

Figure 4. Vamp mill (n = number of functional AU's;k = number of bits per word).

The algorithms we use to obtain high-speed arith-metic operation are almost completely combinatoral circuits. Floating add (and its variants) use the alignment and normalization combinatorial shifter described above.

The multiplier is based on the well-kno~n carry save multiplier as recently extended by Wallace³to process all partial products simultaneously. Wal-lace's proposal is to interpret the contents of the multiplier register as radix 4 digits and recode these digits into the digits - 2, - 1, 0, 1, 2. The now easy to generate partial products are grouped by threes and gated into a tree of carry save adders. The out-puts of each level of carry save adders are again grouped by threes and gated into the next level.

Each level of carry save adders reduces the number of summands by about a factor of 1.5. The two out-puts of the tree are added to produce the double length product.

The divide algorithm is an iterative method based on that used in the Harvard Marck IV and discussed by Richards.⁴ The technique essentially involves generating the divisor inverse by a series of truncated multiplications.

During a vector. floating point multiply we could have four sets of operands in motion at once. One set is being accessed from the register array. One set is being multiplied. A product is being normal-ized. A normalized product is being stored in the register array.

In an effort to keep all operations times in the stream to about the same duration the multiply tree can be split such that some fraction of the multiplier bits are processed simultaneously in each section.

While this may slightly increase the time to com-plete a single multiply, the cost will be· slightly de-creased and overall speed will be significantly increased.

As with multiply, a snapshot of Vector Floating Add would reveal many sets of operands in motion at once. Here we have the following possibilities:

Fetch operands from registers, determine the ex-ponent difference, alignment shift, addition or sub-traction, normalization shift, and store result in reg-ister.

To a good first approximation, for floating point fraction lengths of about 32 bits these techniques give approximately 16 times the speed and cost 16

times as much as the classical "parallel" algorithms that use a parallel adder but do serial shifting and process multiplier and quotient digits serially.

Hence, for given speed and cost to implement the length and the particular circuit characteristics.

VAMP COMPUTER

This section describes V AMP with word length, number of arithmetic units, number of memory units, and instruction set fixed. A simulator to in-vestigate the organizational concepts described above was written to run on the 7094 for the par-ticular organization described in this section. Since our study was to investigate concepts, not circuit and memory speeds, we will not supply such things as multiply times and, hence, performance improve-ment factors over convenient targets.

The simulator was not complete in the sense that interrupts were not programmed and I/O must be done outside the simulator program. The simulator will accept a program written in V AMP symbolic assembly language, assemble, and execute the pro-gram.

The simulated VAMP assumes 16 functional arith-metic units and -16 memory boxes. of 16,384

VAMP has 15 index registers. The 10, 11, 12 and Boolean (AND, OR, equivalence) operations. The number representation in the index unit is 2's com-plement. The word length is normally 18 bits with multiply producing a double length product and di-vide producing an 18 bit quotient and 18 bit re-mainder.

Memory Addresses are calculated from informa-tion specified in the F, 11, and Address fields. The

Address modification is extended to include base address indirect addressing. Base address indirect is specified by a one in bit 13 of the instruction opera-tions simultaneously, use the effective base address as the address of the first operand. The address of the second operand is determined by adding the con-tents of the index registers specified. by field 12 to the effective base address. Letting ao represent the effective base address and i2 the contents of the index register addressed by 12, the address vector, a, is of in-struction format. Vector indirect addressing does

not proceed beyond 1 level; i.e., the address vector fetched from memory is used as the operand address vector without further modification. (When modifi-cation of the address vector is required it can be scalar register-memory instructions the contents of the memory location specified by the effective base pro-vided. These instructions combine stapping an index, testing the index, and conditional branching. They are made more powerful by having them set to the

"do not execute" state the screen bit of arithmetic units which will not participate on the last iteration when there are less than 16 items to process.

Looking at the instruction VTILE in detail, the

the results of. the two comparisons are different, the screen bit 2 is set to zero (do not execute state).

The contents of index 12 are now added to the above sum.)If this sum Mod 2¹⁸bears the same re-lation to index (13

+

1) Mod 16 as did the addition on which the branch decision was made, screen bit 3 is not modified. If the relation changes the screen bit is set to zero. This process repeats until 15 addi-tions have been done.

Note that screen bit 1 is never modified, screen bit 2 may be modified at the end of the first addi-tion, screen bit 15 may be modified at the end of the 15th addition.

The VTIH instruction differs from VTILE only in that the branch decision is reversed. If, after the first add shift operation, the contents of index (13

+

1) Mod 16 are greater than index 13 we branch, i.e., put the effective address in the instruction

.Y,tlS,1l

!,tlS,!)

MILL

.

~ 18 81TPATH --9--3S BIT PATH

~ 72 BIT PATH

counter; if the contents of index (13

+

1) Mod 16 are less than or equal to index 13 the instruction counter is advanced by 1. All other operations are the same in VTILE and VTIH. The instruction VTCR uses an implicit -1 as the increment and the contents of the register 12 as the initial value. The comparison is against an implicit zero. Other than this the instruction is identical to VTIH.

The 15 additions required by VTILE, VTIH and VTCR are performed by the same unit and in the same way as addresses for a vector instruction are generated.

The instruction set for VAMP has been designed for the processing of vectors in memory, including rows and columns of matrices. These will normally have considerably more components than the num-ber of AU's. Many operations such as compress, search for largest, and sum and product reduction (sum or product of all compnents) must operate

CONTROL

Figure 5. The simulated vamp CPU.

over the entire vector even though only 16 are han-dled at anyone time. The instruction set is de-signed around this concept.

The simulated CPU is shown in Fig. 5. The w, s, u, X and Z arrays perform the functions described in the second section of this paper. The address unit A contains three bit registers and two 18-bit adders. Like Z, A is not seen by the program-mer. It is used by the control for index and address arithmetic.

The right-most 18 bits of the program status word, PSW, contain the instruction counter. Bits 16 and 17 of the PSW word contain the condition code. The results of all index operations as well as scaler operations in the Mill are used to set the con-dition code to indicate whether the result of the last operation was zero, less than zero, greater than zero, or overflowed. An instruction to test the con-dition code and branch accordingly is provided.

Bits 0-15 of the PSW and PSWM were not

de-fined in the simulation. They are reserved for inter-rupt indicators, interinter-rupt masks, and the interinter-rupt branch address (the location where a new PSW and PSWM are to be picked up from and where the cur-rent ones are to be placed).

The registers I-BUF and IRB in Fig. 5 are used by a relatively simple anticipatory control. Instruc-tions are executed in three levels. At level 1 the in-structor is fetched from memory and placed in reg-ister I-BUF (instruction buffer). In level 2 the in-struction op code is scanned to determine if it is a vector arithmetic insturction or one of the vector transfer instructions: VTILE, VTIH or VTCR. If it is one of the vector instructions the necessary pro-grammer modified until the previous instruction has been completed. Thus if an interrupt occurs at the end of the current instruction some unnecessary work has been done but no procedure for recovery of previous register contents need be included.

PROGRAMMING EXAMPLE

The following small FORTRAN problem is coded in the VAMP symbolic assembly language:

The following instructions are used ( definitions given above). Algebraically add the floating point numbers specified by the address vector to the floating memory locations specified by the address vec-tor by the floating point numbers svec-tored in the most significant half of X register (subject to s).

The double length product appears in X. fea-tures inherent in interleaved memories, very high-speed arithmetic units, mUltiple register CPU's and by adding a number of special instructions one

ob-tains a machine that has the functional capabilities of SOLOMON but which fits within the framework of a more conventional computer organization. Fur-ther, the ideas presented here should result in a ma-chine which is applicable to a much wider range of problems than SOLOMON.

There certainly exists a large class of problems for which neither VAMP nor SOLOMON would show any appreciable advantage over a more con-ventional organization. Compiling is probably the best example of these.

REFERENCES

1. K. E. Iverson, A Programming Language, Wiley, 1962.

2. J. Gregory and R. McReynolds, "The Solo-mon Computer," PGEC, vol. EC-12, no. 5, pp.

774-781 (Dec. 1963).

3. C. S. Wallace, "A Suggestion for a Fast Mul-tiplier," PGEC, vol. EC-13, no. 1, pp. 14-17 (Feb.

1964).

4. R. K. Richards Arithmetic Operations in Digital Computers, Van Nostrand, 1955, pp. 279-282.

LOCATION OP ADDRESS, 11, F, 12, 13

LDA 1" , 1 SET IR 1

=

LDA 1, " 2 SET IR 2 = 1 LDA 58, , , 3 SET IR 3

=

⁵⁸

GPS 16 SET SCREEN TO

BEGIN VLDX A2, 2,,1 LOAD A2(I), A2(I+.

VUFA FXZR PLACE BINARY P

VAND MASK REMOVE EXPONEI\

*

A2 CONVERTE D TO T UNCATEDINTEGER

VUFA LOCB ADD LOC OF B(l) TO

VLXI LOAD B(A2(I), B(A2{Ii

VFAD A2+li 2,,1 ADD A2(I+l), A2(I+2), •••

VFAD A2-1 2,,1 ADD A2 (1-1). A2(I), ••• ,.

VSTX TEMP",1 STORE TEMP. RESULT

VLDX A2, 2,,1 LOAD A2(I), A2(I+l), ••• ,A

VFMP C MPY BY C

VFAD TEMP",1 ADD TEMP. VECTOR

VSTX AI, 2,,1 STORE Al{I), ••• ,Al(I+15) V TILE BEGIN, , ,1,2 STEP INDEX CNTR 16

*

END TEST PR GRAM

*

DATA STORAG

C BSS 1

Al BSS 60

A2 BSS 60

B BSS 40

TEMP BSS 16

LOCB VFMP B

FXZR OCT 233000000000

MASK OCT 400777777777

END

Figure 6. A . VAMP assembly language program.

G. A. Garrett

Lockheed Missiles and Space Company Sunnyvale, California

At this session it seems to me that you might be interested in several of the more-or-Iess techni-cal facets of the direction of a large aerospace puter installation. Consequently I will avoid com-peting with our environment by discussing the ubi-quitous problems of recruiting, of personnel moti-vation, of obtaining cooperation among the mem-bers of the various computing groups, or even the basic problems inherent in convincing our computer folks that the whole computer center does not exist for them at all, but rather as a service for the other parts of our company.

Instead, I want to tell you today about a few of the figures we have on the actual costs of

"Change"; then go into a few aspects of the

"Turn-Around-Problem" from the management point of view; and finish with a few remarks on what a computer center such as ours may reasona-bly expect in the future.

While there are many fields in which constant change is the order of the day, the operation of vir-tually any modern computer center is faced with adjustment to changing computers, changing com-puter languages, and changing software systems with a frequency which is quite notable. In a recent attempt to analyze the economic effects of such changes, several interesting relationships have been noted.

129

In the past, the speed with which a newly in-stalled computer has been "loaded" has often been of interest, but few figures have appeared which treat as a dynamic quantity the relationship be-tween the program checkout load and the produc-tion load. Such relaproduc-tionships must be known, how-ever, before one can evaluate the effects of change, since the purely static "before and after" pictures tend to conceal many of the significant points.

In analyzing the dynamics of the situation, some simple relationships have been postulated, and their predictions compared with those historical data which were obtainable at the LMSC computation center.

First, it was assumed that the loading on a com-puter could be divided between check-out and production, and that the ratio of these two would vary with time from installation. Since the propor-tion of computer programs which run for produc-tion without previous check-out is vanishingly small, it would seem reasonable that the load on a newly installed computer initially must consist sole-ly of program development work. From such a starting point it follows that the ratio of production to development will show a continuous increase un-til it either levels off with time, or the pattern be-comes confused by changes in language, operating system, or type of work processed by the center.

Both the ratio of production to development at steady state and the rate at which this ratio ap-proaches steady state must be determined in order to understand and to evaluate the economics in-volved.

Historical data obtained subsequent to the instal-lation of two types of computers, the IBM 709's which replaced Univac 1103's and the IBM 1410's which were installed to handle administrative sys-tems, shows the patterns given in Figs. 1 and 2 re-spectively.

It can be seen that the data in both cases have a basic similarity, and that the experience in general seems to follow the same type of curve. The smoothed curves themselves were obtained by as-suming that the development load would be reduced

half-way to the steady state load during each four month period.

Data on the introduction of new computer lan-guages have been somewhat more difficult to ob-tain, and tend to be less definitive. Hardware changes are necessarily abrupt. Software changes need not be. However, records have been found which show the ratio of production to development following a fairly general shift from Fortran I to Fortran II on the IBM 7090's which started in June of 1963. Since the scatter of these data is consider-ably greater than it was for the introduction of a computer itself, and since figures are not available for rework as a separate item, smooth curves were not derived from the data. Instead, the curves from Fig. 1, the introduction of the 709's, were plotted

100p---~---~~---~----~

~ Z

0 ~

e;, ~ (j ~

<t!

U'J.

;:J

~ ~

E-! ;:J t:lt

:21 0 0

JAN APR JULY

l

709 INTRODUCED ¹⁹⁵⁹

OCT JAN APR JULY OCT

1960

Figure 1. Load components versus time from installation of 709.

JAN APR JULY

100==---=~---~~--80

OCT JAN APR

L

1410 INTRODUCED ¹⁹⁶³

JULY 1964

OCT JAN APR JULY

1965

OCT JAN

Figure 2. Load components versus time from installation of 1410.

directly in Fig. 3. It would seem that the data at least do not disagree with this pattern.

During the ten-year period from mid-1955 to July 1965 LMSC has employed six distinctly differ-ent computers to handle important parts of its work (IBM 650, 709/7090/7094, 1410; Univac 1103, 1107/1108; and RCA 301), and six addi-tional computers either as peripheral units or to serve special purposes (IBM 1401, RPC 4000, CDC 160A, 924,3200 and GE 415).

During the same ten-year span, in addition to the machine languages of these twelve computers, more than twelve distinct versions of compiler lan-guages or general operating systems have been put into operation.

Each innovation brought improvement. It also brought a new set of problems. With each

improve-ment some portion of the total number of computer users at LMSC has been required to familiarize it-self with either a new computer, new computer lan-guage, or new system some 24 times in these 10 years.

Since some of these changes affected more than two-thirds of the work load, while others touched only a few percent, it would appear reasonable to assume that on the average each change affected one-sixth of the work load. From this figure it follows, again on an average basis, that there must have been the equivalent of a complete change some four times in ten years. That is not to say that large disruptions can be noted at that frequency, but only that they are actually present in the average figures to that extent. The fact that truly disruptive changes do not show in the records results from several

fac-tors. First, in an operation as large as the one being studied, there are usually several large computers of the same type, which can be phased into or out of the operation over a long enough period of time to give a nearly smooth operation in the aggregate.

Second, the introduction of a new language can also be phased in rather smoothly by involving only a few programs at a time. As you all probably realize,

phasing a language out is an even more gradual process than introducing it. It may not be complete-ly impossible, but it is almost impossibcomplete-ly difficult.

Nevertheless, the net effect of changes must have the same cumulative effect as would a complete change every two and one-half years. The follow-ing analysis is based upon that frequency.

Figures 1, 2, and 3, together with the mean

100~---~~~~~

80

JULY OCT JULY

1963 1964 1965

FORTRAN II INTRODUCED

Figure 3. Load components versus time from introduction of Fortran II.

change rate of 0.4 changes per year, provide the data required to calculate the average annual cost of change.

Subtracting the asymptotic values of the curves on these figures gives a value of (98 - 30) = 68 per-cent for the steady state, which must be contrasted

Im Dokument FALL JOINT (Seite 134-181)

ROUTING TO MARiS

WITH APPLICATION TO PICTURE PROCESSING*

ROUTING TO MARiS

t ZO

o

• •

D···

__ !:.I ___ .

+

+

.

+

=

=

*

*

*

l

L

80

t ^ZO

!:.I _ .