• Keine Ergebnisse gefunden

Combinational arithmetic systems for the approximation of functions*

Im Dokument SPRI NG JOI NT COMPUTER CONFERENCE (Seite 103-117)

by CHIN TUNG

IBM Research Laboratory San Jose, California

and

ALGIRDAS A VIZIENIS

University of California Los Angeles, California

INTRODUCTION

The concepts of arithmetic building blocks (ABB) and combinational arithmetic (CA) nets as well as their applications have been previously reported in Refer-ences 3, 4, and 5. The unique ABB, resulting from the efforts of minimizing the set of building blocks in Refer-ence 3, is designed at the arithmetic level, employing the redundant signed-digit number system,2 and is to be im-plemented as one package by LSI techniques. The ABB performs arithmetic operations on individual digits of radix r

>

2 and its main transfer functions are: the sum (symbol +) and product (symbol *) of two digits, the multiple sum of m digits (m ::; r

+

1),

(symbol ¢), and the reconversion to a non-redundant form (symbol RS).

A single ABB may serve as the arithmetic processor of a serially organized computer. Many ABB's can be interconnected to form parallel arrays called combina-tional arithmetic (CA) nets which compute sums, prod-ucts, quotients, or evaluate more complex functions:

trigonometric, exponential, logarithmic, gamma, etc.

Because of the use of signed-digit numbers, the parallel addition and multiplication speed is independent of the length of operands. A design procedure has been de-veloped for CA netsS-a given algorithm is initially represented by a directed graph (algorithm graph, or A-graph), which is then converted to an interconnected diagram of ABB's (hardware graph, or H-graph). The delay through one ABB is defined to be one time unit,

~t.

* The work was sponsored by AEC-AT(U-1) Gen. 10, Project 14.

9.5

A simple example-evaluation of polynomials-is used here to illustrate the concept of CA nets.

The method suggested by Estrin,7 computing Pn(X) = ao

+

alX

+

X2(~

+

a3x)

+

x4(a4

+

asx

+

x2(a6

+

a7x»)

+ ...

permits the fastest evaluation when CA nets are used.

This is shown in Figure 1 with n = 3; the extension to higher values of n is evident. In general, the delay through such a net is Ilog2 n I

+

1 multiplication-addition times.

This paper summarizes our study of applying a par-ticular version of CA nets, i.e., pipelined CA nets, to approximating functions. Involved are not only the topological layout of pipelined CA nets for approxi-mating functions but also the computational com-plexity.

Throughout this paper, w,e will use minimally re-dundant radix 16 signed-digit number representation whose allowed digit values are {-9, -8, "', -1, 0, 1, "', 9}.

APPROXIl\1ATION OF FUNCTIONS

The basic capability of a typical digital computer is limited to simple algebraic manipulations. As a result of this inherent limitation approximation is inevitably involved in the practical computational procedure if the numerical approach is to apply to the evaluation of functions at all. The discrepancy between approxi-mated and the approximating values is required to be

96 Spring JQint CQmputer CQnference, 1970

T =IL0gz31 + 1 .. 2

o

= Multiplication

0 ..

Addition

0 ..

Storage

Figure I-Evaluation of 3rd degree polynomial with Estrin's method

adjusted to. a certain tQlerable degree as individual cases demand. There are two. general apprQaches in the theQry Qf apprQximatiQn-PQlynQmial apprQximatiQn and ratiQnal. apprQximatiQn.8

The representatiQn Qf functiQns by PQlynQmials is an Qld art. The TaylQr series has been Qne Qf the CQrner-stQnes Qf analytical research. If a series has no. Qther purpQse than numerical evaluatiQn Qf the functiQn, the degree Qf CQnvergence has to. be investigated. The TaylQr expansiQn may cQnverge in the entire plane Qr within a given circle Qnly, and it may diverge even at every PQint. With the develQpment Qf the theQry Qf QrthQgQnar expansiQns, the realizatiQn came that QC-casiQnally PQwer expansiQns whQse cQefficients are nQt determined accQrding to. the scheme Qf TaylQr expan-siQn can Qperate niore effectively than TaylQr series itself. Such expansiQns are nQt based Qn the prQcess Qf successive differentiatiQn but Qn integratiQn. A large class Qf functiQns which are nQt sufficiently analytic to.

allQw a TaylQr expansiQn· can be represented by such QrthQgQnal expansiQns. These expansiQns belQng to. a given definite real realm Qf the independent variable x, arid it is aimed to. apprQximate a functiQn in such a

way that the errQr shall be Qf the same Qrder Qf magni-tude all Qver the range. Rapidly cQnvergent PQwer ex-pansiQns are Qf practical impQrtance. Mere CQnvergence Qf an expansiQn, valuable as it is frQm the purely analytical standpQint, is Qf little practical use if the number Qf terms demanded fQr a reasQnable accuracy is. very large.9

In light Qf the abQve cQnsideratiQn Chebyshev polynomials, defined by

Tn(x) = CQS (n CQS-l) fQr -1 :::; x :::;

+

1, (1) emerge as a prQmising PQtential candidate fQr apprQxi-mating functiQns.

With the speed Qf divisiQn rapidly increased in CQn-ventiQnal cQmputers, the superiQrity Qf ratiQnal ap-prQximatiQns seems to. be generally recQgnized.9 RQughly speaking, Qne may say that the "curve-fitting ability" Qf rational function R{x)

ao

+

alX

+

U2X2

+ ... +

amxm R(x) =

-bo

+

b1x

+

b2x2

+ ... +

bnxn is apprQximately equal to. that Qf a PQlynQmial Qf degree n

+ m.

In cQmpeting with the PQlynQmial Qf degree n

+

m, R(x) has an unsuspected advantage in that the cQmputatiQn Qf R(x) fQr a given x dQes nQt require n

+

m additiQns, n

+

m - 1 multiplicatiQns, and Qne divisiQn as might be surmised at first. By transfQrming R (x) into. a continued fraction

C2

R(x) = Pl(X)

+

-C3

P2(x)

+

P ( ) 3 X

+ ...

we achieve the significant reductiQn in the number Qf multiplicatiQns and divisiQns fQr evaluating any R (x) to. n Qr m, whichever is larger. The cQntinued fractiQn fQrm Qf a ratiQnal functiQn nQt Qnly lends itself· to. a faster executiQn but also, sQmetimes, refrains frQm a disadvantage suffered by the ratiQnal functiQns-the cQefficients depend Qn the degrees Qf the numeratQr and denQminatQr.

A practical applicatiQn Qf CA nets to. apprQximating a given functiQn shQuld inVQlve three basic criteria:

speed, accuracy, and cost. The Qverall speed Qn a machine is gQverned by two. factQrs: the speed Qf signals physically gQing through circuit cQmpQnents and the speed Qf the cQmputatiQnal algQrithm in terms

of logical steps. Weare primarily concerned with the latter. The inherent unique property of a totally com-binational arithmetic net clearly allows as many parallel computations to be done simultaneously as they are mathematically permissible. Therefore, the delay from the presence of given data to the inter-pretation of the result could be minimized in a CA net.

As the evaluation of a given function through the nu-merical 'technique inevitably introduces approxima-tions, the problem of accuracy is twofold. First, how accurate is the approximating formula? Second, how can the error, thus incurred, be estimated, adjusted, and controlled? Cost is given a restricted meaning here.

It is a measure of the number of building blocks needed in the implementation. most strongly convergent of a wide class of. expansions in orthogonal polynomials.9 Therefore, the truncation error of the Chebyshev approximation is ascertainable at a glance. Further, the partial sum of the Chebyshev series as-sertion can be found in Reference 6.

Even though explicit polynomials can be evaluated on a maximally parallel CA nets, as shown in the pre-vious section, they have some drawbacks. First, the power form given by

f(x) = Co.n

+

Cl.nX

+ ... +

cn.nxn (6)

has coefficients which are functions of n, so that a change in order of approximation requires a new set of coefficients. The second drawback stems from the ill-determination of the coefficients Ci.n when n is large, which frequently occurs when a function is represented to high accuracy over a long range.

Combinational Arithmetic Systems 97

Recent developments have demonstrated that ra-tional approximations can give higher accuracy than Chebyshev approximations of the same computational complexity.lO However, approximation of the form

R(x) = P(x)/Q(x) = f(x) (7) share one of the disadvantages of explicit polynomials;

the coefficients of P (x) and Q (x) depend on the degrees of P (x) and Q (x). Continued fractions derived from the form (7) can overcome this drawback and have shown a promising prospect in numerical computation on conventional computers. Nevertheless, continued fractions are still impaired by some shortcomings, at least, as far as the application of CA nets is concerned.

The most serious problem in this r~pect is that the evaluation of a continued fraction involves a series of divisions; division is rather complicated in a CA net.1l·12 The other shortcoming of less importance is that inte-gration and differentiation cannot be done on a con-tinued fraction as easily as on a Chebyshev series.

The fact that continued fractions involve many divisions has forced us to choose polynomial approxima-tions, which have no division at all,' rather than ra-tional approximations in the design of the combina-tional arithmetic system for the approximation of functions. This choice is more or less unique to our system and may not be justified in many other cases.

Further improvement of this system may alter this basic decision.

PIPELINED COMBINATIONAL ARITHMETIC NETS FOR EVALUATING CHEBYSHEV SERIES

In between the two extremes, totally serial· and totally concurrent (e.g., Figure 1), a pipelined CA (PCA) net serves as a compromised alternative. A PCA net, in general, consists of both sequential and combinational circuits. Different composition of these two kinds of circuits gives to the resultant PCA net a wide spectrum of performance versus cost. A designer is thus endowed with more freedom at his disposal to choose a particular composition to meet his require-ments. The study we made shows that the PCA net is particularly attractive for evaluating Chebyshev series.

The general concept of pipelining techniques has been successfully applied to modern information processing systems in order to obtain a much improved per-formance at the cost of very moderate increase of hardware. 1

98 Spring Joint Computer Conference, 1970

input

nA

'it

loop-free ckt

~

1

loop accumulating ckt

I

B

111

output

Figure 2-An abstract model of pipelining

A n abstract notion of pipelining

The concept of pipelining technique relevant to the evaluation of Chebyshev series and polynomials can be abstracted with the following simplified model.

Consider

(8) Assume not only that the computations of fi's are in-dependent of one another but also that the computa-tion pattern of each fi is essentially the same, or can be made the same by introducing dummy operations if necessary. The block identified by "loop-free ckt"

in Figure 2 assumes the responsibility of computing fi's; the raw data of fi's are fed into it one after the other. The computed results of fi's are then accumu-lated in the "accumulating ckt."

Let to,i be the instant at which the raw data for com-puting fi are fed into the pipelined circuit (at point A) , tl,i the instant at which fi is' computed and accumu-latedas the partial result (at point B), IlT the amount of time needed to compute ii, i.e., IlT = tl,i - to,i, T the total amount of time required to evaluate

f.

Clearly, assume

to,o

=

0

then

tl,i = to,i

+

IlT

T = tl,n = to,n

+

IlT (9)

In order to decrease T one must decrease to,n or IlT, or both. To decrease IlT is to shorten the longest in-formation flow path; to decrease to,n is to minimize the time spacings between consecutive to's. The minimum.

possible value of to,n is n • Ilt.

lt is interesting to investigate the effect on T by variations of IlT and the time spacing between con-secutive to's. Suppose IlT is increased by Ilt, due to the addition of more circuit in the longest information flow path, i.e., IlT' = IlT

+

Ilt, then the corresponding total computation time T' becomes

T' = to,n

+

IlT' = to,n

+

IlT

+

Ilt = T

+

Ilt (10)

In most cases, T is much greater than Ilt, hence T' is approximately the same as T.

On the other hand, suppose the time spacing between consecutive to's is increased by, Ilt, i.e.,

to,of = to,o toi = to,l

+

Ilt

to./ = to,i

+

i . Ilt

to,n; = to,n

+

n ·Ilt (11)

then

T' = to,n'

+

IlT = to,n

+

n . Ilt

+

T = T

+

n • Ilt

(12) A comparison of Equations (10) and (12) clearly shows that the effect of a variation of the time spacing between consecutive to's is n times greater than that of a variation ofllT with the same magnitude. Therefore, it is more desirable to decrease the time intervals.

between consecutive to's than to shorten the longest information flow path.

Ideally, one would like to see the input data for computing the fi's are fed into the pipelined circuit one after the other with the least possible delay. In this case, with

to,o = 0 we have

to,i

=

i . Ilt, to,n = n . Ilt,

and

T = n • A.t

+

AT (13)

In Equation (13), with A.t fixed, the interactions of T, n, and AT, can be briefly summarized in the follow-ing. As the circuit complexity increases, most likely AT will be lengthened and the computational power of the circuit will be enhanced. If Equation (8) can be reorganized

n'

f= Lf/

(14)

i=O

with the complexity of f/ greater than that of fi, then we expect n'

<

n. The effect on T of the increase of AT and decrease of n cannot be specified without detailed information. It remains to be investigated.

Layout of peA nets

With the knowledge of the above section, we can now begin the layout of PCA nets for evaluating Chebyshev series. The functional block diagram of a PCA-W net is shown in Figure 3. It consists of three subnets, namely, CA-W subnet, self-multiplication CA (SMCA) subnet, and pipelined sequential CA (PSCA) subnets. A CA net with the capacity of com-puting polynomials Pn(x), n ~ w, asynchronously without the necessity of segmenting P n (x) is said to have a width wand is denoted by CA-:W net. The meaning of SM CA and PSCA will be clear in the later text.

The Chebyshev series used to approximate a given

CA-W Subnet

,. '.

SMCA

-

PSCA

Subset

-

Subset

Figure 3-Functional block diagram of the PCA-W net

Combinational Arithmetic Systems 99

x S

0,5

Figure 4-CA net for evaluating 4th degree polynomial

function f(x) assumes the following forms:

n

f(x) =

L'

aiTi(X)

i=O

L

i/2 C2jX2j j=O (i-I) /2

Ti(X) =

L

C2i+IX2i+1

j=O

-1~x~1

for i = 0, 2, 4, ... ,

for i = 1, 3, 5, ... , (16) SMCA assumes the responsibility of computing the powers of x. Using Estrin's method, CA-W specializes in evaluating Ti(X) when i ~ w. If i

>

w, then Ti(x) must be broken into several segments. Each segment can be evaluated on the CA-W net with Estrin's methods. The input data for-computing these segments are fed into the CA-W one immediately after the other.

The outputs from both CA-W and SMCA arrive in the PSCA to form Ti(X). At PSCA, the coefficient ai is multiplied with Ti(x), and aiTi(x) must then be accumulated.

One of the inherent properties of Chebyshev poly-nomials, as can be seen from Equation (16), is that Ti(X) contains only even terms when i is even and only odd terms when i is odd. Due to this inhomo-geneity, the CA-W subnet can be simplified by re-moving half of the storage vertices as well as all the

1r-~ pairs of vertices for evaluating the sub-expres-sions, Cj

+

Cj+IX. For instance, a full CA net for evalu-ating a normal 4th degree polynomial

P4(x) = Co

+

CIX

+

C2X2

+

Caxa

+

C4X4 (17)

is shown in Figure 4.

Since

(18)

100 Spring Joint Computer Conference, 1970

s s s

x 0,2 0,3 0,4

below. Assume

15

!(X) =

L:

aiTi(x)

i=O

for i = 0,2, •••

for i = 1, 3, •.• (19) then the contents of SO,I, SO,2, SO,3) and SO,4 are given in Figure 6 as time increases from 0, the exact timing as to when these c's should be fed into the PCA-W net will be seen in the appendix.

The output of the CA-7 subnet is multiplied at '1"4,1

by an appropriate factor coming from SMCA. For

3,2 example, the output of CA-7, Ti(X) for i :::; 7 or the first segment of Ti(X) for i

>

7 is multiplied at '1"4,1 by the unity coming from 83,1 through M4,1. The second, third, and fourth segments, if they exist, of T i (x) for i

>

7, are multiplied by x8, XI6 , and X24 respectively.

The partial result of each T i (x) is accumulated at

@=Merging

f(x) = ~ a.T.(x) i ~ 0 I I

Figure 5-PCA-7 net for evaluating Chebyshev series-A-Ievel

with Cl = C3 = 0, all the dotted vertices and arcs in Figure 4 can be removed but with SO,1 and SO,3 con-nected to ~2,1 and '1"2,1 respectively. Roughly speaking, the cost of a CA net for evaluating a Chebyshev polynomial with degree i is about half of that of a normal polynomial with the same degree. The speed, however, remains the same since it is the formation of higher degrees of the independent variable x which determines the speed in a CA net. Odd Chebyshev polynomials can be treated as if they were even ones after x is factored out. Double subscripts are appended to coefficients, c's, with first one indicating the associa-tion of c's with Ti(X) and the second one as a running index within Ti(X). From own on, even i will be used in the following discussion but the· arguments apply to both even i and odd i cases unless otherwise stated.

Figure 5 shows a PCA-7 net at the A-level. The OA-7 subnet can handle Chebyshev polynomials with degrees up to seven. If the degree is higher than seven, then the Chebyshev polynomial is broken into seg-ments. An example showing how this is done is given

t

<C15,9 c15,11 c15,13 c15,15

c15,1 c15,3 c15,5 c15,7

<C14,8 c14,10 c14,12 c14,14

c14,0 c14,2 c14,4 c14,6

<C13,9 c13,11 c13,13 0

c13,1 c13,3 c13,5 c13,7

<C

12,8 c12,10 c12,12 0

c12,0 c12,2 c12,4 c12,6

<Cll ,9 c11,11 0 0

cll,1 cll,3 c1l,5 c11.7

<C10,8 c10,10 0 0

c10,0 c10,2 c10,4 c10,6

<eg,9 0 0 0

eg,1 eg,3 c9,5 c9,7

<ca,8 0 0 0

ca,o c8,2 c8,4 c8,6

c7,1 c7,3 c7,5 c7,7

ce,o ce,2 c6,4 c6,6

c5,1 c5,3 c5,5 0

c4,0 c4,2 c4,4 0

c3,1 c3,3 0 0

c2,O c2,2 0 0

c1,1 0 0 0

CO,O 0 0 0

SO, 1 SO,2 SO,3 SO,4

Figure 6-Contents of the initial storage layer of PCA-7 net

~4,l until Ti(X) is completely evaluated and stored in

~4,1. Ti(X) is then multiplied at ?r5,l by ai. All aiTi(x) are accumulated at ~5,l sequentially with i increasing from 0 to the prescribed n.

It is thus seen that once the PCA net is filled up with

~ignificant data then almost every piece of the hard-ware is in constant use until no more inputs are fed.

In this manner, the overall utilization factor is very high, hence more economical and practical than the totally combinational arithmetic net. One important unique advantage of employing Chebyshev approxima-tion is that when changing from approximating one given function to the other, nothing of the CA net needs to be changed except a new set of coefficients a/s Rhould be prepared.

The H -level graph of Figure 6 is shown in Figure 7 with the assumption that the precision is no greater th/1n 17 digits. Detailed timing analysis based on this is shown in the appendix.

Computational complexity

Speed, cost, and error studied here refer only to the H -graph of a CA net. Speed is measured by the delay

Figure 7-PCA-7 net for evaluating Chebyshev series-H-Ievel

Combinational Arithmetic Systems 101

through the longest information flow path of the net, which depends not only on the topology of the net but also on the precision of the data words. Cost, defined as the count of ABB's is a function of the de-grees of polynomials and the series as well as the pre-cision of the data word. Error in a net comes only from round-off operations if all input data are assumed free from inherent errors. We will use precision and the degree of polynomial and series as independent vari-ables in the following study. Further, we assume r = 16, minimal-redundancy, and precision p ~ 17. The results only show the upper bounds of cost, speed, and error.

Delay analysis

With the detailed timing analysis, we are able to construct tables, e.g., Tables (1) and (2), showing the time required to evaluate a given Chebyshev series on PCA nets of different widths. For each width we consider two cases: one for evaluating full Cheby-shev series; the other for evaluating even ChebyCheby-shev series in which odd terms are missing. In. the tables to is the time at which the data for a given segment are in the first storage layer of the PCA net and tf is the time at which the segment is evaluated and stored in the last vertex. In each table it is implied 'that the inde-pendent variable x is fed into the PCA net at t

=

O.

Among the functions whose Chebyshev approxima-tion tables are available in Reference 6, the following functions are approximated by full Chebyshev series:

Exponential, Logarithmic, Gamma, Exponential In-tegral, some Bessel functions, and the following

Exponential, Logarithmic, Gamma, Exponential In-tegral, some Bessel functions, and the following

Im Dokument SPRI NG JOI NT COMPUTER CONFERENCE (Seite 103-117)