Design and Architecture - Design and Architecture of a Arithmetic Accelerator

4.5 Design and Architecture of a Arithmetic Accelerator

4.5.1 Design and Architecture

Tab. 4.8: Synthesized results of VHDL implementation of the floating-to-fixed and fixed-to-floating modules on the Xilinx Virtex5 vlx110t-2ff1738 FPGA.

Algorithm Utilization Max. Freq.

Slice Reg. Slice LUT LUT-FF (MHz)

Floating-to-Fixed 63 561 32 222.547

Fixed-to-Floating 64 326 31 141.153

4.5 DESIGN ANDARCHITECTURE OF AARITHMETICACCELERATOR 113

ADD

out

op1 op2

MUX

Fetch &

Decode ^data-in

valid-in ack-in

MUL

out

op1 op2

MUX

SoP

out

op1 op2

MUX

op3

PoS

out

op1 op2

MUX

op3

CORDIC

Zout

op1 op2 I-in I-out

op3 Yout

Xout

Write Back

data-out valid-out ack-out

I-in I-out

C1 C2 C3 C4 C5

rdy-in

rdyo rdyo rdyo rdyo rdyo

rdyi

bordy

birdy

Internal output-busInternal input-bus

op3 op2 op1 ic opr3 opr2 opr1 oc

rdy-out

rdyi

Fig. 4.8: The architecture of the floating-point arithmetic accelerator based on the CORDIC unit.

input and output operands will be held in registersop1,op2,op3,R1, andR3, respectively.

The 15 micro-instructions are provided for the proposed arithmetic accelerator as shown in Tab. 4.9. There are two and three input operands for the micro-instruction, i.e. op₁, op₂, and op₃, where the computational results of an executing instruction are either one word or two words depending on such instructions. Fig. 4.9 presents short and long instruction formats,#F1 and#F2, as well as short and long replay formats, #S1and#S2, of the proposed floating-point arithmetic accelerator.

cmd n/a op₁ op₂

0 15 32 63 64 95

cmd n/a op₁ op₂ op₃

0 15 16 32 63 64 95 96 127

Info. n/a R₁

0 15 16 32 63

#F1

#F2

#S1

16 31

Info. n/a R₁

0 15 16 32 63

#S2

R₂

64 95

Fig. 4.9: Instruction format#F1and#F2 as well as reply format #S1 and#S2of the floating-point arithmetic accelerator

4.5.1.2 A Fetch-and-Decode Unit

This unit is responsible for receiving and decoding an intermediate instruction from out-side; next a control word and information corresponding to the intermediate instruction’s property is created to manipulate related components during the computational process.

The two stage machines, i.e. fetch stage and decode stage, will be executed. The fetch stage will fetch the input instruction from the external bus system to issue the processor with the instruction. The number of fetching instructions is examined fromCmd which is the first word of the instruction. If Cmdequals to X’0001, X’0002, or X’000D, then the number of fetching instructions will be set 3 times, otherwise 4 times. Fig. 4.10 presents the timing diagram of theFetch-and-Decode unit for the short instruction format#F1and the long instruction format#F2. There are four signals applied for fetching an interme-diate instruction from the external bus system, i.e. valid-in signal, data-in signal, rdy-in

Tab.4.9:Themicro-instructionoftheproposedfloating-pointarithmeticaccelerator.

CmdMnemonicOperandOperationDescriptionx’0001ADDop1,op2,R1R1←op1+op2Additionfunctionx’0002MULop1,op2,R1R1←op1·op2Multiplicationfunctionx’0003PoSop1,op2,op3,R1R1←(op1+op2)·op3Product-of-Sumfunctionx’0004SoPop1,op2,op3,R1R1←(op1·op2)+op3Sum-of-Productfunctionx’0005SIN-COSop1,op2,op3,R1,R2R1←cos(op3)CosinefunctionR2←sin(op3)Sinefunctionx’0006SUM-SIN-COSop1,op2,op3,R1,R2R1←op1·cos(op3)−op2·sin(op3)SubtractionofmultiplicationofSineandCosinefunctionR2←op1·sin(op3)+op2·cos(op3)AdditionofmultiplicationofSinandCosx’0007POLAR-RECop1,op2,op3,R1,R2R1← pop21+op22Polartorectangularfunction

R2←op3+tan −1 op2op1 x’0008SINH-COSHop1,op2,op3,R1,R2R1←cosh(op3)HyperbolicCosinefunctionR2←sinh(op3)HyperbolicSinefunctionx’0009SUM-SINH-COSHop1,op2,op3,R1,R2R1←op1·cosh(op3)+op2·sinh(op3)AdditionofmultiplicationofhyperbolicSinandCosinefunctionR2←op1·sinh(op3)+op2·cosh(op3)AdditionofmultiplicationofhyperbolicSinandCosinefunctionx’000AVEC-HYPERXYop1,op2,op3,R1,R2R1← pop21−op22squarerootofdifferenceoftwoconstantvalues

R2←op3+tanh −1 op2op1 Hyperbolicarchtanwithconstantadditionfunctionx’000BROT-LINEARop1,op2,op3,R1,R2R1←op1R2←op2+(op1·op3)Sum-of-ProductfunctioncorrespondingtoCORDICboundaryx’000CVEC-LINEARop1,op2,op3,R1,R2R1←op1R2←op3+ op2op1 Sum-of-DivisionfunctioncorrespondingtoCORDICboundaryx’000DDIVop1,op2,R1R1← op2op1 Divisionfunction x’000ELnop1,op2,op3,R1R1←Ln(op3)=2·tan −1 op2op1 Nationallogarithmicfunction

op1=op3+1,op2=op3−1x’000FSQRTop1,op2,op3,R1R1← √op3= pop21−op22Square-rootfunctionop1=op3+0.25,op2=op3−0.25

4.5 DESIGN ANDARCHITECTURE OF AARITHMETICACCELERATOR 115

Clk

valid-in data-in ack-in

1 2 3 4 5 6 7 8 9 10 11 12

cmd op1 op2 cmd op1 op2 op3 cmd op1 op2

Instruction 1(I1) Instruction 2(I2) Instruction 3(I3)

Fetch & Decode cycle Fetch & Decode cycle Fetch & Decode cycle

0 1 2 0 1 2 3 0 1 2 0

0 0

#F1 #F2 #F1

rdy-in

Fig. 4.10: Timing diagram of theFetch-and-Decodeunit for short instruction format#F1 and long instruction format#F2.

High Precision CORDIC

Fixed-to-Floating Fixed-to-Floating Fixed-to-Floating

Floating-to-Fixed Floating-to-Fixed Floating-to-Fixed

Control and Delayline

x_i y_i z_i

x_o y_o z_o

cm ConX ConY

hsm rmode KK^-1

func I-out

I-in op1 op2 op3

X_out Y_out Z_out

ena

p1 p3 p2 p1 p3 p2

Fig. 4.11: The architecture of a CORDIC Unit

signal, and ack-in signal. As soon as rdy-in signal is active, the valid-in signal and the data-insignal are simultaneously detected and fetched. After the two signals have been presented to theFetch-and-Decodeunit, theack-invalue will be plus one in order to inform to the source of such instruction that the presented word has been already obtained by the processor. When the instruction is completely fetched, the value of the ack-insignal will be reset.

4.5.1.3 A CORDIC Unit

The architecture of the CORDIC unit illustrated in Fig. 4.11 is designed to cover the func-tionalities exhibited in Tab. 4.9. The architecture consists of six components, i.e. Floating-to-Fixed,Fixed-to-Floating,ConX,ConY,Constant Multiplier, and a high precision CORDIC components. These components are explained as follows.

• Floating-to-fixed and fixed-to-floating components: The components transform input data in floating-point to fixed-point format, and vice versa. The algorithms

em-ployed for the two data converters are described in section 4.4.

• Control and delay-line component: The component receives the information from the I-insignal and generates the control signalsenaandcmin order to prepare the pre-processing and post-pre-processing steps for natural logarithm and square root func-tions. Finally, the control signals will be propagated as depicted in Algorithm 21.

• ConX and ConY components: The cmsignal generated by the control and delay-line component is detected with these components, where the cm signal is applied to manipulate inputsp1andp2. Their functionality can be expressed in Equations (4.5) and (4.6).

• Constant Multiplier component: Whenever theenasignal is enabled, outputZ₀ of the high accuracy CORDIC module is multiplied by two; otherwise they will be for-warded without being processed.

• High accuracy CORDIC component: This component is designed conforming to Al-gorithm 13. The component consists of six input parameters, i.e. m, hs, rmod, K, K⁻¹ and func. The values of these parameters depend on the current function as illustrated in Tabs. 4.2 and 4.9. The parameterfuncis customized to conform to the parametercmdin Tab. 4.10

Equation of ConX module

p₃ =











p2+ 1.0 if(CM = 1) p₂+ 0.25 if(CM = 2) p₁ Otherwise

(4.5)

Equation of ConY module

p₃ =











p₂−1.0 if(CM = 1) p₂−0.25 if(CM = 2) p₁ Otherwise

(4.6)

4.5.1.4 A WriteBack Unit

TheWriteBackunit is responsible for managing the computational results that are gener-ated from the arithmetic units. The computational results are presented on the internal output-bus comprising of five buses, i.e.op1,op2,op3,oc, andbo_rdyas illustrated in Fig. 4.8.

The computational results from all the arithmetic units are selected by an impartial policy such as a fairness arbiter mechanism, and arranged conforming to the reply format either

#S1 or #S2 depending on an instructioncmd. The issue of the computational results from

4.5 DESIGN ANDARCHITECTURE OF AARITHMETICACCELERATOR 117

Tab. 4.10: Mapping between the instruction cmdin Tab. 4.10 and the functional numberfunc in Tab. 4.2.

Instructioncmd Functional numberfunc

x’0006 1

x’0007 2

x’0008 3

x’0009 4

x’000A 5

x’000B 6

x’000C 7

x’000D 8

x’000E 6

x’000F 6

Algorithm 21Pre-Post Processing

Require: Cmd Ensure: ena,cm

1: ifCmd=x⁰000Ethen

2: ena=b⁰1

3: cm=b⁰01

4: else ifCmd=x⁰000Fthen

5: ena=b⁰0

6: cm=b⁰10;

7: else

8: ena=b⁰0;

9: cm=b⁰00;

10: end if

11: return ena,cm

Tab. 4.11: Accuracy analysis of hardware’s double-rotation CORDIC in various fixed-point repre-sentations.

Fixed-point format

(QI:QF)

2:29 3.6899E-5 3.3357E-5 3.5746E-5 7.8789E-7 4:27 3.6823E-5 3.3272E-5 3.5657E-5 7.8853E-7 6:25 3.6637E-5 3.2987E-5 3.5335E-5 7.8742E-7 8:23 3.5883E-5 3.1127E-5 3.3809E-5 9.1425E-7 10:21 3.7630E-5 2.4561E-5 3.7630E-5 2.5011E-6 12:19 4.9646E-5 5.1370E-7 4.9646E-5 9.1294E-6 14:17 1.2279E-4 3.1866E-7 1.2279E-4 2.7748E-5 16:15 4.7996E-2 4.6672E-4 4.7996E-2 1.4112E-2

Tab. 4.12: Accuracy analysis of hardware’s triple-rotation CORDIC in various fixed-point repre-sentations.

Fixed-point format

(QI:QF)

2:29 3.0749E-5 2.7798E-5 2.9788E-5 6.5657E-7 4:27 3.0686E-5 2.7727E-5 2.9714E-5 6.5711E-7 6:25 3.0531E-5 2.7489E-5 2.9446E-5 6.5618E-7 8:23 2.9903E-5 2.5940E-5 2.8174E-5 7.6187E-7 10:21 3.1358E-5 2.0468E-5 3.1358E-5 2.0843E-6 12:19 4.1372E-5 4.2808E-7 4.1372E-5 7.6078E-6 14:17 1.0232E-4 2.6555E-7 1.0232E-4 2.3123E-5 16:15 3.9997E-2 3.8893E-4 3.9997E-2 1.1760E-2

each arithmetic unit is controlled by a ready-output signal rdy_o via bo_rdy bus authorized by the impartial policy of this unit. The ready-output signal rdy_o of each arithmetic unit connects to the ready-input signalrdy_i internally in order to stall the fetching instruction of the Fetch-and-Decode unit. Therefore, there is no loss, duplication, or collision of data and instruction.

Fig. 4.12 exhibits the timing diagram of theWritBackunit. The computational results generated from the instruction numbers 1, 3, and 2 appear at clock number 9, 16 and 21 respectively. Reading these computational results is handled bybo_rdy signals number 1 to 5. As soon as abordy signal is active, the value of oprbus numbers 1 to 3 and their valid signal will read into the unit.

Im Dokument Optimal Design of Fixed-Point and Floating-Point Arithmetic Units for Scientific Applications (Seite 146-152)