4.5 Design and Architecture of a Arithmetic Accelerator
4.5.1 Design and Architecture
Tab. 4.8: Synthesized results of VHDL implementation of the floating-to-fixed and fixed-to-floating modules on the Xilinx Virtex5 vlx110t-2ff1738 FPGA.
Algorithm Utilization Max. Freq.
Slice Reg. Slice LUT LUT-FF (MHz)
Floating-to-Fixed 63 561 32 222.547
Fixed-to-Floating 64 326 31 141.153
4.5 DESIGN ANDARCHITECTURE OF AARITHMETICACCELERATOR 113
ADD
out
op1 op2
MUX
Fetch &
Decode data-in
valid-in ack-in
MUL
out
op1 op2
MUX
SoP
out
op1 op2
MUX
op3
PoS
out
op1 op2
MUX
op3
CORDIC
Zout
op1 op2 I-in I-out
SW
op3 Yout
Xout
Write Back
data-out valid-out ack-out
I-in I-out
I-in I-out
I-in I-out
I-in I-out
C1 C2 C3 C4 C5
rdy-in
rdyo rdyo rdyo rdyo rdyo
rdyi
rdyi
rdyi
rdyi
bordy
birdy
Internal output-busInternal input-bus
op3 op2 op1 ic opr3 opr2 opr1 oc
rdy-out
rdyi
Fig. 4.8: The architecture of the floating-point arithmetic accelerator based on the CORDIC unit.
input and output operands will be held in registersop1,op2,op3,R1, andR3, respectively.
The 15 micro-instructions are provided for the proposed arithmetic accelerator as shown in Tab. 4.9. There are two and three input operands for the micro-instruction, i.e. op1, op2, and op3, where the computational results of an executing instruction are either one word or two words depending on such instructions. Fig. 4.9 presents short and long instruction formats,#F1 and#F2, as well as short and long replay formats, #S1and#S2, of the proposed floating-point arithmetic accelerator.
cmd n/a op1 op2
0 15 32 63 64 95
cmd n/a op1 op2 op3
0 15 16 32 63 64 95 96 127
Info. n/a R1
0 15 16 32 63
#F1
#F2
#S1
16 31
31
31
Info. n/a R1
0 15 16 32 63
#S2
31
R2
64 95
Fig. 4.9: Instruction format#F1and#F2 as well as reply format #S1 and#S2of the floating-point arithmetic accelerator
4.5.1.2 A Fetch-and-Decode Unit
This unit is responsible for receiving and decoding an intermediate instruction from out-side; next a control word and information corresponding to the intermediate instruction’s property is created to manipulate related components during the computational process.
The two stage machines, i.e. fetch stage and decode stage, will be executed. The fetch stage will fetch the input instruction from the external bus system to issue the processor with the instruction. The number of fetching instructions is examined fromCmd which is the first word of the instruction. If Cmdequals to X’0001, X’0002, or X’000D, then the number of fetching instructions will be set 3 times, otherwise 4 times. Fig. 4.10 presents the timing diagram of theFetch-and-Decode unit for the short instruction format#F1and the long instruction format#F2. There are four signals applied for fetching an interme-diate instruction from the external bus system, i.e. valid-in signal, data-in signal, rdy-in
Tab.4.9:Themicro-instructionoftheproposedfloating-pointarithmeticaccelerator.
CmdMnemonicOperandOperationDescriptionx’0001ADDop1,op2,R1R1←op1+op2Additionfunctionx’0002MULop1,op2,R1R1←op1·op2Multiplicationfunctionx’0003PoSop1,op2,op3,R1R1←(op1+op2)·op3Product-of-Sumfunctionx’0004SoPop1,op2,op3,R1R1←(op1·op2)+op3Sum-of-Productfunctionx’0005SIN-COSop1,op2,op3,R1,R2R1←cos(op3)CosinefunctionR2←sin(op3)Sinefunctionx’0006SUM-SIN-COSop1,op2,op3,R1,R2R1←op1·cos(op3)−op2·sin(op3)SubtractionofmultiplicationofSineandCosinefunctionR2←op1·sin(op3)+op2·cos(op3)AdditionofmultiplicationofSinandCosx’0007POLAR-RECop1,op2,op3,R1,R2R1← pop21+op22Polartorectangularfunction
R2←op3+tan −1 op2op1 x’0008SINH-COSHop1,op2,op3,R1,R2R1←cosh(op3)HyperbolicCosinefunctionR2←sinh(op3)HyperbolicSinefunctionx’0009SUM-SINH-COSHop1,op2,op3,R1,R2R1←op1·cosh(op3)+op2·sinh(op3)AdditionofmultiplicationofhyperbolicSinandCosinefunctionR2←op1·sinh(op3)+op2·cosh(op3)AdditionofmultiplicationofhyperbolicSinandCosinefunctionx’000AVEC-HYPERXYop1,op2,op3,R1,R2R1← pop21−op22squarerootofdifferenceoftwoconstantvalues
R2←op3+tanh −1 op2op1 Hyperbolicarchtanwithconstantadditionfunctionx’000BROT-LINEARop1,op2,op3,R1,R2R1←op1R2←op2+(op1·op3)Sum-of-ProductfunctioncorrespondingtoCORDICboundaryx’000CVEC-LINEARop1,op2,op3,R1,R2R1←op1R2←op3+ op2op1 Sum-of-DivisionfunctioncorrespondingtoCORDICboundaryx’000DDIVop1,op2,R1R1← op2op1 Divisionfunction x’000ELnop1,op2,op3,R1R1←Ln(op3)=2·tan −1 op2op1 Nationallogarithmicfunction
op1=op3+1,op2=op3−1x’000FSQRTop1,op2,op3,R1R1← √op3= pop21−op22Square-rootfunctionop1=op3+0.25,op2=op3−0.25
4.5 DESIGN ANDARCHITECTURE OF AARITHMETICACCELERATOR 115
Clk
valid-in data-in ack-in
1 2 3 4 5 6 7 8 9 10 11 12
cmd op1 op2 cmd op1 op2 op3 cmd op1 op2
Instruction 1(I1) Instruction 2(I2) Instruction 3(I3)
Fetch & Decode cycle Fetch & Decode cycle Fetch & Decode cycle
0 1 2 0 1 2 3 0 1 2 0
0 0
#F1 #F2 #F1
rdy-in
Fig. 4.10: Timing diagram of theFetch-and-Decodeunit for short instruction format#F1 and long instruction format#F2.
High Precision CORDIC
Fixed-to-Floating Fixed-to-Floating Fixed-to-Floating
Floating-to-Fixed Floating-to-Fixed Floating-to-Fixed
Control and Delayline
xi yi zi
xo yo zo
cm ConX ConY
hsm rmode KK-1
func I-out
I-in op1 op2 op3
Xout Yout Zout
2
cm
ena
p1 p3 p2 p1 p3 p2
Fig. 4.11: The architecture of a CORDIC Unit
signal, and ack-in signal. As soon as rdy-in signal is active, the valid-in signal and the data-insignal are simultaneously detected and fetched. After the two signals have been presented to theFetch-and-Decodeunit, theack-invalue will be plus one in order to inform to the source of such instruction that the presented word has been already obtained by the processor. When the instruction is completely fetched, the value of the ack-insignal will be reset.
4.5.1.3 A CORDIC Unit
The architecture of the CORDIC unit illustrated in Fig. 4.11 is designed to cover the func-tionalities exhibited in Tab. 4.9. The architecture consists of six components, i.e. Floating-to-Fixed,Fixed-to-Floating,ConX,ConY,Constant Multiplier, and a high precision CORDIC components. These components are explained as follows.
• Floating-to-fixed and fixed-to-floating components: The components transform input data in floating-point to fixed-point format, and vice versa. The algorithms
em-ployed for the two data converters are described in section 4.4.
• Control and delay-line component: The component receives the information from the I-insignal and generates the control signalsenaandcmin order to prepare the pre-processing and post-pre-processing steps for natural logarithm and square root func-tions. Finally, the control signals will be propagated as depicted in Algorithm 21.
• ConX and ConY components: The cmsignal generated by the control and delay-line component is detected with these components, where the cm signal is applied to manipulate inputsp1andp2. Their functionality can be expressed in Equations (4.5) and (4.6).
• Constant Multiplier component: Whenever theenasignal is enabled, outputZ0 of the high accuracy CORDIC module is multiplied by two; otherwise they will be for-warded without being processed.
• High accuracy CORDIC component: This component is designed conforming to Al-gorithm 13. The component consists of six input parameters, i.e. m, hs, rmod, K, K−1 and func. The values of these parameters depend on the current function as illustrated in Tabs. 4.2 and 4.9. The parameterfuncis customized to conform to the parametercmdin Tab. 4.10
Equation of ConX module
p3 =
p2+ 1.0 if(CM = 1) p2+ 0.25 if(CM = 2) p1 Otherwise
(4.5)
Equation of ConY module
p3 =
p2−1.0 if(CM = 1) p2−0.25 if(CM = 2) p1 Otherwise
(4.6)
4.5.1.4 A WriteBack Unit
TheWriteBackunit is responsible for managing the computational results that are gener-ated from the arithmetic units. The computational results are presented on the internal output-bus comprising of five buses, i.e.op1,op2,op3,oc, andbordyas illustrated in Fig. 4.8.
The computational results from all the arithmetic units are selected by an impartial policy such as a fairness arbiter mechanism, and arranged conforming to the reply format either
#S1 or #S2 depending on an instructioncmd. The issue of the computational results from
4.5 DESIGN ANDARCHITECTURE OF AARITHMETICACCELERATOR 117
Tab. 4.10: Mapping between the instruction cmdin Tab. 4.10 and the functional numberfunc in Tab. 4.2.
Instructioncmd Functional numberfunc
x’0006 1
x’0007 2
x’0008 3
x’0009 4
x’000A 5
x’000B 6
x’000C 7
x’000D 8
x’000E 6
x’000F 6
Algorithm 21Pre-Post Processing
Require: Cmd Ensure: ena,cm
1: ifCmd=x0000Ethen
2: ena=b01
3: cm=b001
4: else ifCmd=x0000Fthen
5: ena=b00
6: cm=b010;
7: else
8: ena=b00;
9: cm=b000;
10: end if
11: return ena,cm
Tab. 4.11: Accuracy analysis of hardware’s double-rotation CORDIC in various fixed-point repre-sentations.
Fixed-point format
M ax.|error| M in.|error| Ave.|error| Std. Dev.|error|
(QI:QF)
2:29 3.6899E-5 3.3357E-5 3.5746E-5 7.8789E-7 4:27 3.6823E-5 3.3272E-5 3.5657E-5 7.8853E-7 6:25 3.6637E-5 3.2987E-5 3.5335E-5 7.8742E-7 8:23 3.5883E-5 3.1127E-5 3.3809E-5 9.1425E-7 10:21 3.7630E-5 2.4561E-5 3.7630E-5 2.5011E-6 12:19 4.9646E-5 5.1370E-7 4.9646E-5 9.1294E-6 14:17 1.2279E-4 3.1866E-7 1.2279E-4 2.7748E-5 16:15 4.7996E-2 4.6672E-4 4.7996E-2 1.4112E-2
Tab. 4.12: Accuracy analysis of hardware’s triple-rotation CORDIC in various fixed-point repre-sentations.
Fixed-point format
M ax.|error| M in.|error| Ave.|error| Std. Dev.|error|
(QI:QF)
2:29 3.0749E-5 2.7798E-5 2.9788E-5 6.5657E-7 4:27 3.0686E-5 2.7727E-5 2.9714E-5 6.5711E-7 6:25 3.0531E-5 2.7489E-5 2.9446E-5 6.5618E-7 8:23 2.9903E-5 2.5940E-5 2.8174E-5 7.6187E-7 10:21 3.1358E-5 2.0468E-5 3.1358E-5 2.0843E-6 12:19 4.1372E-5 4.2808E-7 4.1372E-5 7.6078E-6 14:17 1.0232E-4 2.6555E-7 1.0232E-4 2.3123E-5 16:15 3.9997E-2 3.8893E-4 3.9997E-2 1.1760E-2
each arithmetic unit is controlled by a ready-output signal rdyo via bordy bus authorized by the impartial policy of this unit. The ready-output signal rdyo of each arithmetic unit connects to the ready-input signalrdyi internally in order to stall the fetching instruction of the Fetch-and-Decode unit. Therefore, there is no loss, duplication, or collision of data and instruction.
Fig. 4.12 exhibits the timing diagram of theWritBackunit. The computational results generated from the instruction numbers 1, 3, and 2 appear at clock number 9, 16 and 21 respectively. Reading these computational results is handled bybordy signals number 1 to 5. As soon as abordy signal is active, the value of oprbus numbers 1 to 3 and their valid signal will read into the unit.