이강좌는과학기술부의국가지정연구실인연세대학교이용석교수연구실 ( 프로세서연구실 ) 에서 C&S Technology 사의지원을받아서제작되었습니다 고성능부동소수점나눗셈기 Goldschmidt`s 00. 1. 연세대학교전기전자공학과프로세서연구실박사과정정우경 E-mail: yonglee@yonsei.ac.kr Homepage: http://mpu.yonsei.ac.kr 전화 : 0-13 13-8787 고성능마이크로프로세서구조와설계강좌시리즈 (http://mpu.yonsei.ac.kr mpu.yonsei.ac.kr) 00. 1. 연세대학교전기전자공학과프로세서연구실박사과정정우경 E-mail: yonglee@yonsei.ac.kr 1. 반도체산업과비메모리분야육성을위한방안 (1998.). 고성능마이크로프로세서구조의개요 (1998.) 3. 고성능마이크로프로세서명령어해석기 (Instruction Decoder) 의구조 (1998.3) 4. 고성능마이크로프로세서분기명령어 (Branch Instruction) 의수행방법 (1998.3) 5. 고성능마이크로프로세서곱셈기 (Multiplier) 의구조 (1998.3) 6. 고성능마이크로프로세서부동소수점연산기 (Floating-Point Unit) 구조 (1999.3) 7. 고성능마이크로프로세서캐쉬 (Cache) 메모리구조 (1999.3) 8. 고성능마이크로프로세서나눗셈연산기 (Divider) 의구조 (1999.3) 9. 고성능마이크로프로세서초월함수 (Transcendental) 연산기구조 (1999.3) 10. 고성능마이크로프로세서 ALU 와레지스터 파일의구조 (000.1) 11. 직접디지털주파수합성기 (DDFS) 의구조 (000.1) -- -3- -4-
1. 암호화를위한 VLSI 구조와설계의개요 (000.1) 13. 고성능마이크로프로세서부동소수점연산기 (Floating-Point Unit) 구조 () (000.1) 14. 고성능부동소수점나눗셈기 : Goldschmidt s s (00.) -5- 참고문헌 [3] D.A.Patterson and J.L.Hennesy, Computer Architecture: a Quantitative Approach,, nd Ed., Morgan Kaufmann, pp.a13-a61, A61, 1996 [4] R.E.Goldschmidt, Applications of Division by Convergence, MS thesis, Dept. of EE, M.I.T., Cambridge, Mass., June 1964. [5] 정재원, 내장형프로세서를위한 IEEE-754 고성능부동소수점나눗셈기의설계, 연세대학교전기전자공학과석사학위논문, 001.. (http:// http://mpu.yonsei.ac.kr/ 석박사학위논문 ) 고성능마이크로프로세서나눗셈연산기 (Divider) 의구조 - 고성능마이크로프로세서구조와설계강좌시리즈 8번 * Restoring division (RD) * Non-restoring division (NRD) * Radix-4 4 SRT division * Square root algorithm Floating Point Division ( 참고문헌 [1]) * FP multiplier, adder 에비해느린발전 * FP division: low frequency, high latency - 최대 +0.50 CPI 까지증가 [1] S.F.Oberman and M.J.Flynn, "Design Issues in Division and Other Floating-Point Operations," IEEE Trans. on Comp., vol.46, no., Feb. 1997, pp.154-161 161 [] P.Soderquist and M.Leeser,, "An Area/Performance Comparison of Subtractive and Multiplicative Divide/Square Root Implementations," Proc. 1th IEEE Symp. Computer Arithmetic,, IEEE, 1995, pp.13-139 139-6- -7- -8- -9- * Instruction mix - 3% division, 0.33% sqrt, 37% multiply, 55% add * Stall - 40% division, 4% add, 18% multiply (divider 0, adder/multiplier 3 cycle) * Interlock distances - 3.34 O0 optimization, 10. O3 optimization -10-
Subtractive Division ( 참고문헌 []) * Subtractive methods (dedicated hardware) Radix-4 4 SRT Parallel operation, cost 크고 latency 길다. multiplier 와 de-couple * Implementation: DEC 1164 Alpha, HP PA8000, IBM/Motorola PowerPC 604, Intel P6, MIPS R4000, Sun UltraSPARC -11 11- Multiplicative Division ( 참고문헌 []) * Multiplicative methods (sharing FPU multiplier) Newton-Raphson Raphson,, Goldschmidt's algorithm High speed- quadratical convergence: 각 iteration 마다두배의정확한 quotient bits 생성 Multiplier 의 latency 의증가 -1- Multiplicative Division ( 참고문헌 []) * Implementation: Newton-Raphson Raphson: : IBM RS/6000, MIPS R8000 Goldschmidt: Sun SuperSPARC, TI 의 arithmetic coprocessor Newton-Raphson ( 참고문헌 [3]) * a/b = a*1/b * 곱셈의반복을통해 1/b 을구한뒤 a를곱해나눗셈을수행한다. -13- -14- Newton-Raphson Newton-Raphson -b Y 0 Y 1 Y P 0 X 0 X 1 X 1/X n -b b = 0 X n = 1/b f (X) = 1/X - b X P0에서접선의기울기 = f`(x0) f`(x) = -1/X f`(x0) = -1/X0 = -Y0/(X1-X0) (X1-X0) ) = Y0X0Y X1 = X0+Y +Y0X0 = X0+(1/X +(1/X0-b)X0 = X0-bX bx0 = X0( (-bx0) -15- Q = a/b = a*1/b f(x) = 1/X-b = 0 X i+1 = XiX - f(xi)/f`(x )/f`(xi)) = Xi X + (1/Xi-b)/(1/X b)/(1/xi ) = Xi*( *(-b*xi) * 번의곱셈과 1번의뺄셈필요 * 뺄셈은 1`s complement 생성 logic 으로변환가능 * Iteration 안의곱셈이 dependent 하기때문에 parallelism 불가능 -16-
나눗셈예제 (1) * 1.01101000 1.11010100 (1.4065 1.8815) b(1.11010100) 의역수를구한다. 1 X 0 = 0.1 (1<b< 이므로, 1/<1/b<1) e 0 = - X 1 = X 0*(-b*X 0) = 0.1*(-1.11010100*0.1) 1.11010100*0.1) = 0.10001011 e 1 = -4 나눗셈예제 (1) 3 X = X 1*(-b*X 1) = 0.10001011*(-1.11010100*0.10001011) 1.11010100*0.10001011) = 0.10001100 e = -8-17- 1/b = 0.10001100 a/b = a*1/b = 1.01101000*0.10001100 = 0.11000101 (0.769049687) * 1.4065 1.8815 = 0.76930769 (0.110001010) -18- Goldschmidt ( 참고문헌 [4]) * 분자와분모에같은수를곱해분모가 1에수렴하도록하면분자는나눗셈의몫으로수렴한다. a a*r 0 *R 1 *...*R m-1 Q Q = = b b*r 0 *R 1 *...*R m-1 1 Goldschmidt * Talyor series g(y) = g(p)+(y-p)g`(p)+(y p)g`(p)+(y-p) g``(p)/!+.. * Maclaurin series g(y) = 1/(1+y), p = 0 g(y) = 1-y+y1 -y 3 +y 4 -... = (1-y)(1+y )(1+y 4 )(1+y 8 ).. -19- -0- Goldschmidt Goldschmidt (1+y)*g(y) = (1+y)/(1+y) = 1 = (1+y)(1-y)(1+y y)(1+y )(1+y 4 )... = (1-y)*{(1+y)(1+y )(1+y 4 )...} a*(1+y)(1+y )(1+y 4 )... = Q b R 0 R 1 R R 0 R 1 R -1- R 0 = 1+y = -(1 (1-y) = -b R 1 = 1+y = -(1 (1-y ) = -b*r 0 R = 1+y 4 = -(1 (1-y 4 ) = -b*r 0 *R 1 R 3 = 1+y 8 = -(1 (1-y 8 ) = -b*r 0 *R 1 *R Q = N i / D i R n = -D n D n+1 = Dn*Rn = b*r0*r *R1*R*R3*...*... 1 n+1 = Nn*Rn = a*r0*r *R1*R*R3*...*... Q N n+1 --
Goldschmidt Goldschmidt N 0 N 1 D 1 = a, D 0 = b, R 0 = -D 0 = -b = 1+y = N 0 *R 0 = a*(1+y) = D 0 *R 0 = b*(1+y) = (1-y)(1+y) = 1-y R 1 = -D 1 = -(1 (1-y ) = 1+y N D = N 1 *R 1 = a*(1-y)(1+y ) = D 1 *R 1 = (1-y )(1+y ) = 1-y 4 R = -D = -(1 (1-y 4 ) = 1+y 4 Ni = a*(1+y)(1+y )(1+y 4 )...(1+y i ) Q Di = 1-y i 1-3- * Di = 1-y i 1 로수렴하기위해서는 0 y < 1 b = 1-y, 0 b < 1 * IEEE-754 부동소수점표준의 significand 1 x < 1/ x` < 1 * 분자, 분모를 1/ 로 prescale 후 iteration -4- 나눗셈예제 () 나눗셈예제 () ex) 1.01101000 1.11010100 (1.4065 1.8815) prescale: : 0.10110100 0.11101010 N 1 = 0.10110100*1.00010110 = 0.11000011 D 1 = 0.11101010*1.00010110 = 0.11111110 R 1 = -0.11111110 = 1.00000010 N = 0.11000011*1.00000010 = 0.11000100 D = 0.11111110*1.00000010 = 0.11111111 R = -0.11111111 = 1.00000001 N 0 = 0.10110100 D 0 = 0.11101010 R 0 = -0.11101010 = 1.00010110-5- * a/b = 0.11000100 (0.76565) * 1.4065 1.8815 = 0.76930769 (0.110001010) -6- 비교 Reciprocal Look-up Table * Newton-Raphson 종속적인두번의곱셈 i+1 = Xi*( *(-b*xi) X i+1 * Goldschmidt s s 독립적인두번의곱셈 N i+1 = Ni*Ri, D i+1 = Di*Ri, R i+1 = -D i+1-7- * Multiplicative Division: Quadratic convergence 몫의정밀도 : 1 4 8 161 16 3 * 근사역수 ROM table 이용해초기 iteration 을빠르게수행 초기값 X 0 의정밀도가 8bit 이면두번만에 3bit 의정밀도를얻는다. -8-
Reciprocal Look-up Table b: 1.xxxxxxxx... k-bits address N 0 X a ROM Table k *m bits data out m-bits 0.1yyyyyy : x 0 X b D 0-9- Reciprocal Look-up Table * 1.xxxx 로 access 하는값의범위 : 1.xxxx0000.. b 1.xxxx1111... * [a,b] 안의어떤값 x에대해최대 error 를최소화하기위한 reciprocal approximation 은 /(a+b) * Relative error 의최대값 : (b-a)/(b+a) * Reci(1.xxxx) = /(1.xxxx + (1.xxxx + 0.0001)) = /(*1.xxxx + 0.0001) = k /(1xxxx + 1/) = 0.1yyyyy... * 1yyyyy = RN( k+m+1 /(1xxxx + 1/)) data out address -30- Reciprocal Look-up Table * Table 에의한에러는 m=k+g 일때 r.e max +1) (1+1/ g +1 -(k +1) +1 ) 나눗셈예제 (3) ex) 1.01101000 1.11010100 (1.4065 1.8815) ROM table: 4bit in, 4bit out * 초기에러가표현범위밖으로작아질때까지 iteration 반복으로정확한몫을찾아간다. -31- k=4, m=4 ROM 출력 : RN( 4+4+1 /(11101 +1/)) = 10001 (ROM table access: 주소 : 1101 출력 : 0001) X 0 = 0.10001-3- 나눗셈예제 (3) Goldschmidt Divider 의구현 N 0 = X 0 *a = 0.10001*1.01101000 = 0.10111111 D 0 = X 0 *b = 0.10001*1.11010100 = 0.11111000 R 0 = -D 0 = -0.11111000 = 1.00001000 N 1 = 0.10111111*1.00001000 = 0.11000100 D 1 = 0.11111000*1.00001000 = 0.11111111 R 1 = -D 1 = -0.11111111 = 1.00000001 * a/b = 0.11000100 (0.76565) * 1.4065 1.8815 = 0.76930769 (0.110001010) -33- * 두개의독립된곱셈동시수행 - 6 cycle latency * 개의 3x3+64 MAC unit 이용 : 최근의고성능마이크로프로세서들의 SIMD 연산기능을이용 * MAC 연산을통해 remainder 계산 * 11Kbit reciprocal ROM table 사용 -34-
Block Diagram Ni A 3 X 3 MAC unit Bits Adjustment B Rom Table 10-bits bits-inin 1-bits bits-out Ri Di Ri 3 3 3 3 3 Ni 3 3 3 3 constant 3 3 64 64 3 X 3 MAC unit Bits Adjustment 3 Remainder look-up 3 Qi,Qi-1, Qi+1 Final Rounding 3 64 Ri Di constant Rounding mode -35- 각 cycle 별동작 cycle 0 1 3 4 5 6 동작 input=(a,b), output=(q) ROM table lookup, x 0 =ROMACCESS(b) d 0 =MUL(x 0,b), n 0 =MUL(x 0,a), r 0 =~d 0 d 1 =MUL(r 0,d 0 ), n 1 =MUL(r 0,n 0 ), r 1 =~d 1 q i =MUL(r 1,n 1 ) rem=a =a-mul(q i,b) q=round(q i,rem) -36- Error Analysis multiplier rounding error: 3 Em -3 - -3 1 s s complement error: E ones = - -31 initial reciprocal error: E x0 -(k+1 ) (1+1/ g+1 ) < -10 Error Analysis x 0 = 1/b + E x0 N 0 = a*x 0 = a*(1/b+e x0 ) + Em D 0 = b*x 0 = b*(1/b+e x0 ) + Em R 0 = -D 0 = -b*(1/b+ b*(1/b+e x0 )-Em + E ones N 1 = N 0*R 0 = -Em +EmE ones -(a+b)e x0 (-a/b) a/b)em -abe x0 +a/be ones +a/b D 1 = -Em -be x0em +EmE ones +Em -b E x0 +be x0e ones +E ones +1 R 1 = Em +be x0em -EmE ones -Em +b E x0 -be x0e ones +1 x0em + x0-37- +1-38- Error Analysis Error Analysis Q = N 1*R 1 = (a-b) b)e x0em - (b 3 +3ab )E 3 x0 Em + b E x0 Em + ab E 3 x0 E ones + a ae x0e ones + abe x0 E ones + (1-a/b) a/b)em + a/be ones - ab 3 4 E x0 + a/b Eq = (1-a/b) a/b)em + a/ + a/be ones - -8 < Eq < -8 - ab 3 4 E x0-39- * Single precision 의정밀도 : -3 * IEEE rounding 을위해 bit 정밀도가더필요 * Rounding 으로인한 1bit shift: 1bit 더필요 - -6 < Eq < -6-40-
Design Methodology 설계결과 * C model 로 algorithm 구현 : random vector generation 으로결과비교 * Verilog HDL 로 modeling: C model 과결과비교 * Design synthesis: 0.35um CMOS standard cell library delay, area report -41- unit Divider MAC Divide only ROM table REM Round Control Bitmask gate 5711 47183. 1007.8 4071.4 17.8 97.8 06. 49.8 delay 17.43ns 15.80ns - 10.70ns 0.54ns 0.86ns 0.36ns area(%) 100 8.47 17.53 7.1 3.88 0.5 0.36 0.09-4- Critical Path Latency Comparison D i MAC R i 입력선택 0.93ns FF output 0.6ns MAC 연산 15.80ns 1 s comp 0.ns Total 17.4ns -43- Processor Intel i486 MIPS R4000 SPARC Intel Pentium (Radix-4 4 SRT) UltraSPARC (Radix-8 8 SRT) IBM RISC/6000 (Newton-Raphson Raphson) Elbrus Ek Intel Pentium III Goldschmidt Latency 35 3 0~3 3 19 1 19 10~13 18~36 6 perf.. ratio 17.14 6.09 30 31.58 50 31.58 60 33.33 100-44-