(JBE Vol. 18, No. 2, March 2013) (Regular Paper) 18 2, 2013 3 (JBE Vol. 18, No. 2, March 2013) http://dx.doi.org/10.5909/jbe.2013.18.2.204 ISSN 2287-9137 (Online) ISSN 1226-7953 (Print) DVB-T GPU FFT a), a) Implementation of FFT on Massively Parallel GPU for DVB-T Receiver Kyu Hyung Lee a) and Seo Weon Heo a) GPU. DVB-T 2K/8K FFT GPU. DTV DVB-T CPU. DVB-T FFT NVIDIA GPU. CPU GPU, (coalescing),. DVB-T 2K/8K FFT CPU FFT 20~30, NVIDIA FFT (CUFFT version 2.1) 1.8 1.5~10. Abstract Recently various research have been conducted relating to the implementation of signal processing or communication system by software using the massively parallel processing capability of the GPU. In this work, we focus on reducing software simulation time of 2K/8K FFT in DVB-T by using GPU. we estimate the processing time of the DVB-T system, which is one of the standards for DTV transmission, by CPU. Then we implement the FFT processing by the software using the NVIDIA's massively parallel GPU processor. In this paper we apply stream process method to reduce the overhead for data transfer between CPU and GPU, coalescing method to reduce the global memory access time and data structure design method to maximize the shared memory usage. The results show that our proposed method is approximately 20~30 times as fast as the CPU based FFT processor, and approximately 1.8 times as fast as the CUFFT library (version 2.1) which is provided by the NVIDIA when applied to the DVB-T 2K/8K mode FFT. Keyword : FFT, GPU, CUDA, DVB-T, NVIDIA, SDR a) (Department of Electronic, information & Communication Engineering, Hongik University) Corresponding Author : (Seo Weon Heo) E-mail: seoweon.heo@hongik.ac.kr Tel: +82-2-320-3081 2012 () (:2012-0001368) (ATC) (:10032734). Manuscript received December 10, 2012 Revised January 29, 2013 Accepted February 5, 2013
1 : DVB-T GPU FFT (Kyu Hyung Lee et al. : Implementation of FFT on Massively Parallel GPU for DVB-T Receiver). GPU (graphics processing unit) 3 (rendering). GPU (unit). GPU, OpenGL API., GPU CPU GPU [1-7]. GPU., PC SDR (software defined radio) [5,6]. DVB-T (digital video broadcasting-terrestrial) [8]. -(Reed-Solomon) (convolution), OFDM (orthogonal frequency division multiplexing). DVB-T., CPU GPU (DTV) CPU GPU. DVB-T FFT [9]., FFT GPU SIMT (single instruction multiple threads). [7] FFT (shared memory), FFT (global memory). [7] batch FFT(single FFT) DVB-T 8K-point FFT. DVB-T 2K 8K FFT CUDA [10,11] (stream) 32 (coalescing). GPU CPU GPU GPU PCI-Express (overhead).,, GPU. radix-2. FFT 2K 8K FFT CPU 22~30 single FFT 4~8.. II DVB-T CPU DVB-T. GPU GPU. GPU FFT VI.
(JBE Vol. 18, No. 2, March 2013). DVB-T CPU DVB-T 1. RF (baseband) ADC (analog to digital converter). OFDM (sample rate) 2 ADC (farrow) (resampler). CP (cyclic prefix) (autocorrelation) (continuous pilot). (Derotator), (halfband). STR (symbol timing recovery), FFT., DFT (discrete fourier transform). (interleaving), (Viterbi decoder). -. DVB-T 8MHz, 2K OFDM 280μsec 8K 1120μsec., 8MHz/8K 1120μsec DVB-T., CPU (time budget) C -. CPU(INTEL I7 2600K 3.4GHz) 1. 2K 4.55msec 280μ sec. FFT (critical path). FFT FFT (channel estimator) 3 ( 1 FFT ). FFT GPU. CPU, 2K 8K FFT 2. OFDM. 1. DVB-T Fig. 1. DVB-T receiver block diagram
1 : DVB-T GPU FFT (Kyu Hyung Lee et al. : Implementation of FFT on Massively Parallel GPU for DVB-T Receiver) 1. DVB-T Table 1. DVB-T receiver software simulation time Block 2K mode processing time 8K mode processing time (μsec) (%) (μsec) (%) Resampler 208.136 4.577 950.549 5.08 Derotator 175.011 3.849 819.741 4.38 Halfband LPF 253.706 5.579 1349.89 7.213 Frame synchronizer & FFT 785.287 17.27 947.796 5.064 Channel estimator 1300.405 28.598 2280.26 12.185 Symbol deinterleaver 10.807 0.238 72.5903 0.388 Bitwise deinterleaver 21.313 0.469 125.098 0.669 Depuncture 7.804 0.172 23.8577 0.127 Viterbi decoder 1717.787 37.777 11839.332 63.267 Byte processor 66.942 1.472 304.029 1.624 Total 4547.198 100 18713.143 100 Total FFT operation/total 17.85% 14.90% 2. CPU FFT Table 2. FFT processing time using CPU N-point CPU (μsec) 2K-point 270.488 8K-point 930.523. GPU, SFU (special function unit) (warp scheduler) SM 8 GPU. CUDA (vector processing), SM. GPU 2. GPU. GPGPU (general purpose GPU) NVIDIA CUDA (computer unified device architecture). API CUDA C GPU. GTX 560 Ti GF114. GPU GTX 560 Ti. GPU, 48 SM (streaming multiprocessor). SM,, 2. GTX 560 [11] Fig. 2. GTX 560 hardware architecture[11]
(JBE Vol. 18, No. 2, March 2013) CUDA,. GPU (on chip) (off chip). 48KB. DRAM, 2D (texture). / /. GPU 3.. byte.. GPU 3. 4. Fig. 4. Shared memory bank architecture 3. GTX 560 GPU Table 3. GTX 560 GPU specification Number of core 384 GPU clock Memory clock Memory interface Memory interface width Memory Bandwidth 900 MHz 1700 MHz GDDR5 256-bit 100.9GB/sec Register / Block 32768 Shared Memory / Block L2 Cache Size 48KB 384KB 3. GPU [11] Fig. 3. GPU memory architecture [11]. GPU FFT GPU SM 32,768. GPU SM. SM 4. (bank)., GPU FFT. DVB-T 2K-point FFT 8K-point FFT.. FFT FFT, FFT
1 : DVB-T GPU FFT (Kyu Hyung Lee et al. : Implementation of FFT on Massively Parallel GPU for DVB-T Receiver). 1. FFT DFT DFT (1). x( n ) X ( k) (twiddle factor). DFT (2). DFT FFT. radix-2, radix-4, radix-n split-radix radix-2. Radix-2 2-point FFT 5. DVB-T 8K FFT 6 13 (stage). stage 2 4-point. 6. 8K point FFT 13 Fig. 6. Block diagram of 13 stages for 8K point FFT processing 2. FFT, GPU FFT GPU GPU CPU. PCI-Express (interface) 7. PCI-Express 2.0 x16. (lane) 5Gbps, 10Gbps. 20% 4Gbps, 8Gbps. 8 x(0) X (0) = x(0) + x(1) W 0 2 x(1) k W 2 X (1) = x(0) - x(1) W 5. 2-point FFT Fig. 5. An example of basic butterfly computation 2-point FFT algorithm 2 7. GPU Fig. 7. Data flow for using GPU
(JBE Vol. 18, No. 2, March 2013) 8. Fig. 8. An example of processing time with and without the stream method 32Gbps. 2K-point FFT 32KB 8μsec 8K-point FFT 128KB 32μsec.. GPU., (page-locked memory). CPU DRAM GPU (direct memory access). 8. HD CPU DRAM GPU DRAM, DH. S1 S13 8K-point FFT. 8,192. x(0) 8,192 4 (HD1 - HD4) 2,048, HD1. (HD2).. 3. GPU FFT GPU. N. N FFT.. 9. 8K-point FFT. 1 1,024. 1,024 8 8,192. radix-2 FFT. FFT X (0) = x(0) + x(1) W N 0 x(1) k W N -1 X (1) = x(0) - x(1) W N 1 9. Fig. 9. The processing of thread accessing and computing
1 : DVB-T GPU FFT (Kyu Hyung Lee et al. : Implementation of FFT on Massively Parallel GPU for DVB-T Receiver).... 1 1 16,384 16. SM. 4,096. GPU SM GPU. FFT... 8 SM 1,024 FFT. 9.. 10.. 10 0 2. 10. 0 0 2 2. 0 2 2 0. 1 5. 10. 2 Fig. 10. An example of bank conflict in stage 2 6 10. FFT. 11 8K-point FFT 7. 11. 0 0 64. 0 64. 1 10 11 SM. 48KB. 11 11. 7 Fig. 11. An example of data access to shared memory in stage 7
(JBE Vol. 18, No. 2, March 2013) (latency).. DVB-T 2K 8K FFT. INTEL I7 3.4GHz CPU, DVB-T C GNU C., GPU NVIDIA GF114 GTX 560. FFT CUDA 2K-point FFT (, ) 13.06μ sec, 57μsec. 8K-point FFT 34.21μsec, 132μsec.... CUDA. (loop). 2K-point FFT 12.25μsec. 2.329μsec. 8K-point FFT 38.78μsec 5.54μsec. 4 CPU GPU FFT 4. FFT CPU FFT 2K-point FFT 22, 8K-point FFT 28. CUDA FFT [12] CUFFT(ver2.1) 2Kpoint FFT 8K-point FFT 1.8. CUFFT. CUFFT. CUFFT FFT FFT. DVB-T DVB-T 2K 8K FFT. GFLOPS. 2K-point FFT 89.08 GFLOPS, 8K-point FFT 170.67 GFLOPS. [7] GTX 280. GPU [7]. GTX 560 GTX 280 1.6, 1.4, 1.26 2.8 HW. [7] 2.8 5. [7] 2K-point FFT 8, 8K-point FFT 4. CPU GPU FFT (GPU ) Table 4. FFT processing time comparison (GPU s case include data copy time) N-point CPU (μsec) GPU (μsec) CUFFT ver2.1(μsec) 2K-point 270.488 12.25 22.246 8K-point 930 38.78 56.128
1 : DVB-T GPU FFT (Kyu Hyung Lee et al. : Implementation of FFT on Massively Parallel GPU for DVB-T Receiver) 4.. 5. [7] Table 5. Performance comparison of the proposed method with the result in the paper [7] N-point (GFLOPS) [7] FFT (GFLOPS) 2K-point 89.08 11.2 8K-point 170.67 42 GPU DVB- T 6. FFT 2K 17.85% 1.42%, 8K 14.9% 0.74%. CPU DVB-T FFT GPU FFT DVB-T. DTV DVB-T C., FFT GPU.,,.. DVB-T PC. [1] Y. Chen, X. Cui, and H. Mei, Large-scale FFT on GPU clusters, Proc. ACM/IEEE Int. Conf. on Supercomputing, pp. 315-324, June 2010. 6. GPU DVB-T Table 6. Software simulation time in DVB-T receiver using GPU Block 2K mode processing time (μsec) 8K mode processing time (μsec) (μsec) (%) (μsec) (%) Resampler 208.136 7.935 950.549 5.926 Derotator 175.011 6.672 819.741 5.11 Halfband LPF 253.706 9.672 1349.89 8.415 Frame synchronizer & FFT 16.854 0.643 57.276 0.357 Channel estimator 144.701 5.516 499.22 3.112 Symbol deinterleaver 10.807 0.412 72.5903 0.453 Bitwise deinterleaver 21.313 0.813 125.098 0.78 Depuncture 7.804 0.298 23.8577 0.149 Viterbi decoder 1717.787 65.488 11839.332 73.804 Byte processor 66.942 2.552 304.029 1.895 Total 2623.061 16041.58 FFT simulation time per total simulation time rate 1.42% 0.74%
(JBE Vol. 18, No. 2, March 2013) [2] Z. Lili, Z. Shengbing, Z. Meng and Z. Yi, Streaming FFT asynchronously on graphics processor units, Proc. IEEE Int. Forum. on Information Technology and Applications (IFITA), pp. 308-312, July 2010. [3] R. debeer and D. van Ormondt, Accelerating batched 1D-FFT with a CUDA-capable computer, Proc. IEEE Int. Conf. on Imaging System and Techniques (IST), pp. 446-451, July 2010. [4] N. Hinitt and T. Kocak, GPU-based FFT computation for multi-gigabit wireless HD baseband processing, EURASIP Jounal on wireless communications and Networking, vol. 2010, no. 30, June 2010. [5] G. Wang, M. Wu, Y. Sun and J. R. Cavallaro, A massively parallel implementation of QC-LDPC decoder on GPU, IEEE 9th Symposium on Application Specific Processors (SASP), pp.82-85, June 2011. [6] M. Wu, Y. Sun, S. Gupta, and J. Cavallaro, Implementation of a high throughput soft MIMO detector on GPU, Journal of Signal Processing Systems, vol. 64, no. 1, pp. 123-136, Sept. 2010. [7] N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith and J. Manferdelli, High performance discrete fourier transforms on graphics processors, Proc. ACM/IEEE Int. Conf. on Supercomputing, pp. 1-12, Nov. 2008. [8] L. Vangelista, N. Benvenuto, S. Tomasin, C. Nokes, J. Stott, A. Filippi, M. Vlot, V. Mignone, and A. Morello, Key technologies for next-generation terrestrial digital television standard DVB-T2, IEEE Communization Magazine, vol. 47, no. 10, pp. 146-153, Oct. 2009. [9] J. H. Suck, D. W. Kim, T. W. Kwon, S. K. Hyung and J. R. Choi, A 8192 complex point FFT/IFFT for COFDM modulation scheme in DVB-T system, Proc. IEEE Int. Conf. on System on Chip (ICSOC), pp. 131-134, Sept. 2003. [10] NVIDIA corp., NVIDIACUDA C Best Practices Guide 5, Oct. 2012. NVIDIA corp., NVIDIA CUDA C Programming Guide 5, Oct. 2012. NVIDIA corp., NVIDIA CUDA CUFFT Library, Oct. 2011. - 2012 2 : - 2012 3 ~ : - :,, - 1990 : - 1992 : - 2001 : Purdue Univ. - 1992 ~ 1998 : LG - 2001 ~ 2006 : - 2006 ~ : - :,,