3 : HEVC GPU (Sangmin Kim et al. : Adaptive Search Range Decision for Accelerating GPU-based Integer-pel Motion Estimation in HEVC Encoders) (Regular Paper) 19 5, 2014 9 (JBE Vol. 19, No. 5, September 2014) http://dx.doi.org/10.5909/jbe.2014.19.5.699 ISSN 2287-9137 (Online) ISSN 1226-7953 (Print) HEVC GPU a), a), b), a) Adaptive Search Range Decision for Accelerating GPU-based Integer-pel Motion Estimation in HEVC Encoders Sangmin Kim a), Dongkyu Lee a), Dong-Gyu Sim b), and Seoung-Jun Oh a) High Efficiency Video Coding (HEVC) GPU (integer-pel) (Motion Estimation). Motion Vector Difference (MVD)., MVD. MVD. GPU Motion Vector (MV). CPU, CPU GPU. GPU. 1.1% BD-rate 37.9% 951.2., 57.5% 0.6% BD-rate. Abstract In this paper, we propose a new Adaptive Search Range (ASR) decision algorithm for accelerating GPU-based Integer-pel Motion Estimation (IME) of High Efficiency Video Coding (HEVC). For deciding the ASR, we classify a frame into two models using Motion Vector Differences (MVDs) then adaptively decide the search ranges of each model. In order to apply the proposed algorithm to the GPU-based ME process, starting points of the ME are decided using only temporal Motion Vectors (MVs). The CPU decides the ASR as well as the starting points and transfers them to the GPU. Then, the GPU performs the integer-pel ME. The proposed algorithm reduces the total encoding time by 37.9% with BD-rate increase of 1.1% and yields 951.2 times faster ME against the CPU-based anchor. In addition, the proposed algorithm achieves the time reduction of 57.5% in the ME running time with the negligible coding loss of 0.6%, compared with the simple GPU-based ME without ASR decision. Keyword : HEVC, Fast ME, Adative Search Range, MVD, GPGPU
(JBE Vol. 19, No. 5, September 2014). Moving Picture Experts Group (MPEG) Video Coding Experts Group (VCEG) High Efficiency Video Coding (HEVC). HEVC H.264/AVC 40%.,., (Motion Estimation : ME) 70%, [1]. [2]-[7]. Central Processing Unit (CPU) [2]-[4] Graphics Process- ing Unit (GPU) [5]-[7]. CPU three-step search [2] four-step search [3] diamond search [4]. GPU [5]-[7] General-Purpose computing on GPU (GPGPU). [2]-[4] Single Instruction Multiple Threads (SIMT) a) (Dept. of Electronic Engineering, Kwangwoon university) b) (Dept. of Computer Engineering, Kwangwoon university) Corresponding Author : (Seoung-Jun Oh) E-mail: sjoh@kw.ac.kr Tel: +82-2-940-5102 2014. [ 14-000-11-002, ] Manuscript received July 10, 2014 Revised September 4, 2014 Accepted September 4, 2014., SIMT GPGPU. SIMT, GPU (integer-pel). (global) (local). Motion Vector Difference (MVD),. Coding Tree Unit (CTU) MVD. [7]. [7] Nvidia GPGPU CUDA(Compute Unified Device Architecture). Prediction Unit (PU) (Advanced Motion Vector Prediction : AMVP) PU. Motion Vector (MV) [8][9]. CPU, GPU. GPU. low complexity.. 2 HEVC CUDA. 3, 4. 5, 6.
3 : HEVC GPU (Sangmin Kim et al. : Adaptive Search Range Decision for Accelerating GPU-based Integer-pel Motion Estimation in HEVC Encoders). HEVC CUDA 1. HEVC H.264/AVC 16x16 (Macroblock : MB) HEVC 64x 64~16x16 (Coding Tree Unit : CTU), 64x64~8x8 (quad tree). Coding Unit (CU). CU PU, PU. (1). J - (Rate-Distortion : RD), SAD Sum of Absolute Differences (SAD). B MVP MV MVD. λ (lagrangian multiplier). J MV... H.264/AVC MB MV MVP,. HEVC AMVP PU MV MV MVP. 1 MVP MV. 2. CUDA CUDA CPU host GPU (thread) device. (kernel) CPU SIMT GPU. CUDA. (thread block), (grid). GPU - (on-chip) (shared memory). 3. CUDA Kepler GPU Streaming Multiprocessor (SMX), SMX Streaming Processor (SP),,, (warp scheduler). SMX 32 (warp),.. -,.. 1. Fig. 1. ME process GPU
(JBE Vol. 19, No. 5, September 2014), [7]. [5] (fractional-pel) GPU - (latency). [6]. [7] (hierarchical) SAD,. Concurrent Parallel Reduction (CPR) SAD Parallel Reduction (PR) (synchronization). [7] SAD CPR. [7], MB SAD. MB 4x4 SAD. 4x4 SAD SAD group, MB 16 SAD group CPR SAD. PR (utilization). data hazard. CPR SAD group SAD group.,., CPR 16 4x4 IMV(integer-pel MV). 8x4, 4x8 SAD 4x4 SAD, 8x4, 4x8 16 SAD group. SAD group CPR, 16 IMV. IMV. [7] HEVC 32x32 CTU. 8x8 PU SAD, CPR PU IMV.., AMVP PU. PU PU PU PU MV. H.264/AVC. [7].. [8][9]. Search Starting Point (SSP). SSP CTU CTU MV, CTU PU SSP.. GPU [5]-[7]. [2]-[4] SIMT., SIMT GPGPU. SIMT MVD, GPU.. MVD,. CTU MVD. 2.,. CPU SSP GPU. GPU
3 : HEVC GPU (Sangmin Kim et al. : Adaptive Search Range Decision for Accelerating GPU-based Integer-pel Motion Estimation in HEVC Encoders) 2. Fig. 2. Overflow of the proposed algorithm, CPU. Rate-Distrotion Optimization (RDO) PU IMV. RDO HEVC test model (HM). RDO RDO IMV. HM. 1. MV MVP MVD. MVD..,. (dominant) (chess board distance : ), MVD chess. MVD chess 0. D motion (Degree of Motion),. D motion 0, 100. 2 D motion 3. 3 Class B 2. D motion Table 2. Experimental environment for D motion research reference model HM 10.0 configuration low delay P QP 22,27,32,37 sequence Class B number of frame 100 CTU size 32x32 reference frame 1 search type full search search range ±16
(JBE Vol. 19, No. 5, September 2014) D motion 37%. (correlation). D motion., D motion 37%, LMF(Large Motion Frame). 3. Class B D motion Fig. 3. D motion Distribution of Class B sequences. 4 D motion 2 (Gaussian distribution). (error minimization) 37%. 4. Class B D motion Fig. 4. Gaussian modeling of D motion distribution of Class B sequences SMF(Small Motion Frame). 2.. LMF SMF. CTU. CTU MVD CTU. MVD CTU CTU MVD chess. 5 LMF SMF MVD CTU. (Laplacian distribution).., SMF MVD CTU TH SMF (Search Range : SR), LMF TH LMF SR. 5. SMF LMF MVD CTU Fig. 5. Distributions of MVD CTU in SMF and LMF
3 : HEVC GPU (Sangmin Kim et al. : Adaptive Search Range Decision for Accelerating GPU-based Integer-pel Motion Estimation in HEVC Encoders). 3 MVD CTU 0 75%~95% MVD CTU 1/4 (quarter-pel). LMF 6 MVD CTU 6 MVD CTU 75%. 3. MVD CTU Table 3. Threshold section and MVD CTU Section Mode 75% 85% 95% LMF -6~6-12~12-32~32 SMF -1~1-3~3-11~11. 7 TH LMF 85% TH SMF 95% Class B Basketball Drive Bjontegaard-delta bitrate (BD-rate) [10] 0.6% 1.1%. TH LMF TH SMF 85% 95%., LMF, CTU MVD CTU TH LMF,. 3 3 TH LMF TH SMF. 4. TR(Time Reduction). TR (3). 4. Table 4. Test environment for threshold and adaptive search range decision reference model HM 10.0 configuration low delay P QP 22,27,32,37 sequence Class B number of frame 100 CTU size 32x32 reference frame 1 search type full search search range ±8 6. Fig. 6. Comparison of coding performance for threshold decision. 6,7. 6 TH LMF TH SMF 75% 95% TR., TH LMF TH SMF 95%. 7. TR Fig. 7. Comparison of ME TR for threshold decision TH LMF TH SMF. 3 4x4, 8x8, 12x12 TH LMF TH SMF. 8,9
(JBE Vol. 19, No. 5, September 2014) TR BD-rate. 8x8. TH LMF TH SMF CTU 8x8. 10 (pseudo code).. 8. SR Fig. 8. Comparison of coding performance for SR decision HEVC HM 10.0 CUDA GPU. CPU Intel i7 3.07GHz, 16GB DRAM, GPU GeForce GTX 460, 1GB DRAM. 5. TR. BD-rate RD. (training set) Class B (test set) Class A,C. 5. Table 5. Test environment for coding performance evaluation of the proposed algorithm 9. SR TR Fig. 9. Comparison of ME TR for SR decision configuration low delay P QP 22,27,32,37 sequence Class A,B,C number of frame 100 CTU size 32x32 reference frame 1 search type full search search range ±16 prediction mode no 4x8 and 8x4 inter predictions AMP off 10. Fig. 10. Pseudo code for adaptive search range decision 11 Class A C D motion. Class A 37% D motion. (D motion 37%) LMF,., Class C D motion
3 : HEVC GPU (Sangmin Kim et al. : Adaptive Search Range Decision for Accelerating GPU-based Integer-pel Motion Estimation in HEVC Encoders) 11. Class A C D motion Fig. 11. D motion Distributions of Class A and Class c sequences D motion 16%. 37%. 37% 16%. 6. 5. (Class B) (Class A,C) Y BD-rate 1%, 37.9% TR. SSP. SSP Y BD-rate 0.5%. SSP 0.5% 0.6% Y BD-rate 1.1%., 12 RD.. 6. Table 6. Coding performance of the proposed algorithm Class A Class B Class C Sequence BD-rate (%) Encoding Time (h) Encoding TR (%) Y U V Anchor Algorithm Traffic 1.0 0.9 0.7 1.90 1.06 44.2 PeopleOnStreet 0.7 2.1 1.8 2.39 1.55 34.9 Class A average 0.9 1.5 1.3 2.14 1.31 39.5 Kimono 0.5 0.5 0.7 1.10 0.69 37.7 ParkScene 1.7 1.3 1.7 1.04 0.62 40.2 Cactus 0.7 0.9 0.9 1.06 0.65 39.1 BasketballDrive 2.3 3.6 2.2 1.11 0.71 36.4 BQTerrace 0.4 0.5-0.4 1.14 0.73 36.5 Class B average 1.1 1.4 1.0 1.09 0.68 38.0 BasketballDrill 2.9 3.3 4.5 0.22 0.14 36.8 BQMall 0.9 1.1 0.9 0.22 0.13 37.7 PartyScene 0.1 0.5 0.4 0.27 0.13 30.0 RaceHorses 1.2 1.2 1.7 0.27 0.19 29.0 Class C average 1.3 1.5 1.9 0.24 0.16 33.4 Total average 1.1 1.5 1.4 0.97 0.61 37.9
(JBE Vol. 19, No. 5, September 2014)
3 : HEVC GPU (Sangmin Kim et al. : Adaptive Search Range Decision for Accelerating GPU-based Integer-pel Motion Estimation in HEVC Encoders) 12. RD Fig. 12. RD curve comparisons between anchor and the proposed algorithm 7. 951.2. 7. Table 7. ME execution time of the proposed algorithm Class A Class B Class C Sequence ME Time (s) Anchor Algorithm Speed-up Traffic 3020.19 2.86 1054.5x PeopleOnStreet 3020.68 2.98 1013.7x Class A average 3020.43 2.92 1034.1x Kimono 1512.00 1.79 844.6x ParkScene 1512.02 1.56 972.0x Cactus 1512.79 1.51 1001.0x BasketballDrive 1512.39 1.83 828.6x BQTerrace 1511.60 1.62 930.7x Class B average 1512.16 1.66 915.4x BasketballDrill 293.85 0.30 973.1x BQMall 294.26 0.30 982.0x PartyScene 294.35 0.29 1023.3x RaceHorses 294.42 0.35 839.6x Class C average 294.22 0.31 954.5x Total average 1343.50 1.40 951.2x
(JBE Vol. 19, No. 5, September 2014) SIMT. 8.. 0.6% BD-rate, 57.5% TR. Low delay P Common Test Condition (CTC) 5. CTC Class A,B C Y BD-rate 17.5%., CTC TR TR 46.2% 99.8% Y BD-rate 18.8%. 5. low delay P low complexity,. CTC MVD. CTC.. MVD, GPU., LMF SMF, CTU. CPU, GPU., SSP.. CPU 1.1% BD-rate, 37.9% 951.2 8. Table 8. Coding performance of adaptive search range Class A Class B Class C Sequence BD-rate (%) IME Time (ms) Y U V Anchor Algorithm IME TR (%) Traffic 0.7 0.9 0.9 7314.6 2864.2 60.8 PeopleOnStreet 0.3 1.4 1.3 7314.6 2979.8 59.3 Class A average 0.5 1.1 1.1 7314.6 2922.0 60.1 Kimono 0.2 0.3 0.2 3742.6 1790.2 52.2 ParkScene 0.8 0.6 0.7 3744.0 1555.5 58.5 Cactus 0.3 0.6 0.8 3744.1 1511.2 59.7 BasketballDrive 1.0 1.9 1.2 3742.6 1825.3 51.2 BQTerrace 0.2 1.2 0.3 3744.1 1624.1 56.6 Class B average 0.5 0.9 0.6 3743.5 1661.3 55.6 BasketballDrill 1.9 2.1 2.9 718.7 302.0 58.0 BQMall 0.6 0.7 0.9 718.3 299.6 58.3 PartyScene 0.0 0.3-0.1 717.7 287.7 60.0 RaceHorses 0.8 0.9 1.2 718.8 350.7 51.2 Class C average 0.8 1.0 1.2 718.4 310.0 56.9 Total average 0.6 1.0 0.9 3292.7 1399.1 57.5
3 : HEVC GPU (Sangmin Kim et al. : Adaptive Search Range Decision for Accelerating GPU-based Integer-pel Motion Estimation in HEVC Encoders)., 0.6% 57.5% TR. (References) [1] F. Bossen, B. Bross, K. Suhring, and D. Flynn, HEVC Complexity and Implementation Analysis, IEEE Transactions on Circuit Systems for Video Technoogy, vol. 22, no. 12, Dec. 2012. [2] R. Li, B. Zeng, M. L. Liou, A new three-step search algorithm for block motion estimation, IEEE Trans. Circuits Syst. Video Technol., vol. 4, no: 4, pp. 438-442, Aug. 1994 [3] L. M. Po, W. C. Ma, A novel four-step search algorithm for fast block motion estimation, IEEE Trans. Circuits Syst. Video Technol., vol. 6, pp. 313 317, June 1996. [4] S. Zhu, K. K. Ma, A new diamond search algorithm for fast block matching motion estimation, Information, Communications and Signal Processing, ICICS., 292-296, vol.1, Sep. 1997. [5] W. N. Chen, H. M. Hang, H.264/AVC motion estimation implementation on Compute Unified Device Architecture (CUDA), IEEE International Conference on Multimedia and Expo, pp. 697-700, April, 2008. [6] J. Zhou, L. Jiao, X. Cao, Implementation of parallel full search algorithm for motion estimation on multi-core processors, International Conference on Next Generation Information Technology, Gyeongju, pp. 31-35, June 2011. [7] D. K. Lee, S. J. Oh, Variable block size motion estimation implementation on compute unified device architecture (CUDA), IEEE International Conference on Consumer Electronics, Las Vegas, pp. 635-636, Jan. 2013. [8] R. Rodriguez, L. Martinez, Accelerating H.264 Inter Prediction in a GPU by using CUDA, International Conference on Consumer Electronics (ICCE), Las Vegas, pp. 463-464, Jan. 2010. [9] R. Rodriguez, J. L. Martinez, Reducing Complexity in H.264/AVC Motion Estimation by using a GPU, IEEE International Workshop on Multimedia Signal Processing (MMSP), Hangzhou, pp. 1-6, Oct. 2011 [10] G. Bjontegaard, Calculation of average PSNR differences between RD-curves, document VCEG-M33, ITU-T SG16 Q.6 Video Coding Experts Group (VCEG), April 2001. - 2013 2 : - 2013 3 ~ : - :, - 2012 2 : - 2014 2 : - 2014 3 ~ : - :,
(JBE Vol. 19, No. 5, September 2014) - 1993 2 : - 1995 2 : - 1999 2 : - 1999 3 ~ 2000 8 : - 2000 9 ~ 2002 3 : - 2002 4 ~ 2005 2 : University of Washington Senior research engineer - 2005 3 ~ : - :,, - 1980 2 : - 1982 2 : - 1988 5 : Syracuse University / - 1982 3 ~ 1992 8 : - 1986 7 ~ 1986 8 : NSF Supercomputer Center - 1987 5 ~ 1988 5 : Northeast Parallel Architecture Center - 1992 3 ~ 1992 8 : - 1992 9 ~ : - 2002 3 ~ : SC29-Korea MPEG Forum - :,,