(Regular Paper) 18 6, 2013 11 (JBE Vol. 18, No. 6, November 2013) http://dx.doi.org/10.5909/jbe.2013.18.6.835 ISSN 2287-9137 (Online) ISSN 1226-7953 (Print) HEVC a), a), b), b), b), a) Study of Parallelization Methods for Software based Real-time HEVC Encoder Implementation Yong-Jo Ahn a), Tae-Jin Hwang a), Dongkyu Lee b), Sangmin Kim b), Seoung-Jun Oh b), and Dong-Gyu Sim a) ISO/IEC MPEG ITU-T VCEG JCT-VC (Joint Collaborative Team on Video Coding) HEVC (High Efficiency Video Coding) H.264/AVC 2.,. HEVC, HEVC. HEVC SIMD, CPU GPU., HEVC. HM (HEVC reference model) 10.0 832 480 20~30fps, 1920 1080 5~10fps. Abstract Joint Collaborative Team on Video Coding (JCT-VC), which have founded ISO/IEC MPEG and ITU-T VCEG, has standardized High Efficiency Video Coding (HEVC). Standardization of HEVC has started with purpose of twice or more coding performance compared to H.264/AVC. However, flexible and hierarchical coding block and recursive coding structure are problems to overcome of HEVC standard. Many fast encoding algorithms for reducing computational complexity of HEVC encoder have been proposed. However, it is hard to implement a real-time HEVC encoder only with those fast encoding algorithms. In this paper, for implementation of software-based real-time HEVC encoder, data-level parallelism using SIMD instructions and CPU/GPU multi-threading methods are proposed. And we also proposed appropriate operations and functional modules to apply the proposed methods on HM 10.0 software. Evaluation of the proposed methods implemented on HM 10.0 software showed 20-30fps for 832 480 sequences and 5-10fps for 1920 1080 sequences, respectively. Keyword : HEVC, encoder, real-time, parallelization
(JBE Vol. 18, No. 6, November 2013). FHD (full high definition) UHD (ultra high definition), ISO/IEC MPEG (moving picture experts group) ITU-T VCEG (video coding expert group) JCT-VC (joint collaborative team on video coding) H.264/AVC 2 HEVC (high efficiency video coding) [1][2]. HEVC, H.264/AVC, MPEG-2, 4 CU (coding unit), PU (prediction unit), TU (trans- form unit)., HEVC 64 64, 32 32, 16 16, 8 8 CU.,,, -.,,.,. 64 64 8 8 - (RDO: rate-distortion optimization) cost. HEVC a) (Dept. of Computer Engineering, Kwangwoon university) b) (Dept. of Electronic Engineering, Kwangwoon university) Corresponding Author : (Dong-Gyu Sim) E-mail: dgsim@kw.ac.kr Tel: +82-2-940-5470 ( ).[10039199, 3D ] Manuscript received September 10, 2013 Revised October 23, 2013 Accepted October 23, 2013., HEVC [3][4][5]. HEVC HM 50~60%., HEVC. HEVC, SIMD (single instruction multiple data) (DLP: data level parallelism), CPU GPU (multi threading),.. 2 SIMD HEVC, 3 CPU GPU HEVC., 4 2, 3 HEVC HM, HEVC. 5 HEVC.. SIMD HEVC HEVC SIMD. SIMD. SIMD
[6][7][8]. SIMD HEVC, - cost, (transform) - (inverse-transform), (interpolation filter). SIMD. 1. Cost HEVC - SAD (sum of absolute difference), SATD (sum of absolute transformed difference), SSE (sum of square error) cost. cost HEVC HM 30% [9]. HM cost., SAD SATD, HM SAD 10~12%, SATD 15~16%., SSE - HM 2~3%., SIMD. 4 4 64 64 SAD, SSE 4 4, 8 8 SATD SIMD., SAD. 8, 16 2 16 SAD 128 PSADBW. PSADBW 1 8 16 1. PSADBW Fig. 1. PSADBW instruction 2. Butterfly 2 2 SATD Fig. 2. 2 2 SATD using butterfly method
(JBE Vol. 18, No. 6, November 2013) 8 SAD 16, 8 SAD 16. SATD. HM SATD butterfly 2 2 2., SIMD. SSE. SSE 16 8 PSUBW, PMADDWD. 2. - 1 cost SIMD -. HEVC - (quad-tree) TU. HEVC 4 4 8 8, 16 16 32 32., HM - partial butterfly / -., - TU. - SIMD. - SIMD. 3 4 4 - SIMD. 3 -.,. - 16 32 4. 3. HEVC cost, - 3. SIMD 4 4 - Fig. 3. 4 4 inverse-transform using SIMD instruction
30%. HEVC 8-tap 7-tap DCT-IF, 4-tap DCT-IF. DCT-IF, DCT,. HEVC DCT-IF tap HEVC [10]., SIMD DLP,. 8 4 15, 4. SIMD Fig. 4. Horizontal directional interpolation filter using SIMD instruction 5. SIMD Fig. 5. Vertical directional interpolation filter using SIMD instruction
(JBE Vol. 18, No. 6, November 2013) 8 16. 8 5 64,, 5 64. PADDW PMULLW.. CPU/GPU HEVC HEVC CPU GPU. 2 SIMD DLP CPU GPU SIMD. 1 OpenMP CPU, 2 CUDA GPU, GPU GPU (block partition). 1. CPU multi-threading / OpenMP CPU API - (fork-join). HEVC CPU OpenMP.,,,.,.,.,.,. HEVC,. HEVC version 1 2013 1 HEVC MP (main profile) WPP (wave-front parallel processing)., HEVC,, WPP., HEVC DLP WPP,.,,.,,., HEVC,. 6,. 6
(a) (b) 6. ((a) 4, (b) 4 ) Fig. 6. Threads allocation and coding order according to parallel processing of slice and tile ((a) 4 slices, (b) 4 tiles), CTU Z-. HM OpenMP,. 1 HM. HM TEncTop, (GOP,, CU ). 1 CU., CABAC, RDO CABAC CABAC RDO CABAC. 2. GPU GPU (graphics processing unit) CPU (central processing unit)., GPU GPU 3D GPGPU (general- purpose computing on GPU). NVidia CUDA (compute unified device architecture) GPU [11]. CUDA,, GPU, 1. Table 1. Required classes and roles of instances for parallel processing of slice and tile TEncSearch TComTrQuant TComRdCost TEncEntropyCoder TEncCavlc TEncSbac TEncBinCABAC TBinCABACCounter TEncBitCounter TEncSlice TEncCu RD-cost CAVLC CABAC Bin CABAC Bin (* ) CU
(JBE Vol. 18, No. 6, November 2013) 7. GPU HEVC Fig. 7. HEVC encoder flow chart using GPU-based motion estimation.,, - [12][13][14]. HM 70~80%, GPU. GPU GPU CPU GPU., GPU. HEVC GPU,. PU. GPU, HEVC CU PU. HEVC AMVP (advanced motion vector prediction) CTU PU (MVP : motion vector predictor),., CU 8. GPU Fig. 8. HEVC encoder flow chart using GPU-based motion estimation and block partition decision
CU AMVP. CU CTU MVP. (co-located) CTU TMVP (temporal MVP) CTU MVP., CTU PU MVP. (full-search), PU SAD PU SAD SAD (hierarchical SAD computing) - CPR (concurrent parallel reduction) [15]., [16] GPU. GPU CPU. 7 GPU HEVC. HEVC. GPU GPU, CPU. CPU RDO. GPU - RDO CTU,,. CPU RDO,,. 8 GPU. HM 10.0 C SIMD, OpenMP CPU, CUDA GPU. HEVC 2 HEVC [17]. PC, (FPS : frames per second)., (1) ATS (average time saving). Etime anchor, Etime proposed. 3 SIMD. SIMD 45.92% ATS. SIMD, SAD, SATD, SSE cost 4, - partial butterfly 3~4 2. HEVC Table 2. Test environment for parallel processing of HEVC encoder CPU CPU ( ) Intel(R) CoreTM i7-3960x 6 (12: hyper threading). Clock OS 3.3GHz 48 GB (DDR3) Windows 8 (64-bit) HEVC. HEVC, HEVC Intel 64-bit GPU NVIDIA GeForce GTX 660 Ti CUDA CUDA toolkit 5.0
(JBE Vol. 18, No. 6, November 2013) 3. SIMD DLP HEVC Table 3. Average time saving of HEVC encoder according to DLP using SIMD instructions Sequence QP Bitrate PSNR Y PSNR U PSNR V Enc. time Anchor SIMD ATS (%) 22 6139.30 41.79 43.13 45.01 1289.20 754.64 41.46 Kimono 27 2929.16 39.74 41.59 43.15 1074.03 596.68 44.44 32 1432.65 37.17 40.34 41.89 917.22 488.12 46.78 37 720.94 34.64 39.54 41.12 806.08 416.72 48.30 22 7869.32 39.92 42.17 43.81 1020.61 611.21 40.11 ParkScene 27 3356.52 37.41 40.35 41.70 830.00 462.18 44.32 32 1514.52 34.82 38.81 40.33 732.87 394.26 46.20 37 694.68 32.38 37.73 39.51 673.51 356.25 47.10 22 19358.89 38.39 39.92 43.39 2398.68 1395.79 41.81 Cactus 27 6042.43 36.66 38.93 41.61 1805.29 976.53 45.91 32 2814.72 34.70 38.08 40.07 1566.70 800.34 48.92 37 1438.38 32.48 37.39 38.87 1440.33 712.27 50.55 22 12897.72 39.62 44.41 45.55 2768.25 1531.77 44.67 Basketball 27 4743.87 38.20 43.19 43.60 2226.22 1205.80 45.84 Drive 32 2297.09 36.49 41.89 41.72 1937.03 982.95 49.25 37 1226.28 34.55 40.88 40.37 1721.24 866.85 49.64 22 38225.57 37.92 42.24 44.27 3615.34 2163.19 40.17 BQTerrace 27 8414.34 35.44 40.65 42.74 2494.89 1343.65 46.14 32 2600.21 33.63 39.26 41.49 2062.09 1072.28 48.00 37 1044.15 31.48 38.25 40.54 1892.58 969.91 48.75 Average ATS (%) 45.92 4. GPU Table 4. Speed up of integer-pel motion estimation execution time using GPU multi-threading Sequence QP HM full search (ms) GPU (ms) speed up HM fast search (ms) GPU (ms) speed up 22 10190.34 28.06 363.2 380.72 28.06 13.6 Kimono 27 10180.38 28.1 362.3 368.092 28.1 13.1 32 10173.52 27.96 363.9 359.254 27.96 12.8 37 10154.73 27.8 365.3 348.599 27.8 12.5 22 10175.29 28.7 354.5 271.064 28.7 9.4 ParkScene 27 10164.8 28.01 362.9 266.936 28.01 9.5 32 10280.91 27.83 369.4 262.726 27.83 9.4 37 10263.1 27.88 368.1 260.37 27.88 9.3 22 10301.65 28.08 366.9 292.141 28.08 10.4 Cactus 27 10171.86 27.9 364.6 269.928 27.9 9.7 32 10157.17 27.91 363.9 262.709 27.91 9.4 37 10164.69 27.79 365.8 256.911 27.79 9.2 22 10182.83 27.95 364.3 456.531 27.95 16.3 BasketballDrive 27 10173.28 28.15 361.4 414.755 28.15 14.7 32 10161.77 27.8 365.5 380.798 27.8 13.7 37 10160.24 27.8 365.5 351.927 27.8 12.7 22 10288.54 28.11 366.0 296.03 28.11 10.5 BQTerrace 27 10152.8 28.1 361.3 285.393 28.1 10.2 32 10150.93 27.95 363.2 290.435 27.95 10.4 37 10138.75 27.79 364.8 266.926 27.79 9.6 Average 10189.38 27.98 364.14 296.9 27.98 11.33
., 4. 9 CPU. 16, 4 3.2., 8 4.5, 16 5.5. CPU,.. GPU. GPU anchor HM 10.0. GPU CPU GPU. Geforce GTX 470. 4 HM GPU. 364, GPU HM 11. SIMD DLP CPU GPU HEVC. SIMD cost, -, DLP., CUDA GPU. OpenMP CPU. HEVC 5. LDP (low-delay P) (a) 9. Fig. 9. Speed up of encoding time according to the number of slices and tiles (b)
(JBE Vol. 18, No. 6, November 2013) 5. HEVC Table 5. Encoding time of HEVC encoder with proposed parallel processing methods Class Sequence Frame QP FPS 22 5.74 Kimono 240 27 7.25 32 8.38 37 9.40 22 5.51 ParkScene 240 27 7.52 32 8.87 37 10.03 22 5.19 B Cactus 500 27 7.70 32 9.09 37 10.09 22 4.80 BasketballDrive 500 27 6.71 32 8.09 37 9.18 22 4.14 BQTerrace 600 27 7.68 32 9.60 37 10.62 22 14.86 BasketballDrill 500 27 19.07 32 23.60 37 28.12 22 14.81 BQMall 600 27 19.88 32 24.91 C 37 29.20 22 11.09 PartyScene 500 27 16.46 32 22.03 37 27.60 22 10.48 RaceHorses 300 27 14.60 32 19.46 37 24.49, QP 22, 27, 32, 37., 1, CTU 32 32.. HEVC,. cost, -, SIMD DLP. SIMD DLP HEVC 2., OpenMP CPU, CUDA GPU
. OpenMP CPU 4 3.5, 16 6. HEVC 832 480 10~30fps, 1920 1080 5~10fps. SIMD DLP cost, -,,, SIMD., CPU GPU. 832 480, 2K 4K HEVC. (References) [1] G. J. Sulivan, J.-R. Ohm, Recent developments in standardization of high efficiency video coding (HEVC), SPIE Application of Digital Image Proc. XXXIII, vol. 7798, pp. 7798-30, Aug. 2010. [2] JCT-VC, Report of subjective test results of responses to the joint call for proposals (CfP) on Video coding technology for high efficiency video coding (HEVC), Document JCTVC-A204, Dresden, DE, Apr. 2010. [3] R. H. Gweon, Y.-L. Lee, J. Lim, "Early termination of CU encoding to reduce HEVC complexity, ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC) document JCTVC-F045, Jul. 2011. [4] K. Choi, E. S. Jang, Coding tree pruning based CU early termination, ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC) document JCTVC-F092, Jul. 2011. [5] J. Wang, J. Kim, K. won, H. Lee, B. Jeon, Early skip detection for HEVC, ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC) document JCTVC-G543, Jul. 2011. [6] L. Yan, Y. Duan, J. Sun and Z. Guo, Implementation of HEVC decoder on x86 processor with SIMD optimiation, CVIP, pp. 1-6, Nov. 2012. [7] K. Chen, Y. Duan, L. Yan, J. Sun, and Z. Guo, Efficient SIMD optimization of HEVC encoder over x86 processor, APSIPA ASC 2012 Asia-pacific, pp. 1-4, Dec. 2012. [8] M. Alvarez-Mesa, C. C. chi, V. George, T. Schierl, and B. Juurlink, Parallel video decoding in the emerging HEVC standard, Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2012), Kyoto, Japan, March 2012. [9] Y. J. Ahn, T. J. Hwang, S. E. Yoo, W. -J. Han, and D. G. Sim, "Statistical characteristics and complexity analysis of HEVC encoder software," Journal of Broadcasting & Electronic Media, vol. 17, no. 6, pp. 1091-1105, Nov. 2012. [10] Y. J. Ahn, W. J. Han, and D. G. Sim, Study of decoder complexity for HEVC and AVC standards based on tool-by-tool comparision, SPIE Applications of Digital Image Proc. XXXV, vol. 8499, pp. 8499-32, Aug. 2012. [11] NVIDIA, CUDA C programming guide, document PG-02829-001_ v5.0, Oct. 2012. [12] W.-N. Chen and H.-M. Hang, H.264/AVC motion estimation implementation on compute unified device architecture (CUDA), IEEE International Conference on Multimedia and Expo 2008 (ICME'08), pp. 697-700, April 2008. [13] N.-M. Cheung, X. Fan, O. C. Au, and M.-C. Kung, Video coding on multi-core graphics processors, in Proc. IEEE Siganl Process. Mag., 2010, pp. 78-89 [14] Z. Jing, J. Liangbao, and C. Xuehong, Implementation of parallel full search algorithm for motion estimation on multi-core processors. The 2nd International Conference on Next Generation Information Technology (ICNIT), pp. 31-35, June 2011. [15] D.-K. Lee and S.-J. Oh, Variable block size motion estimation implementation on compute unified device architecture (CUDA), IEEE International Conference on Consumer Electronics, pp. 635-636, Jan. 2013. [16] S. Kim, D. Lee, Y. Ahn, T.-J. Hwang, D. Sim, and S.-J. Oh, DCT-based interpolation filter for HEVC on Graphics processing units, International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC), pp. 155-158, July 2013. [17] F. Bossen, Common test conditions and software reference configuration, ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC) document JCTVC-G1200, Nov. 2011.
848 방송공학회논문지 제18권 제6호, 2013년 11월 (JBE Vol. 18, No. 6, November 2013) 저자소개 안용조 년 2월 : 광운대학교 컴퓨터공학과 학사 년 2월 : 광운대학교 컴퓨터공학과 석사 년 3월 ~ 현재 : 광운대학교 컴퓨터공학과 박사과정 주관심분야 : 영상압축, 최적화 및 병렬화 - 2010-2012 - 2012 - 황태진 년 2월 : 광운대학교 컴퓨터공학과 학사 년 3월 ~ 현재 : 광운대학교 컴퓨터공학과 석사과정 주관심분야 : 영상압축, 멀티미디어시스템 - 2012-2012 - 이동규 년 2월 : 광운대학교 전자공학과 학사 년 3월 ~ 현재 : 광운대학교 전자공학과 석사과정 주관심분야 : 영상압축, 컴퓨터비전 - 2012-2012 - 김상민 년 2월 : 광운대학교 전자공학과 학사 년 3월 ~ 현재 : 광운대학교 전자공학과 석사과정 주관심분야 : 영상압축, 최적화 및 병렬화 - 2013-2013 - 오승준 - 년 2월 : 서울대학교 전자공학과 학사 년 2월 : 서울대학교 전자공학과 석사 년 5월 : 미국 Syracuse University 전기/컴퓨터공학과 박사 년 3월 ~ 1992년 8월 : 한국전자통신연구원 멀티미디어연구실 실장 년 7월 ~ 1986년 8월 : NSF Supercomputer Center 초청 학생연구원 년 5월 ~ 1988년 5월 : Northeast Parallel Architecture Center 학생연구원 년 3월 ~ 1992년 8월 : 충남대학교 컴퓨터공학부 겸임교수 년 9월 ~ 현재 : 광운대학교 전자공학과 교수 년 3월 ~ 현재 : SC29-Korea 의장 및 MPEG Forum 부의장 주관심분야 : 비디오 데이터 처리, 영상압축, 멀티미디어시스템 1980 1982 1988 1982 1986 1987 1992 1992 2002
- 1993 2 : - 1995 2 : - 1999 2 : - 1999 3 ~ 2000 8 : - 2000 9 ~ 2002 3 : - 2002 4 ~ 2005 2 : University of Washington Senior research engineer - 2005 3 ~ : - :,,