4 : (Hyo-Jin Cho et al.: Audio High-Band Coding based on Autoencoder with Side Information) (Special Paper) 24 3, 2019 5 (JBE Vol. 24, No. 3, May 2019) https://doi.org/10.5909/jbe.2019.24.3.387 ISSN 2287-9137 (Online) ISSN 1226-7953 (Print) a), a), b), b), a) Audio High-Band Coding based on Autoencoder with Side Information Hyo-Jin Cho a), Seong-Hyeon Shin a), Seung Kwon Beack b), Taejin Lee b), and Hochong Park a). MDCT,,., -. 4 latent 12.. SBR 1/2 SBR. Abstract In this study, a new method of audio high-band coding based on autoencoder with side information is proposed. The proposed method operates in the MDCT domain, and improves the performance by using additional side information consisting of the previous and current low bands, which is different from the conventional autoencoder that only inputs information to be encoded. Moreover, the side information in a time-frequency domain enables the high-band coder to utilize temporal characteristics of the signal. In the proposed method, the encoder transmits a 4-dimensional latent vector computed by the autoencoder and a gain variable using 12 bits for each frame. The decoder reconstructs the high band by applying the decoded low bands in the previous and current frames and the transmitted information to the autoencoder. Subjective evaluation confirms that the proposed method provides equivalent performance to the SBR at approximately half the bit rate of the SBR. Keyword : autoencoder, neural network, audio high-band coding, side information a) (Dept. of Electronics Engineering, Kwangwoon University) b) (Electronics and Telecommunications Research Institute) Corresponding Author : (Hochong Park) E-mail: hcpark@kw.ac.kr Tel: +82-2-940-5104 ORCID: https://orcid.org/0000-0003-1600-6610 2018 2018 ( ) (No.2017-0-00072, AV LF ). The present Research has been conducted by the Research Grant of Kwangwoon University in 2018 and by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) (No. 2017-0-00072002, Development of audio/video coding and light field media fundamental technologies for ultra realistic tera-media). Manuscript received March 15, 2019; Revised April 30, 2019; Accepted April 30, 2019. Copyright 2016 Korean Institute of Broadcast and Media Engineers. All rights reserved. This is an Open-Access article distributed under the terms of the Creative Commons BY-NC-ND (http://creativecommons.org/licenses/by-nc-nd/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited and not altered.
(JBE Vol. 24, No. 3, May 2019)., [1].,., [2].,.,. spectral band replication (SBR) [2]. SBR quadrature mirror filter (QMF) -, (tonality). -,. SBR, QMF QMF [3]. [4,5]., (recurrent neural network) (convolutional neural network, CNN).,. (autoencoder) [6].,. -. MDCT (modified discrete cosine transform), SBR QMF MDCT., MDCT. SBR, SBR 1/2.. 1. 1. (hidden layer), 1. Fig. 1. Basic structure of autoencoder
4 : (Hyo-Jin Cho et al.: Audio High-Band Coding based on Autoencoder with Side Information). latent.,., (encoding network) latent, (decoding network) latent. 1. 2. 1024, 50% 2048 MDCT 1024 MDCT. 14.25 khz, 9.75 ~ 14.25 khz. 48 khz, 608 MDCT 192 MDCT. 2. 1,. 192 MDCT, 3 FCN (fully-connected network) 4 latent X [7]. 7 3.75 ~ 9.75 khz MDCT, 8 256 2 (2D). 3.75 khz.. 3 2D CNN 1 (flatten) FCN 10 latent Y [7]. 2D CNN. latent X Y 14 latent,. 2. Fig. 2. Structure of the proposed autoencoder 1. Table 1. Detail of network structure in the proposed method Encoding network for high band layer function output dim. in high-band MDCT coeff. 192 1 FCN, GLU 96 2 FCN, GLU 24 3 FCN, sigmoid 4 out latent vector 4 Encoding network for side information layer function output dim. filters kernel stride in side-info. MDCT coeff. 8 256 1 2D CNN, GLU 4 128 32 32 [5,5] [2,2] 2 2D CNN, GLU 2 64 64 64 [5,5] [2,2] 3 2D CNN, GLU 1 32 128 128 [5,5] [2,2] 4 flatten, FCN, sigmoid out latent vector 10 10 - - - Decoding network layer function output dim. in latent vector 14 1 FCN, GLU 32 2 FCN, GLU 96 3 FCN, sigmoid 192 out high-band MDCT coeff. 192
(JBE Vol. 24, No. 3, May 2019) latent,.,. Y 10. sigmoid, 3 GLU (gated linear unit) [8]. h t 1 W b z, z tanh sigmoid GLU h t. GLU tanh sigmoid. X. 3. 4 latent X X. 4 X 2, x y 32.,. k- 8-. 8-8-, 1/6., 12. 3. GLU Fig. 3. GLU structure 2. 192 MDCT 8 256, 192., MDCT. ADAM [9]., latent X,. 4. Latent X 2 Fig. 4. 2D scatter diagram of latent vector X 4. MDCT, MDCT (sign) MDCT.,
4 : (Hyo-Jin Cho et al.: Audio High-Band Coding based on Autoencoder with Side Information) MDCT 2. MDCT MDCT, MDCT. MDCT. MDCT MDCT. MDCT 0 ~ 1. MDCT MDCT 1.. MDCT G,. MDCT G MDCT. G 4, k-. MDCT 2,., MDCT 192, G 4., 192 MDCT 4 X, 8., 5, 12, 0.56 kbps.. 7 MDCT 10 Y., Y 4 X 14, G MDCT., intelligent gap filling (IGF) MDCT MDCT MDCT [3]. MDCT, MDCT.. VCTK (voice cloning toolkit) [10], RWC (real world computing) [11],, 57. USAC (unified speech and audio coding) 12, 4 speech, speech-over-music (SoM), music 3 [12]. MDCT MDCT. 48 kbps USAC [13]., 48 kbps USAC MDCT, MDCT 9.75 khz. USAC,., long window. short window. SBR. SBR, 10.125 khz [13]. 10.125 ~ 14.25kHz SBR. SBR 1.08 kbps, 2. 48 kbps USAC., SBR,
(JBE Vol. 24, No. 3, May 2019).. 5, SBR. SBR 1/2 SBR. MUSHRA, 3.5 khz [14]. 5,,,, 0 ~ 100. 6, 95%.. SBR, speech SBR. SBR 1/2. 6. MUSHRA Fig. 6. Result of MUSHRA test...,. 4, 0.56 kbps. SBR 1/2.. (References) 5. (a), (b), (c) SBR Fig. 5. Spectrogram of test data (a) original, (b) decoded signal by proposed method and (c) decoded signal by SBR [1] ISO/IEC 11172-3, Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s - Part 3, 1993. [2] M. Dietz, L. Liljeryd, K. Kjörling, and O. Kunz, Spectral band replication, a novel approach in audio coding, 112th Conv. Audio Eng. Soc., May 2002. [3] C. R. Helmrich, et al., Spectral envelope reconstruction via IGF for
4 : (Hyo-Jin Cho et al.: Audio High-Band Coding based on Autoencoder with Side Information) audio transform coding, Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Brisbane, Australia, pp. 389-393, 2015. [4] L. Jiang, R. Hu, X. Wang, W. Tu, and M. Zhang, Nonlinear prediction with deep recurrent neural networks for non-blind audio bandwidth extension, China Communication, vol. 15, no. 1, pp. 72-85. Jan. 2018. [5] K. Schmidt and B. Edler, Blind bandwidth extension based on convolutional and recurrent deep neural networks, Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Calgary, Canada, pp. 5444-5448, 2018. [6] G. E. Hinton and R. Salakhutdinov, "Reducing the dimensionality of data with neural networks," Science, 313.5786, pp. 504-507, 2006. [7] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature, 521.7553, pp. 436-444, 2015. [8] Y. N. Dauphin, et al., Language modeling with gated convolutional networks, Proc. of the 34th Int. Conf. on Machine Learning, vol 70, Sydney, Australia, pp. 933-941, 2017. [9] D. P. Kingma and J. L. Ba, Adam: A method for stochastic optimization, Proc. of Int. Conf. on Learning Representation, San Diego, USA, 2015. [10] C. Veaux, et al., Superseded-CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit, 2016. [11] M. Goto, Development of the RWC music database, Proc. of Int. Congress on Acoustics, vol. 1, pp. 553-556, April 2004. [12] ISO/IEC JTC1/SC29/WG11 N9927, Workplan for subjective testing of Unified Speech and Audio Coding proposals, April 2008. [13] S. Beack, et al., Single-mode-based Unified Speech and Audio Coding by extending the linear prediction domain coding mode, ETRI Journal, vol. 39, no. 3, pp. 310-318, 2017. [14] ITU-R BS.1534-3, Method for the subjective assessment of intermediate quality level of audio systems, 2015. - 2017 2 : - 2017 3 ~ : - ORCID : http://orcid.org/0000-0003-2296-2270 - : /, - 2016 2 : - 2016 3 ~ : - ORCID : http://orcid.org/0000-0002-2343-8983 - : /, - 2005 8 : - 2005 8 ~ : AV - ORCID : https://orcid.org/0000-0002-6254-2062 - : /
(JBE Vol. 24, No. 3, May 2019) - 2014 : - 2002 ~ 2003 : Tokyo Denki University, - 2000 ~ : ETRI AV - :,, - 1986 2 : - 1987 12 : Univ. of Wisconsin-Madison - 1993 5 : Univ. of Wisconsin-Madison - 1993 9 ~ 1997 8 : - 1997 9 ~ : - ORCID : https://orcid.org/0000-0003-1600-6610 - : /, 3D,