4 : CNN (Sangwon Suh et al.: Dual CNN Structured Sound Event Detection Algorithm Based on Real Life Acoustic Dataset) (Regular Paper) 23 6, (J

(Regular Paper) 23 6, 2018 11 (JBE Vol. 23, No. 6, November 2018) https://doi.org/10.5909/jbe.2018.23.6.855 ISSN 2287-9137 (Online) ISSN 1226-7953 (Print) CNN a), a), a), a), a) Dual CNN Structured Sound Event Detection Algorithm Based on Real Life Acoustic Dataset Sangwon Suh a), Wootaek Lim a), Youngho Jeong a), Taejin Lee a), and Hui Yong Kim a). DCASE. DCASE,.,., CNN, 2016 2017 DCASE. Abstract Sound event detection is one of the research areas to model human auditory cognitive characteristics by recognizing events in an environment with multiple acoustic events and determining the onset and offset time for each event. DCASE, a research group on acoustic scene classification and sound event detection, is proceeding challenges to encourage participation of researchers and to activate sound event detection research. However, the size of the dataset provided by the DCASE Challenge is relatively small compared to ImageNet, which is a representative dataset for visual object recognition, and there are not many open sources for the acoustic dataset. In this study, the sound events that can occur in indoor and outdoor are collected on a larger scale and annotated for dataset construction. Furthermore, to improve the performance of the sound event detection task, we developed a dual CNN structured sound event detection system by adding a supplementary neural network to a convolutional neural network to determine the presence of sound events. Finally, we conducted a comparative experiment with both baseline systems of the DCASE 2016 and 2017. Keyword : Machine learning, Deep learning, Audio signal processing, Sound event detection, Dataset Copyright 2016 Korean Institute of Broadcast and Media Engineers. All rights reserved. This is an Open-Access article distributed under the terms of the Creative Commons BY-NC-ND (http://creativecommons.org/licenses/by-nc-nd/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited and not altered.

(JBE Vol. 23, No. 6, November 2018).,., (sound event detection) [1-4].,. DCASE(Detection and Classification of Acoustic Scenes and Events), (onset) (offset). 2013, 2016 2017 DCASE [5-7],. DCASE (non-negative matrix factorization) [2] (gaussian mixture model) [8], RNN(recurrent neural network) [9] CNN(convolutional neural network) [10] (deep learning). 2017 CNN RNN CRNN(convolutional recurrent neural network) [11], a) AV (Realistic AV Research Group, Electronics and Telecommunications Research Institute) Corresponding Author : (Youngho Jeong) E-mail: yhcheong@etri.re.kr Tel: +82-42-860-6472 ORCID: https://orcid.org/0000-0001-9552-8593 This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No.2017-0-00050, Development of Human Enhancement Technology for auditory and muscle support) 2018 () (No.2017-0-00050, ) Manuscript received August 20, 2018; Revised October 24, 2018; Accepted October 24, 2018..,., 5~10 % 3.6% [12] 2015 ILSVRC(ImageNet Large Scale Visual Recognition Challenge) [13]. (ImageNet) [14]. ILSVRC 1,000 120. DCASE 2017 3~5 24, 659. (imbalanced dataset), DCASE.,. state-of-the-art CNN.,.. II,. III CNN. IV, DCASE 2016 2017. V.

. 1.. (onset) (offset) (annotation) [15]. DCASE TUT [8] 1. Table 1. Sound event classes I n d o o r O u t d o o r Event class label Number of Hazard instances event Kettle whistle 227 O Children crying 255 O Children playing 602 - Children shouting 94 - Dish cleanup 249 - Dish rinse sound 121 - Dishwasher 143 - Doorbell 201 - Drawer sound 155 - Drop impact sound 287 O Fire alarm sound 218 O Footfall 139 - Keyboard sound 95 - Scream 236 O Speech 411 - Water flowing 211 - Bicycle idle horn 127 O Bird singing 504 - Car crash 82 O Car idle horn 90 O Car passing 173 O Car passing horn 130 O Drop impact sound 231 O Footfall 485 - Motorcycle idle horn 141 O Motorcycle passing horn 212 O Scream 122 O Speech 3 - Truck idle horn 108 O Truck passing 123 O Truck passing horn 206 O Wind sound 593 -,.. 1, 16., DCASE 2016 2017 [5,6]. 2,. 3~5. 2. Table 2. Recording signal specifications Signal type Sampling rate Bit resolution Audio format Binaural / Stereo 44.1 khz 24 bits PCM WAV 2..,. Soundman OKM II Klassik/studio A3 electret in-ear microphones, Rode NT4 X/Y Stereo Condenser Microphone. TASCAM DR-100mkII PCM.

(JBE Vol. 23, No. 6, November 2018),., 10.,,. 2,.,,. binaural stereo 13 9 230, 1 254. DCASE 2017 1 31 8.7, 1.7. 3. 3. Table 3. Examples of sound event metadata onset offset event class label 2.087000 10.354000 footfall 16.977000 21.690000 footfall 26.481000 31.470000 water flowing 32.999000 40.825000 dishwashing 38.326000 38.984000 drop impact sound III. 1. CNN CNN.,.,. (confusion matrix), 2.. DCASE,,, 10.. 1 (a), (b). 44.1kHz 40ms (analysis window) 50% 40. (context information) 25 2...

1. CNN Fig. 1. Schematic diagram of dual CNN based sound event detection algorithm 3 (Convolution layer) 2 (fully-connected layer). 3x3 64, ReLU. 20% dropout 2x2 (max-pooling). 2 128, ReLU sigmoid. (minibatch) 64, (learning rate) 0.001, Adam [16], 100 epochs., (over-adaptation) 10 epoch (loss)., (validation) (evaluation criterion). 3. 3x3 32, ReLU. 20% dropout., 1 ( ) 0( )., 0.2.

(JBE Vol. 23, No. 6, November 2018),. TensorFlow [17] Keras [18]. 2. DCASE 2. (Gaussian Mixture Model, GMM) (Multi-Layer Perceptron, MLP).. n EM(Expectation Maximization). DCASE 2016 [8], MFCC MFCC-delta MFCC-acceleration. 40ms 50% -, 0 19 MFCC. MFCCdelta MFCC-acceleration MFCC, MFCC. 60. 16,.,,..,.,. (Backpropagation),. 2. Fig. 2. Block diagram of multi-layer perceptron model DCASE 2017 [7]. 0 ~ 22050 Hz 40 -, 40ms 50%.. 2 50. (over-

fitting) 20% dropout. Adam, 0.001 learning rate 200 epoch. 100 epoch, 10 epoch. IV. 1. II III. DCASE [5,6], 16. 4 (4-fold cross validation). DCASE (metrics) F1- (F1-score) (error rate) [19]. 1 (ground truth), 3. true positives(tp), true negatives(tn)., false positives(fp), false negatives(fn). F1- (1) k (precision, P) (recall, R).,., where, 0. (2). N(number of reference events), S(substitutions), I(insertions) FP S, D(deletions) FN S. 2. 3. Ground truth Fig. 3. Visualization of system output to ground truth

(JBE Vol. 23, No. 6, November 2018),.,.. 2. (context size),., (hop size)., 4., 4 F1-. 4,, 25 51, 10. 4. Table 4. Acoustic event detection results according to the context and hop size Context size [frames] Hop size [frames] 5 10 25 50 11 80.1 79.6 77.5 77.5 21 79.9 79.9 79.1 77.5 51 80 80.4 78.4 78.1 101 79.7 80.3 79.1 78.3 51, 10. II, 5 6. F1-80.4%, 0.35, F1-73%, 0.46.,. Children shouting, Dish cleanup Speech,, (Class imbalance). 5. Table 5. Detailed results of detection per sound event in an indoor environment <Indoor> F1-score [%] Error Rate Precision Recall Kettle whistle 93.4 0.13 0.928 0.94 Children crying 64.4 0.74 0.62 0.67 Children playing 75.7 0.56 0.666 0.877 Children shouting 0 1 0 0 Dish cleanup 0 1 0 0 Dish rinse sound 81.4 0.38 0.803 0.824 Dishwasher 79.5 0.42 0.77 0.821 Doorbell 91.5 0.18 0.881 0.952 Drawer sound 66.8 0.66 0.672 0.664 Drop impact sound 54.4 0.82 0.611 0.49 Fire alarm sound 91.3 0.18 0.895 0.933 Footfall 78.3 0.43 0.797 0.769 Keyboard sound 92.4 0.16 0.903 0.945 Scream 69.3 0.64 0.668 0.721 Speech 77.2 0.45 0.781 0.763 Water flowing 88.2 0.25 0.851 0.915 Instance-based average 80.4 0.35 - -

Bird singing. 6. Table 6. Detailed results of detection per sound event in an outdoor environment <Outdoor> F1-score [%] Error Rate Precision Recall Bicycle idle horn 72.7 0.57 0.698 0.758 Bird singing 9 0.98 0.612 0.049 Car crash 54.9 0.86 0.574 0.526 Car idle horn 66.3 0.61 0.739 0.602 Car passing 77.3 0.48 0.738 0.813 Car passing horn 54.9 0.7 0.768 0.427 Drop impact sound 68.5 0.61 0.71 0.662 Footfall 74.4 0.52 0.727 0.762 Motorcycle idle horn 68.7 0.66 0.656 0.721 Motorcycle passing horn 65.1 0.63 0.724 0.591 Scream 46.9 0.82 0.664 0.362 Speech 0 1 0 0 Truck idle horn 61.8 0.62 0.808 0.5 Truck passing 61.1 0.71 0.671 0.56 Truck passing horn 45.8 0.82 0.673 0.347 Wind sound 84 0.31 0.857 0.823 Instance-based average 73.0 0.46 - - III 2, CNN., III 2 DCASE. 7, F1-19.7%, 0.72, F1-3.0%, 0.01., F1-7.2%, 0.16, F1-3.0%, 0.01. CNN, CNN (w/o aux) CNN (w/ aux)., CNN,. 7. Table 7. Performance test for sound event detection systems F1-score [%] Indoor Error Rate F1-score [%] Outdoor Error Rate GMM 60.7 1.07 65.8 0.62 MLP 77.4 0.36 70.0 0.47 CNN (w/o aux.) Proposed (w/ aux.) 79.2 0.36 72.1 0.47 80.4 0.35 73.0 0.46 V., CNN., DCASE TUT,. 13 9, 254. DCASE 2017 1 31, 1.7. CNN,.

(JBE Vol. 23, No. 6, November 2018) 0 1 CNN. DCASE, II., 51 10. DCASE 2016 2017., F1-.,.,,.,. (References) [1] A. Temko et al., CLEAR evaluation of acoustic event detection and classification systems, Lecture Notes in Computer Science, vol.4122, pp.311-322, 2007. [2] D. Stowell et al., Detection and classification of acoustic scenes and events, IEEE Transactions on Multimedia, vol.17, no.10, pp.1733-1746, 2015. [3] DCASE Community, http://dcase.community/community_info [4] J. Portêlo et al., Non-Speech Audio Event Detection, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2009. [5] DCASE 2016 Task3 Sound event detection in real life audio, http://www.cs.tut.fi/sgn/arg/dcase2016/task-sound-event-detection-in-real-life-audio [6] DCASE 2017 Task3 Sound event detection in real life audio, http:// www.cs.tut.fi/sgn/arg/dcase2017/challenge/task-sound-event-detection-in-real-life-audio [7] A. Mesaros et al., DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System, Detection and Classification of Acoustic Scenes and Events (DCASE), 2017. [8] A. Mesaros, T. Heittola, and T. Virtanen, TUT Database for Acoustic Scene Classification and Sound Event Detection, 24th European Signal Processing Conference (EUSIPCO), pp. 1128-1132, 2016. [9] S. Adavanne, G. Parascandolo, P. Pertila, T. Heittola, and T. Virtanen, Sound event detection in multichannel audio using spatial and harmonic features, Detection and Classification of Acoustic Scenes and Events (DCASE), 2016. [10] I. Jeong, S. Lee, Y. Han, and K. Lee, Audio event detection using multiple-input convolutional neural network, Detection and Classification of Acoustic Scenes and Events (DCASE), 2017. [11] S. Adavanne, and T. Virtanen, A report on sound event detection with different binaural features, Detection and Classification of Acoustic Scenes and Events (DCASE), 2017. [12] K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, IEEE Conference on Computer Vision and Patter Recognition (CVPR), 2016. [13] Large Scale Visual Recognition Challenge (LSVRC), http://image-net.org/challenges/lsvrc/imagenet, http://www.image-net.org/ [14] ImageNet, http://www.image-net.org/ [15] Y. Jung, S. Seo, W. Lim, and H. Kim, Design and construction of Acoustic Database for developing Sound Event Detection technique, IEIE Summer General Conference, June, 2018 [16] D. P. Kingma, and J. Ba, Adam: A method for stochastic optimization, Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2014. [17] TensorFlow, https://www.tensorflow.org/ [18] Keras, https://keras.io/ [19] Metrics For sound event detection tasks, http://www.cs.tut.fi/sgn/arg/ dcase2017/challenge/metrics

- 2015 : - 2015 ~ : ETRI AV - ORCID : https://orcid.org/0000-0002-4286-6537 - :, - 2010 : - 2012 : - 2012 ~ : ETRI AV - :, - 1992 : - 1994 : - 2006 : - 2011 ~ 2017 : (UST) - 1994 ~ : ETRI AV - ORCID : https://orcid.org/0000-0001-9552-8593 - :,, - 2014 : - 2002 ~ 2003 : Tokyo Denki University, - 2000 ~ : ETRI AV / - :,, - 1994 : - 1998 : - 2004 : - 2003 ~ 2005 : - 2006 ~ 2010 : (UST) - 2013 ~ 2014 : Univ. Southern California(USC) - 2005 ~ : ETRI AV - ORCID : https://orcid.org/0000-0001-7308-133x - : /,, UHD/3D/HDR/VR