1 : DNN (Yongwoo Lee et al: DNN Based Multi-spectrum Pedestrian Detection Method Using Color and Thermal Image) (Special Paper) 23 3, 2018 5 (JBE Vol. 23, No. 3, May 2018) https://doi.org/10.5909/jbe.2018.23.3.361 ISSN 2287-9137 (Online) ISSN 1226-7953 (Print) DNN a), a) DNN Based Multi-spectrum Pedestrian Detection Method Using Color and Thermal Image Yongwoo Lee a) and Jitae Shin a)... DNN (deep neural network). SSD (single shot multibox detector). KAIST SSD-H (SSD-Halfway fusion) KAIST 18.18% miss rate halfway fusion 2.1% miss rate. Abstract As autonomous driving research is rapidly developing, pedestrian detection study is also successfully investigated. However, most of the study utilizes color image datasets and those are relatively easy to detect the pedestrian. In case of color images, the scene should be exposed by enough light in order to capture the pedestrian and it is not easy for the conventional methods to detect the pedestrian if it is the other case. Therefore, in this paper, we propose deep neural network (DNN)-based multi-spectrum pedestrian detection method using color and thermal images. Based on single-shot multibox detector (SSD), we propose fusion network structures which simultaneously employ color and thermal images. In the experiment, we used KAIST dataset. We showed that proposed SSD-H (SSD-Halfway fusion) technique shows 18.18% lower miss rate compared to the KAIST pedestrian detection baseline. In addition, the proposed method shows at least 2.1% lower miss rate compared to the conventional halfway fusion method. Keyword : CNN, pedestrian detection, multi-spectrum, network fusion Copyright 2016 Korean Institute of Broadcast and Media Engineers. All rights reserved. This is an Open-Access article distributed under the terms of the Creative Commons BY-NC-ND (http://creativecommons.org/licenses/by-nc-nd/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited and not altered.
(JBE Vol. 23, No. 3, May 2018).. DNN [1]..,...... [2] R-CNN (region-based convolutional neural networks) [3]. a) (School of Electronic and Electrical Engineering, Sungkyunkwan University) Corresponding Author : (Jitae Shin) E-mail: jtshin@skku.edu Tel: +82-31-290-7994 ORCID: https://orcid.org/0000-0002-2599-3331 2018 ( ) (NRF-2017R1D1A1B03031752). This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education(NRF-2017R1D1A1B03031752). Manuscript received March 30, 2018; Revised April 30, 2018; Accepted April 30, 2018. R-CNN [4-6], [2]. [7] R-CNN Faster R-CNN[5]. Faster R-CNN R-CNN. [8] SSD (Single Shot multibox Detector) (default bounding boxes) Faster R-CNN. SSD Faster R-CNN RPN (Region Proposal Networks). [7]. Faster R-CNN SSD[8] DNN [7]. KAIST [9] SSD. KAIST ACF+T+THOG (aggregated channel features with thermal and thermal histogram of oriented gradients) [9] [7] early fusion, halfway fusion late fusion SSD-E (SSD-early fusion), SSD-H (SSD-halfway fusion), SSD-L (SSD-late fusion). FPPI (false positive per image) miss rate. SSD-H ACF+T+ THOG 18.18% miss rate [7] halfway fusion 2.1% miss rate.. 2 [7] SSD. 3 SSD. 4 5.
1 : DNN (Yongwoo Lee et al: DNN Based Multi-spectrum Pedestrian Detection Method Using Color and Thermal Image). [7] SSD. 1. 1 [7] Faster R-CNN 3. early fusion, halfway fusion late fusion. VGG- 16 [10] 5 Conv. (convolutional layer) 2 FC (fully connected layer) early fusion (feature map) Conv.. Concat. (concatenate layer) convolutional layer NIN (Network-in- Network) [11]. Early fusion Conv.. Halfway fusion 4 Conv. NIN Concat.. Halfway fusion 4 Conv. (semantic). Late fusion FC. FC, late fusion. [7] score fusion. 2. SSD SSD [8] DNN. (a) early fusion (b) halfway fusion (c) late fusion 1. [7]: convolutional layer, concatenate layer, NIN layer, fully connected layer. Fig. 1. The network fusion architectures [7]: Gray and blue boxes indicate convolutional and concatenate layers, orange and bright yellow boxes represent NIN and fully connected layers. 2. SSD : ReLU layer, pooling layer, dropout layer Fig. 2. SSD network structure: ReLU layers, pooling layers, dropout layers are excluded to make it simple
(JBE Vol. 23, No. 3, May 2018) 2 [8] SSD VGG-16 [10] FC6 FC7 Conv. 4 layer. SSD Faster R-CNN RPN. SSD. Conv.... (ground truth box)... SSD VGG-16. SSD-E, SSD-H, SSD-L. 3. ConvN_M N convolutional layer M layer. [7] Faster R-CNN. SSD. SSD-E. Conv1_2 Concat. NIN. Concatenate. VGG-16 NIN VGG-16. SSD-H Conv4_3 SSD-L Conv11_2 3. SSD-E ( ), SSD-H ( ), SSD-L ( ) :,, convolution layer, concatenate layer, NIN Fig. 3. Proposed SSD-E (left), SSD-H (middle), SSD-L (right) network structure: Except for the input images, white, blue, and yellow boxes indicate convolution layers, concatenate layers, and NIN respectively
1 : DNN (Yongwoo Lee et al: DNN Based Multi-spectrum Pedestrian Detection Method Using Color and Thermal Image). SSD layer. 2 6. SSD-L layer.. KAIST multispectral pedestrian dataset [9]. KAIST 95,328, 1,182 103,128. (occlusion), 50. 3,357. 2,094. [8] pre-training 120,000. FPPI-miss rate. ACF+T+THOG [9], SSD-C, SSD-T, halfway fusion [7], SSD-E, SSD-H, SSD-L. ACF+T+THOG [9] ACF [13] 10 HOG (histogram of gradients) [14]. SSD-C SSD-T. SSD-E, SSD-H, SSD-L SSD early fusion, halfway fusion, late fusion. 4 FPPI-miss rate. Miss rate [12] log. SSD-E, SSD-H, SSD-L KAIST ACF+T+THOG. SSD-H 18.18% miss rate. SSD-E SSD-L 4. FPPI-miss rate Fig. 4. Comparison of detection results in FPPI-miss rate
366 방송공학회논문지 제23권 제3호, 2018년 5월 (JBE Vol. 23, No. 3, May 2018) 두 단일 영상으로만 트레이닝했던 SSD-C와 SSD-T보다 성 능이 높게 나왔지만 [7]의 halfway fusion보다는 좋지 않은 성능을 보여주었다. 그러나 SSD-H의 경우 2.1% 낮은 miss rate를 획득해 SSD가 Faster R-CNN에 비해 정확도가 높은 방법인 것을 확인할 수 있었다. SSD-C와 SSD-T의 경우 miss rate의 결과 차이는 미미한 수준이었다. SSD-H 방식 이 가장 좋은 결과를 보이는지에 대한 분석은 다음과 같다. SSD-H는 중간 단계의 layer에서 특징들을 퓨전했기 때문 에 특징맵이 의미론적인 정보를 가지는 동시에 세밀한 디 테일을 보존할 수 있다. SSD-E의 경우 픽셀 단위에 가까운 특징을 퓨전하기 때문에 네트워크 퓨전의 이점이 적을 수 밖에 없으며 SSD-L의 경우 너무 의미론적인 (semantic) 특 징을 퓨전하기 때문에 의미론적인 특징에서 생기는 부정확 한 데이터에 전체 성능이 크게 영향을 받을 수 밖에 없다. 또한 기존의 [7]의 경우 Faster R-CNN을 기반한 방식으로 실제 정확도의 경우 같은 VGG-16을 기반으로 한 SSD 보 (a) (b) 다 낮다. SSD의 경우 Faster R-CNN 보다 속도가 빠른 것을 차치하고서라도 최종 물체를 탐지하는 판단을 하는 부분에 있어서 다양한 스케일의 특징맵을 함께 사용하기 때문에 그 정확도가 향상되고 따라서 SSD-H의 성능이 다른 비교 방식에 비해 가장 성능이 좋다. 제안한 네트워크 퓨전 구조 (SSD-H)가 컬러 영상만을 이 용했을 때 (SSD-C)보다 성능이 향상되는 것을 확인하기 위 해 실제 보행자 탐지 영상을 확인해 조명 환경이 좋은 낮 영상과 조명이 거의 없는 밤 영상에 테스트를 해서 결과를 비교하였다. 그림 5는 SSD-C와 SSD-H의 보행자 검출 결 과를 보여준다. 그림 5의 처음 두 줄은 SSD-C의 결과이고 아래의 두 줄은 SSD-H의 결과이다. 각 결과는 컬러 영상과 열 영상을 함께 나타내었다. 다양한 영상에서 SSD-H가 SSD-C보다 정확하게 보행자를 탐지하는 것을 확인할 수 있다. 그림 5. (a)와 (c)의 낮 영상의 경우 SSD의 이점과 열 영상에서 제공되는 추가적인 정보로 인해 더 많은 보행자 (c) 그림 5. SSD-C (처음 두 줄)과 SSD-H (아래 두 줄)의 결과 비교 Fig. 5. Results from SSD-C (First two row) and SSD-H (Bottom two row) (d) (e)
1 : DNN (Yongwoo Lee et al: DNN Based Multi-spectrum Pedestrian Detection Method Using Color and Thermal Image). SSD-C ( 5. (b) (e)) ( 5. (d)). SSD-H.... FPPI-miss rate SSD-H ACF+ T+THOG 18.18% miss rate halfway fusion 2.1% miss rate... (References) [1] S. Zhang, R. Benenson, M. Omran, J. Hosang, and B. Schiele, How far are we from solving pedestrian detection?, IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp.1259-1267, 2016. [2] J. Wagner, V. Fischer, M. Herman, and S. Behnke, Multispectral pedestrian detection using deep fusion convolutional neural networks, European Symposium on Artificial Neural Networks, Bruges, Belgium, pp. 509-514, 2016. [3] R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, pp. 580-587, 2014. [4] R. Girshick, Fast r-cnn, arxiv preprint arxive:1504.08083, 2015. [5] S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, Neural Information Processing Systems, Montreal, Canada, pp. 91-99, 2015. [6] K. He, G. Gkioxari, P. Dollar, R. Girshick, Mask R-CNN, IEEE International Conference on Computer Vision, Venice, Italy, pp. 2980-2988, 2017. [7] J. Liu, S. Zhang, S. Wang, D. N. Metaxas, Multispectral deep neural networks for pedestrian detection, arxiv preprint arxiv:1611.02644, 2016. [8] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A. C. Berg, SSD: Single Shot MultiBox Detector, European Conference on Computer Vision, Amsterdam, the Netherlands, pp. 21-37, 2016. [9] S. Hwang, J. Park, N. Kim, Y. Choi, and I. S. Kweon, Multispectral pedestrian detection: benchmark dataset and baseline, IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, pp. 1037-1045, 2015. [10] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arxiv preprint arxiv:1409.1556, 2014. [11] M. Lin, Q. Chen, S. Yan, Network in Network, arxiv preprint arxive:1312.4400, 2013. [12] P. Dollar, C. Wojek, B. Schiele, and P. Perona, Pedestrian detection: A benchmark, IEEE Conference on Computer Vision and Pattern Recognition, Miami, USA, pp. 304-311, 2009. [13] P. Dollar, R. Appel, S. Belongie, and P. Perona, Fast feature pyramids for object detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 36, No. 8, pp. 1532-1545, Jan. 2014. [14] N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, IEEE Conference on Computer Vision and Pattern Recognition, San Diego, USA, pp. 886-893, 2005.
(JBE Vol. 23, No. 3, May 2018) - 2013 2 : - 2013 2 ~ : - ORCID : https://orcid.org/0000-0002-2873-7122 - :,,,, - 1986 : - 1988 : (KAIST) - 1988 ~ 1991 : - 1991 ~ 1996 : - 1996 ~ 2001 : University of Southern California - 2001 ~ 2002 : - 2002 ~ : - ORCID : https://orcid.org/0000-0002-2599-3331 - : /, /,