Delving Deeper into Convolutional Networks for Learning Video Representations - Nicolas Ballas, Li Yao, Chris Pal, Aaron Courville arXiv:

Delving Deeper into Convolutional Networks for Learning Video Representations Nicolas Ballas, Li Yao, Chris Pal, Aaron Courville arxiv: 1511.06432 Il Gu Yi DeepLAB in Modu Labs. June 13, 2016 Il Gu Yi Delving Deeper into ConvNets June 13, 2016 1 / 21

Content 1 Introduction Gated Recurrent Unit Networks (GRU) 2 Delving Deeper into Convolutional Neural Networks 3 Related Works 4 Experiments Action Recognition Video Captioning 5 Conclusion Il Gu Yi Delving Deeper into ConvNets June 13, 2016 2 / 21

Introduction Introduction Video analysis and understanding Human action recognition, video retrieval or video captioning Previous: hand-crafted and task-specific representations Current researches CNN: image analysis (good) but NOT use temporal information RNN: temporal sequences analysis (good) Recurrent Convolutional Networks (RCN) Srivastava et al., 2015; Donahue et al., 2014; Ng et al., 2015 RNN + CNN for learning video representations Il Gu Yi Delving Deeper into ConvNets June 13, 2016 3 / 21

Introduction Recurrent Convolutional Networks (RCN) Basic architecture Visual percepts: CNN feature maps RNN input: Visual percepts Previous works High-level visual percepts (only top-layer) Drawbacks: local information 을많이잃어버림 Drawbacks: frame-to-frame 에서 temporal variation 이크지않음 Novel architecture top-layer + middle-layers GRU-RNN: RNN cell 안에 fc ops 대신에 conv2d ops 를사용 Il Gu Yi Delving Deeper into ConvNets June 13, 2016 4 / 21

Introduction Gated Recurrent Unit Networks (GRU) Gated Recurrent Unit Networks (GRU) GRU z t = σ(w z x t + U z h t 1 ), r t = σ(w r x t + U r h t 1 ), h t = tanh(wx t + U(r t h t 1 )), h t = (1 z t )h t 1 + z t ht, Learning phrase representations using rnn encoder-decoder for statistical machine translation, Cho et. al, arxiv: 1406.1078, 2014 long-term temporal dependency modelling z t : update gate r t : reset gate : element-wise multiplication Il Gu Yi Delving Deeper into ConvNets June 13, 2016 5 / 21

Delving Deeper into Convolutional Neural Networks Two RCN architectures GRU-RCN (그림에서 위 방향 점선 화살표를 빼면 됨) Stacked GRU-RCN (figure) (x1t,, xl 1, xl t t ), t = 1,, T Il Gu Yi Delving Deeper into ConvNets June 13, 2016 6 / 21

Delving Deeper into Convolutional Neural Networks GRU-RCN GRU-RCN z l t = σ(w l z x l t + U l z h l t 1), r l t = σ(w l r x l t + U l r h l t 1), h l t = tanh(w l x l t + U l (r l t h l t 1)), h l t = (1 z l t)h l t 1 + z l t h l t, h l t = φ l (x l t, h l t 1) : conv2d ops 맨마지막시점의 hidden들 (h 1 T,, hl T ) 을가지고 classify fc ops: conv maps의특성을반영하지못함 conv maps: 다른위치에서반복적으로나타나는강한 local correlation 을끄집어냄 Il Gu Yi Delving Deeper into ConvNets June 13, 2016 7 / 21

Delving Deeper into Convolutional Neural Networks GRU-RCN (cont.) GRU-RCN z l t = σ(w l z x l t + U l z h l t 1), r l t = σ(w l r x l t + U l r h l t 1), h l t = tanh(w l x l t + U l (r l t h l t 1)), h l t = (1 z l t)h l t 1 + z l t h l t, number of parameter in GRU Size of W l, W l z, and W l r: N 1 N 2 O x O h N: input spatial size, O x : input channels, O h : size of hidden node number of parameter in GRU-RCN Size of W l, W l z, and W l r: k 1 k 2 O x O h k: kernel size; usually 3 3 N 1 N 2 Il Gu Yi Delving Deeper into ConvNets June 13, 2016 8 / 21

Delving Deeper into Convolutional Neural Networks Stacked GRU-RCN Stacked GRU-RCN z l t = σ(w l z x l t + W l z l hl 1 t + U l z h l t 1), r l t = σ(wr l x l t + W l r l hl 1 t ) + U l r h l t 1), h l t = tanh(w l x l t + U l (r l t h l t 1)), h l t = (1 z l t)h l t 1 + z l t h l t, h l t = φ l (x l t, h l t 1, h l 1 t ), current time step and previous layer : conv2d ops Il Gu Yi Delving Deeper into ConvNets June 13, 2016 9 / 21

Related Works Related Works Large-scale Video Classification with Convolutional Neural Networks (Karpathy et al. 2014) Tran et al. (2014): C3D ( 박은수님발표 ) 이미지분류와달리비약적인발전은없었음 오히려큰데이터셋으로비디오학습은힘들다고함 Simonyan & Zisserman (2014a): two-stream framework 제안 RGB color 와 optical flow 정보를각각인풋으로넣고 CNN 학습함 Ng et al. (2015), Donahue et al. (2014): two-stream framework 모델의 top layer 를 RNN 적용 Il Gu Yi Delving Deeper into ConvNets June 13, 2016 10 / 21

Experiments Action Recognition Action Recognition Model Architecture VGG-16: (ImageNet pertained UCF-101 로 fine tuning) extract 5 feature maps: pool2, pool3, pool4, pool5, and fc-7 위의 feature map 들이 RCN 모델의 x l t input UCF-101 dataset 101 action, 13320 youtube video clips Il Gu Yi Delving Deeper into ConvNets June 13, 2016 11 / 21

Experiments Action Recognition Three RCN architectures Three RCN architectures GRU-RCN number of feature maps: 64, 128, 256, 256, 512 average pooling in last time step T ex. Layer 1 - pool2) (56 x 56 x 64) (1 x 1 x 64) 로바꿔주기위함각각을다섯개의 classifier 로보냄한 classifier 는하나의 hidden representation 에만 focus 를맞추고학습최종결정은다섯개의 classifier average 로결정 dropout prob: 0.7 Stacked GRU-RCN bottom-up connection 이얼마나중요한지조사하기위해실험아래 layer input 의 spatial dimension 을맞추기위해 max-pooling 을함 Bi-directional GRU-RCN reverse temporal information 의중요성을체크하기위해실험 Il Gu Yi Delving Deeper into ConvNets June 13, 2016 12 / 21

Experiments Action Recognition Model Training and Evaluation Follow the two-stream framework batch size: 64 videos 네가지사이즈 256, 224, 192, 168 중하나로 random하게 cropping temporal cropping size: 10 최종인풋은 224로 resize, 최종인풋의볼륨은 (224 x 224 x 10) Maximum log-likelihood L = 1 N log p(y n c(x n ), θ) N n=1 Il Gu Yi Delving Deeper into ConvNets June 13, 2016 13 / 21

Experiments Action Recognition Results Baseline VGG-16: pre-trained ImageNet and fine tune on the UCF-101 VGG-16 RNN: fc7을 GRU의 input (fc) VGG-16 RNN(78.1) > VGG-16(78.0): slightly improve CNN top-layer가 temporal information을많이잃어버렸다는증거 Il Gu Yi Delving Deeper into ConvNets June 13, 2016 14 / 21

Experiments Action Recognition Results (cont.) RGB test Best: Bi-directional GRU-RCN state-of-art C3D (Tran et. al.): 85.2 Karpathy: 65.2 Il Gu Yi Delving Deeper into ConvNets June 13, 2016 15 / 21

Experiments Action Recognition Results (cont.) Flow test Best: GRU-RCN (85.4 85.7) VGG16 이이미 10 장의연속된이미지를가지고학습하기때문에그런것같음 Il Gu Yi Delving Deeper into ConvNets June 13, 2016 16 / 21

Experiments Action Recognition Results (cont.) RGB + Flow Details: Wang et al., (2015b) 두모델을각각돌리고 weighted linear combination baseline: fusion VGG-16: 89.1; state-of-art: 90.9 (Wang) Combining Bi-directional GRU-RCN: 90.8 Il Gu Yi Delving Deeper into ConvNets June 13, 2016 17 / 21

Experiments Video Captioning Video Captioning Model Architecture Data YouTube2Text: 1970 video clips with multiple natural language descriptions train: 1200, valid: 100, test: 670 Encoder-decoder framework: Cho et al., (2014) Encoder K equally-space segments(k=10) 10 개로 segment 를나누고각각의 VGG-16 에서 fc7 layer 를뽑아냄마지막 time step 에서합치고 (concatenate) 그걸 input 으로사용 Decoder: LSTM text-generator with soft-attention, Yao et al., (2015b) L = 1 N N t n n=1 i=1 log p(y n i y n <i, x n i, θ) Il Gu Yi Delving Deeper into ConvNets June 13, 2016 18 / 21

Experiments Video Captioning Results Il Gu Yi Delving Deeper into ConvNets June 13, 2016 19 / 21

Conclusion Conclusion temporal variation 을잘모델링하기위해서로다른 spatial resolution 을이용 top layer 에가까우면 discriminative information 이더높지만 spatial resolution 이떨어짐 아래레이어에가까우면그반대 VGG-16 에서 5 개의 layer 를뽑아멀티레벨 GRU 적용 Il Gu Yi Delving Deeper into ConvNets June 13, 2016 20 / 21

Conclusion Thank you for your attention! Il Gu Yi Delving Deeper into ConvNets June 13, 2016 21 / 21