Delving Deeper into Convolutional Networks for Learning Video Representations - Nicolas Ballas, Li Yao, Chris Pal, Aaron Courville arXiv:

Similar documents
RNN & NLP Application

<4D F736F F D20B1E2C8B9BDC3B8AEC1EE2DC0E5C7F5>

(JBE Vol. 23, No. 2, March 2018) (Special Paper) 23 2, (JBE Vol. 23, No. 2, March 2018) ISSN

PowerPoint 프레젠테이션

(JBE Vol. 24, No. 2, March 2019) (Special Paper) 24 2, (JBE Vol. 24, No. 2, March 2019) ISSN

2 : (Seungsoo Lee et al.: Generating a Reflectance Image from a Low-Light Image Using Convolutional Neural Network) (Regular Paper) 24 4, (JBE

BSC Discussion 1

김기남_ATDC2016_160620_[키노트].key

THE JOURNAL OF KOREAN INSTITUTE OF ELECTROMAGNETIC ENGINEERING AND SCIENCE Jul.; 29(7),

untitled

09권오설_ok.hwp

°í¼®ÁÖ Ãâ·Â

(JBE Vol. 24, No. 1, January 2019) (Special Paper) 24 1, (JBE Vol. 24, No. 1, January 2019) ISSN 2287-

Journal of Educational Innovation Research 2017, Vol. 27, No. 3, pp DOI: (NCS) Method of Con

4 : (Hyo-Jin Cho et al.: Audio High-Band Coding based on Autoencoder with Side Information) (Special Paper) 24 3, (JBE Vol. 24, No. 3, May 2019

Software Requirrment Analysis를 위한 정보 검색 기술의 응용

2 : (EunJu Lee et al.: Speed-limit Sign Recognition Using Convolutional Neural Network Based on Random Forest). (Advanced Driver Assistant System, ADA

Visual recognition in the real world SKT services

R을 이용한 텍스트 감정분석

<32B1B3BDC32E687770>

DIY 챗봇 - LangCon

À±½Â¿í Ãâ·Â

Ch 1 머신러닝 개요.pptx


(JBE Vol. 23, No. 2, March 2018) (Special Paper) 23 2, (JBE Vol. 23, No. 2, March 2018) ISSN

<4D F736F F D20C3D6BDC C0CCBDB4202D20BAB9BBE7BABB>

<313120C0AFC0FCC0DA5FBECBB0EDB8AEC1F2C0BB5FC0CCBFEBC7D15FB1E8C0BAC5C25FBCF6C1A42E687770>

삼성955_965_09

(Exposure) Exposure (Exposure Assesment) EMF Unknown to mechanism Health Effect (Effect) Unknown to mechanism Behavior pattern (Micro- Environment) Re

<C1DF3320BCF6BEF7B0E8C8B9BCAD2E687770>

07.045~051(D04_신상욱).fm



Contents Ⅰ. 연구 개요 1. 연구의 필요성 및 목적 2. 연구 수행 절차 및 방법 Ⅱ. 농촌지도조직 OJT 현황분석 결과 Ⅲ. 농촌지도직공무원 S-OJT 매뉴얼 개발 Ⅳ. 논의 및 추후일정

(JBE Vol. 22, No. 2, March 2017) (Special Paper) 22 2, (JBE Vol. 22, No. 2, March 2017) ISSN

(JBE Vol. 24, No. 4, July 2019) (Special Paper) 24 4, (JBE Vol. 24, No. 4, July 2019) ISSN

1217 WebTrafMon II

<4D F736F F F696E74202D F ABFACB1B8C8B85FBEF0BEEEC3B3B8AEBFCDB1E2B0E8B9F8BFAAC7F6C8B228C1F6C3A2C1F829>

Structural SVMs 및 Pegasos 알고리즘을 이용한 한국어 개체명 인식

02( ) SAV12-19.hwp

High Resolution Disparity Map Generation Using TOF Depth Camera In this paper, we propose a high-resolution disparity map generation method using a lo

Voice Portal using Oracle 9i AS Wireless

(JBE Vol. 22, No. 2, March 2017) (Regular Paper) 22 2, (JBE Vol. 22, No. 2, March 2017) ISSN

(, sta*s*cal disclosure control) - (Risk) and (U*lity) (Synthe*c Data) 4. 5.

방송공학회논문지 제18권 제2호


Journal of Educational Innovation Research 2018, Vol. 28, No. 3, pp DOI: NCS : * A Study on

.,,,,,,.,,,,.,,,,,, (, 2011)..,,, (, 2009)., (, 2000;, 1993;,,, 1994;, 1995), () 65, 4 51, (,, ). 33, 4 30, (, 201

<FEFF E002D B E E FC816B CBDFC1B558B202E6559E830EB C28D9>

Microsoft PowerPoint - 실습소개와 AI_ML_DL_배포용.pptx

SchoolNet튜토리얼.PDF

歯15-ROMPLD.PDF

제 출 문 문화체육관광부장관 귀하 본 보고서를 문화예술분야 통계 생산 및 관리 방안 연구결과 최종 보고서로 제출합니다. 2010년 10월 숙명여자대학교 산학협력단 본 보고서는 문화체육관광부의 공식적인 견해와 다를 수 있습니다

(JBE Vol. 7, No. 4, July 0)., [].,,. [4,5,6] [7,8,9]., (bilateral filter, BF) [4,5]. BF., BF,. (joint bilateral filter, JBF) [7,8]. JBF,., BF., JBF,.

Journal of Educational Innovation Research 2018, Vol. 28, No. 1, pp DOI: * A Study on the Pe

(JBE Vol. 23, No. 5, September 2018) (Special Paper) 23 5, (JBE Vol. 23, No. 5, September 2018) ISSN

Gray level 변환 및 Arithmetic 연산을 사용한 영상 개선

44-4대지.07이영희532~

thesis-shk

Artificial Intelligence: Assignment 6 Seung-Hoon Na December 15, Sarsa와 Q-learning Windy Gridworld Windy Gridworld의 원문은 다음 Sutton 교재의 연습문제

개정판 서문 Prologue 21세기 한국경제를 이끌어갈 후배들에게 드립니다 1부 인생의 목표로써 CEO라는 비전을 확고히 하자 2부 인생의 비전을 장기 전략으로 구체화하라 1장 미래 경영환경 이해하기 20p 4장 장기 실행 전략 수립하기 108p 1) 미래 환경분석이

untitled

12¾ÈÇö°æ 1-155T304®¶ó

#Ȳ¿ë¼®

저작자표시 - 비영리 - 변경금지 2.0 대한민국 이용자는아래의조건을따르는경우에한하여자유롭게 이저작물을복제, 배포, 전송, 전시, 공연및방송할수있습니다. 다음과같은조건을따라야합니다 : 저작자표시. 귀하는원저작자를표시하여야합니다. 비영리. 귀하는이저작물을영리목적으로이용할

<C7A5C1F620BEE7BDC4>

(72) 발명자 이동희 서울 동작구 여의대방로44길 10, 101동 802호 (대 방동, 대림아파트) 노삼혁 서울 중구 정동길 21-31, B동 404호 (정동, 정동상 림원) 이 발명을 지원한 국가연구개발사업 과제고유번호 부처명 교육과학기술부

Microsoft PowerPoint - AC3.pptx

VOL /2 Technical SmartPlant Materials - Document Management SmartPlant Materials에서 기본적인 Document를 관리하고자 할 때 필요한 세팅, 파일 업로드 방법 그리고 Path Type인 Ph


SW¹é¼Ł-³¯°³Æ÷ÇÔÇ¥Áö2013

PowerPoint 프레젠테이션

쿠폰형_상품소개서

Multi-pass Sieve를 이용한 한국어 상호참조해결 반-자동 태깅 도구

Analysis of objective and error source of ski technical championship Jin Su Seok 1, Seoung ki Kang 1 *, Jae Hyung Lee 1, & Won Il Son 2 1 yong in Univ

화해와나눔-여름호(본문)수정

화해와나눔-가을호(본문)

untitled

<30352D30312D3120BFB5B9AEB0E8BEE0C0C720C0CCC7D82E687770>

歯mp3사용설명서

LCD Display

DBPIA-NURIMEDIA

<30342DBCF6C3B3B8AEBDC3BCB33228C3D6C1BE292E687770>

텀블러514

DBPIA-NURIMEDIA

다중 곡면 검출 및 추적을 이용한 증강현실 책

04김호걸(39~50)ok

차 례... 박영목 **.,... * **.,., ,,,.,,

PowerPoint 프레젠테이션

2 : (Rahoon Kang et al.: Image Filtering Method for an Effective Inverse Tone-mapping) (Special Paper) 24 2, (JBE Vol. 24, No. 2, March 2019) h

04_오픈지엘API.key

4 : CNN (Sangwon Suh et al.: Dual CNN Structured Sound Event Detection Algorithm Based on Real Life Acoustic Dataset) (Regular Paper) 23 6, (J

Reinforcement Learning & AlphaGo

09구자용(489~500)

Microsoft Word - Final_ _최정빈.docx

현대패션의 로맨틱 이미지에 관한 연구

UML

(72) 발명자 이병근 광주 북구 첨단과기로 123, E-201 (오룡동, 광주과 학기술원) 쉐리 아흐매드 무킴 광주 북구 첨단과기로 123, E-506 (오룡동, 광주과 학기술원) 최형욱 광주 북구 첨단과기로 123, 5214호 (오룡동, 광주 과학기술원 대학원생활관

untitled

Oracle Apps Day_SEM

Transcription:

Delving Deeper into Convolutional Networks for Learning Video Representations Nicolas Ballas, Li Yao, Chris Pal, Aaron Courville arxiv: 1511.06432 Il Gu Yi DeepLAB in Modu Labs. June 13, 2016 Il Gu Yi Delving Deeper into ConvNets June 13, 2016 1 / 21

Content 1 Introduction Gated Recurrent Unit Networks (GRU) 2 Delving Deeper into Convolutional Neural Networks 3 Related Works 4 Experiments Action Recognition Video Captioning 5 Conclusion Il Gu Yi Delving Deeper into ConvNets June 13, 2016 2 / 21

Introduction Introduction Video analysis and understanding Human action recognition, video retrieval or video captioning Previous: hand-crafted and task-specific representations Current researches CNN: image analysis (good) but NOT use temporal information RNN: temporal sequences analysis (good) Recurrent Convolutional Networks (RCN) Srivastava et al., 2015; Donahue et al., 2014; Ng et al., 2015 RNN + CNN for learning video representations Il Gu Yi Delving Deeper into ConvNets June 13, 2016 3 / 21

Introduction Recurrent Convolutional Networks (RCN) Basic architecture Visual percepts: CNN feature maps RNN input: Visual percepts Previous works High-level visual percepts (only top-layer) Drawbacks: local information 을많이잃어버림 Drawbacks: frame-to-frame 에서 temporal variation 이크지않음 Novel architecture top-layer + middle-layers GRU-RNN: RNN cell 안에 fc ops 대신에 conv2d ops 를사용 Il Gu Yi Delving Deeper into ConvNets June 13, 2016 4 / 21

Introduction Gated Recurrent Unit Networks (GRU) Gated Recurrent Unit Networks (GRU) GRU z t = σ(w z x t + U z h t 1 ), r t = σ(w r x t + U r h t 1 ), h t = tanh(wx t + U(r t h t 1 )), h t = (1 z t )h t 1 + z t ht, Learning phrase representations using rnn encoder-decoder for statistical machine translation, Cho et. al, arxiv: 1406.1078, 2014 long-term temporal dependency modelling z t : update gate r t : reset gate : element-wise multiplication Il Gu Yi Delving Deeper into ConvNets June 13, 2016 5 / 21

Delving Deeper into Convolutional Neural Networks Two RCN architectures GRU-RCN (그림에서 위 방향 점선 화살표를 빼면 됨) Stacked GRU-RCN (figure) (x1t,, xl 1, xl t t ), t = 1,, T Il Gu Yi Delving Deeper into ConvNets June 13, 2016 6 / 21

Delving Deeper into Convolutional Neural Networks GRU-RCN GRU-RCN z l t = σ(w l z x l t + U l z h l t 1), r l t = σ(w l r x l t + U l r h l t 1), h l t = tanh(w l x l t + U l (r l t h l t 1)), h l t = (1 z l t)h l t 1 + z l t h l t, h l t = φ l (x l t, h l t 1) : conv2d ops 맨마지막시점의 hidden들 (h 1 T,, hl T ) 을가지고 classify fc ops: conv maps의특성을반영하지못함 conv maps: 다른위치에서반복적으로나타나는강한 local correlation 을끄집어냄 Il Gu Yi Delving Deeper into ConvNets June 13, 2016 7 / 21

Delving Deeper into Convolutional Neural Networks GRU-RCN (cont.) GRU-RCN z l t = σ(w l z x l t + U l z h l t 1), r l t = σ(w l r x l t + U l r h l t 1), h l t = tanh(w l x l t + U l (r l t h l t 1)), h l t = (1 z l t)h l t 1 + z l t h l t, number of parameter in GRU Size of W l, W l z, and W l r: N 1 N 2 O x O h N: input spatial size, O x : input channels, O h : size of hidden node number of parameter in GRU-RCN Size of W l, W l z, and W l r: k 1 k 2 O x O h k: kernel size; usually 3 3 N 1 N 2 Il Gu Yi Delving Deeper into ConvNets June 13, 2016 8 / 21

Delving Deeper into Convolutional Neural Networks Stacked GRU-RCN Stacked GRU-RCN z l t = σ(w l z x l t + W l z l hl 1 t + U l z h l t 1), r l t = σ(wr l x l t + W l r l hl 1 t ) + U l r h l t 1), h l t = tanh(w l x l t + U l (r l t h l t 1)), h l t = (1 z l t)h l t 1 + z l t h l t, h l t = φ l (x l t, h l t 1, h l 1 t ), current time step and previous layer : conv2d ops Il Gu Yi Delving Deeper into ConvNets June 13, 2016 9 / 21

Related Works Related Works Large-scale Video Classification with Convolutional Neural Networks (Karpathy et al. 2014) Tran et al. (2014): C3D ( 박은수님발표 ) 이미지분류와달리비약적인발전은없었음 오히려큰데이터셋으로비디오학습은힘들다고함 Simonyan & Zisserman (2014a): two-stream framework 제안 RGB color 와 optical flow 정보를각각인풋으로넣고 CNN 학습함 Ng et al. (2015), Donahue et al. (2014): two-stream framework 모델의 top layer 를 RNN 적용 Il Gu Yi Delving Deeper into ConvNets June 13, 2016 10 / 21

Experiments Action Recognition Action Recognition Model Architecture VGG-16: (ImageNet pertained UCF-101 로 fine tuning) extract 5 feature maps: pool2, pool3, pool4, pool5, and fc-7 위의 feature map 들이 RCN 모델의 x l t input UCF-101 dataset 101 action, 13320 youtube video clips Il Gu Yi Delving Deeper into ConvNets June 13, 2016 11 / 21

Experiments Action Recognition Three RCN architectures Three RCN architectures GRU-RCN number of feature maps: 64, 128, 256, 256, 512 average pooling in last time step T ex. Layer 1 - pool2) (56 x 56 x 64) (1 x 1 x 64) 로바꿔주기위함각각을다섯개의 classifier 로보냄한 classifier 는하나의 hidden representation 에만 focus 를맞추고학습최종결정은다섯개의 classifier average 로결정 dropout prob: 0.7 Stacked GRU-RCN bottom-up connection 이얼마나중요한지조사하기위해실험아래 layer input 의 spatial dimension 을맞추기위해 max-pooling 을함 Bi-directional GRU-RCN reverse temporal information 의중요성을체크하기위해실험 Il Gu Yi Delving Deeper into ConvNets June 13, 2016 12 / 21

Experiments Action Recognition Model Training and Evaluation Follow the two-stream framework batch size: 64 videos 네가지사이즈 256, 224, 192, 168 중하나로 random하게 cropping temporal cropping size: 10 최종인풋은 224로 resize, 최종인풋의볼륨은 (224 x 224 x 10) Maximum log-likelihood L = 1 N log p(y n c(x n ), θ) N n=1 Il Gu Yi Delving Deeper into ConvNets June 13, 2016 13 / 21

Experiments Action Recognition Results Baseline VGG-16: pre-trained ImageNet and fine tune on the UCF-101 VGG-16 RNN: fc7을 GRU의 input (fc) VGG-16 RNN(78.1) > VGG-16(78.0): slightly improve CNN top-layer가 temporal information을많이잃어버렸다는증거 Il Gu Yi Delving Deeper into ConvNets June 13, 2016 14 / 21

Experiments Action Recognition Results (cont.) RGB test Best: Bi-directional GRU-RCN state-of-art C3D (Tran et. al.): 85.2 Karpathy: 65.2 Il Gu Yi Delving Deeper into ConvNets June 13, 2016 15 / 21

Experiments Action Recognition Results (cont.) Flow test Best: GRU-RCN (85.4 85.7) VGG16 이이미 10 장의연속된이미지를가지고학습하기때문에그런것같음 Il Gu Yi Delving Deeper into ConvNets June 13, 2016 16 / 21

Experiments Action Recognition Results (cont.) RGB + Flow Details: Wang et al., (2015b) 두모델을각각돌리고 weighted linear combination baseline: fusion VGG-16: 89.1; state-of-art: 90.9 (Wang) Combining Bi-directional GRU-RCN: 90.8 Il Gu Yi Delving Deeper into ConvNets June 13, 2016 17 / 21

Experiments Video Captioning Video Captioning Model Architecture Data YouTube2Text: 1970 video clips with multiple natural language descriptions train: 1200, valid: 100, test: 670 Encoder-decoder framework: Cho et al., (2014) Encoder K equally-space segments(k=10) 10 개로 segment 를나누고각각의 VGG-16 에서 fc7 layer 를뽑아냄마지막 time step 에서합치고 (concatenate) 그걸 input 으로사용 Decoder: LSTM text-generator with soft-attention, Yao et al., (2015b) L = 1 N N t n n=1 i=1 log p(y n i y n <i, x n i, θ) Il Gu Yi Delving Deeper into ConvNets June 13, 2016 18 / 21

Experiments Video Captioning Results Il Gu Yi Delving Deeper into ConvNets June 13, 2016 19 / 21

Conclusion Conclusion temporal variation 을잘모델링하기위해서로다른 spatial resolution 을이용 top layer 에가까우면 discriminative information 이더높지만 spatial resolution 이떨어짐 아래레이어에가까우면그반대 VGG-16 에서 5 개의 layer 를뽑아멀티레벨 GRU 적용 Il Gu Yi Delving Deeper into ConvNets June 13, 2016 20 / 21

Conclusion Thank you for your attention! Il Gu Yi Delving Deeper into ConvNets June 13, 2016 21 / 21