저작자표시 - 비영리 - 변경금지 2.0 대한민국 이용자는아래의조건을따르는경우에한하여자유롭게 이저작물을복제, 배포, 전송, 전시, 공연및방송할수있습니다. 다음과같은조건을따라야합니다 : 저작자표시. 귀하는원저작자를표시하여야합니다. 비영리. 귀하는이저작물을영리목적으로이용할수없습니다. 변경금지. 귀하는이저작물을개작, 변형또는가공할수없습니다. 귀하는, 이저작물의재이용이나배포의경우, 이저작물에적용된이용허락조건을명확하게나타내어야합니다. 저작권자로부터별도의허가를받으면이러한조건들은적용되지않습니다. 저작권법에따른이용자의권리는위의내용에의하여영향을받지않습니다. 이것은이용허락규약 (Legal Code) 을이해하기쉽게요약한것입니다. Disclaimer
i
ii
iii
iv
v
vi
vii
1
2
3
4
5
6
7
8
9
10
P D W P W ( ) ( ) = Õ N d ( x ) d = 1 [-E x W ] [- x ], 1 é E( x W ) ù exp ( ) P( x W ) = exp ê- ú= Z ( W ) ë k B T û exp E ( ' W ) å x' é E( x W ) ù Z( W ) = åexpê- ú x ë kbt û 11
12
ìw j xi Î E j j = 0 i ï a j = í 1 xi Î E j j = 1,..., E ï î 0 xi Ï E j j = 1,..., E 1 P W W Z( W ) d { x } ( d ) ( ) ( x ) = exp -e ( ; ), E = -å i1i 2... i E i i 1 i2 i Ei i= 1 ( d ) ( d ) ( d ) ( d ) e ( x ; W ) w x x... x e 13
( d ) N DN = { x } d =1 ( ) n= 1 ( ) ( ) = Õ N n P D W P x W Õ ( n ln P( D W ) = ln P( x ) W ) N d = 1 N ìé K ü ï 1 ù ( k ) ( n) ( n) ( n) ï = å íêå ( ) 1 2...... ú - ln ý 1 2 ê ( ) å wi i i x k i xi xi Z W k ú = 1 ï C k d îë k = 1 i1, i2,..., i û ï k þ Ñ ( k ) Ñw i1, i2,..., ik ( d ) ln P( x W ) d = 1 N ìé K ü Ñ ï 1 ù ( k ) ( d ) ( d ) ( d ) ï = ( ) å íêå 1 2...... ln ( ) 1 2 ( ) å w ú i i i x - ý k i xi xi Z W k k Ñw ê ú 1 1 1, 2,..., = ï C k i i i d îë k = i1, i2,..., i û ï k k þ = N å N Õ ì é K ï Ñ 1 ù ( k ) ( d ) ( d ) ( d ) Ñ í ê ( ) å 1 2...... ú 1 2 ê ( ) å wi i i x k i xi xi - k k Ñ C k ú k ï wi 1 1, i2,..., i ë k = i1, i2,..., i k k û Ñw î 1 2 d = 1 i, i,..., ik N å ì ( d ) ( d ) ( d ) = íxi x... -... 1 i x i 2 i x x k 1 i x 2 i î d = 1 ü ý ( x W ) þ ì ü = N í xi x......, 1 i x - ý 2 i x k i x 1 i x 2 ik î Data P( x W ) þ k P ln Z ( W ) ( ) ü ï ý ï þ N 1 x x... x = x x... x, ( d ) ( d é ) ( d ) å ù ë k N û k P( x W ) i1 i2 ik Data i1 i2 i d = 1 x x... x = å é ë x x... x P ( x W ) ù û i1 i2 i i1 i2 ik x 14
15
å å å P( x) = P( x h ) P( h ) = P( x h ) P( h h ) P( h ) 1 1 1 1 2 2 h1 h1 h2 = x1 x2 xv x (,,..., ) h = ( h1, h2,..., hn ) P( x) P( x h ) P( h h ) P( h ) = åå h2 h1 1 1 2 2 16
å å P( x) =... P( x h ) P( h h ) P( h h ) P( h ) hn h1 å å h1 1 1 2 n-1 n n =... P( h h ) P( h h ) P( h x) P( x) hn n n-1 2 1 1 exp( -e ( hs, hs- 1)) P( hs hs- 1) = exp( -e ( h )) s-1 e ( h ) s s ( h ) s e ( h ) = h( s( h )), s 1 h( x) =, 1 + exp( - x) s å å å s( h ) = w h + w h h +... + w h... h ( s) ( s) ( s) ( s) ( s) ( s) ( s) ( s) s i1 i1 i1i 2 i1 i2 i1... ik i1 ik i1 i1, i2 i1, i2,..., ik r1, r2,..., r n w1, w2,..., wm 17
x = ( r, w) r = ( r, r,..., r ) 1 2 w = ( w, w,..., w ) 1 2 n m C 1, C 2 h = h, C = h, 1 1 2 2 C = h3 2 C 18
19
1 2 2 1 1 1 2 P r w h c c P c c h Pt- 1 h c r w c = 2 P ( h, c,, ) t (,,, ) (, ) (, ) P( r, w, c ) 1 Pt-1 ( h, c ) 2 r, w, c 1 2 Pt ( h, c r, w, c ) 2 P( r, w, c ) 2 1 2 2 1 1 1 = òò t-1 h, c 1 P( r, w, c ) P( r, w h, c, c ) P( c c, h) P ( h, c ) dhdc 1 dhdc D 1 2 ( ) ( ) 1 2 2 1 1 d = 1 d d { t-1 } P ( h,c r, w, c ) µ Õ t P( r, w h,c,c ) P( c c ) P( c h) P ( h) t 20
argmax ì D ( ( ( ), ( ) ( ),, ) ( ( ) q a, a )) 1 1 (, ü = íå t d d d d t logp r w c e + logp c e + D logp - a) q d = t e ý î þ N ( ) M ( d ) ( d ) 2 1 d 2 1 ( d ) 2 1 = å n + å m n= 1 m= 1 log P( r, w c, c, h) log P( r c, c, h) log P( w c, c, h) ( d ) æ e c ö 2 1 ( 1,, ) expç w P w - a m = c c h = sm å i, ç i= 1 è ø ( d ) æ h c ö 2 1 ( = 1, ) = expç r P r -åa n c,c h sn i, ç i= 1 è ø s e c e c w w r r = åaiei, s = åaiei i= 1 i= 1 w ei e r i 21
a i { g e f r w e } ( ) ( ) 1 = å D d d t t i i i i = i + - i d = 1 a ( ) (, ; ), a la (1 l) a -, ( d ) ( d ) 1, if s.t. f ( r, w ; ei ) = í ï î d r d w ( r w ) ì ( ) ( ) T ï ei + ei eiei >k 0, otherwise d r ( ) i w ( ) e r d e w i g( ei ) å ij i, j Sim( u, v) = ( u - v ) ij 2 Sim( u, v) 22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
p(r w) p(w) p(w r) = = p(r w) p(w), p(r) p(w r) p(r) p(r w) = = p(w r) p(r) p(w) 38
w* = arg max P(w r, q ) = arg max P(r w, q ) P(w q ), w w r* = arg max P(r w, q ) = arg max P(w r, q ) P(r q ) r r w w { } w* = arg max log P(w r, q ) = arg max P(r w, q ) + log P(w q ) q 39
40
Sentence retrieval Image retrieval Rec@1 Rec@5 Rec@10 Med r Rec@1 Rec@5 Rec@10 Med r Random 0.5 1.2 3.3 121 0.6 1.3 5.5 101 STD-RNN[38] 10.3 20.4 41.1 20 11.9 18.3 39.1 18 m-rnn[48] 24.6 36.3 52.1 15 23.5 36.3 53 11 FV[40] 33.5 44.3 57.1 9 35.1 46.8 61.4 3 m-cnn[51] 40.2 50.3 64.3 3 34.4 46.4 59.1 6 m-cnn+dhn 42.6 52.1 63.2 3 36.1 54.9 60.7 4 41
42
[1] B. J. Biddle. Recent development in role theory. In Proceedings of Annual Review of Sociology. 12: 67-92. (1986) [2] S. Eisenstein. The film sense, New York: Harcourt Brace & World. Inc. (1947) [3] U. Hasson, O. Furman, D. Clark, Y. Dudai and L. Davachi. Enhanced intersubject correlations during movie viewing correlate with successful episodic encoding. In Proceedings of Neuron. 57: 452-462. (2008) [4] E. Morin. The cinema or the imaginary man, minneapolis: University of minnesota press. (2005) [5] J.-W. Ha, K.-M. Kim and B.-T. Zhang. Automated construction of visual-linguistic knowledge via concept learning from cartoon videos. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence. pp. 522-528. (2015) [6] A. N. Meltzoff. Toward a developmental cognitive science: The implications of cross-modal matching and imitation for development of representation and memory in infancy. In Proceedings of Annual New York Academy Science. 608: 1-31. (1990) [7] Y. Bengio, A. Courville and P. Vencent. Representation learning: A review and new perspectives. In Proceedings of IEEE Transactions on Patten Analysis and Machine Intelligence. 35(8): 1798-1828. (2013) [8] Y. LeCun, B. Boser, J. S. Denker, Henderson D, R. E. Howard, W. Hubbard and L. D. Jackel. Backpropagation applied to handwritten Zip 43
code recognition. In Proceedings of Neural Computation. 1(4): 541-551. (1989) [9] S. Lawrence, C. L. Giles, A. C. Tsoi and A. D. Back, Face recognition: A convolutional neural-network approach. In Proceedings of IEEE Transactions on Neural Networks. 8(1): 98-113. (1997) [10] G. E. Hinton. Deep belief networks. In Proceedings of Scholarpedia. 4(5): 5947. (2009) [11] R. Salakhutdinov and G. E. Hinton. An efficient learning procedure for deep boltzmann machines. In Proceedings of Neural Computation. 24(8): 1967-2006. (2012) [12] D. Wang and E. Nyberg. A long short-term memory model for answer sentence selection in question answering. In Proceedings of Association for Computational Linguistics. pp. 707-712. (2015) [13] T. Mikolov, M. Karafiát, L. Burget et al. Recurrent neural network based language model. In Proceedings of INTERSPEECH 2010. pp. 1045-1048. (2010) [14] N. Srivastava and R. Salakhutdinov. Multimodal learning with deep boltzmann machines. In Proceedings of Advances in Neural Information Processing Systems. pp. 2222-2230. (2012) [15] C.-J. Nan, K.-M. Kim and B.-T. Zhang. Social network analysis of TV drama characters via deep concept hierarchies. In Proceedings of International Conference on Advances in Social Networks Analysis and Mining. pp. 831-836. (2015) [16] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of IEEE Conference on Computer Vision and 44
Pattern Recognition. pp. 1725-1732. (2014) [17] I.-H. Jhuo and D.T. Lee. Video event detection via multi-modality deep Learning. In Proceedings of International Conference on Pattern Recognition. pp. 666-671. (2014) [18] H.-W. Chen, J.-H. Kuo, W.-T. Chu and J.-L. Wu. Action movies segmentation and summarization based on tempo analysis. In Proceedings of the 6th ACM SIGMM International Workshop on Multimedia Information Retrieval. pp. 251-258. (2004) [19] C.-W. Wang, W.-H. Cheng, J.-C. Chen, S.-S. Yang and J.-L. Wu. Film narrative exploration through the analysis of aesthetic elements. In Priceedings of the 13th International Conference on Multimedia Modeling - Volume Part I, pp. 606-615. (2007) [20] D. Tran, L. Bourdev, R. Fergus, L. Torresani and M. Paluri. C3D: Generic features for video analysis. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. arxiv preprint arxiv:1412.0767. (2014) [21] H. Li, H. Ji and L. Zhao. Social event extraction: Task, challenges and techniques. In Proceedings of International Conference on Advances in Social Networks Analysis and Mining. pp. 526-532. (2015) [22] V. Ramanathan, B. Yao and L. Fei-Fei. Social role discovery in human events. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 2475-2482. (2013) [23] Y.-F. Zhang, C.-S. Xu, H.-Q. Lu and Y-M Huang. Character identification in feature-length films using global face-name matching. In Proceedings of IEEE Transactions on Multimedia. pp. 1276-1288. (2009) 45
[24] C.-Y. Weng, W.-T. Chu and J.-L. Wu. RoleNet: Movie analysis from the perspective of social networks. In Proceedings of IEEE Transactions on Multimedia. pp. 256-271. (2009) [25] T. Lan, L. Sigal and G. Mori. Social roles in hierarchical models for human activity recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 1354-1361. (2012) [26] T. Yu, S.-N. Lim, K. Patwardhan and N. Krahnstoever. Monitoring, recognizing and discovering social networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 1462-1469. (2009) [27] L. Ding and A. Yilmaz. Learning relations among movie characters: A social network perspective. In Proceedings of European Conference on Computer Vision. pp 410-423. (2010) [28] L. Ding and A. Yilmaz. Inferring social relations from visual concepts. In Proceedings of IEEE International Conference on Computer Vision. pp. 699-706. (2011) [29] G. Wang, A. Gallagher, J.-B. Luo and D. Forsyth. Seeing people in social context: Recognizing people and social relationships. In Proceedings of European Conference on Computer Vision. pp. 169-182. (2010) [30] A. Frame, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato and T. Mikolov. Devise: A deep visual-semantic embedding model. In Proceedings of Advances in Neural Information Processing Systems. pp. 2121-2129. (2013) [31] D. Grangier and S. Bengio. A neural network to retrieve images from text queries. In Proceedings of International Conference on Artificial Neural Networks. pp. 24-34. (2006) 46
[32] N. Srivastava and R. Salakhutdinov. Learning representations for multimodal data with deep belief nets. In Proceedings of International Conference on Machine Learning Representation Learning Workshop. (2012) [33] J. Weston, S. Bengio and N. Usunier. Wsabie: Scaling up to large vocabulary image annotation. In Proceedings of the International Joint Conference on Artificial Intelligence. (2011) [34] M. A. Sadeghi and A. Farhadi. Recognition using visual phrases. In Proceedings of Computer Vision and Pattern Recognition. pp. 1745-1752. (2011) [35] C. L. Zitnick, D. Parikh and L. Vanderwende. Learning the visual interpretation of sentences. In Proceedings of the IEEE International Conference on Computer Vision. pp. 1681-1688. (2013) [36] M. Hodosh, P. Young and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. In Proceedings of Journal of Artificial Intelligence Research. 47:853 899. (2013) [37] A. Karpathy, A. Joulin and L. Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. In Proceedings of Advances in Neural Information Processing Systems. pp. 1889-1897. (2014) [38] R. Socher, Q. V. L. A. Karpathy, C. D. Manning and A. Y. Ng. Grounded compositional semantics for finding and describing images with sentences. In Proceedings of Transactions of the Association for Computational Linguistics. 2:207-218. (2014) [39] F. Yan and K. mikolajczyk. Deep correlation for matching images and text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3441-3450. (2015) 47
[40] B. Klein, G. Lev, G. Sadeh and L. Wolf. Associating neural word embeddings with deep image representations using fisher vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4437-4446. (2015) [41] R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, R. Urtasun and S. Fidler. Skip-thought vectors. In Proceedings of Advances in Neural Information Processing Systems. pp. 3294-3302. (2015) [42] B. Plummer, L. Wang, C. Cervantes, J. Caicedo, J. Hockenmaier and S. Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image to sentence models. In Proceedings of the IEEE International Conference on Computer Vision. pp. 2641-2649. (2015) [43] X. Chen and C. L. Zitnick. Learning a recurrent visual representation for image caption generation. arxiv:1411.5654. (2014) [44] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko and T. Darreell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2625-2634. (2015) [45] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3128-3137. (2015) [46] R. Kiros, R. Salakhutdinov and R. Zemel. Multimodal neural language model. In Proceedings of International Conference on Machine Learning. pp. 595-603. (2014) [47] R. Kiros, R. Salakhutdinov and R. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arxiv:1411.2539. 48
(2014) [48] J. Mao, W. Xu, Y. Yang, J. Wang and A. L. Yuille. Explain images with multimodal recurrent neural networks. arxiv:1410.1090. (2014) [49] J. Mao, W. Xu, Y. Yang, J. Wang and A. L. Yuille. Deep captioning with multimodal recurrent neural networks. arxiv:1412.6632. (2014) [50] O. Vinyals, A. Toshev, S. Bengio and D. Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3156-3164. (2015) [51] L. Ma, Z. Lu, L. Shang et al. Multimodal convolutional neural networks for matching image and sentence. In Proceedings of the IEEE International Conference on Computer Vision. pp. 2623-2631. (2015) [52] B.-T. Zhang. Hypernetworks: A molecular evolutionary architecture for cognitive learning and memory, In Proceedings of IEEE Computational Intelligence Magazine. 3(3):49-63. (2008) [53] B.-T. Zhang, J.-W. Ha and M. Kang. Sparse population code models of word learning in concept drift. In Proceedings of Annual Meeting of the Cognitive Science Society. pp. 1221-1226. (2012) [54] B.-T. Zhang, P. Ohm, and H. Muhlenbein. Evolutionary induction of sparse neural trees. In Proceedings of Evolutionary Computation, 5(2):213-236. (1997) [55] S. S. Farfade, M. J. Saberian and L. J. Li. Multi-view face detection using deep convolutional neural networks. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. pp. 643-650. (2015) [56] P. Viola and M. J. Jones. Robust real-time face detection. In Proceedings of International Journal of Computer Vision. 57(2): 49
137-154. (2004) [57] R. Socher, B. Huval, B. Bath, C. D. Manning and A. Y. Ng. Convolutional-recursive deep learning for 3D object classification. In Proceedings of Advances in Neural Information Processing Systems. pp. 665-673. (2012) [58] A. Coates, H. Lee, A. Y. Ng. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of Journal of Machine Learning Research. 1001(48109): 2. (2010) [59] L. Fei-Fei, P. Perona. A bayesian hierarchical model for learning natural scene categories. In Proceedings of Computer Vision and Pattern Recognition. pp. 2: 524-531. (2005) [60] R. Girshick, J. Donahue, T. Darrell and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of International Conference on Pattern Recognition. pp. 580-587. (2014) [61] T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of Advances in Neural Information Processing Systems. pp. 3111-3119. (2013) [62] B. S. Everitt and G. Dunn. Principal components analysis. In Proceedings of Applied Multivariate Data Analysis. 2:48-73. (1993) [63] N. Durrani, A. Fraser, H. Schmid, H. Hoang, P. Koehn. Can markov models over minimal traslation units help phrase-based smt? In Proceedings of ACL(2). pp. 399-405. (2013) [64] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arxiv:1409.1556. (2014) 50
ABSTRACT Multimodal Learning from TV Drama using Deep Hypernetworks Chang-Jun Nan Computer Science and Engineering The Graduate School Seoul National University Recently, development of internet technology and advancements in deep learning research has led to the rapid expansion of datasets in artificial intelligence field. Needless to say, there are standardized single-modality data such as ImageNet and WordNet, and representative multimodal data, for instance, Flickr 8K, Flickr 30K, Microsoft COCO have also appeared. Until now, artificial intelligence learned from this kind of static data has attained many successful cases in the field of image retrieval, visual-language translation and so on. Nevertheless, to handle a much wider variety of problems in real world, artificial intelligence technology which is capable of learning the dynamic multimodal data efficiently is necessary. TV drama is a sort of big data, which contains an enormous amount of knowledge regarding modern human society. As the character-centered stories unfold, diverse knowledge on topics such as economics, politics and culture, are displayed by this sort of video data. In particular, the speaking habits 51
and behavioral patterns of the character in different occasions include helpful information for understanding the social relationships between the characters. However, due to the dynamic and multimodal properties of TV drama, it is difficult for the learning model to automatically extract knowledge from the videos. To solve these problems, we need an efficient, dynamic and multimodal data learning technology and diverse image processing methods. Here, we propose the multimodal learning method based on the deep hypernetworks (DHN) to construct and analyze the knowledge from the TV drama automatically. DHN uses a multi-hierarchy structure to abstract various levels of knowledge, thus extracts knowledge from data. This feature makes complicated multimodal learning become efficient. Compared to the fixed structure of neural network models, the structure of the DHN is able to change flexibly, and so more appropriate to handle dynamic information. According to the method proposed, TV dramas have been chosen to be our research object. For our experiment, we adopted data from approximately 183 episodes, 4400-minutes of a TV drama - Friends as our dataset. Using various image processing methods, we extracted visual information such as scenes and characters. Then, the social network between the characters was established automatically by the DHN model and the relationship changes in different scenes were analyzed. Through the social network analysis, that the method we proposed is effective for multimodal learning was proved. Further, we also proved that dynamic multimodal data learning was achievable from relationship changes between characters along with story unfolding. Moreover, for the quantitative evaluation, we utilized the knowledge from data to operate visual-language translation experiments. Depending on the experiment results, we confirmed that knowledge extracted through multimodal learning contributed to the growth of visual-language translation s accuracy, and the accuracy increased as story accumulated. Keywords: Deep Hypernetworks; Multimodal Learning; Social Network Analysis; Visual-language Translation. Student Number: 2014-25159 52