R Data Analyst / ( ) / kim@mindscale.kr
(kim@mindscale.kr) / ( ) ( ) Analytic Director R ( ) / / 3/45
4/45
R? 1. : / 2. : ggplot2 / Web 3. : slidify 4. : 5. Matlab / Python -> R Interactive Plots. 5/45
:.,,, SNS. : (, ).,. : + 6/45
- Text KLT2000 ( ) R wordcloud shiny 7/45
Shiny - - 8/45
영화 이미테이션 게임 & 베네딕트 9/45
- Text 10/45
library(konlp) library(tm) library(qgraph) (stopwords) ## [1] "" "3d" "4d" "cg " "" ## [1] "" "" "" "" "" 11/45
[1] "" "" "" "" "" "" "" "" "" "" [1] "" "" "" "" " " "" [7] " " "" "" "" 12/45
a a library(networkd3) 13/45
How?
tm / tau / NLP / opennlp KoNLP tm.plugin.sentiment http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/ http://word.snu.ac.kr/kosac/ http://clab.snu.ac.kr/arssa/doku.php?id=app_dict_1.0 www.openhangul.com 15/45
Dragut, E. C., Yu, C., Sistla, P., & Meng, W. (2010). Construction of a sentimental word dictionary. Paper presented at the Proceedings of the 19th ACM international conference on Information and knowledge management. Rao, Y., Lei, J., Wenyin, L., Li, Q., & Chen, M. (2014). Building emotional dictionary for sentiment analysis of online news. World Wide Web, 17(4), 723-742. 16/45
Workflow 17/45
(tm.plugin.sentiment) 18/45
: Mario Annau(2010) 19/45
(,, ) 20/45
21/45
22/45
23/45
24/45
WHY? ## [1] ",......" ## [2] "..?..." ## [3] " " ## [4] " " ## [5] " : " ## [6] ".." ## [7] " " ## [8] "..." ## [9] "" ## [10] "..." ## [11] " 101010010101" ## [12] ".? ## [13] " " ## [14] "0? " ## [15] "." ## [16] "..." ## [17] "." ## [18] "? " ## [19] " " ## [20] ".?..SK ## [21] ". 33 4 ## [22] " ^^ " ## [23] "..." ## [24] " " 25/45
Probabilistic Topic Models LDA Blei, David M. and Ng, Andrew and Jordan, Michael. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research 26/45
LDA 27/45
LDA 28/45
SLDA 29/45
SLDA Blei and McAuliffe, (2008). Supervised topic models. vances in Neural Information Processing Systems, pages 121 128. MIT Press. Cross-Validation X TEST.POINT POLARITY SENTI.DIFF SLDA 1 test.point 1.00 0.01 0.07 0.66 2 Polarity 0.01 1.00 0.75-0.01 3 Senti-Diff 0.07 0.75 1.00 0.05 Training Set Test Set 7:3 4 slda 0.66-0.01 0.05 1.00 library(lda) library(topicmodels) library(ldavis) library(servr) 30/45
Graph
Selected Topic: 9 Previous Topic Next Topic Clear Topic Slide to adjust relevance metric: (2) λ = 0.51 0.0 0.2 0.4 0.6 0.8 1.0 Intertopic Distance Map (via multidimensional scaling) Top-30 Most Relevant Terms for Topic 9 (9.1% of tokens) PC1 1 7 Marginal topic distribtion 8 3 10 5 9 PC2 6 2 4 0 2 4 6 8 10 12 14 2% 5% 10% Overall term frequency Estimated term frequency within the selected topic 1. saliency(term w) = frequency(w) * [sum_t p(t w) * log(p(t w)/p(t))] for topics t; see Chuang et. al (2012) 2. relevance(term w topic t) = λ * p(w t) + (1 - λ) * p(w t)/p(w); see Sievert & Shirley (2014) 32/45
Selected Topic: 7 Previous Topic Next Topic Clear Topic Slide to adjust relevance metric: (2) λ = 0.51 0.0 0.2 0.4 0.6 0.8 1.0 Intertopic Distance Map (via multidimensional scaling) Top-30 Most Relevant Terms for Topic 7 (9.4% of tokens) PC1 1 7 Marginal topic distribtion 8 3 10 5 9 PC2 6 2 4 0 20 40 60 80 2% 5% 10% Overall term frequency Estimated term frequency within the selected topic 1. saliency(term w) = frequency(w) * [sum_t p(t w) * log(p(t w)/p(t))] for topics t; see Chuang et. al (2012) 2. relevance(term w topic t) = λ * p(w t) + (1 - λ) * p(w t)/p(w); see Sievert & Shirley (2014) 33/45
Selected Topic: 4 Previous Topic Next Topic Clear Topic Slide to adjust relevance metric: (2) λ = 0.5 0.0 0.2 0.4 0.6 0.8 1.0 Intertopic Distance Map (via multidimensional scaling) Top-30 Most Relevant Terms for Topic 4 (6.2% of tokens) PC1 9 8 2 4 15 Marginal topic distribtion 7 6 18 12 20 14 PC2 5 13 17 16 1 19 3 11 10 0 20 40 60 80 100 120 140 160 2% 5% 10% Overall term frequency Estimated term frequency within the selected topic 1. saliency(term w) = frequency(w) * [sum_t p(t w) * log(p(t w)/p(t))] for topics t; see Chuang et. al (2012) 2. relevance(term w topic t) = λ * p(w t) + (1 - λ) * p(w t)/p(w); see Sievert & Shirley (2014) 34/45
Selected Topic: 15 Previous Topic Next Topic Clear Topic Slide to adjust relevance metric: (2) λ = 0.5 0.0 0.2 0.4 0.6 0.8 1.0 Intertopic Distance Map (via multidimensional scaling) Top-30 Most Relevant Terms for Topic 15 (3.9% of tokens) PC1 9 8 2 4 15 Marginal topic distribtion 7 6 18 12 20 14 PC2 5 13 17 16 1 19 3 11 10 0 20 40 60 80 100 120 140 2% 5% 10% Overall term frequency Estimated term frequency within the selected topic 1. saliency(term w) = frequency(w) * [sum_t p(t w) * log(p(t w)/p(t))] for topics t; see Chuang et. al (2012) 2. relevance(term w topic t) = λ * p(w t) + (1 - λ) * p(w t)/p(w); see Sievert & Shirley (2014) 35/45
Selected Topic: 18 Previous Topic Next Topic Clear Topic Slide to adjust relevance metric: (2) λ = 0.5 0.0 0.2 0.4 0.6 0.8 1.0 Intertopic Distance Map (via multidimensional scaling) Top-30 Most Relevant Terms for Topic 18 (3.4% of tokens) PC1 9 8 2 4 15 Marginal topic distribtion 7 6 18 12 20 14 PC2 5 13 17 16 1 19 3 11 10 0 10 20 30 40 50 2% 5% 10% Overall term frequency Estimated term frequency within the selected topic 1. saliency(term w) = frequency(w) * [sum_t p(t w) * log(p(t w)/p(t))] for topics t; see Chuang et. al (2012) 2. relevance(term w topic t) = λ * p(w t) + (1 - λ) * p(w t)/p(w); see Sievert & Shirley (2014) 36/45
Dynamic Topic Model Blei, D. M., & Lafferty, J. D. (2006) Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning. ACM. 37/45
A 38/45
A 39/45
Marginal Topic Distribution 40/45
Deep-Learning 41/45
Deep-Learning 42/45
Wordnet / Sentiwordnet N-gram + LDA Conditional Random Fields Recursive Neural Network Recurrent Neural Network Convolution Neural Network 43/45
http://course.mindscale.kr/course/text-analysis : 44/45