R을 이용한 텍스트 감정분석
- 가은 장
- 6 years ago
2 R Data Analyst / ( ) / kim@mindscale.kr
3 / ( ) ( ) Analytic Director R ( ) / / 3/45
4 4/45
5 R? 1. : / 2. : ggplot2 / Web 3. : slidify 4. : 5. Matlab / Python -> R Interactive Plots. 5/45
6 :.,,, SNS. : (, ).,. : + 6/45
7 - Text KLT2000 ( ) R wordcloud shiny 7/45
8 Shiny - - 8/45
9 영화 이미테이션 게임 & 베네딕트 9/45
10 - Text 10/45
11 library(konlp) library(tm) library(qgraph) (stopwords) ## [1] "" "3d" "4d" "cg " "" ## [1] "" "" "" "" "" 11/45
12 [1] "" "" "" "" "" "" "" "" "" "" [1] "" "" "" "" " " "" [7] " " "" "" "" 12/45
13 a a library(networkd3) 13/45
14 How?
15 tm / tau / NLP / opennlp KoNLP tm.plugin.sentiment /45
16 Dragut, E. C., Yu, C., Sistla, P., & Meng, W. (2010). Construction of a sentimental word dictionary. Paper presented at the Proceedings of the 19th ACM international conference on Information and knowledge management. Rao, Y., Lei, J., Wenyin, L., Li, Q., & Chen, M. (2014). Building emotional dictionary for sentiment analysis of online news. World Wide Web, 17(4), /45
17 Workflow 17/45
18 (tm.plugin.sentiment) 18/45
19 : Mario Annau(2010) 19/45
20 (,, ) 20/45
21 21/45
22 22/45
23 23/45
24 24/45
25 WHY? ## [1] ",......" ## [2] "..?..." ## [3] " " ## [4] " " ## [5] " : " ## [6] ".." ## [7] " " ## [8] "..." ## [9] "" ## [10] "..." ## [11] " " ## [12] ".? ## [13] " " ## [14] "0? " ## [15] "." ## [16] "..." ## [17] "." ## [18] "? " ## [19] " " ## [20] ".?..SK ## [21] " ## [22] " ^^ " ## [23] "..." ## [24] " " 25/45
26 Probabilistic Topic Models LDA Blei, David M. and Ng, Andrew and Jordan, Michael. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research 26/45
27 LDA 27/45
28 LDA 28/45
29 SLDA 29/45
30 SLDA Blei and McAuliffe, (2008). Supervised topic models. vances in Neural Information Processing Systems, pages MIT Press. Cross-Validation X TEST.POINT POLARITY SENTI.DIFF SLDA 1 test.point Polarity Senti-Diff Training Set Test Set 7:3 4 slda library(lda) library(topicmodels) library(ldavis) library(servr) 30/45
31 Graph
32 Selected Topic: 9 Previous Topic Next Topic Clear Topic Slide to adjust relevance metric: (2) λ = Intertopic Distance Map (via multidimensional scaling) Top-30 Most Relevant Terms for Topic 9 (9.1% of tokens) PC1 1 7 Marginal topic distribtion PC % 5% 10% Overall term frequency Estimated term frequency within the selected topic 1. saliency(term w) = frequency(w) * [sum_t p(t w) * log(p(t w)/p(t))] for topics t; see Chuang et. al (2012) 2. relevance(term w topic t) = λ * p(w t) + (1 - λ) * p(w t)/p(w); see Sievert & Shirley (2014) 32/45
33 Selected Topic: 7 Previous Topic Next Topic Clear Topic Slide to adjust relevance metric: (2) λ = Intertopic Distance Map (via multidimensional scaling) Top-30 Most Relevant Terms for Topic 7 (9.4% of tokens) PC1 1 7 Marginal topic distribtion PC % 5% 10% Overall term frequency Estimated term frequency within the selected topic 1. saliency(term w) = frequency(w) * [sum_t p(t w) * log(p(t w)/p(t))] for topics t; see Chuang et. al (2012) 2. relevance(term w topic t) = λ * p(w t) + (1 - λ) * p(w t)/p(w); see Sievert & Shirley (2014) 33/45
34 Selected Topic: 4 Previous Topic Next Topic Clear Topic Slide to adjust relevance metric: (2) λ = Intertopic Distance Map (via multidimensional scaling) Top-30 Most Relevant Terms for Topic 4 (6.2% of tokens) PC Marginal topic distribtion PC % 5% 10% Overall term frequency Estimated term frequency within the selected topic 1. saliency(term w) = frequency(w) * [sum_t p(t w) * log(p(t w)/p(t))] for topics t; see Chuang et. al (2012) 2. relevance(term w topic t) = λ * p(w t) + (1 - λ) * p(w t)/p(w); see Sievert & Shirley (2014) 34/45
35 Selected Topic: 15 Previous Topic Next Topic Clear Topic Slide to adjust relevance metric: (2) λ = Intertopic Distance Map (via multidimensional scaling) Top-30 Most Relevant Terms for Topic 15 (3.9% of tokens) PC Marginal topic distribtion PC % 5% 10% Overall term frequency Estimated term frequency within the selected topic 1. saliency(term w) = frequency(w) * [sum_t p(t w) * log(p(t w)/p(t))] for topics t; see Chuang et. al (2012) 2. relevance(term w topic t) = λ * p(w t) + (1 - λ) * p(w t)/p(w); see Sievert & Shirley (2014) 35/45
36 Selected Topic: 18 Previous Topic Next Topic Clear Topic Slide to adjust relevance metric: (2) λ = Intertopic Distance Map (via multidimensional scaling) Top-30 Most Relevant Terms for Topic 18 (3.4% of tokens) PC Marginal topic distribtion PC % 5% 10% Overall term frequency Estimated term frequency within the selected topic 1. saliency(term w) = frequency(w) * [sum_t p(t w) * log(p(t w)/p(t))] for topics t; see Chuang et. al (2012) 2. relevance(term w topic t) = λ * p(w t) + (1 - λ) * p(w t)/p(w); see Sievert & Shirley (2014) 36/45
37 Dynamic Topic Model Blei, D. M., & Lafferty, J. D. (2006) Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning. ACM. 37/45
38 A 38/45
39 A 39/45
40 Marginal Topic Distribution 40/45
41 Deep-Learning 41/45
42 Deep-Learning 42/45
43 Wordnet / Sentiwordnet N-gram + LDA Conditional Random Fields Recursive Neural Network Recurrent Neural Network Convolution Neural Network 43/45
44 : 44/45
More information