2016; Rush et al., 2015). Attention models help the NLP model focus on salient words/phrases and transfer these attentions to other machine learning m

Similar documents
=10100 =minusby by1000 °æ=10100 =minusby by1000 Á¦=10100 =minusby by1000 Åë=10100 =minusby by1000 ÇÕ=10100 =minusby by1000 °ú =10100 =minusby by1000 ¹«=10100 =minusby by1000 ¿ª=10100 =minusby by1000 Á¤=10100 =minusby by1000 Ã¥ No. 3

Contents 서서서문문문 3 1 개개개론론론 6 2 시시시장장장의의의 맥맥맥락락락 및및및 문문문제제제 6 3 루루루나나나의의의 전전전랴

Content Neutrality Network (CNN) D-Run Foundation Ltd. 이월 28, 2018

Journal of the Korean Data & Information Science Society 2019, 30(1), 한국데이터정보과학회지 거래소

378 Hyun Deuk Lee Sun Young Jung 간호사의 심폐소생술의 수행률을 높이기 위해서는 심폐소생술의 수행 의지를 높여 야 하고, 심폐소생 술의

Journal of the Korean Data & Information Science Society 2019, 30(2), 한국데이터정보과학회지

1218 Dongha Kim Gyuseung Baek Yongdai Kim 대표적이다. 이후에는 ReLU를 응용하여 LeakyReLU (Maas 등, 2013), PReLU (He 등, 2015), ELU (Clevert 등

1288 Donghwan Lee Kyungha Seok 용하였는데, 심층신경망 모형에서 미소 객체 탐색이 어려운 이유는 입력 이미지의 크기가 합성곱 연산 (c

Journal of the Korean Data & Information Science Society 2019, 30(4), 한국데이터정보과학회지

Net media covered by Opoint in South Korea 1612 sites May 8, 2019 모든 국민은 교육자다! 뉴스에듀 ( All the people are educators! News edudu) (5)

Comprehensive Resiliency Evaluation for Dependable Embedded Systems Yohan Ko The Graduate School Yonsei University Department of Computer Science

석 사 학 위 논 문 신경망 예측기와 퍼지논리 투표기법을 이용한 센서의 고장 진단, 고립 및 적응 권 성 호 기 계 공 학 부 광 주 과

#Ȳ¿ë¼®


Page 2 of 5 아니다 means to not be, and is therefore the opposite of 이다. While English simply turns words like to be or to exist negative by adding not,

저작자표시 - 비영리 - 변경금지 2.0 대한민국 이용자는아래의조건을따르는경우에한하여자유롭게 이저작물을복제, 배포, 전송, 전시, 공연및방송할수있습니다. 다음과같은조건을따라야합니다 : 저작자표시. 귀하는원저작자를표시하여야합니다. 비영리. 귀하는이저작물을영리목적으로이용할

Output file


Journal of Educational Innovation Research 2017, Vol. 27, No. 2, pp DOI: : Researc

歯1.PDF

Page 2 of 6 Here are the rules for conjugating Whether (or not) and If when using a Descriptive Verb. The only difference here from Action Verbs is wh

<B3EDB9AEC1FD5F3235C1FD2E687770>

11¹Ú´ö±Ô

04-다시_고속철도61~80p

°í¼®ÁÖ Ãâ·Â

<32382DC3BBB0A2C0E5BED6C0DA2E687770>

I&IRC5 TG_08권

<B3EDB9AEC1FD5F3235C1FD2E687770>

R을 이용한 텍스트 감정분석



歯kjmh2004v13n1.PDF

<31342D3034C0E5C7FDBFB52E687770>

- 2 -

퇴좈저널36호-4차-T.ps, page Preflight (2)

본문01

À±½Â¿í Ãâ·Â

step 1-1

한국성인에서초기황반변성질환과 연관된위험요인연구

6 영상기술연구 실감하지 못했을지도 모른다. 하지만 그 이외의 지역에서 3D 영화를 관람하기란 그리 쉬운 일이 아니다. 영화 <아바타> 이후, 티켓 파워에 민감한 국내 대형 극장 체인들이 2D 상영관을 3D 상영관으로 점차적으로 교체하는 추세이긴 하지만, 아직까지는 관


Stage 2 First Phonics

09권오설_ok.hwp

Journal of Educational Innovation Research 2018, Vol. 28, No. 1, pp DOI: * A Analysis of

DIY 챗봇 - LangCon

DBPIA-NURIMEDIA

DBPIA-NURIMEDIA

DBPIA-NURIMEDIA

DBPIA-NURIMEDIA

<C7D1B1B9B1A4B0EDC8ABBAB8C7D0BAB85F31302D31C8A35F32C2F75F E687770>

서론 34 2

[ 영어영문학 ] 제 55 권 4 호 (2010) ( ) ( ) ( ) 1) Kyuchul Yoon, Ji-Yeon Oh & Sang-Cheol Ahn. Teaching English prosody through English poems with clon

High Resolution Disparity Map Generation Using TOF Depth Camera In this paper, we propose a high-resolution disparity map generation method using a lo

<313020C1A4BFECBAC034332E687770>


300 구보학보 12집. 1),,.,,, TV,,.,,,,,,..,...,....,... (recall). 2) 1) 양웅, 김충현, 김태원, 광고표현 수사법에 따른 이해와 선호 효과: 브랜드 인지도와 의미고정의 영향을 중심으로, 광고학연구 18권 2호, 2007 여름

<313120C0AFC0FCC0DA5FBECBB0EDB8AEC1F2C0BB5FC0CCBFEBC7D15FB1E8C0BAC5C25FBCF6C1A42E687770>

09김정식.PDF

大学4年生の正社員内定要因に関する実証分析

DBPIA-NURIMEDIA

<C1DF3320BCF6BEF7B0E8C8B9BCAD2E687770>


ÀÌÁÖÈñ.hwp

Journal of Educational Innovation Research 2017, Vol. 27, No. 3, pp DOI: (NCS) Method of Con

,,,,,, ),,, (Euripides) 2),, (Seneca, LA) 3), 1) )

서론

Á¶´öÈñ_0304_final.hwp

<4D F736F F D20B1E2C8B9BDC3B8AEC1EE2DC0E5C7F5>

歯M PDF

304.fm

07_Àü¼ºÅÂ_0922

지능정보연구제 16 권제 1 호 2010 년 3 월 (pp.71~92),.,.,., Support Vector Machines,,., KOSPI200.,. * 지능정보연구제 16 권제 1 호 2010 년 3 월

Journal of Educational Innovation Research 2018, Vol. 28, No. 1, pp DOI: A study on Characte

Journal of Educational Innovation Research 2018, Vol. 28, No. 4, pp DOI: * A Research Trend


½Éº´È¿ Ãâ·Â

강의지침서 작성 양식

2011´ëÇпø2µµ 24p_0628

<BFA9BAD02DB0A1BBF3B1A4B0ED28C0CCBCF6B9FC2920B3BBC1F62E706466>

WHO 의새로운국제장애분류 (ICF) 에대한이해와기능적장애개념의필요성 ( 황수경 ) ꌙ 127 노동정책연구 제 4 권제 2 호 pp.127~148 c 한국노동연구원 WHO 의새로운국제장애분류 (ICF) 에대한이해와기능적장애개념의필요성황수경 *, (disabi

DBPIA-NURIMEDIA

182 동북아역사논총 42호 금융정책이 조선에 어떤 영향을 미쳤는지를 살펴보고자 한다. 일제 대외금융 정책의 기본원칙은 각 식민지와 점령지마다 별도의 발권은행을 수립하여 일본 은행권이 아닌 각 지역 통화를 발행케 한 점에 있다. 이들 통화는 일본은행권 과 等 價 로 연

저작자표시 - 비영리 - 변경금지 2.0 대한민국 이용자는아래의조건을따르는경우에한하여자유롭게 이저작물을복제, 배포, 전송, 전시, 공연및방송할수있습니다. 다음과같은조건을따라야합니다 : 저작자표시. 귀하는원저작자를표시하여야합니다. 비영리. 귀하는이저작물을영리목적으로이용할

2 大 韓 政 治 學 會 報 ( 第 18 輯 1 號 ) 과의 소통부재 속에 여당과 국회도 무시한 일방적인 밀어붙이기식 국정운영을 보여주고 있다. 민주주의가 무엇인지 다양하게 논의될 수 있지만, 민주주의 운영에 필요한 최소한의 제도적 조건은 권력 행사에서 국가기관 사이의

歯3이화진

WRIEHFIDWQWF.hwp

0125_ 워크샵 발표자료_완성.key

1. 서론 1-1 연구 배경과 목적 1-2 연구 방법과 범위 2. 클라우드 게임 서비스 2-1 클라우드 게임 서비스의 정의 2-2 클라우드 게임 서비스의 특징 2-3 클라우드 게임 서비스의 시장 현황 2-4 클라우드 게임 서비스 사례 연구 2-5 클라우드 게임 서비스에

Microsoft PowerPoint - ch03ysk2012.ppt [호환 모드]

공연영상

DBPIA-NURIMEDIA

<B1E2C8B9BEC828BFCFBCBAC1F7C0FC29322E687770>


2 min 응용 말하기 01 I set my alarm for It goes off. 03 It doesn t go off. 04 I sleep in. 05 I make my bed. 06 I brush my teeth. 07 I take a shower.

Journal of Educational Innovation Research 2019, Vol. 29, No. 1, pp DOI: (LiD) - - * Way to

< C7CFB9DDB1E22028C6EDC1FD292E687770>

민속지_이건욱T 최종

<30362E20C6EDC1FD2DB0EDBFB5B4EBB4D420BCF6C1A42E687770>

우리들이 일반적으로 기호

Journal of Educational Innovation Research 2018, Vol. 28, No. 3, pp DOI: NCS : * A Study on

Transcription:

Sentiment Classification with Word Attention based on Weakly Supervised Leaning with a Convolutional Neural Network Gichang Lee 1 Jaeyun Jeong 1 Seungwan Seo 1 CzangYeob Kim 1 Pilsung Kang 1 arxiv:1709.09885v1 [cs.cl] 28 Sep 2017 Abstract In order to maximize the applicability of sentiment analysis results, it is necessary to not only classify the overall sentiment (positive/negative) of a given document but also to identify the main words that contribute to the classification. However, most datasets for sentiment analysis only have the sentiment label for each document or sentence. In other words, there is no information about which words play an important role in sentiment classification. In this paper, we propose a method for identifying key words discriminating positive and negative sentences by using a weakly supervised learning method based on a convolutional neural network (CNN). In our model, each word is represented as a continuous-valued vector and each sentence is represented as a matrix whose rows correspond to the word vector used in the sentence. Then, the CNN model is trained using these sentence matrices as inputs and the sentiment labels as the output. Once the CNN model is trained, we implement the word attention mechanism that identifies high-contributing words to classification results with a class activation map, using the weights from the fully connected layer at the end of the learned CNN model. In order to verify the proposed methodology, we evaluated the classification accuracy and inclusion rate of polarity words using two movie review datasets. Experimental result show that the proposed model can not only correctly classify the sentence polarity but also successfully identify the corresponding words with high polarity scores. Keywords: Weakly Supervised Learning, 1 School of Industrial Management Engineering, Korea University, Seoul, South Korea. Correspondence to: Pilsung Kang <pilsung_kang@korea.ac.kr>. Word Attention, Convolutional Neural Network, Class Activation Mapping 1. Introduction Sentiment analysis and opinion mining is a field of study that analyzes people s opinions, sentiments, evaluations, attitudes, and emotions from written language. It is one of the most active research areas in natural language processing (NLP) and has also been widely studied in data mining, Web mining, and text mining (Medhat et al., 2014; Liu, 2012; Pang et al., 2008; Ravi & Ravi, 2015) Application domains for sentiment analysis include analyses of customer response to new products or services, analyses of public opinion towards the government s new policies or political issues under debate, etc. (Jo, 2012). In response to increasing needs in diverse domains, various sentiment analysis techniques have been developed (Gui et al., 2017; Cho et al., 2014; Poria et al., 2016; Xianghua et al., 2013; Socher et al., 2013; Kalchbrenner et al., 2014; Tai et al., 2015). However, many of the current sentiment analysis techniques suffer from the over-abstraction problem (Nasukawa & Yi, 2003); the only information obtained from these techniques is the polarity of the document, i.e., whether the nuance of the document is positive or negative. It is difficult to receive more in-depth sentiment analysis results, such as identifying the main words contributing to the polarity classification or finding opposite words or phrase to the overall sentiment of the document, i.e., negative words/phrases in a positive document or positive words/phrases in a negative document. Recently, attention models have been highlighted in the field of computer vision because of its ability to focus on semantically significant areas in a given image to solve the task of object classification, localization, and detection (Ba et al., 2014; Russakovsky et al., 2015; Mnih et al., 2014). They have also been widely adopted in the field of NLP, as attention models can provide more fruitful interpretations for text analysis tasks (Luong et al., 2015; Shen & Huang,

2016; Rush et al., 2015). Attention models help the NLP model focus on salient words/phrases and transfer these attentions to other machine learning models to solve more complicated tasks such as image captioning or text to image generation (Xu et al., 2015). In addition, as one of the basic building blocks of artificial intelligence (AI) is to understand a human speaker s intention, global technology leaders have released their own AI speakers, such as Amazon s Eco, Google s Google Home, and Apple s Homepod, to collect real-word conversational data in order to upgrade their AI engines. As these AI speakers process the human speaker s query at a sentence level, it becomes more critical to correctly identify the main intentions (words/phrases) of the speaker, which is the ultimate goal of attention models. It is not that easy to implement an attention model in NLP tasks. This is mainly because most text datasets have document-level labels, i.e., whether the overall nuance of the document is positive or negative, but phrase- or word-level sentiment labels are rarely available. It implies that there is a restriction that the model should learn attention scores for words or phrases without actual labels. To overcome this problem, previous studies modified the structure of a recurrent neural network (RNN) such that the added weights play an attention role inside the model. Applications of RNN-based attention models include document classification (Yang et al., 2016), parsing (Vinyals et al., 2015), machine translation (Bahdanau et al., 2014; Luong et al., 2015), and image captioning (Xu et al., 2015). In this paper, we propose a sentiment classification with a word attention model based on weakly supervised leaning with a convolutional neural network (CNN), named CAM 2 : Classification and Attention Model with a Class Activation Map. The main advantage of the proposed model is its ability to identify crucial words or phrases in a sentence for the sentiment classification perspective without explicit wordor phrase-level sentiment polarity information. It identifies the words by weak labels only, i.e., the sentencelevel polarity that is more abstracted but easily available. In the proposed model, words are embedded in a fixed-size of continuous vector space using Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), and FastText (Bojanowski et al., 2016). Sentences are represented in a matrix form, whose rows correspond to word vectors, and they are used as the input of a CNN model. The CNN model is trained by considering the sentence-level sentiment polarity as the target, and it produces both the sentence-level polarity score and word-level polarity scores for all words in the sentence, which helps us understand the result of sentence-level sentiment classification. Unlike the existing attention models based on RNN, there is no need to separately learn the weights for the attention. Considering that the same word is used in different contexts for different domains, it is relatively easy to build a dictionary that reflects the characteristics of each domain by using the proposed model. The rest of this paper is organized as follows. In Section 2, we briefly review and discuss some related works. In Section 3, we demonstrate the architecture of the proposed model. Detailed experimental settings are demonstrated in Section 4 followed by the analysis and discussion of the results. Finally, in Section 5 we present our conclusions. 2. Related Work In this section, we briefly review the representative studies on for CNN-based document classification (Kim, 2014), weakly supervised learning for CNNbased object detection (Oquab et al., 2015; Zhou et al., 2016), and the RNN-based document attention model named the hierarchical attention network (Yang et al., 2016). 2.1. Convolutional Neural Networks for Document Classification Kim (2014) showed CNN, which is the most successful neural network structure for image processing, can also work well for text data, especially for document classification. The architecture of Kim (2014) is shown in Figure 1, and it has the following three main ideas: (1) A large number of filters are used, but the network is not as deep as popular CNN architectures for image processing. (2) The size of the CNN filter is matched with the vector size of input words. (3) Multi-channels consisting of static and non-static input vectors are combined. Experimental results show that the CNN-based document classification model achieved higher classification accuracies than the conventional machine learningbased models, such as the support vector machine or conditional random field, and other deep neural network structures, such as the deep feedforward neural network or recursive neural network. In addition, the word vector could also be customized for a given corpus, and it sometimes yielded better classification performance than pre-trained word vectors.

Figure 1. Model architecture with two channels for an example sentence (Kim, 2014). Figure 2. Class activation mapping (Zhou et al., 2016). 2.2. Class Activation Mapping Oquab et al. (2015) proposed a weakly supervised learning method for object detection without bounding box information. In this study, a standard CNN architecture with max pooling between the final convolution and the output layer was utilized. Zhou et al. (2016) proved the average pooling is more appropriate for the object detection task than the max pooling. The CNN structure and an example of the attention mechanism are shown in Figure 2. In this model, the CNN is trained to correctly classify the object in the input image. In Figure 2, the target of the given image is Australian terrier, but no information on the dog s position in the input image is available during the training. When the training is complete, the weights in the fully connected layers are used to combine the feature map to emphasize the attention area of the original input image. They called this process class activation mapping (CAM), and by utilizing it, not only can the CNN model determine that the Australian terror is in the image, but also this classification is mainly inferred by seeing the bottom right part of the image (red area in the final CAM in Figure 2). 2.3. Hierarchical Attention Network Yang et al. (2016) proposed a hierarchical RNN architecture, inspired by the fact that the document consists of sentences and the sentences are composed of words. In the study, the authors added attention weights to reflect the importance of each sentence and word. As can be seen in Figure 3, the result of their model is the most similar to what we attempted to do in this study. However, the main differences between their work and this work is that Yang et al. (2016) employed an RNN as the base model and the attention weights were separately learned from the corpus. However, a CNN is employed as the base model for sentiment classification in this study, and we do not explicitly train the model to learn the word-level attention scores.

Figure 3. Hierarchical Attention Network (Yang et al., 2016). 3. Classification and Attention Model based on Class Activation Map: CAM 2 3.1. Overall Framework Figure 4 shows the overall framework of the proposed method. After collecting the sentences, low-level embedding is performed by the Word2Vec, GloVe, and FastText methods, and the word vectors in the sentence are concatenated to form the initial input matrix for the CNN. Once the CNN model training is completed, the polarity of a given test sentence is predicted. Then, the weights of the fully connected layer are used to combine the feature maps to produce the attention score for every single word in the sentence. 3.2. Network Architecture The architecture of the CNN used in this paper is basically rooted in the CNN architecture used in Kim (2014). However, since the CNN used in Kim (2014) was originally designed for document classification, we made some modifications to it to facilitate the extraction of essential words or phrases. First, the zeropadding is added before the first word and after the last word in the sentence to make that the number of times that each word is included in the receptive field during convolution the same, irrespective of the word s position in the sentence. Second, we applied averagepooling instead of max-pooling. According to Zhou et al. (2016), average-pooling and max-pooling are essentially similar, but using average-pooling is advantageous in identifying the overall scope of the target. Third, we increased the number of filters compared to the CAMs used in Oquab et al. (2015) and Zhou et al. (2016). As these CAMs are specialized for image processing, the receptive field of convolution is a square (ex: 3 3). However, the receptive field of the proposed CAM 2 is a rectangular (ex: 3 word embedding dimension), which integrates a larger amount of information in one scalar value compared to the convolutional filter in image processing. To prevent a possible loss of information due to a larger receptive field, we used a much larger number of convolution filters than was used in (Kim, 2014). Finally, we used more various word embedding techniques to form an input matrix of a sentence. Kim (2014) only used the Word2Vec for word embedding, but we consider two recently developed word embedding techniques: GloVe and Fast- Text. 3.3. Classification and Attention Model based on Class Activation Map The input of CNN, x 1:l is created by concatenating the word vectors in a sentence and zero-paddings. We used four type of inputs CNN-rand, CNN-static, CNNnon-static, and CNN-Multichannel. The CNN-rand uses a randomly initialized word vector while CNNstatic and CNN-non-static use the word vectors pretrained by the Word2Vec. CNN-Multichannel uses the word vectors pre-trained by the Word2Vec, GloVe, and FastText. Let k, d, and h denote the dimension of the word embedding vector, number of maximum words in a sentence, and the height of the receptive

Figure 4. Framework of proposed method. field of convolution, respectively, then the input matrix X R ([d+2(h 1)]k) is constructed as follows. The zero-padding is first performed before and after x 1:d so that the number of times that each word is included in the receptive field during convolution is the same (h times). (d + h 1). f i = [f 1i, f 2i,..., f li ] T, (2) f ji = ReLu(W conv x j:j+h 1 + b), (3) W conv R h k, b R. (4) X = x 1:l = 0... 0 }{{} h 1 x } 1 x 2... x {{ d } d 0... 0. }{{} (1) h 1 When the window size of the CNN filter, i.e., the height of filter is h, the i-th feature map f i is constructed as follows. As the size of CNN filter w is h d and zero-padding is performed in the previous step, f i becomes a I-dimensional vector, where I is Let ˆf l be the scalar value computed by applying the average pooling to the feature map f i. The final feature vector z passed to the fully connected layer is constructed as follows. Considering that n feature maps are computed for a given sentence, z becomes an n- dimensional vector. z = [ ˆf 1, ˆf2,..., ˆ fn ] T, (5) where n is n ftypes (the number of filter type) n filters (the number of filters for each type). The output of the fully connected layer for the i-th sentence is y,

Figure 5. An example of computing a score vector. computed as follows: y = W fc z + b fc, (6) W fc R c n, (7) c : the number of classes. b fc R c, (8) Once the CNN model is trained, the sentiment importance score of each word is computed as follows. An illustrated example of the following process is provided in Figure 5. Let F l be the feature maps corresponding to the l-th filter type and w lci be the row vector of W fc for the l-th filter type and the c i -th class. Then, the score vector v is computed as v = F l w T lc i, (9) F l R I n filter, (10) w T lc i R nfilter. (11) The p-th element in the score vector s lci corresponding to the l-th filter type and the c i -th class is computed by averaging h elements with the step size of 1, which makes the s lci a d-dimensional vector, regardless of the height of filters: s lci = 1 h p+h 1 q=p V q. (12) The final sentiment score of the words in the sentence to c i -th class, CAM 2 c i is computed by CAM 2 c i = n ftype l=1 s lci. (13) 3.4. Word Embedding We employed four different word embedding methods to construct the input matrix X: random vectors, Word2Vec, GloVe, and FastText. With the random vectors, the elements of the word vectors were randomly initialized, and they were updated during the CNN training. For the latter three methods, word embedding vectors were separately trained using the same corpus for sentiment classification. We also compared the static word embedding and non-static word embedding methods for CAM 2 according to whether the word embedding vectors are updated during the CNN training (non-static) or not (static). In addition, two multi-channel input matrices were also considered. In summary, we tested the following five input matrices for CAM 2. (1) CNN-Rand: word vectors are randomly initialized and they are updated during the CNN training. (2) CNN-Static: word vectors are trained by Word2Vec. They are not updated during the CNN training. (3) CNN-Non-Static: word vectors are trained by Word2Vec first, and they are updated during the CNN training. (4) CNN-2ch: CNN-Static and CNN-Non-Static are combined. The input of CNN becomes a 3- dimensional (I k 2) tensor. (5) CNN-4ch: Three matrices with word vectors trained by Word2Vec, GloVe, and FastText are used. They are updated during the CNN training. The CNN-Non-Static method is used as the

Table 1. Rating distributions of the IMDB dataset Score 1 2 3 4 7 8 9 10 Reviews 10,122 4,586 4,961 5,531 4,803 5,859 4,607 9,731 Class Negative Positive Table 2. Rating distributions of the WATCHA dataset score 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Reviews 50,660 66,184 62,094 163,272 173,650 411,757 424,378 652,250 297,327 416,096 Class Negative Not used Positive Table 3. The number of tokens IMDB WATCHA 115,205 424,027 Table 4. The hyper-parameters of the CNN 3 (tri-gram) Filter type (window size) 4 (quad-gram) 5 (5-gram) N. filters 128 each Doc. length 100 words Dropout rate 0.5 L 2 regularization (λ) 0.1 Batch size 64 fourth matrix. The input of CNN becomes a 3- dimensional (I k 4) tensor. 4. Experimental Settings 4.1. Data Sets & Target Labeling To verify the proposed CAM 2, we used two sets of movie reviews, one written in English and the other written in Korean. Not only do movie reviews have explicit sentiment labels (ratings or stars), but they generally also have more subjective expressions compared to other formal texts such as news articles. For the English movie review dataset, we used the publicly available IMDB dataset (Maas et al., 2011), while Korean movie reviews were collected directly from the WATCHA website (https://watcha.net), which is the largest movie recommendation service in Korea. Each dataset consists of review sentences and ratings. The distributions of ratings for the IMDB and WATCHA are shown in Table 1 and 2. As shown in Table 2, the ratings are well-balanced in the IMDB dataset. Hence, we used the reviews with Table 5. The test accuracy between methodology Test IMDB WATCHA CNN-Rand 0.8435 0.7793 CNN-Static 0.7750 0.7150 CNN-Non-Static 0.8257 0.7538 CNN-2channel 0.8300 0.7602 CNN-4channel 0.8729 0.7533 Table 6. CAM example Word Score this 0.0145 film 0.0291 is 0.1324 actually 0.2183 quite 0.2561 entertaining 0.3496 ratings smaller than or equal to 4 as negative examples, whereas the reviews with ratings greater than or equal to 7 were used as positive examples. Unlike for the IMBD dataset, the ratings of the WATCH dataset are highly skewed toward the positive scores. Therefore, we used the reviews with ratings smaller than or equal to 2 as negative examples whereas only the reviews with 5-point-ratings were used as positive examples. In both datasets, 70% of the reviews were used as training data, and the remaining 30% were used as test data. 4.2. Word Embedding, CNN Parameters, and Performance Measure Each sentence was split into tokens using the space. The punctuations and numbers were removed. All tokens were used to learn the word embedding vectors. We fixed the dimension of word embedding to 100 and set the window size of Word2Vec and FastText to 3.

Table 7. Frequently appeared words in the positive/negative sentences in in the IMDB test dataset (semantically positive or negative words are colored in blue and red, respectively) CNN- Rand CNN- Static Positive CNN-Non- Static CNN- 2channel CNN- 4channel CNN- Rand CNN- Static Negative CNN-Non- Static CNN- 2channel CNN- 4channel the and and and and the the the the the and great is is is a is and and and a is the the a and was worst worst of of a a a the of and of of a is very of of of to bad a a worst to the s s s is a is is is I well excellent excellent it I this the was was in film it great excellent in of awful awful I this of great it to this plot boring to to it it to in I that just was I awful that I in to great was acting to boring this was as an I it it movie I movie movie as excellent I an in movie I bad this boring movie wonderful was perfect perfect for awful poor poor bad with movie perfect with very with script this bad s for story as as was as boring movie waste in film in best very fun have to waste terrible waste but favorite very best by on that terrible in poor on beautiful enjoyed enjoyed enjoyed film so in s terrible an my with wonderful an but t s with for have good wonderful fun as not terrible horrible are with are comedy fun by with be it are as as one loved by movie best are stupid with by are his also amazing amazing wonderful you horrible for acting it you most movie loved loved an in by for that not s most that amazing at are film horrible film be best loved superb most his film as film by who enjoyed superb film are one no it that horrible by love that most for from worst poorly it so For Word2Vec and FastText, we used the skip-gram structure, while unigram was used to create the cooccurrence matrix for GloVe. The total number of tokens for each dataset is shown in Table 3. The hyper-parameters for training CNN are summarized in Table 4. We used three different window sizes (how many words are considered in one receptive field), while the number of filters was fixed to 128. The document length, i.e., the maximum number of words, was set to 100. For sentences shorter than 100 words, zeropaddings were added after the last word, whereas the last words were trimmed if sentences were longer than 100 words. We also used two regularization methods. The dropout is an implicit regularization that ignores some weights in each step (dropout rate = 0.5 in this study), whereas the L 2 regularization is an explicit regularization that adds the L 2 -norm of the total weight in the loss function. 5. Result 5.1. Classification Performance Table 5 shows the classification accuracies for the five CNN models. It is worth noting that the CNN-Static

Table 8. Frequently appearing words in the positive/negative sentences in in the WATCHA test dataset (semantically positive or negative words are in blue and red fonts, respectively) Positive Negative CNN- CNN- CNN-Non- CNN- CNN- CNN- CNN- CNN-Non- CNN- CNN- Rand Static Static 2channel 4channel Rand Static Static 2channel 4channel 영화 영화 영화 영화 영화 영화 영화 영화 영화 영화 너무 이 수 최고의 (best) 최고의 (best) 너무 최고의 (best) 너무 너무 너무 너무 너무 너무 너무 최고의 (best) 수 정말 수 그리고 수 더 왜 다시 그리고 그 그 왜 이 왜 이 그냥 그냥 없고 없는 영화를 잘 정말 수 그리고 수 그냥 더 없는 그 그 가장 정말 없다 그냥 그리고 가장 정말 더 그냥 좀 없다 더 그냥 왜 왜 왜 이 없는 이 그 그 없는 없고 없는 이 없고 또 더 더 가장 영화를 영화는 영화는 더 그나마 진짜 최고 (best) 다 너무 있는 이 이 그 최고 (best) 잘 없는 없다 이런 가장 진짜 진짜 다시 다 느낌 다 영화를 다 그 이런 수 정말 더 없고 없다 것 내 잘 다시 최고 뻔한 것 (best) (obvious) 좀 영화를 영화를 그 있는 것 잘 진짜 내가 영화를 다 그나마 좀 영화가 좋다 (good) 이 것 것 이런 보는 한 좀 영화는 영화는 아름다운 안 없다 모든 영화가 보고 영화가 정말 영화는 (beautiful) (not) 진짜 함께 영화가 보고 좋다 없다 정말 내 (good) 내 이런 본 더 보고 있는 봐도 영화는 그 그나마 수 수 좀 이 내 내가 내 이렇게 무슨 보는 이런 별 정말 최고 (best) 다시 마지막 있는 좀 건 이런 대한 내 작품 내가 내 본 보는 것도 내 보는 영화가 잘 모든 봐도 모든 모든 본 스토리 건 차라리 (rather) 내 내가 내가 본 한 이렇게 본 내내 한 봐도 또 보는 중 좋다 (good) 좋다 (good) 완벽한 (perfect) 완벽한 (perfect) 이렇게 내가 진짜 많이 대한 한 잘 없고 봐도 하는 한 차라리 (rather) 차라리 (rather) 별 것 잘 건 아닌 (not) 이건 있을까 마지막 영화를 다 내 듯 안 좋은 (good) 보고 정말 별로 (not much of) 차라리 (rather) 안 아깝다 (wasted) 별 잘 느낌 뻔한 (obvious) 꼭 대한 다 또 스토리 뭘 내가 봤는데 모두 완벽한 (perfect) 본 한 이건 못한 (not) 영화가 최악의 (worst) 대한 최악의 (worst) 것

Table 9. Example of word attention for a positively classified sentence in the IMDB dataset Methodology Sentence Raw text I m normally not a Drama/Feel good movie kind of guy but once I saw the trailer for Radio I couldn t resist. Not only is this a great film but it also has great acting. Cuba Gooding Jr. did an excellent job portraying James Robert Kennedy a.k.a. RAdio. Ed Harris also did a fantastic job as Coach Jones. I was pleasantly surprised to see some comedy in it as well. So for a great story great acting and a little comedy I give Radio a 10 out of 10! (10 / 10 points) I m normally not a Drama Feel good movie kind of guy but once I saw the trailer for Radio I couldn t resist Not only is this a great film but it also has great acting Cuba Gooding Jr did CNN-Rand an excellent job portraying James Robert Kennedy a k a RAdio Ed Harris also did a fantastic CNN-Static CNN-Non-Static CNN-2channel job as Coach Jones I was pleasantly surprised to see some comedy in it as well So for a great story great acting and a little comedy I give Radio a out of Positive I m normally not a Drama Feel good movie kind of guy but once I saw the trailer for Radio I couldn t resist Not only is this a great film but it also has great acting Cuba Gooding Jr did an excellent job portraying James Robert Kennedy a k a RAdio Ed Harris also did a fantastic job as Coach Jones I was pleasantly surprised to see some comedy in it as well So for a great story great acting and a little comedy I give Radio a out of Positive I m normally not a Drama Feel good movie kind of guy but once I saw the trailer for Radio I couldn t resist Not only is this a great film but it also has great acting Cuba Gooding Jr did an excellent job portraying James Robert Kennedy a k a RAdio Ed Harris also did a fantastic job as Coach Jones I was pleasantly surprised to see some comedy in it as well So for a great story great acting and a little comedy I give Radio a out of Positive I m normally not a Drama Feel good movie kind of guy but once I saw the trailer for Radio I couldn t resist Not only is this a great film but it also has great acting Cuba Gooding Jr did an excellent job portraying James Robert Kennedy a k a RAdio Ed Harris also did a fantastic job as Coach Jones I was pleasantly surprised to see some comedy in it as well So for a great story great acting and a little comedy I give Radio a out of Positive I m normally not a Drama Feel good movie kind of guy but once I saw the trailer for Radio I couldn t resist Not only is this a great film but it also has great acting Cuba Gooding Jr did CNN-4channel an excellent job portraying James Robert Kennedy a k a RAdio Ed Harris also did a fantastic job as Coach Jones I was pleasantly surprised to see some comedy in it as well So for a great story great acting and a little comedy I give Radio a out of Positive resulted in the lowest classification accuracy for both IMDB and WATCHA datasets. Since the CNN-Static is the only model which does not update the word embedding vectors during the CNN training, updating the word embedding vectors for a given corpus during the model training, whether or not the word vectors are independently trained before, is encouraged to achieve better classification performance. Table 6 shows an example of CAM 2 for a test sentence. The overall sentiment of this sentence is classified as positive. For each word, the higher the score, the CNN model considers it as a significantly contributing word to the overall sentiment. Thus, the word entertaining had the greatest impact on the classification of this review as being positive. 5.2. Finding Sentimental Words Table 7 provides the frequent words listed in the IMDB test dataset by selecting the top five highly scored words in the sentences classified as positive (left five columns) and negative (right five columns). It is worth noting that although the CNN-Rand yielded a relatively good classification performance compared to other techniques, it identified the least emotional words among the five CNN models. Although the classification performance of CNN-Static was the worst, its attention mechanism seemed to work well, in that many emotional words were highly ranked. In terms of classification performance, it is important whether or not the input vector is updated in the training process. However, for the sake of word attention in sentiment

Table 10. Example of word attention for a negatively classified sentence in the IMDB dataset Methodology Sentence This is one of the most boring films I ve ever seen. The three main cast members just didn t Raw text seem to click well. Giovanni Ribisi s character was quite annoying. For some reason he seems to like repeating what he says. If he was the Rain Man it would ve been fine but he s not. (3/10 points) This is one of the most boring films I ve ever seen The three main cast members just didn t CNN-Rand seem to click well Giovanni Ribisi s character was quite annoying For some reason he seems to like repeating what he says If he was the Rain Man it would ve been fine but he s not Negative This is one of the most boring films I ve ever seen The three main cast members just didn t CNN-Static seem to click well Giovanni Ribisi s character was quite annoying For some reason he seems to like repeating what he says If he was the Rain Man it would ve been fine but he s not Negative This is one of the most boring films I ve ever seen The three main cast members just didn t CNN-Non-Static seem to click well Giovanni Ribisi s character was quite annoying For some reason he seems to like repeating what he says If he was the Rain Man it would ve been fine but he s not Negative This is one of the most boring films I ve ever seen The three main cast members just didn t CNN-2channel seem to click well Giovanni Ribisi s character was quite annoying For some reason he seems to like repeating what he says If he was the Rain Man it would ve been fine but he s not Negative This is one of the most boring films I ve ever seen The three main cast members just didn t CNN-4channel seem to click well Giovanni Ribisi s character was quite annoying For some reason he seems to like repeating what he says If he was the Rain Man it would ve been fine but he s not Negative classification, it becomes more important whether the general grammatical relationship between the words are well-preserved in the word embedding vector (not updated for classification task). Table 8 provides the frequent words listed in the WATCHA test dataset by selecting the top five highly scored words in the sentences classified as positive (left five columns) and negative (right five columns). In this case, the emotional word in the upper word list is somewhat overlapped with other methods compared to the IMDB dataset. This is because Korean is an agglutinative language, which tends to have a high rate of affixes per word. For example, 없다, 없 는, 없고..., 안, 아닌, 못... (not), and 차 라리(rather) are usually used in Korean for negative expressions. Experimental results confirm that these words are more frequently used in the negative reviews than in the positive reviews (except CNN-Rand). 5.3. Word Attention: IMDB Table 9 shows an example of word attention of a positively classified sentence in the IMDB dataset. The words highlighted in blue are the top 10% highly scored words in the sentence. The four models except the CNN-Rand can successfully capture semantically positive words or phrases (ex. excellent, fantastic, and was pleasantly surprised). In particular, the CNN-Static is especially good at paying attention to longer sentimental phrases such as a great story great acting. Table 10 shows an example of word attention of a negatively classified sentence in the IMDB dataset. The words highlighted in red are the top 10% highly scored words in the sentence. If one reads the review, he/she can easily recognize multiple negative expressions within the review, which results in different attention words or phrases according to different models. For example, the CNN-Non-Static, CNN- 2channel, and CNN-4channel pay attention to boring and annoying, both of which are clearly negative expressions when used in a movie review. However, there is another explicit negative expression, namely, it would (have) been fine, which receives an attention by the CNN-Rand. Table 11 shows an example of attention results for a sentence whose predicted class is different according to the CNN models because of mixed emotional expressions within the sentence. In this case, the words in the top 10% highest scores are highlighted in blue and those in the bottom 10% lowest scores are highlighted in red if the sentence is classified as positive. The highlighting scheme is reversed if the sentence is classified as negative. Likewise, the CNN-Static, CNN-Non- Static, CNN-2channel, and CNN-4channel have relatively better attention performances than the CNN- Rand. Again, the CNN-Static has a relatively good performance in capturing longer emotional phrases such as is also very interesting and touching.

Table 11. Example of word attention for a sentence in the IMDB dataset whose predicted class is different according to CNN models Methodology Sentence This movie has a lot to recommend it. The paintings the music and David Hewlett s naked butt are all gorgeous! The plot a story of redemption forgiveness and courage in the face of adversity is also very interesting and touching and it s not predictable which is saying quite a lot about a movie in Raw text this day and age. But the acting is mediocre the direction is confusing and the script is just odd. It often felt like it was trying to be a parody but I never figured out what it was trying to be parody *of*. (9 / 10 points) This movie has a lot to recommend it The paintings the music and David Hewlett s naked butt are all gorgeous The plot a story of redemption forgiveness and courage in the face of adversity is also very interesting and touching and it s not predictable which is saying quite a lot about a CNN-Rand movie in this day and age But the acting is mediocre the direction is confusing and the script is just odd It often felt like it was trying to be a parody but I never figured out what it was trying to be parody of Negative This movie has a lot to recommend it The paintings the music and David Hewlett s naked butt are all gorgeous The plot a story of redemption forgiveness and courage in the face of adversity is also very interesting and touching and it s not predictable which is saying quite a lot about CNN-Static a movie in this day and age But the acting is mediocre the direction is confusing and the script is just odd It often felt like it was trying to be a parody but I never figured out what it was trying to be parody of Negative This movie has a lot to recommend it The paintings the music and David Hewlett s naked butt are all gorgeous The plot a story of redemption forgiveness and courage in the face of adversity is also very interesting and touching and it s not predictable which is saying quite a lot about CNN-Non-Static a movie in this day and age But the acting is mediocre the direction is confusing and the script is just odd It often felt like it was trying to be a parody but I never figured out what it was trying to be parody of Positive This movie has a lot to recommend it The paintings the music and David Hewlett s naked butt are all gorgeous The plot a story of redemption forgiveness and courage in the face of adversity is also very interesting and touching and it s not predictable which is saying quite a lot about CNN-2channel a movie in this day and age But the acting is mediocre the direction is confusing and the script is just odd It often felt like it was trying to be a parody but I never figured out what it was trying to be parody of Positive This movie has a lot to recommend it The paintings the music and David Hewlett s naked butt are all gorgeous The plot a story of redemption forgiveness and courage in the face of adversity is also very interesting and touching and it s not predictable which is saying quite a lot about a CNN-4channel movie in this day and age But the acting is mediocre the direction is confusing and the script is just odd It often felt like it was trying to be a parody but I never figured out what it was trying to be parody of Positive 5.4. Word Attention: WATCHA Table 12 shows an example of word attention of a positively classified sentence in the WATCHA dataset. The words highlighted in blue are the top 10% highly scored words in the sentence. In this sentence, there are two obvious positive expressions, i.e., 감탄스럽다 (impressing) and 존경스럽다 (admirable); the former was successfully detected by CNN-Static, CNN-Non- Static, CNN-2channel, and CNN-4channel while the latter was detected by CNN-Rand. Table 13 shows an example of word attention of a negatively classified sentence in the WATCHA dataset. The words highlighted in blue are the top 10% highly scored words in the sentence. This sentence also has two semantically explicit negative expressions: 불필 요하고 의미없는 가오 (unnecessary and meaningless flaunt) and 한마디로 총체적 난국 (a total crisis in a word). The CNN-Rand focused on the former expression, whereas the rest of the four models focused on the latter expression. Similar to the example of the positive sentence in Table 12, it seems that the atten-

Table 12. Example of word attention for a positively classified sentence in the WATCHA dataset Methodology Raw text Sentence 살라딘의 기사도 정신이 진짜 감탄스럽다. 예수상을 다시 세우고 십자가 바닥을 안 밟고 지나가는 장면 이 존경스럽다. (5 / 5 points) (Saladin s Chivalry spirit is truly amazing. I m very impressed by the scene of setting up the Jesus prize and passing without stepping on the floor of the cross.) 살라딘의 기사도 정신이 진짜 감탄스럽다 예수상을 다시 세우고 십자가 바닥을 안 밟밟밟고고고 지나가는 장 CNN-Rand 면이 존존존경경경스스스럽럽럽다다다 Positive 살라딘의 기사도 정신이 진짜 감감감탄탄탄스스스럽럽럽다다다 예예예수수수상상상을을을 다시 세우고 십자가 바닥을 안 밟고 지나가는 CNN-Static 장면이 존경스럽다 Positive 살라딘의 기사도 정신이 진짜 감감감탄탄탄스스스럽럽럽다다다 예예예수수수상상상을을을 다시 세우고 십자가 바닥을 안 밟고 지나가는 CNN-Non-Static 장면이 존경스럽다 Positive 살라딘의 기사도 정신이 진짜 감감감탄탄탄스스스럽럽럽다다다 예예예수수수상상상을을을 다시 세우고 십자가 바닥을 안 밟고 지나가는 CNN-2channel 장면이 존경스럽다 Positive 살라딘의 기사도 정신이 진짜 감감감탄탄탄스스스럽럽럽다다다 예예예수수수상상상을을을 다시 세우고 십자가 바닥을 안 밟고 지나가는 CNN-4channel 장면이 존경스럽다 Positive Table 13. Example of word attention for a negatively classified sentence in the WATCHA dataset Methodology Sentence 영화 전체를 통틀어 가장 불필요하고 의미없는 가오를 잡는 여자가 환호를 받고 있는 아이러니한 영화! 사운드트랙은 인정하더라도 관객을 지나가는 메트로폴리스 행인만도 못하게 다루는 스토리텔링 한마디로 총체적 난국. (2 / 5 points) Raw text (An ironic movie in which the most unnecessary and meaningless flaunt woman in the whole movie is being cheered! Soundtracks are acceptable but storytelling makes the audience run down. A total impasse in a word.) 영화 전체를 통틀어 가장 불불불필필필요요요하하하고고고 의의의미미미없없없는는는 가가가오오오를를를 잡는 여자가 환호를 받고 있는 아이러니한 CNN-Rand 영화 사운드트랙은 인정하더라도 관객을 지나가는 메트로폴리스 행인만도 못하게 다루는 스토리텔링 CNN-Static CNN-Non-Static CNN-2channel CNN-4channel 한마디로 총체적 난국 Negative 영화 전체를 통틀어 가장 불필요하고 의미없는 가오를 잡는 여자가 환호를 받고 있는 아이러니한 영화 사운드트랙은 인정하더라도 관객을 지나가는 메트로폴리스 행인만도 못하게 다루는 스토리텔링 한한한마마마디디디로로로 총총총체체체적적적 난난난국국국 Negative 영화 전체를 통틀어 가장 불필요하고 의미없는 가오를 잡는 여자가 환호를 받고 있는 아이러니한 영화 사운드트랙은 인정하더라도 관객을 지나가는 메트로폴리스 행인만도 못하게 다루는 스토리텔링 한한한마마마디디디로로로 총총총체체체적적적 난난난국국국 Negative 영화 전체를 통틀어 가장 불필요하고 의미없는 가오를 잡는 여자가 환호를 받고 있는 아이러니한 영화 사운드트랙은 인정하더라도 관객을 지나가는 메트로폴리스 행인만도 못하게 다루는 스토리텔링 한한한마마마디디디로로로 총총총체체체적적적 난난난국국국 Negative 영화 전체를 통틀어 가장 불필요하고 의미없는 가오를 잡는 여자가 환호를 받고 있는 아이러니한 영화 사운드트랙은 인정하더라도 관객을 지나가는 메트로폴리스 행인만도 못하게 다루는 스토리텔링 한한한마마마디디디로로로 총총총체체체적적적 난난난국국국 Negative tion mechanism of CNN-Rand is somewhat different from those of the other models. This is mainly because the word embedding vectors are not updated to reflect the user s rating information. Hence, more general

Table 14. Example of word attention for a sentence in the IMDB dataset whose predicted class is different according to CNN models Methodology Sentence 이렇게 재미없고 그래픽도 꾸지고 난장판인 엑스맨을 과거의 이야기로 새로 시작한 메튜 본 감독과 깔끔하게 다시 재정리한 브라이언 싱어 감독에게 박수를... ( 1 / 5 points) Raw text (I would like to pay tribute to Bryan Singer, who just reconstituted this boring and messy X-Men as a story of the past, and Matthew Vaughn, who neatly rearranged it again.) 이렇게 재미없고 그래픽도 꾸지고 난장판인 엑스맨을 과거의 이이이야야야기기기로로로 새로 시시시작작작한한한 메튜 본 CNN-Rand 감독과 깔끔하게 다시 재정리한 브라이언 싱싱싱어어어 감독에게 박박박수수수를를를 Negative 이렇게 재재재미미미없없없고고고 그그그래래래픽픽픽도도도 꾸지고 난장판인 엑스맨을 과거의 이야기로 새로 시작한 메튜 본 CNN-Static 감독과 깔끔하게 다시 재정리한 브라이언 싱어 감감감독독독에에에게게게 박박박수수수를를를 Positive 이렇게 재재재미미미없없없고고고 그그그래래래픽픽픽도도도 꾸지고 난장판인 엑스맨을 과거의 이야기로 새로 시작한 메튜 본 CNN-Non-Static 감독과 깔깔깔끔끔끔하하하게게게 다다다시시시 재정리한 브라이언 싱어 감독에게 박수를 Negative 이렇게 재재재미미미없없없고고고 그그그래래래픽픽픽도도도 꾸꾸꾸지지지고고고 난장판인 엑스맨을 과거의 이야기로 새로 시작한 메튜 본 CNN-2channel 감독과 깔깔깔끔끔끔하하하게게게 다다다시시시 재정리한 브라이언 싱어 감독에게 박수를 Negative 이렇게 재재재미미미없없없고고고 그그그래래래픽픽픽도도도 꾸꾸꾸지지지고고고 난장판인 엑스맨을 과거의 이야기로 새로 시작한 메튜 본 CNN-4channel 감독과 깔깔깔끔끔끔하하하게게게 다다다시시시 재정리한 브라이언 싱어 감독에게 박수를 Negative emotional expressions, rather than movie-review specific expressions, receive higher attention by the CNN- Rand. Table 14 shows an examples in the same manner as the example illustrated in Table 11. The three models except CNN-Rand and CNN-Static focus on the negative phrase 재미없고 (boring) and the positive phrase 깔끔하게 (neatly). Qualitatively, the former is a stronger emotional expression than the latter, which results in the entire sentence being predicted as negative. However, the CNN-Static finds a stronger positive expression, i.e., 박수를 (pay tribute to) rather than 깔끔하게 (neatly), which results in the CNN model predicting the whole sentence as positive. 6. Conclusion In this paper, we propose CAM 2, a classification and attention model with class activation map, which is a sentiment classification model with word attention based on weakly supervised CNN learning. Although the proposed model is trained based on class labels only, it can not only predict the overall sentiment of a given sentence but also find important emotional words significantly contributing the predicted class. Compared to the previous CNN-based text classification model, CAM 2 utilizes zero-paddings to help the CNN consider every word equally regardless of its position in the sentence. Moreover, it uses average pooling and a large number of filters to preserve the information as much as possible. In addition, various word embedding techniques are employed and integrated. Experimental results on two movie review datasets, IMDB, which is in English, and WATCHA, which is in Korean, show that the proposed CAM 2 yielded classification accuracies higher than 87% for the IMDB and 78% for the WATCHA dataset. The CNN models that update the word embedding vectors during the sentiment classification learning (CNN-Rand, CNN-Non-Static, CNN-2channel, and CNN-4channel) achieved higher classification performance than that did not update the word embedding vectors (CNN-Static). It is also worth noting that the integration of multiple word embedding techniques improved the classification performance for the IMDB dataset. However, all models showed the ability to find important emotional words in the sentence, although the internal mechanism might be different. For the WATCHA dataset, in particular, the CNN- Static, which does not update the word embedding vector during the training, focused more on generally accepted emotional expressions, whereas the other models, which adapt to the language usage pattern in the movie review domain, seemed to focus more on the domain-dependent emotional expressions. We expect that the proposed methodology can be a useful application in domains where it is important to understand what the input sentences are intended to convey, such as visual question and answering system or chat bots. Although the experimental results were favorable, the current study has some limitations, which lead us to the future research directions. First,

the proposed method used a simple space-based token for training word embedding vectors. If more sophisticated preprocessing techniques, such as lemmatization, are performed, the classification and attention performance can be improved. Secondly, quantitative evaluation of word attention, i.e., how good or appropriate the identified words are in the context of sentiment classification, is difficult, which is why we qualitatively interpreted the word attention results in Section 4. Developing a systematic and quantitative evaluation method for word attention can be another meaningful future research topic. References Ba, Jimmy, Mnih, Volodymyr, and Kavukcuoglu, Koray. Multiple object recognition with visual attention. arxiv preprint arxiv:1412.7755, 2014. Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. arxiv preprint arxiv:1409.0473, 2014. Bojanowski, Piotr, Grave, Edouard, Joulin, Armand, and Mikolov, Tomas. Enriching word vectors with subword information. arxiv preprint arxiv:1607.04606, 2016. Cho, Heeryon, Kim, Songkuk, Lee, Jongseo, and Lee, Jong-Seok. Data-driven integration of multiple sentiment dictionaries for lexicon-based sentiment classification of product reviews. Knowledge-Based Systems, 71:61 71, 2014. Gui, Lin, Zhou, Yu, Xu, Ruifeng, He, Yulan, and Lu, Qin. Learning representations from heterogeneous network for sentiment classification of product reviews. Knowledge-Based Systems, 124:34 45, 2017. Jo, Eun Kyoung. The Current State of Affairs of the Sentiment Analysis and Case Study Based on Corpus. The Journal of Linguistics Science, 61: 259 282, 2012. URL http://www.dbpia.co.kr/ Article/NODE06607901. Kalchbrenner, Nal, Grefenstette, Edward, and Blunsom, Phil. A convolutional neural network for modelling sentences. arxiv preprint arxiv:1404.2188, 2014. Kim, Yoon. Convolutional neural networks for sentence classification. arxiv preprint arxiv:1408.5882, 2014. Liu, Bing. Sentiment analysis and opinion mining. Synthesis lectures on human language technologies, 5(1):1 167, 2012. Luong, Minh-Thang, Pham, Hieu, and Manning, Christopher D. Effective approaches to attentionbased neural machine translation. arxiv preprint arxiv:1508.04025, 2015. Maas, Andrew L, Daly, Raymond E, Pham, Peter T, Huang, Dan, Ng, Andrew Y, and Potts, Christopher. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 142 150. Association for Computational Linguistics, 2011. Medhat, Walaa, Hassan, Ahmed, and Korashy, Hoda. Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal, 5(4):1093 1113, 2014. Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey. Efficient estimation of word representations in vector space. arxiv preprint arxiv:1301.3781, 2013. Mnih, Volodymyr, Heess, Nicolas, Graves, Alex, et al. Recurrent models of visual attention. In Advances in neural information processing systems, pp. 2204 2212, 2014. Nasukawa, Tetsuya and Yi, Jeonghee. Sentiment analysis: Capturing favorability using natural language processing. In Proceedings of the 2nd international conference on Knowledge capture, pp. 70 77. ACM, 2003. Oquab, Maxime, Bottou, Léon, Laptev, Ivan, and Sivic, Josef. Is object localization for free?-weaklysupervised learning with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 685 694, 2015. Pang, Bo, Lee, Lillian, et al. Opinion mining and sentiment analysis. Foundations and Trends R in Information Retrieval, 2(1 2):1 135, 2008. Pennington, Jeffrey, Socher, Richard, and Manning, Christopher D. Glove: Global vectors for word representation. In EMNLP, volume 14, pp. 1532 1543, 2014. Poria, Soujanya, Cambria, Erik, and Gelbukh, Alexander. Aspect extraction for opinion mining with a deep convolutional neural network. Knowledge- Based Systems, 108:42 49, 2016.