Introduction to Topic Models

Similar documents
DIY 챗봇 - LangCon

김기남_ATDC2016_160620_[키노트].key

R을 이용한 텍스트 감정분석

1

서론 34 2

04-다시_고속철도61~80p

Journal of Educational Innovation Research 2019, Vol. 29, No. 1, pp DOI: (LiD) - - * Way to

#Ȳ¿ë¼®

Something that can be seen, touched or otherwise sensed

DocsPin_Korean.pages

Secure Programming Lecture1 : Introduction

DE1-SoC Board

강의10

Probabilistic graphical models: Assignment 3 Seung-Hoon Na June 7, Gibbs sampler for Beta-Binomial Binomial및 beta분포는 다음과 같이 정의된다. k Bin(n, θ):

example code are examined in this stage The low pressure pressurizer reactor trip module of the Plant Protection System was programmed as subject for

01-OOPConcepts(2).PDF

Journal of Educational Innovation Research 2018, Vol. 28, No. 4, pp DOI: * A S

<B3EDB9AEC1FD5F3235C1FD2E687770>

Journal of Educational Innovation Research 2018, Vol. 28, No. 4, pp DOI: * A Research Trend

Page 2 of 6 Here are the rules for conjugating Whether (or not) and If when using a Descriptive Verb. The only difference here from Action Verbs is wh

DBPIA-NURIMEDIA

Facebook API

Journal of Educational Innovation Research 2018, Vol. 28, No. 3, pp DOI: NCS : * A Study on

03.Agile.key


6자료집최종(6.8))

High Resolution Disparity Map Generation Using TOF Depth Camera In this paper, we propose a high-resolution disparity map generation method using a lo

,,,.,,,, (, 2013).,.,, (,, 2011). (, 2007;, 2008), (, 2005;,, 2007).,, (,, 2010;, 2010), (2012),,,.. (, 2011:,, 2012). (2007) 26%., (,,, 2011;, 2006;

Contents Contents 2 1 Abstract 3 2 Infer Checkers Eradicate Infer....

Eclipse 와 Firefox 를이용한 Javascript 개발 발표자 : 문경대 11 년 10 월 26 일수요일

MPLAB C18 C

VOL /2 Technical SmartPlant Materials - Document Management SmartPlant Materials에서 기본적인 Document를 관리하고자 할 때 필요한 세팅, 파일 업로드 방법 그리고 Path Type인 Ph

Interstage5 SOAP서비스 설정 가이드


2011´ëÇпø2µµ 24p_0628

DBPIA-NURIMEDIA

Output file

<30352DC0CCC7F6C8F B1B3292DBFACB1B8BCD2B1B3C1A42E687770>

3 Gas Champion : MBB : IBM BCS PO : 2 BBc : : /45

- 2 -

FMX M JPG 15MB 320x240 30fps, 160Kbps 11MB View operation,, seek seek Random Access Average Read Sequential Read 12 FMX () 2

歯kjmh2004v13n1.PDF

28 THE ASIAN JOURNAL OF TEX [2] ko.tex [5]

DBPIA-NURIMEDIA

05-class.key

IKC43_06.hwp

public key private key Encryption Algorithm Decryption Algorithm 1

4S 1차년도 평가 발표자료

09권오설_ok.hwp

07 자바의 다양한 클래스.key

融合先验信息到三维重建 组会报 告[2]

02 C h a p t e r Java

교육2 ? 그림

Microsoft PowerPoint - ch03ysk2012.ppt [호환 모드]

iii. Design Tab 을 Click 하여 WindowBuilder 가자동으로생성한 GUI 프로그래밍환경을확인한다.

Social Network

44-4대지.07이영희532~

PRO1_02E [읽기 전용]

step 1-1

2 佛敎學報 第 48 輯 서도 이 목적을 준수하였다. 즉 석문의범 에는 승가의 일상의례 보다는 각종의 재 의식에 역점을 두었다. 재의식은 승가와 재가가 함께 호흡하는 공동의 場이므로 포 교와 대중화에 무엇보다 중요한 역할을 수행할 수 있다는 믿음을 지니고 있었다. 둘째

May 2014 BROWN Education Webzine vol.3 감사합니다. 그리고 고맙습니다. 목차 From Editor 당신에게 소중한 사람은 누구인가요? Guidance 우리 아이 좋은 점 칭찬하기 고맙다고 말해주세요 Homeschool [TIP] Famil

정보기술응용학회 발표

대한한의학원전학회지24권6호-전체최종.hwp

Buy one get one with discount promotional strategy

<B1A4B0EDC8ABBAB8C7D0BAB8392D345F33C2F75F E687770>


chapter4

thesis

¹Ìµå¹Ì3Â÷Àμâ

김경재 안현철 지능정보연구제 17 권제 4 호 2011 년 12 월

서강대학원123호

C# Programming Guide - Types

<C3D6C1BEBFCFBCBA2DBDC4C7B0C0AFC5EBC7D0C8B8C1F D31C8A3292E687770>

사회통계포럼

11¹Ú´ö±Ô

Orcad Capture 9.x

<313120C0AFC0FCC0DA5FBECBB0EDB8AEC1F2C0BB5FC0CCBFEBC7D15FB1E8C0BAC5C25FBCF6C1A42E687770>

歯제7권1호(최종편집).PDF

PowerPoint 프레젠테이션

UML

MasoJava4_Dongbin.PDF

Intro to Servlet, EJB, JSP, WS

Software Requirrment Analysis를 위한 정보 검색 기술의 응용

KM-380BL,BLB(100908)

Journal of Educational Innovation Research 2016, Vol. 26, No. 3, pp DOI: Awareness, Supports

(Exposure) Exposure (Exposure Assesment) EMF Unknown to mechanism Health Effect (Effect) Unknown to mechanism Behavior pattern (Micro- Environment) Re


조사연구 using odds ratio. The result of analysis for 58 election polls registered in National Election Survey Deliberation Commission revealed that progr

Intra_DW_Ch4.PDF

歯처리.PDF

Manufacturing6

Microsoft PowerPoint - AC3.pptx


디지털포렌식학회 논문양식

DBPIA-NURIMEDIA

Microsoft PowerPoint - 27.pptx

4 CD Construct Special Model VI 2 nd Order Model VI 2 Note: Hands-on 1, 2 RC 1 RLC mass-spring-damper 2 2 ζ ω n (rad/sec) 2 ( ζ < 1), 1 (ζ = 1), ( ) 1

Microsoft Word - Westpac Korean Handouts.doc


歯M PDF

Transcription:

1 Analyzing Big Scientific Publication Collections by Topic Modeling Techniques Min Song, Ph.D. Department of Lib. and Info. Science Yonsei University

The Problem with Information Needle in a haystack: as more information beco mes available, it is harder and harder to find what we are looking for Need new tools to help us organize, search and un derstand information :: topic modeling ::

A Solution? Use topic models to discover hidden topicbased patterns Use discovered topics to annotate the collection Use annotations to o rganize, understand, summarize, search... :: topic modeling ::

Topic (Concept) Models Topic models: LSA, PLSA, LDA Sh are 3 fundamental assumptions: Documents have latent semantic structure ( topics ) Can infer topics from word-document co-occurrences Words are related to topics, topicsto documents Use different mathematical frameworks Linear algebra vs. probabilistic modeling :: topic modeling ::

5 Topic models from David Blei, KDD-11 tutorial Observation: a collection of texts Assumption: the texts have been generated according to some model Output: the model that has generated the texts

6 Topic models Discover hidden topical patterns that pervade the collection through statistical regularities Annotate documents with these topics Use the topic annotations to organize, summarize, search texts...

7 Topic examples Steyvers & Griffiths, 2006

8 LSA and topic models Steyvers & Griffiths, 2006

10 Topic models intuition Find the latent structure of topics or concepts in a text corpus, which is obscured by word choice noise Deerwester et al (1990) LSA co-occurrence of terms in text documents can be used to recover this latent structure, without additional knowledge. Latent topic representations representations of text allow modelling linguistic phenomena, like synonymy and polysemy.

10 Topic models Each document is a mixture of topics: Each word is drawn from one of its document s topics:

11 Topic models The observations are the documents: We need to infer the model, i.e the underlying topic structure, i.e. the topic assignments z m,n, the topic and word distributions Priors: distribution with hyperparameter α distribution with hyperparameter β

Graphical Model Dirichlet param eters topic a ssig nment topics document-specific topic distribution observed word Dirichlet param eters :: topic modeling ::

Dirichlet Distribution Distribution over K-dimensional positive vectors that sum to one (i.e., points on the probability simplex) Two parameters: Base measure, e.g., m (vector) Concentration parameter, e.g., α (scalar) :: topic modeling ::

Posterior Inference latent variables Infer (or integrate out) all latent variables, given tokens :: topic modeling ::

Inference Algorithms (Mukherjee & Blei, 2009; Asuncion et al., 2009) Exact inference in LDA is nottractable Approximate inference algorithms: Mean field variational inference (Blei et al., 2001;2003) Expectation propagation (Minka & Lafferty, 2002) Collapsed Gibbs sampling (Griffiths & Steyvers, 2002) Collapsed variational inference (Teh etal., 2006) Each method has advantages and disadvantages :: topic modeling ::

Topic examples 16

17 Topic examples Object bag of words with labels

18 Topic examples Basic components: A set of entities (e.g. documents, images, individuals, genes) A set of relations (e.g. citation, coauthor, co-tag, friends, pat hways)

20 Topic models in machine learning generative assume an underlying model (probability dis tribution, parameters) generated the observed data the class is a hidden variable can handle a large number of classes difference relative to discriminative models?

20 Topic models in machine learning generative assume an underlying model (probability dis tribution, parameters) generated the observed data the class is a hidden variable can handle a large number of classes difference relative to discriminative models? discriminative: P (Y X ) generative: P(Y, X)

Topic Modeling with Mallet http://mallet.cs.umass.edu Command line scripts: bin/mallet [command] [option] [value] Text User Interface ( tui ) classes Direct Java API http://mallet.cs.umass.edu/api

Learning More about Mallet http://mallet.cs.umass.edu Quick Start guides, focused on command line processing Developers guides, with Java examples mallet dev@cs.umass.edu mailing list Low volume, but can be bursty

Models for Text Data Generative models (Multinomials) Naïve Bayes Hidden Markov Models (HMMs) Latent Dirichlet Topic Models Discriminative Regression Models MaxEnt/Logistic regression Conditional Random Fields (CRFs)

Representations Transform text documents to vectors x 1, x 2, Retain meaning of vector indices Ideally sparsely Call me Ishmael. Document

Representations Transform text documents to vectors x 1, x 2, Retain meaning of vector indices Ideally sparsely Call me Ishmael. Document 1.0 0.0 0.0 6.0 0.0 3.0 x i

Representations Elements of vector are called feature values Example: Feature at row 345 is number of times dog appears in document 1.0 0.0 0.0 6.0 0.0 3.0 x i

Documents to Vectors Call me Ishmael. Document

Documents to Vectors Call me Ishmael. Call me Ishmael Document Tokens

Documents to Vectors Call me Ishmael call me ishmael Tokens Tokens

Documents to Vectors call me ishmael 473, 3591, 17 Tokens Features 17 473 cal l 3591 me ishmae l

Documents to Vectors 473, 3591, 17 17 1.0 473 1.0 Features (sequence) 3591 1.0 Features (bag) 17 ishmael 17 ishmael 473 call 473 cal 473 3591 call me l 3591 me 3591 me

Instances Email message, web page, sentence, journal abstract What is it called? Name Data What is the input? Target/Label Source What is the output? What did it originally look like?

Instances Name Data Target Source String TokenSequence ArrayList<Token> FeatureSequence int[] FeatureVector int -> double map cc.mallet.types

Alphabets 17 ishmael 473 call 3591 me TObjectIntHashMap map ArrayList entries int lookupindex(object o, boolean shouldadd) Object lookupobject(int index) cc.mallet.types, gnu.trove

Alphabets 17 ishmael 473 call 3591 me for TObjectIntHashMap map ArrayList entries int lookupindex(object o, boolean shouldadd) Object lookupobject(int index) cc.mallet.types, gnu.trove

Alphabets 17 ishmael 473 call 3591 me TObjectIntHashMap map ArrayList entries Do not add entries for void stopgrowth() new Objects -- default is to allow growth. void startgrowth() cc.mallet.types, gnu.trove

Creating Instances Instance constructor method Iterators new Instance(data, target, name, source) Iterator<Instance> FileIterator(Fi le[], ) CsvIterator(FileReader, Pattern ) ArrayIterator(Object[]) cc.mallet.pipe.iterator

Creating Instances FileIterator /data/bad/ Each instance in its own file Label from dir name /data/good/ cc.mallet.pipe.iterator

Creating Instances CsvIterator Each instance on its own line 1001 Melville Call me Ishmael. Some years ago 1002 Dickens It was the best of times, it was ^ ([^\ t]+)\ t ([^\ t]+)\ t (.*) cc.mallet.pipe.iterator Name, label, data from regular expression groups. CSV is a lousy name. LineRegexIterator?

Instance Pipelines Sequential transformations of instance fields (usually Data) Pass an ArrayList<Pipe> to SerialPipes // data is a String CharSequence2TokenSequence // tokenize with regexp TokenSequenceLowercase // modify each token s text TokenSequenceRemoveStopwords // drop some tokens TokenSequence2FeatureSequence // convert token Strings to ints FeatureSequence2FeatureVector // lose order, count duplicates cc.mallet.pipe

Instance Pipelines A small number of pipes modify the target field There are now two alphabets: data and label // target is a String Target2Label // convert String to int // target is now a Label Alphabet > LabelAlphabet cc.mallet.pipe, cc.mallet.types

Label objects Weights on a fixed set of classes For training data, weight for correct label is 1.0, all others 0.0 implements Labeling int getbestindex() Label getbestlabel() You cannot create a Label, they are only produced by LabelAlphabet cc.mallet.types

InstanceLists A List of Instance objects, along with a Pipe, data Alphabet, and LabelAlphabet InstanceList instances = new InstanceList(pipe); instances.addthrupipe(iterator); cc.mallet.types

Putting it all together

Demo Results 2012 한국대선관련 Twitter Data Reference: Song, M., Kim, M., and Jung, Y.K. (2014). Analyzing the Political Landscape of 2012 Korean Presidential Election in Twitter, IEEE Intelligent System (SCI)

Multinomial Topic Modeling 3.5 3 2.5 2 1.5 1 0.5 0-0.5-1 -1.5-2 -2.5 Topic_01 Topic_02 Topic_03 Topic_04 Topic_05 Topic_06 Topic_07 Topic_08 Topic_09

Multinomial Topic Modeling Topic Label Major Terms Type Topic_01 정수장학회 박근혜, 정수장학회, 안철수, 대선, 문재인, MBC, 최필립, 새누리당, 부산일보 rising Topic_02 대선후보박근혜, 문재인, 안철수, 후보, 대선, 대통령 rising Topic_03 박근혜지지율 박근혜, 후보, 안철수, 대선, 새누리당, 문재인, 단일화, 대통령, 지지율, 선거 falling Topic_04 안철수의혹안철수, 박근혜, 논문, 표절, 의혹, 다운계약서, 서울대 falling Topic_05 대선후보박근혜, 문재인, 안철수, 후보, 대선, 대통령 rising Topic_06 후보단일화안철수, 박근혜, 문재인, 대선, 후보, 무소속, 단일화 rising Topic_07 박근혜슬로건박근혜, 문재인, 안철수, 나라, 내, 꿈이, 이루어지는 rising Topic_08 박근혜캠프구성 박근혜, 문재인, 안철수, 민주당, 캠프, 김경재, 이, 대선, 장악한, 종북세력, 막으러, 들어왔다 rising Topic_09 대선후보후보, 박근혜, 안철수, 새누리당, 무소속, 민주통합당, 문재인 falling Topic_10 NLL 포기의혹박근혜, 문재인, NLL, 안철수, 노무현, 정문헌, 민주통합당 rising

Rising Issues 1.5 1 0.5 0-0.5-1 -1.5-2 -2.5 정수장학회 대선후보 대선후보 후보단일화 박근혜슬로건

Topic Modeling Toolkit Download Mallet-2.0.7.zip at http://informatics.yonsei.ac.kr/tsmm/download /mallet-2.0.7.zip Download Eclipse at https://eclipse.org/downloads/ Unzip Mallet-2.0.7.zip and create a java project in Eclipse Follow the instructions in the next slides (p. 50 p. 66)

Importing a Java Project to Eclipse

Program Import 1) 프로그램다운로드후압축해제 2) 이후 WorkSpace 에붙여넣기 Workspace 확인

Program Import 1) 프로그램다운로드후압축해제 2) 이후 WorkSpace 에붙여넣기 Workspace 내에붙여넣기

Program Import 1) 1) Eclipse 실행시첫화면 Welcome message 이므로그낭닫기 (X) 클릭

Program Import 2) 2) File > import 클릭 Import 클릭

Program Import 3) 3) General > Existing Projects into workspace 클릭, 이후 Next 클릭 Existing Projects into Worspace 클릭후 Next

Program Import 4) 4) Browse 클릭 Browse 를클릭

Program Import 5) 5) 미리설정한 workspace 내의프로젝트를클릭 프로젝트이름은 mallet-2.0.7 로되어있음 Mallet-2.0.7 클릭

Program Import 6) 6) 확인후 Finish 클릭 C:\workspace\text_mining_cla ss C:\workspace\text_mining_cla ss Finish 클릭

Program Import 7) 7) 다음과같이나타나면성공 ~~! Finish 클릭

Build path 설정 1) Mallet-2.0.7 옆의세모모양클릭

Build path 설정 2) Lib 폴더옆의세모모양클릭

Build path 설정 3) 파일에부가적인표시가되어있으면 Import 완료된파일 파일에부가적인표시가없으면 Import 가안된파일

Build path 설정 4) Import 가안된파일선택후우클릭이후 Build Path 에서 Add to Build Path 선택

인코딩설정 Windows 에서 preference 선택

인코딩설정 General > Workspace 선택 Text file Encoding 에서 Other 를선택 UTF-8 으로설정

Eclipse 좌측 Package Explorer 에서 X 라는빨간표시가다없어지면성공 ~!

Q/A 감사합니다.