1 Analyzing Big Scientific Publication Collections by Topic Modeling Techniques Min Song, Ph.D. Department of Lib. and Info. Science Yonsei University
The Problem with Information Needle in a haystack: as more information beco mes available, it is harder and harder to find what we are looking for Need new tools to help us organize, search and un derstand information :: topic modeling ::
A Solution? Use topic models to discover hidden topicbased patterns Use discovered topics to annotate the collection Use annotations to o rganize, understand, summarize, search... :: topic modeling ::
Topic (Concept) Models Topic models: LSA, PLSA, LDA Sh are 3 fundamental assumptions: Documents have latent semantic structure ( topics ) Can infer topics from word-document co-occurrences Words are related to topics, topicsto documents Use different mathematical frameworks Linear algebra vs. probabilistic modeling :: topic modeling ::
5 Topic models from David Blei, KDD-11 tutorial Observation: a collection of texts Assumption: the texts have been generated according to some model Output: the model that has generated the texts
6 Topic models Discover hidden topical patterns that pervade the collection through statistical regularities Annotate documents with these topics Use the topic annotations to organize, summarize, search texts...
7 Topic examples Steyvers & Griffiths, 2006
8 LSA and topic models Steyvers & Griffiths, 2006
10 Topic models intuition Find the latent structure of topics or concepts in a text corpus, which is obscured by word choice noise Deerwester et al (1990) LSA co-occurrence of terms in text documents can be used to recover this latent structure, without additional knowledge. Latent topic representations representations of text allow modelling linguistic phenomena, like synonymy and polysemy.
10 Topic models Each document is a mixture of topics: Each word is drawn from one of its document s topics:
11 Topic models The observations are the documents: We need to infer the model, i.e the underlying topic structure, i.e. the topic assignments z m,n, the topic and word distributions Priors: distribution with hyperparameter α distribution with hyperparameter β
Graphical Model Dirichlet param eters topic a ssig nment topics document-specific topic distribution observed word Dirichlet param eters :: topic modeling ::
Dirichlet Distribution Distribution over K-dimensional positive vectors that sum to one (i.e., points on the probability simplex) Two parameters: Base measure, e.g., m (vector) Concentration parameter, e.g., α (scalar) :: topic modeling ::
Posterior Inference latent variables Infer (or integrate out) all latent variables, given tokens :: topic modeling ::
Inference Algorithms (Mukherjee & Blei, 2009; Asuncion et al., 2009) Exact inference in LDA is nottractable Approximate inference algorithms: Mean field variational inference (Blei et al., 2001;2003) Expectation propagation (Minka & Lafferty, 2002) Collapsed Gibbs sampling (Griffiths & Steyvers, 2002) Collapsed variational inference (Teh etal., 2006) Each method has advantages and disadvantages :: topic modeling ::
Topic examples 16
17 Topic examples Object bag of words with labels
18 Topic examples Basic components: A set of entities (e.g. documents, images, individuals, genes) A set of relations (e.g. citation, coauthor, co-tag, friends, pat hways)
20 Topic models in machine learning generative assume an underlying model (probability dis tribution, parameters) generated the observed data the class is a hidden variable can handle a large number of classes difference relative to discriminative models?
20 Topic models in machine learning generative assume an underlying model (probability dis tribution, parameters) generated the observed data the class is a hidden variable can handle a large number of classes difference relative to discriminative models? discriminative: P (Y X ) generative: P(Y, X)
Topic Modeling with Mallet http://mallet.cs.umass.edu Command line scripts: bin/mallet [command] [option] [value] Text User Interface ( tui ) classes Direct Java API http://mallet.cs.umass.edu/api
Learning More about Mallet http://mallet.cs.umass.edu Quick Start guides, focused on command line processing Developers guides, with Java examples mallet dev@cs.umass.edu mailing list Low volume, but can be bursty
Models for Text Data Generative models (Multinomials) Naïve Bayes Hidden Markov Models (HMMs) Latent Dirichlet Topic Models Discriminative Regression Models MaxEnt/Logistic regression Conditional Random Fields (CRFs)
Representations Transform text documents to vectors x 1, x 2, Retain meaning of vector indices Ideally sparsely Call me Ishmael. Document
Representations Transform text documents to vectors x 1, x 2, Retain meaning of vector indices Ideally sparsely Call me Ishmael. Document 1.0 0.0 0.0 6.0 0.0 3.0 x i
Representations Elements of vector are called feature values Example: Feature at row 345 is number of times dog appears in document 1.0 0.0 0.0 6.0 0.0 3.0 x i
Documents to Vectors Call me Ishmael. Document
Documents to Vectors Call me Ishmael. Call me Ishmael Document Tokens
Documents to Vectors Call me Ishmael call me ishmael Tokens Tokens
Documents to Vectors call me ishmael 473, 3591, 17 Tokens Features 17 473 cal l 3591 me ishmae l
Documents to Vectors 473, 3591, 17 17 1.0 473 1.0 Features (sequence) 3591 1.0 Features (bag) 17 ishmael 17 ishmael 473 call 473 cal 473 3591 call me l 3591 me 3591 me
Instances Email message, web page, sentence, journal abstract What is it called? Name Data What is the input? Target/Label Source What is the output? What did it originally look like?
Instances Name Data Target Source String TokenSequence ArrayList<Token> FeatureSequence int[] FeatureVector int -> double map cc.mallet.types
Alphabets 17 ishmael 473 call 3591 me TObjectIntHashMap map ArrayList entries int lookupindex(object o, boolean shouldadd) Object lookupobject(int index) cc.mallet.types, gnu.trove
Alphabets 17 ishmael 473 call 3591 me for TObjectIntHashMap map ArrayList entries int lookupindex(object o, boolean shouldadd) Object lookupobject(int index) cc.mallet.types, gnu.trove
Alphabets 17 ishmael 473 call 3591 me TObjectIntHashMap map ArrayList entries Do not add entries for void stopgrowth() new Objects -- default is to allow growth. void startgrowth() cc.mallet.types, gnu.trove
Creating Instances Instance constructor method Iterators new Instance(data, target, name, source) Iterator<Instance> FileIterator(Fi le[], ) CsvIterator(FileReader, Pattern ) ArrayIterator(Object[]) cc.mallet.pipe.iterator
Creating Instances FileIterator /data/bad/ Each instance in its own file Label from dir name /data/good/ cc.mallet.pipe.iterator
Creating Instances CsvIterator Each instance on its own line 1001 Melville Call me Ishmael. Some years ago 1002 Dickens It was the best of times, it was ^ ([^\ t]+)\ t ([^\ t]+)\ t (.*) cc.mallet.pipe.iterator Name, label, data from regular expression groups. CSV is a lousy name. LineRegexIterator?
Instance Pipelines Sequential transformations of instance fields (usually Data) Pass an ArrayList<Pipe> to SerialPipes // data is a String CharSequence2TokenSequence // tokenize with regexp TokenSequenceLowercase // modify each token s text TokenSequenceRemoveStopwords // drop some tokens TokenSequence2FeatureSequence // convert token Strings to ints FeatureSequence2FeatureVector // lose order, count duplicates cc.mallet.pipe
Instance Pipelines A small number of pipes modify the target field There are now two alphabets: data and label // target is a String Target2Label // convert String to int // target is now a Label Alphabet > LabelAlphabet cc.mallet.pipe, cc.mallet.types
Label objects Weights on a fixed set of classes For training data, weight for correct label is 1.0, all others 0.0 implements Labeling int getbestindex() Label getbestlabel() You cannot create a Label, they are only produced by LabelAlphabet cc.mallet.types
InstanceLists A List of Instance objects, along with a Pipe, data Alphabet, and LabelAlphabet InstanceList instances = new InstanceList(pipe); instances.addthrupipe(iterator); cc.mallet.types
Putting it all together
Demo Results 2012 한국대선관련 Twitter Data Reference: Song, M., Kim, M., and Jung, Y.K. (2014). Analyzing the Political Landscape of 2012 Korean Presidential Election in Twitter, IEEE Intelligent System (SCI)
Multinomial Topic Modeling 3.5 3 2.5 2 1.5 1 0.5 0-0.5-1 -1.5-2 -2.5 Topic_01 Topic_02 Topic_03 Topic_04 Topic_05 Topic_06 Topic_07 Topic_08 Topic_09
Multinomial Topic Modeling Topic Label Major Terms Type Topic_01 정수장학회 박근혜, 정수장학회, 안철수, 대선, 문재인, MBC, 최필립, 새누리당, 부산일보 rising Topic_02 대선후보박근혜, 문재인, 안철수, 후보, 대선, 대통령 rising Topic_03 박근혜지지율 박근혜, 후보, 안철수, 대선, 새누리당, 문재인, 단일화, 대통령, 지지율, 선거 falling Topic_04 안철수의혹안철수, 박근혜, 논문, 표절, 의혹, 다운계약서, 서울대 falling Topic_05 대선후보박근혜, 문재인, 안철수, 후보, 대선, 대통령 rising Topic_06 후보단일화안철수, 박근혜, 문재인, 대선, 후보, 무소속, 단일화 rising Topic_07 박근혜슬로건박근혜, 문재인, 안철수, 나라, 내, 꿈이, 이루어지는 rising Topic_08 박근혜캠프구성 박근혜, 문재인, 안철수, 민주당, 캠프, 김경재, 이, 대선, 장악한, 종북세력, 막으러, 들어왔다 rising Topic_09 대선후보후보, 박근혜, 안철수, 새누리당, 무소속, 민주통합당, 문재인 falling Topic_10 NLL 포기의혹박근혜, 문재인, NLL, 안철수, 노무현, 정문헌, 민주통합당 rising
Rising Issues 1.5 1 0.5 0-0.5-1 -1.5-2 -2.5 정수장학회 대선후보 대선후보 후보단일화 박근혜슬로건
Topic Modeling Toolkit Download Mallet-2.0.7.zip at http://informatics.yonsei.ac.kr/tsmm/download /mallet-2.0.7.zip Download Eclipse at https://eclipse.org/downloads/ Unzip Mallet-2.0.7.zip and create a java project in Eclipse Follow the instructions in the next slides (p. 50 p. 66)
Importing a Java Project to Eclipse
Program Import 1) 프로그램다운로드후압축해제 2) 이후 WorkSpace 에붙여넣기 Workspace 확인
Program Import 1) 프로그램다운로드후압축해제 2) 이후 WorkSpace 에붙여넣기 Workspace 내에붙여넣기
Program Import 1) 1) Eclipse 실행시첫화면 Welcome message 이므로그낭닫기 (X) 클릭
Program Import 2) 2) File > import 클릭 Import 클릭
Program Import 3) 3) General > Existing Projects into workspace 클릭, 이후 Next 클릭 Existing Projects into Worspace 클릭후 Next
Program Import 4) 4) Browse 클릭 Browse 를클릭
Program Import 5) 5) 미리설정한 workspace 내의프로젝트를클릭 프로젝트이름은 mallet-2.0.7 로되어있음 Mallet-2.0.7 클릭
Program Import 6) 6) 확인후 Finish 클릭 C:\workspace\text_mining_cla ss C:\workspace\text_mining_cla ss Finish 클릭
Program Import 7) 7) 다음과같이나타나면성공 ~~! Finish 클릭
Build path 설정 1) Mallet-2.0.7 옆의세모모양클릭
Build path 설정 2) Lib 폴더옆의세모모양클릭
Build path 설정 3) 파일에부가적인표시가되어있으면 Import 완료된파일 파일에부가적인표시가없으면 Import 가안된파일
Build path 설정 4) Import 가안된파일선택후우클릭이후 Build Path 에서 Add to Build Path 선택
인코딩설정 Windows 에서 preference 선택
인코딩설정 General > Workspace 선택 Text file Encoding 에서 Other 를선택 UTF-8 으로설정
Eclipse 좌측 Package Explorer 에서 X 라는빨간표시가다없어지면성공 ~!
Q/A 감사합니다.