엑소브레인 인공지능 워크샵 암묵적 관계 발견을 통한 QA용 지식베이스 증강 Discovering Implicit Relationships to Augment Web-scale Knowledge Base for QA 맹성현, 김진호 IR & NLP Lab KAIST 2015.08.21 Fri.
001 Introduction CONTENTS 002 Method for Open Knowledge Acquisition 003 Evaluation 004 Conclusion & Future Work 2
1 INTRODUCTION 3
개념그래프(CG) 기반 QA 질문 질문 띠를 나타내는 12마리의 동물 중에서 상상의 동물은? It is imaginary animal among the twelve animals in the Chinese zodiac. it isa imaginary animal includedin oneof twelve animals comprisedof Chinese zodiac Dragon oneof 12-year cycle of animals appearsin Chinese zodiac dragon isa large imaginary animal breathesout fire has wings has long tail Dragon (zodiac) from Wikipedia The Dragon is one of the 12-year cycle of animals which appear in the Chinese zodiac dragon from Longman dictionary a large imaginary animal that has wings and a long tail and can breathe out fire 4
CGQA Flow CG 생성 (KAIST) 질문 CG 변환 (부산외대) CG 생성 (KAIST) 한국어 질문 정답 유형 검출 (POSTECH) 언어 자원간 연계/통합/확장 CG 필터링 CG 매칭/저장 (KAIST) CG 생성 (KAIST) Paraphrasing (경기대) 의미분류 생성 노드 Partial 매칭 (KAIST) 영어 질문 의미분별 의미분별 활용 (울산/부산대) 지식 확장 지식 보정 어휘지도 확장 (울산대) KorLex 확장 (부산대) 학습 & 평가 품질 검증 정답 후보 변환 (부산외대) 이미지 처리 & CG 생성 (KAIST) 말뭉치 태깅 및 평가셋 구축 (솔샘넷) 정답 통합 & 랭킹 (POSTECH) 5
엑소브레인 인공지능 워크샵 INTRODUCTION METHODOLOGY EVALUATION CONCLUSION 1.1 Motivation Knowledge Base Knowledge Base 6
엑소브레인 인공지능 워크샵 INTRODUCTION METHODOLOGY EVALUATION CONCLUSION 1.1 Motivation Augmenting Knowledge Base Augmenting Knowledge Base 새로운 Entity와 Relation을 추가 Knowledge Base I have only 0.0013% of knowledge (possible relation links) 기존 Entity와 Relation을 수정 기존 Entity 사이의 새로운 Relation을 추가 7
엑소브레인 인공지능 워크샵 INTRODUCTION METHODOLOGY EVALUATION CONCLUSION 1.1 Motivation Open Information Extraction and Implicit Relationship (OIE) Seoul is the largest city of Korea OIE Knowledge Base 8
엑소브레인 인공지능 워크샵 INTRODUCTION METHODOLOGY EVALUATION CONCLUSION 1.2 Problem Statement Open Knowledge Acquisition Infer all possible implicit relationships between two entities for which no relationship exist in OIE knowledge base Input Output A set of Triples An Entity Pair Implicit Relationships of input <Asiana airlines, Korea> <Asiana airlines, major_airline, Korea> Infer Implicit Relationship 주어진두개체관계를설명할수있는적절한관계명유추 All Possible Relationships 두 개체 사이에가능한모든 관계명을유추 No Textual Context 지식베이스Triple 집합만을활용 9
엑소브레인 인공지능 워크샵 INTRODUCTION METHODOLOGY EVALUATION CONCLUSION 1.3 Related Work Knowledge Acquisition 2010 SHERLOCK Schoenmackers & Etzioni University of Washington Open IE의관계명을활용하여 Inference Rule을 자동으로 추출 Output {A, Cause By, C}+{C, Short For, B} {A, Cause By, B} 2011 Saeger & Torisawa NICT Class Dependent Pattern과 Partial Pattern을 활용하여 Target Relation의추가적인 Instance를 추출 Output Cause(Hypertension, intracranial bleeding) 2014 ProbKB Yang et al. University of Florida SHERLOCK Rule의 조합을 통해 새로운 Inference Rule 추출 Output {A, Born in, B}+{A, Born in, C} {B, Located in, C} 2014 Knowledge Vault: Link Prediction Dong et al. Google Random Walk를활용하여 Target Relation별새로운 Instance를 추출 Output Profession(Charles Dickens,?} Writer 10
2 THE METHOD 11
엑소브레인 인공지능 워크샵 INTRODUCTION METHOD EVALUATION CONCLUSION 2.1 Overview Main Task Sub-Task Documents 1 Preprocessing Triples (diabetes, Treated With, insulin) (insulin, Needed To Create, energy) Entity Merging Entity Categorization 2 Building Entity Graph 6 Augment Diabetes Treated With insulin 4 Reduce 5 Labeling Relationship energy Subgraph Matching Relation Selection 3Discovering Implicit Entity Pairs 12
엑소브레인 인공지능 워크샵 INTRODUCTION METHOD EVALUATION CONCLUSION 2.2 Preprocessing Entity and Relational phrase normalization Goal 동일한의미를 지닌 Entity와Relation을하나로 통합 이후 모듈에서의매칭 효율성향상 Entity Relation Original Unique Entities 100% Original Unique Relations 100% Entity Merging 94.4% a, the 제거 Frequency가 높은 쪽으로결정 Relational Phrase Normalization 64% 29 stopwords 제거 POS MD, DT 제거 Lemmatization Entity Filtering (For Efficiency) 30.1% 1회등장Entity 제거 13
엑소브레인 인공지능 워크샵 INTRODUCTION METHOD EVALUATION CONCLUSION 2.2 Preprocessing Entity Categorization 28.3% Goal Entity의Class를 부여하여 매칭 정확도향상 10.7 % WordNet Based 첫 번째 Synset의첫 번째 Direct Hypernym을Class로 부여 89.3 % Triple Based 패턴추출: {A, [be 동사+ noun], B} A의클래스는noun 예. {Asiana Airlines, is an airline based in, Seoul} Asiana Airlines 의클래스는 airline Meronym : {A, [part of member of], B} A의클래스는B 예. {Astronomy, is a part of, physics} Astronomy 의클래스는 physics 14
엑소브레인 인공지능 워크샵 INTRODUCTION METHOD EVALUATION CONCLUSION 2.3 Constructing Entity-Relation Graph Directed Graph G = (V, E) V(G) = a set of Entities in the triple set T E(G) = a set of relational phrases in T Direction of Edge(e 1, e 2 ) = (e 1 e 2 ) in t Edge Weight(e 1, e 2 ) = PMI(e 1, e 2 ) PMI(e 1,e 2 ) = l ( (, ) ( ) ), =, 15
엑소브레인 인공지능 워크샵 INTRODUCTION METHOD EVALUATION CONCLUSION 2.4 Finding Implicit Entity Pairs Goal Implicit Relationship을찾아낼Entity Pair의 추출 및 랭킹 Algorithm The Link Prediction on Social Network Algorithm (Dong et al. 2013) Intuition 두 node A, B와 Common Neighbors로만 이루어진 SubGraph가 원래의 그래프와 비슷할 수록, 두 노드 A, B는 연관될 가능성이 높다. Similarity Formula Output Similarity CNGF (, ) = ( ) ( ) 30만 개의 Triple set 약 787만 개의 Implicit Entity Pair 추출 예. (the us., united states : 61.88), (Europe, European union : 43.41) 16
엑소브레인 인공지능 워크샵 INTRODUCTION METHOD EVALUATION CONCLUSION 2.5 Discovering Implicit Relationships Connected Components (CC) Matching X Connected Components(CC) X X Similarity Relation Commonality 17
3 EVALUATION 18
엑소브레인 인공지능 워크샵 INTRODUCTION METHOD EVALUATION CONCLUSION 3.1 Evaluation Dataset Characteristics of Dataset 실험데이터크기 300,000 Triples 300,000 Triples 원본데이터크기 3,000,000 Triples (Lin et 5,652,463 Triples (2015.06) Unique Entity al.2012) 983,410 2,824,974 Unique Relation 540,620 36 Example {Ben Kingsley, was born in, Yorkshire} {Bruce Murray, plays_for, FC Luzern} Source ClueWeb `09 Wikipedia, WordNet Characteristic Entity의Disambiguation 필요 다양한 Relation Entity의표현에일정 부분 Disambiguation이이루어짐 한정된Relation 19
엑소브레인 인공지능 워크샵 INTRODUCTION METHOD EVALUATION CONCLUSION 3.2 Evaluation Method Correctness Judgment by Search Engine Evaluation Process 임의의triple (e 1, r, e 2 )를실험데이터셋에서제거 (e 1, e 2 )를query로입력하여, implicit relation r`을추론 새롭게추출된(e 1, r`, e 2 )의진위여부평가: (r=r`) 또는(e 1,r`,e 2 )가옳으면정답 실험 데이터셋의모든 triple을순차적으로평가 Judgment by Search Engine Validation Test Query Search triple phrase e.g. seoul is a city of Korea 완전일치검색 Correct 결과문서가2개이상일경우 156개정답판단 44개 오답판단 Correlation 0.821 141개정답판단 59개 오답판단 검색엔진이정답으로평가한트리플전체는사람도정답으로판별 검색엔진이오답으로평가한트리플중일부는사람이정답으로판별 사람이오답으로평가한트리플전체는검색엔진도오답으로판별 20
엑소브레인 인공지능 워크샵 INTRODUCTION METHOD EVALUATION CONCLUSION 3.3 Comparison with Other Methods SHERLOCK 19,785 results from 7,368 patterns Baseline #1 Basic Transitive Inference Rule Rule : Entity A Entity B Entity C Entity A Entity C Relation : in 을 포함하는 모든 Relation (포함관계) 앞의 Relation보다 뒤의 Relation이 더 포괄적일 때만 적용 예. {A, is a city of, B} + {B, is located in, C} {A, is located in, C} Baseline #2 Baseline #3 Baseline #4 Baseline #5 Rank by Cosine Similarity Value Rank by Connected Components Similarity Value Rank by Relation Commonality Value Rank by Cohesion Value 21
엑소브레인 인공지능 워크샵 INTRODUCTION METHOD EVALUATION CONCLUSION 3.3 Comparison with Other Methods Evaluation Result (OIE Reverb Data Set) Transitive Rule Sherlock Cosine Sim Proposed Cohesion CC Similarity Relation Commonality 제안방법이가장높은MAP를기록 SHERLOCK은 불충분한Entity Class 정보로인해성능하락 가장 효과가 높은 Factor는 Cohesion # triples extracted 22
엑소브레인 인공지능 워크샵 INTRODUCTION METHOD EVALUATION CONCLUSION 3.5 Evaluation with Individual Relations Evaluation on OIE ReVerb Dataset Instance가1~2개인 Relation의정답률이 매우 높음 - Instance가적은Relation은상대적으로등장횟수가적고정답이될확률이낮은경우가많음 - 하지만정답으로선택될경우에는높은Evidence를가지고있는상태이므로정답확률이높아짐 23
4 CONCLUSION 24
엑소브레인 인공지능 워크샵 INTRODUCTION METHOD EVALUATION CONCLUSION 4.2 Conclusion and Future Work Conclusion 새로운 문제 정의 : Open Knowledge Acquisition Knowledge Acquisition을위해 Graph Structure를활용한 새로운 방법 제안 다양한데이터셋에적용 가능 다양한연구와결합되어 시너지생성 가능 Entity Resolution, Entity Linking, Relation clustering Future Work Entity Linking 및Semantic class learning을통한매칭오류제거 Entity Resolution과Relation Clustering을통한 재현율향상 향상된그래프탐색및매칭알고리즘을통한성능개선 25
26