Information Retrieval Part 2 sigma α 2015.11.15. 2015.11.29. 2015.12.23. sigma α
Information Retrieval (IR): Outline Issues Information Retrieval Boolean Retrieval The term vocabulary and posting lists Dictionaries and tolerant retrieval Scoring, term weighting The vector space model Evaluation in IR Relevance Feedback Probabilistic information retrieval Language models for IR sigma α 2
sigma α Part 2
sigma α Issues
Issues 박테리아로만든변신하는옷 biologic project https://www.facebook.com/makeyourfutures/videos/60113 0050028273/ 제스처컨드롤러 Gest https://www.facebook.com/makeyourfutures/videos/60081 9376726007/ Tensorflow https://youtu.be/ozikw5k_2fm http://tensorflow.org/ sigma α 5
sigma α The Vector Space Model
Binary idx and Count matrix sigma α 7
Weight matrix sigma α 8
Documents as vectors 각문서들은 tf-idf 의실수값 ( 벡터 ) 로표현됨 R V V : word set We have a V -dimensional real-valued vector space. Term들은벡터공간에서하나의축 (axes) 이됨문서들은벡터공간에서점이나벡터가됨 Very high-dimensional Each vector is very sparse 질의도벡터로표현?? Can represent high-dimensional 질의와가까운문서들에랭킹을적용할수있음 sigma α 9
How to use vector space? Euclidean distance q 와 d 2 는유사한문장이지만벡터길이가달라서서로매칭이안됨 sigma α 10
From angles to cosines 질의와문서를각도에따라랭킹을적용하는것은 cosin 을이용하는것과같음 Cosine: [0, 180 ] 사이의단조증가함수 sigma α 11
Length normalization L 2 norm 을이용하여벡터의길이를정규화할수있음 L 2 norm: x 2 = i x 2 i = 1.0 벡터들을 unit sphere에맵핑할수있음 As a result 문서의길이에상관없이같은규모의 order를가짐 이전의 Euclidean distance의문제해결 Cosine similarity sigma α 12 Unit sphere
Cosine similarity q i : tf-idf weight of term i in the query d i : tf-idf weight of term i in the document q and d : the lengths of q and d For normalized vectors, the cosine is equivalent to the dot product of scalar product: cos q, d = q d = q i d i i sigma α 13
Cosine scoring algorithm sigma α 14
Components of tf-idf weighting sigma α 15
sigma α Evaluation in IR
Unranked evaluation 정보검색의유효성측정 : 세가지실험컬렉션 문헌컬렉션 질의로표현가능한정보유구의실험집합 각질의 - 문헌쌍에대해서적합과비적합을이진평가하는적합성판단집합 sigma α 17
Precision and recall 정확률 (precision): P 는검색된문서의적합한일부분 Precision = #(relevant items retrieved) #(retrieved items) = P(relevant retrieved) 재현율 (recall): R 은적합한문서의검색된일부분 Recall = #(relevant items retrieved) = P(retrieved relevant) #(relevant items) sigma α 18
Precision and recall Relevant Non-relevant Retrieved True Positives (TP) False Positives (FP) Not retrieved False Negatives (FN) True Negatives (TN) P = TP/(TP + FP) R = TP/(TP + FN) 정확률과재현율의일반적인평가 : 정확도 (Accuracy) accuracy = (tp + tn)/(tp + fp + fn + tn) sigma α 19
Precision / recall tradeoff 정확률과재현율은서로 trade off 일반유저가 리팩토링 검색 첫페이지는정확한결과 주식분석가가주식관련검색 적합한문서를살펴보길원함 ( 재현율증가 ) 재현율은검색되는문헌의수에대해증가함 모든질의에대하여모든문서를검색할경우재현율은항상 100% 두관계를만족하는평균 Harmonic average sigma α 20
A combined measure: F Precision 과 recall 의 tradeoff 를대처하는방법 : F measure α 0, 1 and thus β 2 0, Most frequently used: balanced F with β=1 or α=0.5 This is the harmonic mean of P and R: 1 F = 1 2 (1 p + 1 R ) sigma α 21
F: Example sigma α 22
Why harmonic mean? 산술평균인경우, 재현율이 100% 이면평균값은 50% 너무크게나옴 부정확 조화평균은항상산술평균이나기하평균보다작거나같음 10,000 개의문서중하나의질의에적합한경우 Recall = 100%, F1 = 0.02% sigma α 23
F1 and other averages sigma α 24
sigma α Probabilistic IR
Basic Probability Theory 사건 A, B P(A, B): Joint probability 두사건이동시에일어날확률 P(A B): Conditional probability B 가발생했을때 A 가일어날확률 Chain rule 조건부확률을이용하여랜덤변수집합의결합확률분포를계산 P A, B = P A B = P A B P B = P B A P A Partition rule B 가어떤표본에서부분집합인경우, P B 는부분집합에해당하는확률 P B = P A, B + P( A, B), A = 1 A sigma α 26
Basic Probability Theory Bayes Rule P B A P A P A B = P(B) P A : prior probability = P(A B): posterior probability Odds X A, A P B A P B X P X P(A) P A 가 1 P A 에비해몇배가되는가의값 즉, 확률변화에대한 multiplier O A = P A P A = P A 1 P A sigma α 27
Probability Ranking Principle (PRP) Ranked retrieval setup 문서셋, 유저의질의 문서의랭킹리스트 Assume binary notion of relevance R d,q : random binary variable R d,q = 1, if document d is relevant w.r.t query q R d,q = 0, otherwise 확률적랭킹은추정된 relevance 확률에의해정렬됨 ( 내림차순 ): P(R = 1 d, q), d =documents, q =w.r.t query PRP in brief 질의에따라검색된문서가 relevance 확률에의해 ranked 됐다면, 시스템의성능은최고가됨 sigma α 28
Probability Ranking Principle (PRP) 0/1 loss PRP는 0/1 loss 등을이용하여랭킹을적용 이진적인상황에서정확성을계산하는방법 다음과같은경우에 1점을감점 부적합문서 1개를검색한경우 적합한문서 1개를검색하지못한경우 Bayes Optimal Decision Rule 임의의문헌이질의에적합한경우검색 d is relevant iff P R = 1 d, q > P(R = 0 d, q) 0/1 loss 환경에서기대손실을최소화하기때문에 PRP 는최적 Ripley (1996) sigma α 29
Binary Independence Model (BIM) Traditionally used with the PRP Binary (equivalent to Boolean) documents and queries represented as binary term incidence vectors 많은문헌이같은벡터표현을갖게됨 E.g., 문서 d 는 vector x = (x 1,, x M ) 으로표현됨 x t = 1, term t 가 d 에등장하면 1 x t = 0, Independence otherwise No association between terms (not true, but practically works naïve assumption of Naïve Bayes models) 벡터공간에서는각용어가다른용어와서로수직인각각차원에해당한다는의미 용어사이의연관성을사용하지않음 sigma α 30
Binary Independence Model (BIM) 확률모델 (Probability model) 확률모델을사용하려면문헌에나타난용어들이적합성 (relevance) 에어떤영향을주는지평가해야함 확률모델에대하여고려해야할사항 용어빈도, 문헌빈도, 문헌길이등 이들의최적조합을통해문헌의적합성확률을계산 sigma α 31
Binary Independence Model (BIM) BIM 은 P R d, q 를 term incidence vectors P(R x, q) 로모델링 (Bayesian rule 이용 ) P R = 1 x, q = P R = 0 x, q = P x R = 1, q P R = 1 q P x q P x R = 0, q P R = 0 q P x q P x R = 1, q andp x R = 0, q : probability that if a relevant or non-relevant document is retrieved, then that document`s representation is x sigma α 32
Binary Independence Model (BIM) P R d, q 는 term incidence vectors P(R x, q) 를이용 P R = 1 x, q = P x R = 1, q P R = 1 q P x q P R = 0 x, q = P x R = 0, q P R = 0 q P x q P R = 1 q and P R = 0 q : prior probability of retrieving a relevant or non-relevant document for a query q Estimate P R = 1 q and P R = 0 q from percentage of relevant documents in the collection chain rule Since a document is either relevant or non-relevant to a query, we must have that: P R = 1 q + P R = 0 q = 1 sigma α 33
Deriving a Ranking Function for Query Terms Probability Ranking Principle (PRP) Given a query q, ranking docs by P R d, q descending BIB에서 P(R x, q) 를이용 ranking 적용 Easier: 단순히랭킹에만관심 we use odds of relevance O R x, q = P R = 1 x, q P R = 0 x, q = P x R = 1, q P R = 1 q P x q P x R = 0, q P R = 0 q P x q = P R = 1 q P R = 0 q P( x R = 1, q) P( x R = 0, q) 상수!! 무시됨 랭킹에만관심 상수는필요없음 sigma α 34
Deriving a Ranking Function for Query Terms 2 Naïve Bayes conditional independence assumption 가정 : 질의가주어질때임의의용어출현 ( 또는미출현 ) 이다른용어의출현 ( 또는미출현 ) 에영향을주지않음 P( x R = 1, q) M P( x R = 0, q) = t=1 P(xt R = 1, q) P(x t R = 0, q) O R x, q = O(R q) M t=1 P(xt R = 1, q) P(x t R = 0, q) = O R q M t=1 P xt = 1 R = 1, q M P x t = 1 R = 0, q t=1 P xt = 0 R = 1, q P x t = 0 R = 0, q x t 는 0 또는 1 이기때문에위와같이분리됨 sigma α 35
Deriving a Ranking Function for Query Terms 3 Odds of relevance O R x, q = O R q M t=1 P xt = 1 R = 1, q M P x t = 1 R = 0, q t=1 P xt = 0 R = 1, q P x t = 0 R = 0, q P x t = 1 R = 1, q : 질의에적합한문서들에서어떤용어가출현할확률 P x t = 1 R = 0, q : 적합하지않는문서에서해당용어가출현할확률 분할표 (contingency table) 표현 document Relevant (R = 1) Nonrelevant (R = 0) Term present x t = 1 p t u t Term absent x t = 0 1 p t 1 u t sigma α 36
Deriving a Ranking Function for Query Terms 4 분할표에따라 가정 : 질의에나타나지않는단어가적합문서와부적합문서에나타날확률은같음 if q t = 0, then p t = u t O R x, q = O R q t:x t =q t =1 p t u t t:x t =0, q t =1 1 p t 1 u t 왼쪽곱 : 문서에서발견된질의용어들에대한곱 오른쪽곱 : 문서에서발견되지않은용어들에대한곱 sigma α 37
Deriving a Ranking Function for Query Terms 5 문서에서나타난질의용어들에대한곱을오른쪽곱에다집어넣고, 그양만큼나누는것을왼쪽곱에다넣게되면다음과같음 O R x, q = O R q t:x t =q t =1 p t (1 u t ) u t (1 p t ) t:q t =1 1 p t 1 u t 왼쪽곱 : 문헌에나타난질의용어들에대한곱 오른쪽곱 : 모든질의용어들에대한곱 상수 ( 무시됨 ) O R q : 상수 ( 무시됨 ) Retrieval Status Value (RSV) RSV d = log t:x t =q t =1 p t (1 u t ) u t (1 p t ) = t:x t =q t =1 p t (1 u t ) log u t (1 p t ) sigma α 38
Deriving a Ranking Function for Query Terms 6 (log) Odds ratio c t = log p t(1 u t ) u t (1 p t ) = log p t (1 p t ) + log 1 u t u t p t /(1 p t ): 용어가적합문서에나타날사건의대비확률 u t /(1 u t ): 용어가부적합문서에나타날사건의대비확률 c t = 0: 임의의용어가적합, 부적합모두나타날경우 c t > 0: 임의의용어가적합문서에나타날확률이더큰경우 c t 는용어가중치기능을하며, 주어진질의에대한문서점수는 RSV d = xt =q t =1 c t 가됨 sigma α 39
Okapi BM25: A non-binary model BIM 모델은짧은카탈로그나요약등과같이고정길이문서를위해개발 용어빈도나문서길이등도고려해야함 BestMatch25 (a.k.a BM25 or Okapi) 확률모델에인수를많이정의하지않으며용어빈도, 문서길이등과같은요소들을고려함 1994 년이후로지금까지좋은성능을보임 최근엔신경망을이용한방법이베스트!! sigma α 40
Okapi BM25: A non-binary model 2 문서 d 의점수를계산하는간단한방법 문서에나타나는질의용어들의 idf 값을더하는것 RSV d = t q 위식에용어빈도와문서길이적용 log N df t RSV d = t q log N df t k 1 + 1 tf td k 1 1 b + b L d /L ave + tf td tf td : 문서 d 의용어빈도 L d : 문서 d 의길이 L ave : 전체문서셋의평균문서길이 k 1 : tuning parameter, 용어빈도가중치조절 if k 1 = then only tf value, else if k 1 = 0 then tf = 0 b: tuning parameter, 문서길이가중치조절 sigma α 41
Okapi BM25: A non-binary model 3 질의가긴경우 RSV d = t q log N df t tf tq : 질의 q 에서의용어빈도 k 1 + 1 tf td k 1 1 b + b L d /L ave + tf td k 3 : tuning parameter, 질의의용어빈도가중치조절 k 3 + 1 tf tq k 3 + tf tq 질의길이에의한정규화없음 : 검색은한질의를대상으로수행되기때문 테스트문서셋을이용하여이개발문서셋에대하여최적파라미터튜닝가능 보통 k 1 과 k 3 는 1.2 와 2 사이의값이적당 b = 0.75 가적당 sigma α 42
References http://nlp.stanford.edu/ir-book/ http://cs.kangwon.ac.kr/~leeck/ir/ sigma α 43
QA 감사합니다. 박천음, 박찬민, 최재혁, 홍다솔 sigma α, 강원대학교 Email: parkce3@gmail.ac.kr sigma α 44