Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 1
Contents Rule Generation 2
Rule Generation from frequent itemset Given a frequent itemset L, find all non-empty subsets f Ì L such that f L f satisfies the minimum confidence requirement If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABC AB CD, AC BD, AD BC, BC AD, BD AC, CD AB, If L = k, then there are 2 k 2 candidate association rules (ignoring L Æ and Æ L) 3
Rule Generation from frequent itemset How to efficiently generate rules from frequent itemsets? 신뢰도 (confidence) 는 anti-monotone 성질을가지지않는다. à Apriori 특성사용이어려움 c(abc D) can be larger or smaller than c(ab D) 그러나, 동일한항목집합에서생성된규칙에대해서는 anti-monotone 성질이성립 (That is, confidence is anti-monotone w.r.t number of items on the RHS of the rule, or monotone w.r.t. the LHS of the rule) e.g., L = {A,B,C,D}: c(abc D) ³ c(ab CD) ³ c(a BCD) σ({a, B, C, D}) c ABC D = σ({a, B, C}) σ({a, B, C, D}) c AB CD = σ({a, B}) σ({a, B, C, D}) c A BCD = σ({a}) 4
Rule Generation for Apriori Algorithm Lattice of rules Low Confidence Rule ABCD=>{ } BCD=>A ACD=>B ABD=>C ABC=>D CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD Pruned Rules D=>ABC C=>ABD B=>ACD A=>BCD 5
Rule Generation for Apriori Algorithm Candidate rule is generated by merging two rules that share the same prefix in the rule consequent join(cdàab,bdàac) would produce the candidate rule D à ABC Prune rule D àabc if the exists a subset (ADàBC) that does not have high confidence CD=>AB Essentially, we are doing Apriori on the RHS Rule consequence 에 A 를공유하고있음 D=>ABC BD=>AC 6
Contents Maximal itemset/closed itemset 7
Maximal Frequent Itemset An itemset is maximal frequent( 최대빈발항목집합 ) if none of its immediate supersets are frequent That is, this is a frequent itemset which is not contained in another frequent itemset. 찾는방법 먼저 Infrequent 와 frequent itemset 사이의 border 에있는 frequent itemset 찾음 모든 immediate supersets 을찾음 만약 immediate superset 모두가 frequent 하지않으면, 해당 itemset 은 maximal frequent 함 ü 예 : Items: a, b, c, d, e ü Frequent Itemset: {a, b, c} ü {a, b, c, d}, {a, b, c, e}, {a, b, c, d, e} are not Frequent Itemset. ü Maximal Frequent Itemsets: {a, b, c} Maximal frequent itemset 은아주긴빈발항목집합을만들때유용함 일반적으로짧은항목집합은규칙으로서큰의미가없는경우가많음 반면에, 긴항목집합은대개가 surprise 한연관규칙을생성할수있음 8
Maximal Frequent Itemset Maximal frequent itemset 찾는예 1: 먼저 Infrequent 와 frequent itemset 사이의 border 에 d, bc, ad, abc frequent itemset 이있음을확인함 이들 itemset 의 immediate superset 을찾음 d 의 superset 으로 ad, bd, cd 가있는데, ad 는 frequent 임. à d 는 maximal 이되지못함 Bc 는 abc 와 bcd 를 superset 으로갖는데, abc 가 frequent 함 à bc 는 maximal 되지못함 ad 와 abc 의 superset 은모두 infrequent 함 à ad 와 abc 는모두 maximal 임 9
Maximal Frequent Itemset Maximal frequent itemset 찾는예 2: null Maximal Itemsets A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Infrequent Itemsets ABCD E Border 10
Closed Itemset An itemset is closed if none of its immediate supersets has the same support as the original itemset That is, this is a set of items which is as large as it can possibly be without losing any transactions Closed itemset 이 frequent 하면 closed frequent itemset 임 예 closed frequent itemset 찾는방법 먼저모든 frequent itemset 을찾음 이후, 만약해당 itemset 의 superset 이 original frequent itemset 과동일한 support 를가지면 closed itemset 아님. 아니면해당 original itemset 은 closed itemset 임 11
Closed Itemset 예 closed frequent itemset 찾는방법 closed frequent itemset frequent itemset c 는 closed frequent itemset 임. C 의 superset 인 ac, bc, cd 는 3 보다작은 support 값을가지므로 왼쪽예제에서총 9 개의 frequent itemset 이존재하며, 이중에서 4 개가 closed frequent itemset 임 파란색 circle 은 frequent itemset 임 파란색두꺼운 circle 은 closed frequent itemset 임 (closed 는 superset 과동일한 support 값을가지지않아야함 ) 노란색색칠된 circle 은 maximal frequent itemset 임 ad 는 frequent itemset 이지만 superset 인 abd 와동일한 support 값을가지므로 closed 아님 12
Maximal vs Closed Itemsets TID Items 1 ABC 2 ABCD 3 BCE 4 ACDE 5 DE null Transaction Ids 124 123 1234 245 345 A B C D E 12 124 24 4 123 2 3 24 34 45 AB AC AD AE BC BD BE CD CE DE 12 2 24 4 4 2 3 4 ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE 2 4 ABCD ABCE ABDE ACDE BCDE Not supported by any transactions ABCDE 13
Maximal vs Closed Frequent Itemsets Minimum support = 2 null Closed but not maximal 124 123 1234 245 345 A B C D E Closed and maximal 12 124 24 4 123 2 3 24 34 45 AB AC AD AE BC BD BE CD CE DE 12 2 24 4 4 2 3 4 ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE 2 4 ABCD ABCE ABDE ACDE BCDE # Closed = 9 # Maximal = 4 ABCDE 14
Maximal vs Closed Itemsets Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets it is important to point out the relationship between frequent itemsets, closed frequent itemsets and maximal frequent itemsets. Closed and maximal frequent itemsets are subsets of frequent itemsets but maximal frequent itemsets are a more compact representation because it is a subset of closed frequent itemsets. The diagram to the right shows the relationship between these three types of itemsets. 15
Contents 연관패턴의평가 16
연관규칙평가 (Pattern Evaluation) 연관규칙생성알고리즘은너무많은연관규칙을생성하는경향이있음 생성된많은규칙은유용하지않음 (uninteresting or redundant) 예를들어, {A, B, C} {D} 와 {A, B} {D} 가동일한지지도 / 신뢰도를 갖는다면, 이들두규칙은 redundant 함 Interestingness measures( 유용성척도 ) 는유도된규칙을제거하거나순위를매기는데 (prune or rank) 사용됨 지지도와신뢰도 (support & confidence) 도유용성척도에속함 17
유용성척도활용시점 Interestingness Measures 18
Computing Interestingness Measure 주어진규칙 X Y 에대해, 다음분할표 (contingency table) 를사용하여다양한유용성척도를계산할수있다 Contingency table for X Y Y Y X f 11 f 10 f 1+ X f 01 f 00 f o+ X 항목이 transaction 에없는경우 à X/ f +1 f +0 T f ij 는 support 즉, 빈도수 count 값을의미함 f 1+ 는결국 X 에대한지지도 count 를의미함 f 11 : support of X and Y f 10 : support of X and Y f 01 : support of X and Y f 00 : support of X and Y Used to define various measures support, confidence, lift, Gini, J-measure, etc. 19
신뢰도의단점 (Drawback of Confidence) Coffee Coffee s ( X ÈY) sx ( Y) = N Tea 15 5 20 Tea 65 15 80 s ( X ÈY) cx ( Y) = s ( X ) 80 20 100 Association Rule: Tea Coffee Support(Tea à Coffee) = 15/100 = 15% Confidence(Tea à Coffee) = s(tea U Coffee)/s(Tea) = 15/20= 75% - 위신뢰도를보고차를마시는사람은 coffee 를마시는경향이있다고볼지도모름 - 하지만, 위데이터를보면차를마시든마시지않든간에, coffee 를마시는사람의비율은원래 80% 였음 - 즉, 규칙 Tea à Coffee 를통해, 어떤사람이차를마신다는정보를통해커피를마시는사람의정보를아는것은 (75% 라는큰신뢰도값을가짐에도 ) 큰의미가없음. 20
Statistical Independence Population of 1000 students 600 students know how to swim (S) 700 students know how to bike (B) 420 students know how to swim and bike (S,B) P(SÙB) = 420/1000 = 0.42 P(S) P(B) = 0.6 0.7 = 0.42 P(SÙB) = P(S) P(B) => Statistical independence P(SÙB) > P(S) P(B) => Positively correlated P(SÙB) < P(S) P(B) => Negatively correlated 21
Statistical-based Measures Measures that take into account statistical dependence Lift 와 Interest 는 equivalent 함 22
연관규칙평가 (Pattern Evaluation) Lift of an association rule: X à Y, lift = P(Y/X)/P(Y)) If Lift > 1, then X and Y appear more often together than expected uthis means that the occurrence of X has a positive effect on the occurrence of Y or that X is positively correlated with Y. If Lift < 1 then, X and Y appear less often together than expected uthis means that the occurrence of X has a negative effect on the occurrence of Y or that X is negatively correlated with Y If Lift = 1, then X and Y are independent. uthis means that the occurrence of X has almost no effect on the occurrence of Y 23
Example: Lift/Interest Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea Coffee Confidence= P(Coffee Tea) = 0.75 but P(Coffee) = 0.9 Þ Lift = P(Coffee/Tea)/P(Coffee) = 0.75/0.9= 0.8333 (< 1, therefore, the Lift is suggesting a slight negative correlation b/w tea drinkers and coffee drinkers) 24
There are lots of measures proposed in the literature Some measures are good for certain applications, but not for others What criteria should we use to determine whether a measure is good or bad? What about Aprioristyle support based pruning? How does it affect these measures? 25