chap6_basic_association_analysis PART2 ver2

Similar documents
PowerPoint Presentation

chap6_basic_association_analysis PART1 ver2

<근대이전> ⑴ 문명의 형성과 고조선의 성립 역사 학습의 목적, 선사 문화의 발전에서 국가 형성까지를 다룬다. 역사가 현재 우리의 삶과 긴밀하게 연결되었음을 인식하고, 역사적 상상력을 바탕으 로 선사 시대의 삶을 유추해 본다. 세계 여러 지역에서 국가가 형성되고 문 명

10-2 삼각형의닮음조건 p270 AD BE C ABC DE ABC 중 2 비상 10, 11 단원도형의닮음 (& 활용 ) - 2 -

본문01

Page 2 of 5 아니다 means to not be, and is therefore the opposite of 이다. While English simply turns words like to be or to exist negative by adding not,

Microsoft PowerPoint - 26.pptx

Microsoft PowerPoint Relations.pptx

6자료집최종(6.8))

Microsoft PowerPoint - Analyze

도비라

300 구보학보 12집. 1),,.,,, TV,,.,,,,,,..,...,....,... (recall). 2) 1) 양웅, 김충현, 김태원, 광고표현 수사법에 따른 이해와 선호 효과: 브랜드 인지도와 의미고정의 영향을 중심으로, 광고학연구 18권 2호, 2007 여름

기본서(상)해답Ⅰ(001~016)-OK

public key private key Encryption Algorithm Decryption Algorithm 1

Microsoft PowerPoint - 27.pptx

- 2 -

大学4年生の正社員内定要因に関する実証分析

Page 2 of 6 Here are the rules for conjugating Whether (or not) and If when using a Descriptive Verb. The only difference here from Action Verbs is wh

歯Ky2002w.PDF

[ReadyToCameral]RUF¹öÆÛ(CSTA02-29).hwp

2005년 6월 고1 전국연합학력평가


ÀÌÁÖÈñ.hwp

하나님의 선한 손의 도우심 이세상에서 가장 큰 축복은 하나님이 나와 함께 하시는 것입니다. 그 이 유는 하나님이 모든 축복의 근원이시기 때문입니다. 에스라서에 보면 하나님의 선한 손의 도우심이 함께 했던 사람의 이야기 가 나와 있는데 에스라 7장은 거듭해서 그 비결을

김경재 안현철 지능정보연구제 17 권제 4 호 2011 년 12 월

Stage 2 First Phonics

(01-16)유형아작중1-2_스피드.ps

Microsoft PowerPoint - ch03ysk2012.ppt [호환 모드]

1

02김헌수(51-72.hwp

44-4대지.07이영희532~

( )EBS문제집-수리

#Ȳ¿ë¼®

A n s w e r % ml g/cm 1.8 kg B E A C LNGLPGLNG LPG 15 << 13 A<


step 1-1


À±½Â¿í Ãâ·Â

아태연구(송석원) hwp


Ⅴ.피타코라스2(P )

DBPIA-NURIMEDIA

<31342D3034C0E5C7FDBFB52E687770>

IKC43_06.hwp

목 차 1. 공통공시 총괄 1 2. 살림규모 세입결산 세출결산 중기지방재정계획 7 3. 재정여건 재정자립도 재정자주도 재정력지수 통합재정수지 채무 및 부채 지방채무 현황


SS수학고등지도서(3-3)-13-OK

DIY 챗봇 - LangCon

(01~80)_수완(지학1)_정답ok

High Resolution Disparity Map Generation Using TOF Depth Camera In this paper, we propose a high-resolution disparity map generation method using a lo


우리들이 일반적으로 기호

878 Yu Kim, Dongjae Kim 지막 용량수준까지도 멈춤 규칙이 만족되지 않아 시행이 종료되지 않는 경우에는 MTD의 추정이 불가 능하다는 단점이 있다. 최근 이 SM방법의 단점을 보완하기 위해 O Quigley 등 (1990)이 제안한 CRM(Continu

확률과통계.indd

11이정민

untitled

0 cm (++x)=0 x= R QR Q =R =Q = cm =Q =-=(cm) =R =x cm (x+) = +(x+) x= x= (cm) =+=0 (cm) =+=8 (cm) + =0+_8= (cm) cm + = + = _= (cm) 7+x= x= +y= y=8,, Q

PowerPoint 프레젠테이션

<C3D1C1A4B8AE B0E6BFECC0C720BCF B9AE2E687770>

, _ = A _ A _ 0.H =. 00=. -> 0=. 0= =: 0 :=;^!;.0H =.0 000=0. -> 00= 0. 00= =: 0 0 :=;()$; P. 0, 0,, 00, 00, 0, 0, 0, 0 P. 0.HH= = 0.H =0. 0=. -> =0.


Microsoft PowerPoint - AC3.pptx

서론 34 2

(Exposure) Exposure (Exposure Assesment) EMF Unknown to mechanism Health Effect (Effect) Unknown to mechanism Behavior pattern (Micro- Environment) Re

,,,.,,,, (, 2013).,.,, (,, 2011). (, 2007;, 2008), (, 2005;,, 2007).,, (,, 2010;, 2010), (2012),,,.. (, 2011:,, 2012). (2007) 26%., (,,, 2011;, 2006;

Vol.258 C O N T E N T S M O N T H L Y P U B L I C F I N A N C E F O R U M

2004math2(c).PDF

한국성인에서초기황반변성질환과 연관된위험요인연구

SB-600 ( ) Kr SB-600 1

2011´ëÇпø2µµ 24p_0628

Microsoft PowerPoint - CHAP-03 [호환 모드]

304.fm

1 1,.,

<BFA9BAD02DB0A1BBF3B1A4B0ED28C0CCBCF6B9FC2920B3BBC1F62E706466>

영남학17합본.hwp

(001~042)개념RPM3-2(정답)

DBPIA-NURIMEDIA

<B3EDB9AEC1FD5F3235C1FD2E687770>

09권오설_ok.hwp

Contents... 테마1. 도형의합동과닮음 평행선의성질 2. 평행선과선분의길이의비 3. 삼각형의합동조건 4. 직각삼각형의합동조건 5. 도형의닮음 6. 직각삼각형에서의닮음 테마2. 삼각형 이등변삼각형의성질 8. 삼각형의중점연결정리 9. 삼

DBPIA-NURIMEDIA



歯M PDF

<B1A4B0EDC8ABBAB8C7D0BAB8392D345F33C2F75F E687770>

Press Arbitration Commission 62

<B3EDB9AEC1FD5F3235C1FD2E687770>

Output file


¹Ìµå¹Ì3Â÷Àμâ

레이아웃 1

Journal of Educational Innovation Research 2018, Vol. 28, No. 3, pp DOI: NCS : * A Study on

Slide 1

2 I.서 론 학생들을 대상으로 강력사고가 해마다 발생하고 있다.범행 장소도 학교 안팎을 가리지 않는다.이제는 학교 안까지 침입하여 스스럼없이 범행을 하고 있는 현실 이 되었다.2008년 12월 11일 학교에 등교하고 있는 학생(여,8세)을 교회 안 화장 실로 납치하여

Microsoft PowerPoint - ch07ysk2012.ppt [호환 모드]

Ł?

9장. 연관규칙분석과 협업필터링

Microsoft PowerPoint - 7-Work and Energy.ppt

Getting Started

(001~006)개념RPM3-2(부속)

Transcription:

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 1

Contents Rule Generation 2

Rule Generation from frequent itemset Given a frequent itemset L, find all non-empty subsets f Ì L such that f L f satisfies the minimum confidence requirement If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABC AB CD, AC BD, AD BC, BC AD, BD AC, CD AB, If L = k, then there are 2 k 2 candidate association rules (ignoring L Æ and Æ L) 3

Rule Generation from frequent itemset How to efficiently generate rules from frequent itemsets? 신뢰도 (confidence) 는 anti-monotone 성질을가지지않는다. à Apriori 특성사용이어려움 c(abc D) can be larger or smaller than c(ab D) 그러나, 동일한항목집합에서생성된규칙에대해서는 anti-monotone 성질이성립 (That is, confidence is anti-monotone w.r.t number of items on the RHS of the rule, or monotone w.r.t. the LHS of the rule) e.g., L = {A,B,C,D}: c(abc D) ³ c(ab CD) ³ c(a BCD) σ({a, B, C, D}) c ABC D = σ({a, B, C}) σ({a, B, C, D}) c AB CD = σ({a, B}) σ({a, B, C, D}) c A BCD = σ({a}) 4

Rule Generation for Apriori Algorithm Lattice of rules Low Confidence Rule ABCD=>{ } BCD=>A ACD=>B ABD=>C ABC=>D CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD Pruned Rules D=>ABC C=>ABD B=>ACD A=>BCD 5

Rule Generation for Apriori Algorithm Candidate rule is generated by merging two rules that share the same prefix in the rule consequent join(cdàab,bdàac) would produce the candidate rule D à ABC Prune rule D àabc if the exists a subset (ADàBC) that does not have high confidence CD=>AB Essentially, we are doing Apriori on the RHS Rule consequence 에 A 를공유하고있음 D=>ABC BD=>AC 6

Contents Maximal itemset/closed itemset 7

Maximal Frequent Itemset An itemset is maximal frequent( 최대빈발항목집합 ) if none of its immediate supersets are frequent That is, this is a frequent itemset which is not contained in another frequent itemset. 찾는방법 먼저 Infrequent 와 frequent itemset 사이의 border 에있는 frequent itemset 찾음 모든 immediate supersets 을찾음 만약 immediate superset 모두가 frequent 하지않으면, 해당 itemset 은 maximal frequent 함 ü 예 : Items: a, b, c, d, e ü Frequent Itemset: {a, b, c} ü {a, b, c, d}, {a, b, c, e}, {a, b, c, d, e} are not Frequent Itemset. ü Maximal Frequent Itemsets: {a, b, c} Maximal frequent itemset 은아주긴빈발항목집합을만들때유용함 일반적으로짧은항목집합은규칙으로서큰의미가없는경우가많음 반면에, 긴항목집합은대개가 surprise 한연관규칙을생성할수있음 8

Maximal Frequent Itemset Maximal frequent itemset 찾는예 1: 먼저 Infrequent 와 frequent itemset 사이의 border 에 d, bc, ad, abc frequent itemset 이있음을확인함 이들 itemset 의 immediate superset 을찾음 d 의 superset 으로 ad, bd, cd 가있는데, ad 는 frequent 임. à d 는 maximal 이되지못함 Bc 는 abc 와 bcd 를 superset 으로갖는데, abc 가 frequent 함 à bc 는 maximal 되지못함 ad 와 abc 의 superset 은모두 infrequent 함 à ad 와 abc 는모두 maximal 임 9

Maximal Frequent Itemset Maximal frequent itemset 찾는예 2: null Maximal Itemsets A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Infrequent Itemsets ABCD E Border 10

Closed Itemset An itemset is closed if none of its immediate supersets has the same support as the original itemset That is, this is a set of items which is as large as it can possibly be without losing any transactions Closed itemset 이 frequent 하면 closed frequent itemset 임 예 closed frequent itemset 찾는방법 먼저모든 frequent itemset 을찾음 이후, 만약해당 itemset 의 superset 이 original frequent itemset 과동일한 support 를가지면 closed itemset 아님. 아니면해당 original itemset 은 closed itemset 임 11

Closed Itemset 예 closed frequent itemset 찾는방법 closed frequent itemset frequent itemset c 는 closed frequent itemset 임. C 의 superset 인 ac, bc, cd 는 3 보다작은 support 값을가지므로 왼쪽예제에서총 9 개의 frequent itemset 이존재하며, 이중에서 4 개가 closed frequent itemset 임 파란색 circle 은 frequent itemset 임 파란색두꺼운 circle 은 closed frequent itemset 임 (closed 는 superset 과동일한 support 값을가지지않아야함 ) 노란색색칠된 circle 은 maximal frequent itemset 임 ad 는 frequent itemset 이지만 superset 인 abd 와동일한 support 값을가지므로 closed 아님 12

Maximal vs Closed Itemsets TID Items 1 ABC 2 ABCD 3 BCE 4 ACDE 5 DE null Transaction Ids 124 123 1234 245 345 A B C D E 12 124 24 4 123 2 3 24 34 45 AB AC AD AE BC BD BE CD CE DE 12 2 24 4 4 2 3 4 ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE 2 4 ABCD ABCE ABDE ACDE BCDE Not supported by any transactions ABCDE 13

Maximal vs Closed Frequent Itemsets Minimum support = 2 null Closed but not maximal 124 123 1234 245 345 A B C D E Closed and maximal 12 124 24 4 123 2 3 24 34 45 AB AC AD AE BC BD BE CD CE DE 12 2 24 4 4 2 3 4 ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE 2 4 ABCD ABCE ABDE ACDE BCDE # Closed = 9 # Maximal = 4 ABCDE 14

Maximal vs Closed Itemsets Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets it is important to point out the relationship between frequent itemsets, closed frequent itemsets and maximal frequent itemsets. Closed and maximal frequent itemsets are subsets of frequent itemsets but maximal frequent itemsets are a more compact representation because it is a subset of closed frequent itemsets. The diagram to the right shows the relationship between these three types of itemsets. 15

Contents 연관패턴의평가 16

연관규칙평가 (Pattern Evaluation) 연관규칙생성알고리즘은너무많은연관규칙을생성하는경향이있음 생성된많은규칙은유용하지않음 (uninteresting or redundant) 예를들어, {A, B, C} {D} 와 {A, B} {D} 가동일한지지도 / 신뢰도를 갖는다면, 이들두규칙은 redundant 함 Interestingness measures( 유용성척도 ) 는유도된규칙을제거하거나순위를매기는데 (prune or rank) 사용됨 지지도와신뢰도 (support & confidence) 도유용성척도에속함 17

유용성척도활용시점 Interestingness Measures 18

Computing Interestingness Measure 주어진규칙 X Y 에대해, 다음분할표 (contingency table) 를사용하여다양한유용성척도를계산할수있다 Contingency table for X Y Y Y X f 11 f 10 f 1+ X f 01 f 00 f o+ X 항목이 transaction 에없는경우 à X/ f +1 f +0 T f ij 는 support 즉, 빈도수 count 값을의미함 f 1+ 는결국 X 에대한지지도 count 를의미함 f 11 : support of X and Y f 10 : support of X and Y f 01 : support of X and Y f 00 : support of X and Y Used to define various measures support, confidence, lift, Gini, J-measure, etc. 19

신뢰도의단점 (Drawback of Confidence) Coffee Coffee s ( X ÈY) sx ( Y) = N Tea 15 5 20 Tea 65 15 80 s ( X ÈY) cx ( Y) = s ( X ) 80 20 100 Association Rule: Tea Coffee Support(Tea à Coffee) = 15/100 = 15% Confidence(Tea à Coffee) = s(tea U Coffee)/s(Tea) = 15/20= 75% - 위신뢰도를보고차를마시는사람은 coffee 를마시는경향이있다고볼지도모름 - 하지만, 위데이터를보면차를마시든마시지않든간에, coffee 를마시는사람의비율은원래 80% 였음 - 즉, 규칙 Tea à Coffee 를통해, 어떤사람이차를마신다는정보를통해커피를마시는사람의정보를아는것은 (75% 라는큰신뢰도값을가짐에도 ) 큰의미가없음. 20

Statistical Independence Population of 1000 students 600 students know how to swim (S) 700 students know how to bike (B) 420 students know how to swim and bike (S,B) P(SÙB) = 420/1000 = 0.42 P(S) P(B) = 0.6 0.7 = 0.42 P(SÙB) = P(S) P(B) => Statistical independence P(SÙB) > P(S) P(B) => Positively correlated P(SÙB) < P(S) P(B) => Negatively correlated 21

Statistical-based Measures Measures that take into account statistical dependence Lift 와 Interest 는 equivalent 함 22

연관규칙평가 (Pattern Evaluation) Lift of an association rule: X à Y, lift = P(Y/X)/P(Y)) If Lift > 1, then X and Y appear more often together than expected uthis means that the occurrence of X has a positive effect on the occurrence of Y or that X is positively correlated with Y. If Lift < 1 then, X and Y appear less often together than expected uthis means that the occurrence of X has a negative effect on the occurrence of Y or that X is negatively correlated with Y If Lift = 1, then X and Y are independent. uthis means that the occurrence of X has almost no effect on the occurrence of Y 23

Example: Lift/Interest Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea Coffee Confidence= P(Coffee Tea) = 0.75 but P(Coffee) = 0.9 Þ Lift = P(Coffee/Tea)/P(Coffee) = 0.75/0.9= 0.8333 (< 1, therefore, the Lift is suggesting a slight negative correlation b/w tea drinkers and coffee drinkers) 24

There are lots of measures proposed in the literature Some measures are good for certain applications, but not for others What criteria should we use to determine whether a measure is good or bad? What about Aprioristyle support based pruning? How does it affect these measures? 25