Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6
Contents Association Rule Mining 2
Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example of Association Rules {Diaper} {Beer}, {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk}, Implication means co-occurrence, not causality!
Definition: Frequent Itemset 항목집합 (Itemset) A collection of one or more items u Example: {Milk, Bread, Diaper} k-itemset u An itemset that contains k items 지지도카운트 (Support count) (s) 특정항목집합을포함하는 transaction 수 예 : s({eggs}) = 1, s({milk, Bread,Diaper}) = 2 지지도 (Support) s 항목집합이나타나는트랜잭션의비율 예 : s{eggs} = 1/5 = 0.2 s({milk, Bread, Diaper}) = 2/5 빈발항목집합 (Frequent Itemset) 지지도가주어진임계치 minsup 보다큰 항목집합 2-itemset 4-itemset TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke 만약 minsup = 0.3이라면, {Eggs} 은빈발하지않으며, {Milk, Bread, Diaper} 은빈발함
Definition: Association Rule 연관규칙이란? X와 Y가항목집합일때, X Y 형태로나타나는함축표현 (implication expression) 예 : {Milk, Diaper} {Bread} 연관규칙의평가척도 s ( X ÈY) sx ( Y) = 지지도 (Support): s N - X와 Y를함께포함하는트랜잭션비율 - 규칙이얼마나중요하며, 유용한지를알수있음 ( 낮은 TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Support Count: s ( X) = { t X Ít, t ÎT} T = { t, t, t,..., t } 는모든 transaction 집합 1 2 3 N i i i 지지도갖는규칙은우연으로볼수있음 ) s ( X ÈY) cx ( Y) = 신뢰도 (Confidence): c s ( X ) Example: { Milk, Diaper} Þ Beer - X를포함한트랜잭션중에 Y가나타나는비율 - 규칙이얼마나믿을만한가? ( 가정과결론이얼마나타이트한관련성있는지를나타냄 ) (Milk,Diaper,Beer) s = s = T s (Milk,Diaper,Beer) c = = s (Milk,Diaper) 2 3 2 5 = 0.67 = 0.4
Association Rule Mining Task Given a set of transactions T, the goal of association rule mining is to find all rules having Support minsup threshold Confidence minconf threshold Brute-force approach: List all possible association rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds Þ Computationally prohibitive! 최소지지도 (Minsup: minimum support) 최소신뢰도 (Minconf: minimum confidence) 연관규칙마이닝 = 왼쪽조건을만족하는모든규칙을찾는작업
Mining Association Rules TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example of Rules: {Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5) Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence Ø 이때문에, 지지도 (support) 와신뢰도 (confidence) 를분리하여마이닝할필요있음
Contents Frequent Itemset Generation & Apriori Algorithm 8
Mining Association Rules Transaction DB 에서연관규칙을찾아야함 연관규칙을만들기위한 2 단계접근법 (Two-step approach): 1. 빈발항목집합생성 (Frequent Itemset Generation) Transaction DB 에서많이존재하는빈발항목집합을찾아야함 빈발항목집합 : support ³ minsup 을만족하는항목집합 2. 연관규칙생성 (Rule Generation) 각빈발항목집합을두개의항목집합으로분리하여 confidence ³ minconf 를만족하는연관규칙들을도출함 (Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset) 여기서, Transaction DB 에서빈발항목집합을찾아내는과정은 computationally expensive 함
Frequent Itemset Generation l Lattice 기반항목집합열거 아래의 lattice( 격자 ) 구조는모든가능한항목집합목록열거에도움됨 I={a,b,c,d,e} 에대한항목집합격자. D개의항목을포함하는데이터집합은 2 d -1개의빈발항목집합생성가능 null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE Given d items, there are 2 d possible candidate itemsets (The number of subsets)
Frequent Itemset Generation Brute-force approach: Lattice 의모든 itemset 은 candidate frequent itemset 이됨 Transaction 데이터베이스를스캔하면서, 각후보에대해 support 계산및카운트함 Transactions List of Candidates TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke N M 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke 모든후보에대해각 transaction 을매치함 복잡도 ~ O(NMw) => Expensive since M = 2 d!!! w ( M:candidate frequent itemset 의개수, N:Transaction 수, w: 최대 transaction 폭 )
Computational Complexity 항목이 d 개주어졌을때, 가능한항목집합의개수 = 2 d 가능한연관규칙의개수 = 3 d - 2 d+1 + 1 1 2 3 1 1 1 1 + - = ú û ù ê ë é ø ö ç è æ - ø ö ç è æ = + - = - = å å d d d k k d j j k d k d R If d=6, R = 602 rules
Frequent Itemset 생성전략 Reduce the number of candidates (M) Complete search: M=2 d Use pruning techniques to reduce M Reduce the number of transactions (N) Reduce size of N as the size of itemset increases Using the DHP(Direct Hashing & Pruning) and vertical-based mining algorithms Reduce the number of comparisons (NM) Use efficient data structures to store the candidates or transactions No need to match every candidate against every transaction
Reducing Number of Candidates Apriori principle: 어떤항목집합이빈발하다면, 그항목집합의모든부분집합도빈발함 예 : {Milk, Bread, Diaper} 가빈발항목집합이면, 이의부분집합인 {Milk, Bread}, {Bread, Diaper} 등도빈발항목집합이다. Apriori principle holds due to the following property of the support measure: " X, Y : ( X Í Y ) Þ s( X ) ³ s( Y ) 어떤항목집합의지지도는그부분집합들의지지도를넘을수없다! u즉, 어떤짧은서브패턴이자주나오지않는다는것을알고있다면, 이서브패턴에서아이템이더붙어서나오는수펀패턴도더자주나오지않음 이는지지도가 anti-monotone 성질을가지기때문이다. (a > b è f(a) < f(b))
Illustrating Apriori Principle null A B C D E AB AC AD AE BC BD BE CD CE DE 만약 {a,b} itemset 이빈발하지않는다면, 이를포함한 superset 도빈발하지않음 ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Pruned supersets 제거됨 ABCDE
Illustrating Apriori Principle Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1 (1) Coke 과 Eggs 는상대적으로빈도가적으므로 drop 됨 (3 이하이면 drop) Minimum Support = 3 Items (1-itemsets) If every subset is considered, 6 C 1 + 6 C 2 + 6 C 3 = 41 (Brute-force) With support-based pruning, 6 + 6 + 1 = 13 로줄어듦 (A Priori) (2) Frequent itemset 의항목이 (1) 에서 4 개이므로, 여기서 2 항목 itemset 을만듦. 4C2 = 6 Itemset Count {Bread,Milk} 3 {Bread,Beer} 2 {Bread,Diaper} 3 {Milk,Beer} 2 {Milk,Diaper} 3 {Beer,Diaper} 3 Pairs (2-itemsets) (3) 다시 3 이하인, 두개 itemset 을 drop 한후, 3 itemset 을만듦 Triplets (3-itemsets) Itemset Count {Bread,Milk,Diaper} 3
Illustrating Apriori Principle Database Tid Sup min = 2 Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E 1 st scan Itemset L {A} 2 1 C 1 {B} 3 C 2 C 2 {A, B} 1 L 2 Itemset sup 2 nd scan {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 C Itemset 3 3 rd scan L 3 {B, C, E} Candidate itemset Itemset {A, C, E} 는없음 Why? {A,E} 가 L2에없으므로 {B, C, E} 는있음. {B,C} {C,E} {B,E} 모두 L2에있으므로 Frequent 3-itemset 17 생성 sup {C} 3 {D} 1 {E} 3 sup {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 Itemset sup {B, C, E} 2 Frequent itemset Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E}
The Apriori Algorithm (Pseudo-Code) C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k!=æ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do L k+1 end increment the count of all candidates in C k+1 that are contained in t return È k L k ; = candidates in C k+1 with min_support 18
Implementation of Apriori How to generate candidates? Step 1: self-joining L k C k+1 = candidates generated from L k ; Step 2: pruning Example of Candidate-generation L 3 ={abc, abd, acd, ace, bcd} Step 1: self-joining: L 3 *L 3 uabcd from abc and abd uacde from acd and ace Step 2: Pruning: uacde is removed because ade is not in L 3 C 4 = {abcd} 19
Reducing Number of Comparisons Candidate counting: 빈발항목이라고결정내리기위해선먼저 candidate itemset 에서 support 값을계산해야함 Support 값계산을위해선 Transaction DB 와모두비교필요 à 높은계산복잡도 à Candidate 을 hash structure 에저장하여효율적비교 u Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets u That is, we use a hash tree structure to reduce the number of candidates in C that are checked for a data-sequence Transactions Hash Structure N TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke k Buckets
How to Count Supports of Candidates? Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates Method: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function(hash function): finds all the candidates contained in a transaction 21
Support Counting( 지지도계산 ) 방식 Transaction t ={1,2,3,5,6} 에속한 3-itemset 을모두열거하는체계적방법 Transaction 으로부터 3-itemset 을열거한후, 후보항목 (Candidate list) 각각과비교하여, 존재하면해당후보항목지지도카운트는증가됨 Transaction, t 1 2 3 5 6 Level 1 1 2 3 5 6 2 3 5 6 3 5 6 Level 2 1 2 3 5 6 1 3 5 6 1 5 6 2 3 5 6 2 5 6 3 5 6 1 2 3 1 2 5 1 2 6 1 3 5 1 3 6 1 5 6 2 3 5 2 3 6 2 5 6 3 5 6 Level 3 Subsets of 3 items
Support Counting using a Hash Tree Candidate itemset 이주어지면 à Hash tree 에서서로다른 bucket 에나눠저장됨 이때지지도계산시, 각 transaction 에포함된 itemset 도그것들에적합한 bucket 들로해쉬됨 즉, transaction 으로부터 itemset 을 hash function 규칙에따라 hash tree 의각 leaf 노드로할당함 이로써, transaction 에속한각 itemset 을각 itemset 마다비교하는대신, 아래그림처럼오직같은 bucket 에 candidate itemset 과비교함 (1) Candidate itemset 은 hash tree 에나눠저장됨 (2) Transaction ID:2,3,4 는 Hash Tree 의왼쪽 leaf node (bucket) 에만비교됨
Support Counting using a Hash Tree Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} You need: Hash function Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node) 15 개의 candidate itemsets 이 leaf node 에할당됨 Hash function 3,6,9 1,4,7 2,5,8 1 4 5 1 2 4 4 5 7 1 2 5 4 5 8 2 3 4 5 6 7 3 4 5 3 5 6 1 3 6 3 5 7 6 8 9 1 5 9 3 6 7 3 6 8
Support Counting using a Hash Tree Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} Hash function 3,6,9 1,4,7 2,5,8 candidate itemsets 에서 1,4,7 로시작되는것은 A 가지에모두할당되어야함 즉, {1 4 5} {1 2 4} {4 5 7} {1 2 5 } {4 5 8} { 1 5 9} {1 3 6}. 총 7 개의 itemset 이 A 가지에할당됨 각 leaf 노드에최대 3 개만할당가능하다면 Hash Tree 는 depth 를늘림 à B tree 추가됨 B tree 에서는두번째값이 2,5 8 것을갖게되고, 세번째값에따라왼쪽, 가운데, 오른쪽 node 로분배됨 1 4 5 1 2 4 4 5 7 A 1 2 5 4 5 8 2 3 4 5 6 7 3 4 5 3 5 6 1 3 6 3 5 7 6 8 9 1 5 9 B 3 6 7 3 6 8
Hash tree 주어진 candidate itemset 을 hash tree 에할당하는방법 Candidate itemset 을 Hash tree 의 level 1 에할당할때는첫번째 item 에의해결정됨 Level 2 의 leaf node 에할당할때는두번째 item 에의해결정됨 Candidate Hash Tree Hash Function 1,4,7 3,6,9 2,5,8 2 3 4 5 6 7 Hash on 2, 5 or 8 1 4 5 1 3 6 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 3 4 5 3 5 6 3 6 7 3 5 7 3 6 8 6 8 9
Hash tree Hash Function Candidate Hash Tree 1,4,7 3,6,9 2,5,8 Hash on 3, 6 or 9 1 4 5 1 3 6 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 2 3 4 5 6 7 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8
Hash tree Hash Function Candidate Hash Tree 1,4,7 3,6,9 2,5,8 Hash on 1, 4 or 7 1 4 5 1 3 6 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 2 3 4 5 6 7 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8
Support Counting using a Hash Tree Transaction {1 2 3 5 6} 은아래처럼됨 1 2 3 5 6 transaction Hash Function 1 + 2 3 5 6 2 + 3 5 6 1,4,7 3,6,9 3 + 5 6 2,5,8 2 3 4 5 6 7 Transaction 의 1,2,3 은각각트리의왼쪽, 중간, 오른쪽방문을의미함 (Hash funciton) 1 4 5 1 3 6 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 3 4 5 3 5 6 3 6 7 3 5 7 3 6 8 6 8 9
Support Counting using a Hash Tree Transaction 의 12, 13, 15 에서, 2,3,4 는각각중간, 오른쪽, 중간가지를의미함 (Hash funciton) Transaction 의 1,2,3 은왼쪽, 중간, 오른쪽가지. 1 2 3 5 6 transaction Hash Function 1 2 + 1 3 + 3 5 6 5 6 1 + 2 3 5 6 2 + 3 5 6 3 + 5 6 1,4,7 2,5,8 3,6,9 1 5 + 6 2 3 4 5 6 7 356 으로부터 5 는중간가지임 Transaction 의 12 + {3,5,6} 으로부터, 3,5,6 은각각오른쪽, 중간, 오른쪽가지이며, 15+6 으로부터 6 오른쪽임 1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 3 4 5 3 5 6 3 6 7 1+{2 3 5 6} 에서 13 의 3 으로부터오른쪽가지임을앎 3 5 7 6 8 9 3 6 8
Support Counting using a Hash Tree 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} 1 2 3 5 6 transaction Hash Function 1 2 + 1 3 + 3 5 6 5 6 1 + 2 3 5 6 2 + 3 5 6 3 + 5 6 1,4,7 2,5,8 3,6,9 1 5 + 6 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 4 5 7 * 1 2 5 4 5 8 * 1 5 9 3 4 5 3 5 6 3 6 7 3 5 7 6 8 9 * 3 6 8 9 개의 leaf node 에서 5 개의 node 를방문 15 개의 candidate itemset 에서 9 개가 transaction 과비교됨 à 실제 5 개 count 증가됨