http://pdd4.webnode.kr/ e-business ch. 9. Big data & IoT Ph.D. Young-Min, Kyoung
contents 데이터마이닝 의개요 개요 기계학습 데이터마이닝기법 데이터마이닝 기법기초 개요 C4.5 알고리즘 엔트로피 결정나무규칙생성 Part. 데이터마이닝 Part. 의사결정나무 (Decision Tree) <Team Report #> Due Date : 9.6.7 ( 출력물 : 수업전제출 / 컴퓨터파일 : 당일자정까지 e- 메일제출 ) 과제주제 : 암호시스템 ( 종류 : 대칭시스템과비대칭시스템 ) 과암호화알고리즘 ( 예 : Caesar Cipher 해설및적용예 ) 에관한보고서 9-5-3
가장널리알려진대표적성공사례 장바구니분석 (Market Basket Analysis, Wal-Mart) [CRM] 고객의구매데이터분석 마케팅활용 & 카드신규발급신청 아기기저귀와맥주가함께팔림 ( 구매상관관계 ) 이현상에의미를부여할것인가? 그이유는? (insight) 9-5-3 3
출현배경과필요성 대량 (volume) 의, 빨리생성 (velocity) 되는, 숫자뿐만아니라텍스트와이미지, 동영상등의다양 (variety) 한데이터 ( 정형및비정형데이터 ) 취급에적합한기술필요 다양성 새로운분석방법요구 ( 전통적인분석기법은한계 ) 속도 시장의변화사이클에신속하게대응 ( 실시간으로수집되는데이터의신속한처리기술 ) 대량 저장및처리용량의증가에대처 ( 단일고성능컴퓨터시스템 vs. 분산컴퓨팅 ) 기업이활용할수있는모든내, 외부데이터를사용하여똑똑한의사결정을할수있는능력은디지털경제환경에서경쟁력확보의필수역량... Rich Data, Meaningful Information Data Insight & Foresight 3Vs (or 4Vs) 데이터급증 (IT/ 인터넷 /IOT) Knowledge Discovery 컴퓨터과학 ( 인공지능기술 ) 의개발 / 발전 Data Analysis Data Processing 컴퓨터기술 (H/W, S/W) 의빠른발전 Useful Information & Knowledge Understanding Customers Competitive Power 9-5-3 4
데이터마이닝개요 통계적분석방법기계학습인공지능컴퓨터과학 Neural Network Big Data 일일거래데이터고객데이터상품데이터고객반응데이터 패턴, 연관성데이터모델 Valuable Information hidden knowledge, unexpected info., or new rules Marketing 고객평가분류 / 예측연관관계분석 예 주중금요일에는어떤상품이잘팔리는가? 함께구매하는상품에는어떤것들이있는가? 현재고려하고있는프로모션의대상고객은어떻게선정할것인가? 9-5-3 5
데이터마이닝정의 수집된데이터로부터변환과정을거쳐, 데이터에포함되어있는관계, 패턴, 규칙등을발견하고모형화해서, 인사이트 (insight) 와포사이트 (foresight) 를도출함으로써유용한정보 / 지식을얻는일련의계산과정 사실에근거한객관적의사결정에도움 Statistics Computer Science Data Mining Today [8C] Bayes Theorem Regression [Mid. C] Neural Network Databases Genetic Algorithm [Late C] KDD SVM Data Science [C] Big Data/IOT 9-5-3 6
Knowledge Discovery in Databases(KDD) Algorithms Interpretation / Evaluation Data Mining Preprocessing Transformation Selection Patterns Raw Data Target Data Preprocessed Data Transformed Data Model / Rules Data Fusion Sampling De-ising Feature-Extraction rmalization Dimension -Reduction Classification Clustering Pattern Recognition Visualization Validation based on the content of Fayyad, Usama, Piatetsky-Shapiro, Gregory, and Smyth, Padhraic(996) From Data Mining to Knowledge Discovery in Databases, AI Magazine 7(3), 37 54. 9-5-3 7
데이터마이닝 (Data Mining) 전처리 (Pre-Processing) meaning : manipulation of data into the form suitable for further processing and analysis routine, tedious, and time consuming estimated that data preparation accounts for 6%~8% of the time spent on a data mining project kinds of dirty data which needs data cleaning Incomplete data Missing attribute values : [eg.] Occupation= isy data (incorrect value) Errors or Outliers : [eg.] Salary= - Inconsistent data Discrepancies in codes or names : [eg.] Duplicate records 변수간의관계를알고자할때, 5 개의레코드로대략적인개념획득가능 Outliers 판정기준? Outliers 의미? Age= 39 Birthday= 5/9/99 was 남성, w 남 Discrepancy between duplicate records 9-5-3 생산시스템공학특론 8
데이터마이닝 (Data Mining) 데이터변환 예 : DM 알고리즘에적합한데이터형으로의변환필요연속적속성값의수를줄이기위해 (discretization) 회귀분석 분류 범주형데이터 ( 성별, 거주형태, 교육정도등 ) 수치형데이터로표현 거주형태 : 범주형데이터 수치형데이터로 종속변수 : 범주형데이터로표현 나이 ( 연속적인값 ) 범주형데이터로 수치형데이터 ( 진데이터 ) ( 나이, 연간수입, 구매건수등 ) 아파트, 주택, 빌라, 주상복합, 원룸,, 3, 4, 5 7, 33, 48, 5, 66 대, 3 대, 4 대, 5 대, 6 대 Low-level concepts(numerical values for age) High-level concepts(young, middle-aged, old) 69 68 67 66 65 64 63 6 6 6 59 58 3 3 3 9 9-5-3 생산시스템공학특론 9
기계학습 (Machine Learning) 인공지능의한분야. 컴퓨터가데이터에대한학습을통해지식을획득하도록하는것. Machine Learning Techniques 지도학습 (Supervised Learning) 자율학습 (Unsupervised Learning) 개념 : 훈련데이터로부터함수를유추해내기위한기계학습방법 훈련데이터 (training data) 와시험데이터 (testing data) 사용 목표변수 ( 출력변수 ) 가정해져있는경우의분석 예 : 분류 (classification) 회귀분석 (regression analysis) 등 개념 : 사전정보없이데이터가어떻게구성되었는지를알아내기위한기계학습방법 목표변수 ( 출력변수 ) 가정해져있지않은경우의분석 예 : 군집분석 (cluster analysis) [ 세분화 (segmentation)] 연관성분석 (association rule discovery) 등 9-5-3
데이터마이닝기법의유형 모든 attribute 가독립변수이다. ( 독립변수 = 입력변수 = 원인 attribute) 자율학습 (Unsupervised Learning) DM techniques 보통하나의 attribute 가종속변수이다. ( 종속변수 = 출력변수 = 결과 attribute= 분류클 래스 ) 지도학습 (Supervised Learning) Descriptive c h a r a c t e r i s t i c s Predictive Clustering Association Sequential Analysis Classification Regression Prediction Decision Tree Rule Induction Neural Networks Nearest Neighbor Classification 출력변수 ( 종속변수 ) 가범주형 출력변수 ( 종속변수 ) 가수치형출력변수 ( 종속변수 ) 가범주형또는수치형 9-5-3
instances 지도학습의사례 의사결정나무를사용한분류 / 규칙생성 훈련데이터 (Training Data) attributes output attributes Patient Sore Swollen (class) ID# Throat Fever Glands Congestion Headache Diagnosis Table. Hypothetical Training Data for Disease Diagnosis input attributes 38.5 Strep throat 36.7 Allergy 39. 3 Cold 36.4 4 Strep throat 37.8 5 Cold 36.3 6 Allergy 36.8 7 Strep throat 36.7 8 Allergy 38.6 9 Cold 38. Cold 패혈증인두염 이데이터를근거로일반적인성질을알고싶다. 일반화된진단모델을얻고자함 ( 귀납적추론 ). 환자번호 인후염 발열 갑상선비대증 충혈두통진단 9-5-3
Table 의데이터에대한결정트리 Fever Swollen Glands Diagnosis = Strep Throat 갑상선이부었다면, 인두염이다. 갑상선이붓지않고열이있다면, 감기이다. 갑상선이붓지않고열이없다면, 알레르기이다. Diagnosis = Allergy Diagnosis = Cold 진단모델 (model) 분류 (Classification) Table. Data Instances with an Unknown Classification Patient Sore Swollen ID# Throat Fever Glands Congestion Headache Diagnosis <Test Data> 로사용시 진단예 () 진단예 ()?? 3? 패혈증인두염감기알레르기 패혈증인두염감기감기 9-5-3 3
DM technique : 분류 (classification) 개요 데이터로부터분류별특성을추출하여분류모형을생성하고, 이를기반으로새레코드 (Data Instances) 의분류값 ( 분류클래스 ) 을결정하는것 특징 예 데이터마이닝에서가장많이사용되는접근방법 지도학습방법 ( 결과속성이정해져있다 ) 종속변수 ( 알고자하는정보 ) 는범주형 (categorical, 명목형 ) 데이터로구성 분석의시점이미래가아닌현재에있다. 카드社, 고객평가모형, 우수회원과불량회원구별 연봉 억이상 인사람의프로파일작성신용카드의부정사용여부판단 / 신용카드신규발급요청의승인요건건강보험공단의보험금부정청구적발 9-5-3 4
DM technique : 예측 (prediction) 개요 특징 예 분류기법과유사한수행과정을사용하지만, 그분석의목적이현재가아닌미래시점의결과를결정 ( 분류또는추정 ) 하는것 지도학습방법에속한다. The emphasis is on predicting future rather than current outcomes. 즉, 미래시점의결과를다룸 출력 attribute : categorical or numeric 알고자하는정보또는예측하려는변수가연속형인경우가많음. 신용카드고객이대금청구서에첨부된프로모션을이용할것인가를예측다우존스산업평균의다음주종가를예측휴대폰가입자가다음세달내에통신사를다른곳으로바꿀것인지를예측 9-5-3 5
DM technique 3 : 군집화 (clustering) 개요 특징 예 데이터인스턴스들을동질성을가지는 개이상의그룹으로묶음으로써지식구조를만드는학습방법 maximizing the intra-class similarity and minimizing the inter-class similarity 자율학습방법이며, 종속변수가없다. 데이터에숨어있는개념구조를발견하는것을주목적으로한다. 데이터에개념형태로찾아질수있는의미있는관계가존재하는지판단 군집에속하는고객에한해도서정보카탈로그를선별발송지도학습모델의성능평가다른데이터마이닝작업을위한선행작업으로실시 지도학습에서사용할가장적합한입력 attribute들을판단이상치 (outlier ; noise data인경우가많음 ) 발견 9-5-3 6
DM technique 4 : 연관분석 (Association Analysis) 개요 특징 예 연관규칙 (Association Rules) 알고리즘을사용하여소매상품들간의숨겨진관계를찾는학습방법 즉, 데이터속에존재하는종속관계를찾아내어연관성을파악하는것 장바구니분석 (Market Basket Analysis) 으로도불림 자율학습방법에해당한다. 데이터에잠재된알려지지않은연관성을찾는것을목적으로한다. Association rules can have one or several output attributes. An output attribute for one rule can be an input attribute for another rule. 백화점 / 마트의매장진열, 상품이나서비스의교차판매 ( 크로스마케팅전략수립 ), 우편첨부 ( 프로모션설계 ), 사기적발등 9-5-3 7
DM technique 5 : 회귀분석 (Statistical Regression) 개요 하나이상의입력 attributes 를하나의출력 attribute( 수치값 ) 에연결하는방정식을세워일반적인모델을생성하는분석방법 ( 지도학습 ) 선형또는비선형회귀분석기법으로분류 [ 선형회귀방정식 ( 예 ) ] 4 3 4 3 5 3 3 3 4 5 4 범주형데이터는수치형으로변환하여적용 생명보험프로모션 =.599 ( 신용카드보험 ) -.5455 ( 성별 ) +.777 [ 적용예 ] 신용카드보험이없는여성의생명보험프로모션가입가능성의크기는얼마인가? 생명보험프로모션 =.599 () -.5455 () +.777 =.777 신용카드보험이없는남성의생명보험프로모션가입가능성의크기는얼마인가? 생명보험프로모션 =.599 () -.5455 () +.777 =.7 9-5-3 8
성능평가 Supervised 학습모델평가 모델의정확도 (accuracy) 평가 모델개발에사용되지않은새로운데이터 (test data) 를사용하여모델의분류결과가얼마나정확한가를계산한다. 평가도구 : 오분류행렬 (confusion matrix) A matrix used to summarize the results of a supervised classification. Entries along the main diagonal are correct classifications. Entries other than those on the main diagonal are classification errors. 오분류행렬의데이터는 모델의정확도계산에필요한값을제공한다. 9-5-3 9
성능평가 Supervised 학습모델평가 Table A- A. two-class confusion matrix Table A- 정확도 =? 오류율 =? Both have same performance? 9-5-3
성능평가 Supervised 학습모델평가 B. three-class confusion matrix n n c ij i= j= 모델평가에사용된 인스턴스데이터수 오분류행렬 모델을적용하여얻어진분류결과 3 classes ( 실제로속하는클래스 ) [ 해석 ] 주대각선에있는값들은정확한분류를나타낸다. [ 해석 ] 행 C i 에있는값들은클래스 C i 에속한인스턴스들을나타낸다. [ 해석 3] 칼럼 C j 에있는값들은 C j 의구성원들로분류된인스턴스들을나타낸다. 모델의정확도 (accuracy) Accuracy n C ij i j i n n i j C ij (%) 모델의오류율 (error rate) Error _ Rate Accuracy 9-5-3
http://pdd4.webnode.kr/ ch. 9. Big data & IoT / Datamining -Business Decision Tree ( 의사결정나무 ) Technique
Decision Tree( 결정트리 ) 지도학습에서 분류 를목적으로주로사용되는기법 a series of nested if-then-else statements 결정트리생성과학습모델 I T Training Data (some instances) Decision Tree 생성 분류모델 ( 규칙 ) Set of All Instances 결정트리성능시험 ( 검증 ) Test Data (remaining instances) 분류에활용 ( 또는예측에도활용 ) Test Data 의모든인스턴스를올바르게분류하면종료, 아니면 Training Data 에새인스턴스를추가하여과정반복 I T E 학습될개념을가장잘나타내줄수있는애트리뷰트를사용하여만들어짐. 9-5-3 3
결정트리생성알고리즘 : C4.5 algorithm(ross Quinlan) 기본개념 builds decision trees from a set of training data using the concept of information entropy (=information gain). At each node of the tree, C4.5 chooses one attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. The attribute with the highest information gain is chosen to make the decision. Information Gain( 정보이득 ) Entropy= Entropy = p i log p i i Entropy= Information Gain = Entropy parent [weighted average]entropy(children) 9-5-3 4
C4.5 algorithm T : training instances 의집합 T 의 instances 를가장잘표현할수있는 attributes 를선택 3 선택된 attribute 를트리의노드로하고, 이 attribute 의각기다른값을이노드의 child link (child link 의값 : instances 를하위부분클래스로구별 ) 로하여트리를생성 4 단계 3 의각하위부분클래스에대해 부분클래스에속하는 instances 가 a. 미리정의된기준을만족하거나또는트리에서이경로를따라남은 attribute 가더이상없으면, 이경로를사용하여새 instance 의분류를결정 b. 미리정의된기준을만족하지않고트리의경로를따라더세분화할 attribute 가 개이상남아있으면, 현재의부분클래스의모든 instances 의집합을 T 로두고단계 로되돌아가서반복 9-5-3 5
attribute의선택 (branching node) 기준 : T 의 instances 를가장잘표현할수있는 attribute? A 와 B 중어느것? A 나이 B 성별 대 3 대 Male Female 홈쇼핑구매 6 5 홈쇼핑구매 3 5 7 9-5-3 6
CLASSIFICATION DECISION TREE Raw Data Instances Age Income Student Credit_Rating Buys_Computer 8 High Fair 7 High Excellent 3 36 High Fair 4 44 Medium Fair 5 58 Low Fair 6 47 Low Excellent 7 33 Low Excellent 8 6 Medium Fair 9 4 Low Fair 46 Medium Fair 7 Medium Excellent 38 Medium Excellent 3 35 High Fair 4 45 Medium Excellent 9-5-3 7
CLASSIFICATION DECISION TREE Instances Age Income Student Credit_Rating Buys_Computer <=3 High Fair <=3 High Excellent 3 3..4 High Fair 4 >4 Medium Fair 5 >4 Low Fair 6 >4 Low Excellent 7 3..4 Low Excellent 8 <=3 Medium Fair 9 <=3 Low Fair >4 Medium Fair <=3 Medium Excellent 3..4 Medium Excellent 3 3..4 High Fair 4 >4 Medium Excellent 9-5-3 8
엔트로피 (Entropy) 의계산예시 summary for Buys-Computer Buys- Computer status counts Positive 9 Negative 5 two classes exists Entropy of the attribute Buys_Computer Entropy = 9 4 log 9 4 5 4 log 5 4 =.94 9-5-3 9
Instances Age Income Student Credit_Rating Buys_Computer <=3 High Fair <=3 High Excellent 3 3..4 High Fair 4 >4 Medium Fair 5 >4 Low Fair 6 >4 Low Excellent 7 3..4 Low Excellent 8 <=3 Medium Fair 9 <=3 Low Fair >4 Medium Fair <=3 Medium Excellent 3..4 Medium Excellent 3 3..4 High Fair 4 >4 Medium Excellent 컴퓨터구매자를판단?! ( 출력변수 ) 9-5-3 3
Split de 로 Age 를취한다면, Age 가중치포함 Entropy (age) = 5 4.97 + 4 4 + 5 4.97 =.694 Buy 3 <=3 3..4 >4 4 3 Information Gain = Entropy Entropy age =.94.694 =.46 Entropy = 5 log Entropy = 4 4 log Entropy = 3 5 log 5 3 5 log 3 5 =.97 4 4 4 log 4 = 3 5 5 log 5 =.97 9-5-3 3
Instances Age Income Student Credit_Rating Buys_Computer <=3 High Fair <=3 High Excellent 3 3..4 High Fair 4 >4 Medium Fair 5 >4 Low Fair 6 >4 Low Excellent 7 3..4 Low Excellent 8 <=3 Medium Fair 9 <=3 Low Fair >4 Medium Fair <=3 Medium Excellent 3..4 Medium Excellent 3 3..4 High Fair 4 >4 Medium Excellent 컴퓨터구매자를판단?! ( 출력변수 ) 9-5-3 3
Split de 로 Income 을취한다면, Income Entropy (Income) = 4 4.8 + 6 4.98 + 4 4 =.9 Buy 3 Low Medium High 4 Information Gain = Entropy Entropy age =.94.9 =.9 Entropy = 3 4 log Entropy = 4 6 log Entropy = 4 log 3 4 4 log 4 =.8 4 4 log 4 =. 4 6 6 log 6 =.98 9-5-3 33
Instances Age Income Student Credit_Rating Buys_Computer <=3 High Fair <=3 High Excellent 3 3..4 High Fair 4 >4 Medium Fair 5 >4 Low Fair 6 >4 Low Excellent 7 3..4 Low Excellent 8 <=3 Medium Fair 9 <=3 Low Fair >4 Medium Fair <=3 Medium Excellent 3..4 Medium Excellent 3 3..4 High Fair 4 >4 Medium Excellent 컴퓨터구매자를판단?! ( 출력변수 ) 9-5-3 34
Instances Age Income Student Credit_Rating Buys_Computer <=3 High Fair <=3 High Excellent 3 3..4 High Fair 4 >4 Medium Fair 5 >4 Low Fair 6 >4 Low Excellent 7 3..4 Low Excellent 8 <=3 Medium Fair 9 <=3 Low Fair >4 Medium Fair <=3 Medium Excellent 3..4 Medium Excellent 3 3..4 High Fair 4 >4 Medium Excellent 컴퓨터구매자를판단?! ( 출력변수 ) 9-5-3 35
엔트로피및정보이득계산결과 Output Split de Entropy Info. Gain Select Buys_Computer.94 Age.694.94-.694=.46 Max. Income.9.94-.9=.9 Student.789.94-.789=.5 Credit_Rating.89.94-.89=.48 Age <=3 3..4 >4 Buy 3 4 3 9-5-3 36
Age age<=3 Instances Age Income Student Credit_Rating Buys_Computer <=3 High Fair <=3 High Excellent 8 <=3 Medium Fair 9 <=3 Low Fair <=3 Medium Excellent age=3..4 age>4 Instances Age Income Student Credit_Rating Buys_Computer 3 3..4 High Fair 7 3..4 Low Excellent 3..4 Medium Excellent 3 3..4 High Fair Instances Age Income Student Credit_Rating Buys_Computer 4 >4 Medium Fair 5 >4 Low Fair 6 >4 Low Excellent >4 Medium Fair 4 >4 Medium Excellent 9-5-3 37
Age Entropy=.97 age<=3 age=3..4 age>4 Income, Student, Credit_Rating 중어느속성을 split node 로선택? Income, Student, Credit_Rating 중어느속성을 split node 로선택? Income, Student, Credit_Rating 중어느속성을 split node 로선택? Instances Age Income Student Credit_Rating Buys_Computer <=3 High Fair <=3 High Excellent 8 <=3 Medium Fair 9 <=3 Low Fair <=3 Medium Excellent 9-5-3 38
Entropy=.97 Age <=3 Entropy=.97 Age <=3 Income Low Income Medium Income High Student Student Buy Buy 3 Entropy = log Entropy = log Entropy = log = log = Entropy (Income) = 5 + 5 + 5 =.4 Information Gain = Entropy Entropy age =.97.4 =.57 같은방법으로 Credit_Rating 의경우, Entropy(Credit_Rating)=.95 Information Gain=. Entropy = 3 log Entropy = Entropy (Student) = 5 + 3 5 = 3 3 3 log 3 3 = Information Gain = Entropy Entropy age =.97 =.97 Age age<=3 age>4 age=3..4 Student all 4 Max. all all 3 9-5-3 39
Age age<=3 Entropy=.97 age>4 age=3..4 Income, Student, Credit_Rating 중어느속성을 split node 로선택? Income, Student, Credit_Rating 중어느속성을 split node 로선택? Income, Student, Credit_Rating 중어느속성을 split node 로선택? Instances Age Income Student Credit_Rating Buys_Computer 4 >4 Medium Fair 5 >4 Low Fair 6 >4 Low Excellent >4 Medium Fair 4 >4 Medium Excellent 9-5-3 4
앞의예와같은방법으로계산하면 ) Income 의경우, Entropy(Income)=.95 Information Gain=. ) Student 의경우, Entropy(Student)=.95 Information Gain=. 3) Credit_Rating 의경우, Entropy(Credit_Rating)= Information Gain=.97 Max. Conclusion Rule # Rule # The classification rules for this tree can be jotted down as: If person s age is less than 3 and he is not a student, he will not buy the product. Age(<=3) & student(no) = NO If person s age is less than 3 and he is a student, he will buy the product. Age(<=3) & student(yes) = YES all Age age<=3 age>4 age=3..4 Student all 4 all 3 Credit_Rating Fair all 3 Excellent all Rule #3 Rule #4 Rule #5 If person s age is between 3 and 4, he is most likely to buy. Age(3 4) = YES If person s age is greater than 4, and has an excellent credit rating, he will not buy. Age(>4) & credit_rating(excellent) = NO If person s age is greater than 4, with a fair credit rating, he will probably buy. Age(>4) & credit_rating(fair) = We get the perfect Decision Tree!! 9-5-3 4