Steven F. Ashby Center for Applied Scientific Computing Month DD, 1997

Size: px
Start display at page:

Download "Steven F. Ashby Center for Applied Scientific Computing Month DD, 1997"

Transcription

1 Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 1

2 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. 목표 : 클래스가결정되지않은레코드에대해서, 가능한정확하게클래스를부여하는것 A test set is used to determine the accuracy of the model Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 2

3 Page 3 척추동물데이터집합 입력속성 클래스속성 ( 타겟속성 )

4 10 10 Illustrating Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? 14 No Small 95K? 15 No Large 67K? Test Set 귀납적 Induction Deduction 연역적 Learning algorithm Learn Model Apply Model Model 4

5 분류모델의성능평가 Confusion matrix( 혼동행렬 ) 분류모델의성능평가는해당모델에의해정확하게혹은부정확하게예측되는시험레코드들의갯수를기반으로함 예 : 이진분류문제의혼동행렬사례 f ij 는클래스 j 로예측된클래스 i 인레코드들의수 Class 1 을 0 으로잘못예측 Predicted Class Class = 1 Class =0 Actual Class Class = 1 f 11 f 10 Class = 0 f 01 f 00 정확도 = 정확한예측개수총예측개수 = ffffff+ffffff ffffff+ffffff+ffffff+ffffff 오류율 = 부정확한예측개수총예측개수 = ffffff+ffffff ffffff+ffffff+ffffff+ffffff 5

6 Examples of Classification Task 종양세포 (tumor cells) 가양성인지음성 ( 악성 ) 인지판별 신용카드거래트랜잭션이정상인지사기인지 구분한다. 단백질 (protein) 의 2 차구조가 alpha-helix 인지, beta-sheet 인지, random coil 인지분류한다. 신문기사를경제, 날씨, 연예, 스포츠등으로구분한다. 6

7 Classification Techniques Decision Tree based Methods( 의사결정트리 ) Rule-based Methods( 규칙기반기법 ) Memory based reasoning Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines 7

8 10 Example of a Decision Tree Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Cheat 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 루트노드 (root node) Yes NO Refund TaxInc No Single, Divorced Splitting Attributes MarSt < 80K > 80K 내부노드 (internal node) Married NO 9 No Married 75K No 10 No Single 90K Yes 단말노드 (leaf, terminal node) NO YES Training Data Model: Decision Tree Cheat = Defaulted Borrower 로간주 8

9 10 Another Example of Decision Tree Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No Cheat Married NO MarSt Yes Single, Divorced Refund No 3 No Single 70K No NO TaxInc 4 Yes Married 120K No < 80K > 80K 5 No Divorced 95K Yes 6 No Married 60K No NO YES 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes There could be more than one tree that fits the same data! 9

10 10 10 Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? 14 No Small 95K? 15 No Large 67K? Test Set 귀납적 Induction Deduction 연역적 Tree Induction algorithm Learn Model Apply Model Model Decision Tree 10

11 10 Apply Model to Test Data Start from the root of tree. Test Data Refund Marital Status Taxable Income Cheat Refund No Married 80K? Yes No NO Single, Divorced MarSt Married TaxInc < 80K > 80K NO NO YES 11

12 10 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? NO Single, Divorced MarSt Married TaxInc < 80K > 80K NO NO YES 12

13 10 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? NO Single, Divorced MarSt Married TaxInc < 80K > 80K NO NO YES 13

14 10 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? NO Single, Divorced MarSt Married TaxInc < 80K > 80K NO NO YES 14

15 10 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? NO Single, Divorced MarSt Married TaxInc < 80K > 80K NO NO YES 15

16 10 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Refund No Married 80K? Yes No NO Single, Divorced MarSt Married Assign Cheat to No TaxInc NO < 80K > 80K NO YES 16

17 10 10 Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No Tree Induction algorithm 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes Induction 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? Learn Model Apply Model Model Decision Tree 12 Yes Medium 80K? 13 Yes Large 110K? Deduction 14 No Small 95K? 15 No Large 67K? Test Set 17

18 Decision Tree Induction( 의사결정트리구축 ) Many Algorithms: Hunt s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT 18

19 Tree Induction ( 트리구축 ) Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. 즉, 특정기준에가장부합하는속성을분할기준으로선택함 Issues Determine how to split the records 속성시험조건 (attribute test condition) 을어떻게지정할것인가? 최선의분할 (best split) 은어떻게결정할것인가? Determine when to stop splitting 19

20 헌트알고리즘의일반적구조 class Let D t be the set of training records that reach a node t General Procedure: 1 If D t contains records that belong the same class y t, then 2 t is a leaf node labeled as y t If D t is an empty set, then t is a leaf node labeled by the 3 default class, y d If D t contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.? D t Cheat = Defaulted Borrower 20

21 Hunt s Algorithm Defaulted= no Yes Defaulted = no Home owner No Defaulted = no Yes Defaulted = no Home owner Single, Divorced Defaulted = yes No Marital Status Married Defaulted = no Yes Defaulted = no Defaulted = no Home owner Single, Divorced Taxable Income No Marital Status < 80K >= 80K Defaulted = yes 21 Married Defaulted = no

22 속성시험조건을어떻게표현하나? 속성의종류에따라다름 명목형 (Nominal) 서열형 (Ordinal) 연속형 (Continuous) 분할개수에따라다름 이진분할 (Binary split) 다중분할 (Multi-way split) 22

23 명목형속성에기반한분할 다중분할 (Multi-way split): 각기다른속성값을사용하여가능한많은파티션으로분할한다. 이진분할 (Binary split): 속성값을두개의부분집합으로분할한다. ( 최적파티셔닝이필요함 ) 23

24 명목형속성에기반한분할 다중분할 : 각기다른속성값을사용하여가능한많은파티션으로분할한다. 이진분할 : Small Size Medium Large 속성값을두개의부분집합으로분할한다. ( 최적파티셔닝이필요함 ) {Small, Medium} Size {Large} OR {Medium, Large} Size {Small} 그런데, 오른쪽분할은어떤가? {Small, Large} Size {Medium} 24

25 연속형속성에기반한분할 연속형속성을처리하는두가지방법 서열형속성이되도록이산화 (discretization) 를적용함 정적방법 : 시작시점에이산화를한번만적용한다. 동적방법 : 분할이필요할때마다, 동일너비, 동일빈도, 클러스터링등으로 이산화를적용한다. 이진결정 (binary decision): (A < v) or (A v) 모든가능한분할을고려하고, 이중최선의분할을찾는다. 아주많은계산을필요로한다. 25

26 연속형속성에기반한분할 Taxable Income > 80K? Taxable Income? < 10K > 80K Yes No [10K,25K) [25K,50K) [50K,80K) (i) Binary split (ii) Multi-way split 26

27 Tree Induction ( 트리구축 ) Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. 즉, 특정기준에가장부합하는속성을분할기준으로선택함 Issues Determine how to split the records 속성시험조건 (attribute test condition) 을어떻게지정할것인가? 최선의분할 (best split) 은어떻게결정할것인가? Determine when to stop splitting 27

28 최선의분할을어떻게할것인가? Before Splitting: 10 records of class 0, 10 records of class 1 Own Car? Car Type? Student ID? Yes No Family Luxury c 1 c 10 c 20 Sports c 11 C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 C0: 1 C1: 0... C0: 1 C1: 0 C0: 0 C1: 1... C0: 0 C1: 1 Which test condition is the best? 첫번째의 Own Car 를사용하는경우보다두번째의 Car Type 을사용하는경우가보다순도 (purity) 가높은분할임 분할을위해서불순도 (impurity) 혹은불순척도 (impurity measure) 개념을도입하고, 이불순도를낮추는방향으로분할을시도한다. 28

29 최선의분할을어떻게할것인가? Greedy approach: 각노드는동종클래스 (homogeneous class) 분포가되도록분할한다. 노드의불순도를측정할필요가있다 C0: 5 C1: 5 Non-homogeneous, High degree of impurity C0: 9 C1: 1 Homogeneous, Low degree of impurity 29

30 최선의분할을어떻게할것인가? 노드에서클래스비율을나타내는척도 p(i t) 노드 t 에서클래스 I 를갖는레코드들의비율 ( 분수 ) 두클래스 0, 1로구성된경우라면, p(1 t) = 1 p(0 t) 가성립한다. 간략히 p i 로나타내기도한다. 오른예에서, 분할전클래스분포는 (0.5,0.5) 이고 Own Car로분할시 (0.6,0.4) 와 (0.4,0.6) 이며, Car Type으로분할시 (1/4,3/4), (1,0), (1/8,7/8) 이다. 각노드에서클래스분포가편중 (skewed) 이되도록분할하는것이좋은분할임

31 불순도척도 Gini Index Entropy ii=00 cc 11 pp ii tt llllll 22 pp(ii tt) ii=00 cc pp ii tt Misclassification error mmmmmm ii [pp ii tt ] 클래스분류가잘되어, 한쪽으로치우진경우불순도는작으며, 클래스분류가잘되지않아서, 고르게분포된경우는불순도는큼 31

32 정보이득 정보이득 : 분할전부모노드의불순도와분할후자식노드들의불순도의차이 = I( parent) k j= 1 N( v N j ) I( v j ) N: 부모노드에서의레코드총수 k : 속성값들의수 N(vj) : 자식노드 vj 와관련된레코드수 Gain 값최대화 = Children 노드의 weighted 평균불순도값최소화 If I() = 불순도척도 ( 예 : Entropy), then Δ info is called information gain 32

33 Tree Induction ( 트리구축 ) Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. 즉, 특정기준에가장부합하는속성을분할기준으로선택함 Issues Determine how to split the records 속성시험조건 (attribute test condition) 을어떻게지정할것인가? 최선의분할 (best split) 은어떻게결정할것인가? -- Gini 지수 Determine when to stop splitting 33

34 불순도척도 : GINI Gini Index for a given node t : GINI( t) = 1 j [ p( j t)] ( p( j t) 혹은 pj 는노드 t 에서 class j 에속하는레코드의수 ). Maximum (1-1/n c ) when records are equally distributed among all classes, implying least interesting information Minimum (0) when all records belong to one class, implying most interesting information 2 C1 0 C2 6 Gini=0.000 C1 1 C2 5 Gini=0.278 C1 2 C2 4 Gini=0.444 C1 3 C2 3 Gini=

35 불순도척도 : GINI GINI( t) = 1 j [ p( j t)] 2 예 : 2 개의클래스유형을가지며, 각각반씩분포하는경우 : 1- ( (1/2)^2 + (1/2)^2) = 1-1/2 = ½ 35

36 GINI 계산사례 GINI( t) = 1 j [ p( j t)] 2 C1 0 C2 6 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 [ P(C1) 2 + P(C2) 2 ] = = 0 C1 1 C2 5 P(C1) = 1/6 P(C2) = 5/6 Gini = 1 [ (1/6) 2 + (5/6) 2 ] = C1 2 C2 4 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 [ (2/6) 2 + (4/6) 2 ] =

37 GINI 기반분할 Used in CART, SLIQ, SPRINT. When a node p is split into k partitions (children), the quality of split is computed as, GINI split = k i= 1 ni n GINI( i) where, n i = number of records at child i, n = number of records at node p.

38 이진속성의분할 : Computing GINI Index Splits into two partitions Effect of Weighing partitions: Larger and Purer Partitions are sought for. A? Yes Node N1 No Node N2 노드 N1 의지니지수 = 1 [(4/7) 2 + (3/7) 2 ] = 노드 N2 의지니지수 = 1 [(2/5) 2 + (3/5) 2 ] = 0.48 Children 노드의 Gini 지수 가중평균필요 = (7/12)* (5/12) * 0.48 = 0.486

39 이진속성의분할 : Computing GINI Index B? Gini(N1) = 1 (1/5) 2 (4/5) 2 = 0.32 Yes Node N1 No Node N2 Gini(N2) = 1 (5/7) 2 (2/7) 2 = Gini(Children) 가중평균 = 5/12 * /12 * = 속성 B 에대한 Gini 지수가더작으므로속성 B 를사용한분할이속성 A 를사용한분할보다더나은분할임

40 명목형속성분할 : Computing Gini Index 명목형속성은이진분할뿐만아니라, 아래의예처럼다중분할 (multi-way split) 가능함 다중분할이이진분할보다더작은 Gini 지수가짐 ( 이중분할은결국다중분할에서일부결과를 merge 한것이므로순도는낮게됨 ) Multi-way split Two-way split (find best partition of values) Gini(Family) = 1 (1/4) 2 (3/4) 2 = Gini(Sports) = 1 (8/8) 2 (0) 2 = 0 Gini(Luxury) = 1 (1/8)^2 (7/8)^2= * Weighted Gini = 4/20 * /20 * =

41 연속형속성분할 : Computing Gini Index Use Binary Decisions based on one value Several Choices for the splitting value Number of possible splitting values = Number of distinct values Each splitting value has a count matrix associated with it Class counts in each of the partitions, A < v and A v Simple method to choose best v For each v, scan the database to gather count matrix and compute its Gini index Computationally Inefficient! Repetition of work. Taxable Income > 80K? Yes No

42 연속형속성분할 : Computing Gini Index 아래의예는 Taxable Income 속성의모든값을분할위치후보로사용하여적절한사이값을기준으로분할함 ( 예, 100 과 120 속성에대해중간값인 110 을기준으로분할함 ) 이후, 각각의구간에대해 Gini 지수계산 가장낮은 Gini 값을가지는경우는 v=97 인경우임 Sorted Values Split Positions Gini( 값 55 중심 ) = 1 (3/10) 2 (7/10) 2 = 0.42 Gini( 값 65 중심 ) = 1 (0) 2 (0) 2 = 0 Gini( 값 65 중심 ) = 1 (3/9)^2 (6/9)^2= * Weighted Gini( 값 65 중심 ) = 1/10 * 0 + 9/10 * =

43 Tree Induction ( 트리구축 ) Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. 즉, 특정기준에가장부합하는속성을분할기준으로선택함 Issues Determine how to split the records 속성시험조건 (attribute test condition) 을어떻게지정할것인가? 최선의분할 (best split) 은어떻게결정할것인가? -- Entropy Determine when to stop splitting 43

44 Alternative Splitting Criteria based on INFO Entropy at a given node t: Entropy ( t) ) = p( j t)log p( j t j (NOTE: p( j t) is the relative frequency of class j at node t). Measures homogeneity of a node. Maximum (log n c ) when records are equally distributed among all classes implying least information Minimum (0.0) when all records belong to one class, implying most information Entropy based computations are similar to the GINI index computations 44

45 Alternative Splitting Criteria based on INFO 45

46 Examples for computing Entropy Entropy t) = p( j t)log p( j t) j ( 2 C1 0 C2 6 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = 0 log 0 1 log 1 = 0 0 = 0 C1 1 C2 5 P(C1) = 1/6 P(C2) = 5/6 Entropy = (1/6) log 2 (1/6) (5/6) log 2 (5/6) = 0.65 C1 2 C2 4 P(C1) = 2/6 P(C2) = 4/6 Entropy = (2/6) log 2 (2/6) (4/6) log 2 (4/6) =

47 Splitting Based on INFO... Information Gain: k ni GAIN = Entropy( p) Entropy( i) split i= 1 n Parent Node, p is split into k partitions; n i is number of records in partition i Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) Used in ID3 and C4.5 Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure.

48 Splitting Based on INFO... Gain Ratio: GainRATIO GAIN Split SplitINFO split = Parent Node, p is split into k partitions n i is the number of records in partition i = k i SplitINFO log i = 1 Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher entropy partitioning (large number of small partitions) is penalized! Used in C4.5 Designed to overcome the disadvantage of Information Gain n n ni n

49 Tree Induction ( 트리구축 ) Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. 즉, 특정기준에가장부합하는속성을분할기준으로선택함 Issues Determine how to split the records 속성시험조건 (attribute test condition) 을어떻게지정할것인가? 최선의분할 (best split) 은어떻게결정할것인가? Classification Error Determine when to stop splitting 49

50 Examples for Classification Error Error( t) = 1 max P( i t) i C1 0 C2 6 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Error = 1 max (0, 1) = 1 1 = 0 C1 1 C2 5 P(C1) = 1/6 P(C2) = 5/6 Error = 1 max (1/6, 5/6) = 1 5/6 = 1/6 C1 2 C2 4 P(C1) = 2/6 P(C2) = 4/6 Error = 1 max (2/6, 4/6) = 1 4/6 = 1/3 50

51 Splitting Criteria based on Classification Error Classification error at a node t : Error( t) = 1 max P( i t) i Measures misclassification error made by a node. Maximum (1-1/n c ) when records are equally distributed among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying most interesting information 51

52 훈련오류율 e(tl) 훈련오류율 e(tr)

53 Splitting Criteria based on Classification Error 53

54 Comparison among Splitting Criteria For a 2-class problem: 특정클래스의비율 (p) 이아주작거나높을경우엔불순도가낮아지나, ( 이진분류에서 ) 0.5에가까운경우에는불순도가높아진다. 54

55 Misclassification Error vs Gini Yes Node N1 A? No Node N2 ME(Parent) = 1- max(7,3) /10 = 1-7/10 = 0.3 Gini(N1) = 1 (3/3) 2 (0/3) 2 = 0 Gini(N2) = 1 (4/7) 2 (3/7) 2 = Gini(Children) = 3/10 * 0 + 7/10 * = Gini improves!! Misclassification Error ME(N1)= 1 max(3,0)/3 = 0 ME(N2)= 1 max(4,3)/7 = 1-4/7 = ME(Children) = 3/10 * 0 + 7/10 * = 0.299

56 Tree Induction ( 트리구축 ) Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. 즉, 특정기준에가장부합하는속성을분할기준으로선택함 Issues Determine how to split the records 속성시험조건 (attribute test condition) 을어떻게지정할것인가? 최선의분할 (best split) 은어떻게결정할것인가? Classification Error Determine when to stop splitting ( 분할멈추는시점 ) 56

57 트리구축중단시점 노드에속하는모든레코드들이동일한클래스를갖는경우, 더이상분할하지않고멈춤 노드에속하는모든레코드들이동일한 ( 유사한 ) 속성값을갖는경우, 더이상분할하지않고멈춤 노드의레코드수가임계치이하로떨어지는경우, 분할을멈춤 그외, 미리멈춤 (early termination) 도존재함 57

58 의사결정트리기반분류장점 장점 Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets 58

59 의사결정트리사례 : C4.5 Simple depth-first construction. Uses Information Gain Sorts Continuous Attributes at each node. Needs entire data to fit in memory. Unsuitable for Large Datasets. Needs out-of-core sorting. You can download the software from: 59

60 의사결정트리의주요이슈 Overfitting ( 과잉적합 ) ( 일반적으로 ) 트리가너무크고자세하게형성되어, 오히려정확도가떨어지는문제 훈련집합에과잉적합생성되어, 테스트집합이나실제분류되지않은데이터에대해서는오류가오히려더커지는문제 Cf. Underfitting( 부족적합 ) Missing Values ( 누락값 ) 누락값이있는경우, 부정확한트리가구축됨 Costs of Classification ( 분류비용 ) 대용량데이터집합, 고차원데이터, 복잡한데이터의경우, 분류에많은시간이걸림 정확도높은의사결정트리를생성하기위해많은비용이요구됨 60

61 Underfitting and Overfitting (Example) 500 circular and 500 triangular data points. Circular points: 0.5 sqrt(x 12 +x 22 ) 1 Triangular points: sqrt(x 12 +x 22 ) > 0.5 or sqrt(x 12 +x 22 ) < 1 61

62 Underfitting and Overfitting Overfitting Overfitting: 트리에서속성값을계속 분할할수록훈련집합에대한 분류에러는줄어듦 Underfitting 하지만, 트리가너무세분화되어분할되었기때문에, 시험집합에대해서는어떤시점부터는분류에러가증가함 Underfitting: when model is too simple, both training and test errors are large 62

63 노이즈에의한과잉적합 Decision boundary is distorted by noise point 63

64 데이터부족으로인한과잉적합 - 적은수의 training data 에의해과잉적합되는경우가많음 - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task ( 붉은색으로채워진데이터가너무부족 (2 개 ) 하여, 붉은색채워지지않은예측데이터로학습을할경우, 원래의분류선 ( 녹색 ) 과는많이다르게됨 ) 64

65 다중비교절차가필요한경우의과잉적합 초기모델 M 에서, 추가고려 ( γγ) 를통해이득이있으면, 추가고려사항이포함된대안모델을사용할수있음 이때다양한추가고려 γγ 사항이존재할수있음 즉, 많은수의대안이있는경우, 의도치않게잘못된선택을할수있으며, 이는 overfitting 을유발할수도있음 65

66 일반화오류 (Generalization error) 에대한추정 일반화오류를최대한줄여야함. 하지만, 모델을만들때는 training set 만사용할수있으므로, 일반화오류를간접적으로추정해서이를줄여야함 1. Re-substitution estimate ( 재치환추정 ) 이용하기 훈련집합이전체데이터를잘대표한다고가정하여, 훈련오류 ( 즉, 재치환오류 ) 는일반화오류에대한추정치를제공하는데이용할수있다고가정함 이에, 의사결정트리귀납알고리즘은단순히가장낮은훈련오류율을보이는모델을그최종모델로선택함 당연, 훈련오류는좋은일반화오류에대한추정이아님! 2. Model Complexity( 모델복잡도 ) 고려하기 모델이복잡해질수록 overfitting 발생가능성높아짐. 이에, 모델의복잡도를고려하여일반화오류를줄여야함 3. Pessimistic Estimate( 비관적오류추정 ) 일반화오류를훈련오류와모델복잡도에대한 penalty 값의합으로봄 ( 각 leaf node 에 penalty 값추가함 ) 66

67 일반화오류 (Generalization error) 에대한추정 4. 최소서술길이원리 (Minimum Description Length Principle) : 5. Statistical Bounds( 통계적한계 ) 추정하기 일반화오류는훈련오류에대한통계적보정 (statistical correction) 으로추정될수있음 일반화오류는훈련오류보다일반적으로크므로, 훈련오류의상한으로계산추정하는경우도있음 6. Using Validation Set( 검증집합 ) 이용하기 원래의훈련집합을두개의작은부분집합으로나눔 하나는훈련, 다른하나는검증집합으로사용 67

68 Resubstitution Estimate Using training error as an optimistic estimate of generalization error e(t L ) = 4/24 +: 3 -: 0 +: 5 -: 2 +: 1 -: 4 +: 3 -: 0 +: 3 -: 6 e(t R ) = 6/24 +: 3 -: 1 +: 2 -: 1 +: 0 -: 2 +: 1 -: 2 +: 3 -: 1 +: 0 -: 5 Decision Tree, T L Decision Tree, T R 훈련오류율 e(tl) 훈련오류율 e(tr) 68

69 Incorporating Model Complexity( 모델복잡성고려 ) Rationale: Occam s Razor Given two models of similar generalization errors, one should prefer the simpler model over the more complex model A complex model has a greater chance of being fitted accidentally by errors in data Therefore, one should include model complexity when evaluating a model 69

70 Occam s Razor ( 오컴의면도날 ) 정의 : 같은일반화오류를갖는두개의모델이있을때, 더단순한모델이복잡한모델보다선호된다. (Given two models of similar generalization errors, one should prefer the simpler model over the more complex model) 복잡한모델에서는데이터에존재하는오류에의해적합해질가능성이더커지기때문 70

71 Pessimistic Estimate e(t L ) = 4/24 +: 3 -: 0 +: 5 -: 2 +: 1 -: 4 +: 3 -: 0 +: 3 -: 6 e(t R ) = 6/24 +: 3 -: 1 +: 2 -: 1 +: 0 -: 2 +: 1 -: 2 +: 3 -: 1 +: 0 -: 5 Ω = 1 Decision Tree, T L Decision Tree, T R e (T L ) = (4 +7 1)/24 = (7 개 leaf node 에 penalty 값각각 1 ) e (T R ) = ( )/24 = (4 개 leaf node 에 penalty 값각각 1 ) 71

72 Pessimistic Estimate 일반화오류를훈련오류와모델복잡도에대한 penalty 값의합으로봄 ( 각 leaf node 에 penalty 값추가함 ) Given a decision tree node t n(t): number of training records classified by t e(t): misclassification error of node t Training error of tree T: e' ( T ) = [ e( t ) + Ω( t )] n( t Ω: is the cost of adding a node N: total number of training records i i i i ) i = e( T ) + Ω( T ) N 72

73 Minimum Description Length ( 최소서술길이 ) halting growth of the tree when the encoding is minimized C.f) Occam s razor Most data mining tasks can be described as creating a model for the data E.g.) the K-means models the data as a set of centroids. Occam s razor: All other things being equal, the simplest model is the best. 73

74 Minimum Description Length ( 최소서술길이 ) Then, what is a simple model? Minimum Description Length Principle: Every model provides a encoding of our data. The model that gives the shortest encoding (best compression) of the data is the best. MDL restricts the family of models considered Encoding cost: cost of party A to transmit to party B the data. 74

75 Minimum Description Length ( 최소서술길이 ) The description length consists of two terms The cost of describing the model (model cost) The cost of describing the data given the model (data cost). L(D) = L(M) + L(D M) There is a tradeoff between the two costs Very complex models describe the data in a lot of detail but are expensive to describe Very simple models are cheap to describe but require a lot of work to describe the data given the model 75

76 Minimum Description Length ( 최소서술길이 ) Regression: find the polynomial for describing the data Complexity of the model vs. Goodness of fit Low model cost High data cost High model cost Low data cost MDL avoids overfitting automatically! Low model cost Low data cost Source: Grnwald et al. (2005) Advances in Minimum Description 76 Length: Theory and Applications.

77 e (T) = = Estimating Statistical Bounds 일반화오류는훈련오류에대한통계적보정 (statistical correction) 으로추정될수있음 2 zα / 2 e(1 e) e + + zα / 2 + e' ( N, e, α) = 2N N 2 zα / 2 1+ N +: 5 -: 2 2 zα / 4N 2 2 αα : 신뢰수준 (confidence level) zz αα/22 : 표준정규분포로부터표준화된값 NN: e 를구하기위해사용되는훈련레코드의총수 Before splitting: e = 2/7, e (7, 2/7, 0.25) = e (T) = = : 3 -: 1 +: 2 -: 1 After splitting: e(t L ) = 1/4, e (4, 1/4, 0.25) = e(t R ) = 1/3, e (3, 1/3, 0.25) = 0.650

78 Using Validation Set ( 검증집합사용 ) 원래의훈련집합을두개의작은부분집합으로나눔 하나는훈련, 다른하나는검증집합으로사용 Divide training data into two parts: Training set: use for model building Validation set: use for estimating generalization error Note: validation set is not the same as test set Drawback: Less data available for training 78

79 Notes on Overfitting Overfitting 은의사결정트리를불필요하게복잡하게만드는결과를초래함 훈련오류 (training error) 의최소화가반드시가장좋은의사결정 트리를생성하는것을의미하지는않음 그렇다면, overfitting 을줄이는방법은? 79

80 How to Address Overfitting Pre-Pruning,Early Stopping Rule ( 사전가지치기, 조기정지규칙 ) 전체훈련데이터에완벽하게맞는완전히성장한트리가만들어지기전에트리성장알고리즘을정지함 일반적인정지조건 : Stop if all instances belong to the same class Stop if all the attribute values are the same 더욱엄격한정지조건사용 : Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using χ 2 test) Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain). 이방식 (More restrictive condition) 의장점은훈련데이터에지나치게과잉적합하는복잡한 subtree 가생성되지않도록한다는것 80

81 How to Address Overfitting Post-pruning( 사후가지치기 ) 의사결정트리는처음에는최대크기로성장함 그다음, 완전히자란트리를상향식 (bottom-up) 으로다듬어가는가지치기절차를수행함 (1) 만약, trimming 후에일반화오류 (generalization error) 가개선된다면, 해당 sub-tree 를 leaf node 로교체함 (2) sub-tree 의다수클래스 (majority class) 가해당 class 를갖는단일 leaf node 로대체됨 사후가지치기는트리성장과정이너무이르게종료될수있는사전가지치기와달리, 완전히성장한트리를기반으로가지치기를하므로, 사전가지치기보다더나은결과를가지는경향있음. 하지만, 트리를완전히성장시키기위해요구되는계산이낭비적일수있음 81

82 Example of Post-Pruning Class = Yes 20 Class = No 10 Error = 10/30 단일노드인경우 : Training error: 1-max(20,10)/30 = 1-20/30 = 10/30 Pessimistic error = (10 + 노드 1 개 * 0.5)/30 = 10.5/30 4 개노드 subset 인경우 : 1-max(8,4)/30 max(3,4)/30 max(4,1)/30 max(5,1)/30 = 9/30 Pessimistic error = (9 + 노드 4 개 * 0.5)/30 = 11/30 Subset 의 error 가하나의노드의 error 보다큼 Pruning 됨 A? A1 A4 Pruned! A2 A3 Class = Yes 8 Class = Yes 3 Class = Yes 4 Class = Yes 5 Class = No 4 Class = No 4 Class = No 1 Class = No 1 82

83 Handling Missing Attribute Values Missing values affect decision tree construction in three different ways: Affects how impurity measures are computed Affects how to distribute instance with missing value to child nodes Affects how a test instance with missing value is classified 83

84 Computing Impurity Measure Before Splitting: Entropy(Parent) = -0.3 log(0.3)-(0.7)log(0.7) = Split on Home owner: Entropy(Home=Yes) = 0 Entropy(Home=No) = -(2/6)log(2/6) (4/6)log(4/6) = Missing value Entropy(Children) = 0.3 (0) (0.9183) = Gain = ( ) =

85 Computing Impurity Measure Before Splitting: Entropy(Parent) = -0.3 log(0.3)-(0.7)log(0.7) = Parent yes 3 7 Home Owner? no C1,Y C2,N C1,Y C2,N Split on Home owner: Entropy(Home=Yes) P(C1)=0/3, P(C2)=3/3 Entropy=-(0)log2(0) (1)log2(1) = 0 Entropy(Home=No) P(C1)=2/6, P(C2)=4/6 Entropy=-(2/6)log2(2/6) (4/6)log2(4/6)= Entropy(Children) = 0.3 (0) (0.9183) = Gain = ( ) =

86 Distribute Instances Yes 혹은 No 로될수있음 Yes Home Owner No Class=Yes 0 + 3/9 Class=No 3 Class=Yes 2 + 6/9 Class=No 4 Yes Class=Yes 0 Class=No 3 Home Owner No Cheat=Yes 2 Cheat=No 4 Probability that Home=Yes is 3/9 Probability that Home=No is 6/9 Assign record to the left child with weight = 3/9 and to the right child with weight = 6/9 86

87 Classify Instances New record: Married Single Divorced Total Class=No Class=Yes 6/ Yes NO NO Home Single, Divorced No TaxInc MarSt < 80K > 80K YES Married NO Total Probability that Marital Status = Married is 3.67/6.67 Probability that Marital Status ={Single,Divorced} is 3/

88 Other Issues Data Fragmentation Search Strategy Expressiveness Tree Replication 88

89 Data Fragmentation Data fragmentation problem: As tree is developed, the questions are selected on the basis of less and less data Number of records(data) gets smaller as you traverse down the tree Number of records(data) at the leaf nodes could be too small to make any statistically significant decision You can introduce a lower bound on the number of items per leaf node in stopping criterion 89

90 Search Strategy Finding an optimal decision tree is NP-hard The algorithm presented so far uses a greedy, top-down, recursive partitioning strategy to induce a reasonable solution Other strategies? Bottom-up Bi-directional 90

91 Expressiveness (1/2) Decision tree provides expressive representation for learning discrete-valued function But they do not generalize well to certain types of Boolean functions (ex) XOR or parity functions) 정확한모델링을위해서 complete tree 까지만들어야함 Example: parity function: Class = 1 if there is an even number of Boolean attributes with truth value = True Class = 0 if there is an odd number of Boolean attributes with truth value = True For accurate modeling, must have a complete tree Not expressive enough for modeling continuous variables Particularly when test condition involves only a single attribute at-a-time 91

92 Expressiveness (2/2) Decision trees can express any function of the input attributes (eg., for Boolean functions, truth tables) Trivially, there is a consistent decision tree for any training set with one path to leaf for each example but it probably won t generalize to new examples 즉, 어떤 training set 이라도여기에일치하는 DT 는존재함 하지만, 우리는 DT 가 complete DT( 끝까지성장하는것보다 ) 보다 compact DT 를선호하는경우많음 92

93 Oblique Decision Trees (Oblique splits) - 속성축값과 orthogonal 하지않은 oblique( 기울어진 ) split 도가능함 - More expressive 하고 compact 한 DT 도가능함 x + y < 1 Test condition may involve multiple attributes Class = + Class = More expressive representation Finding optimal test condition is computationally expensive 93

94 Oblique Decision Trees (Oblique splits) - DT 에서도 oblique split 이가능하지만제한이많음 - 아래예의경우는 DT 보다 Linear Regression 사용이더용이한것을보이고있음 94

95 Decision Boundary (1/2) - Test 조건에따라 Decision Boundary 정해짐 x < 0.43? 0.7 Yes No 0.6 y 0.5 y < 0.47? y < 0.33? Yes No Yes No : 4 : 0 : 0 : 4 : 0 : 3 : 4 : x Border line between two neighboring regions of different classes is known as decision boundary Decision boundary is parallel to axes because test condition involves a single attribute at-a-time 95

96 Decision Boundary (2/2) - Test 조건에따라 Decision Boundary 정해짐 - Depth 4 를가지는 DT 와이에의해만들어진 Decision Boundary 사례 y a b c d x 96

97 Tree Replication 아래그림처럼 subtree 가 DT 에서여러번복제된상태로나올수있음 이는 DT 를필요이상으로복잡하게만들고해석도어렵게함 DT 는 node 의 single 속성테스트조건으로만들어지므로이러한상황발생가능 P Q R S 0 Q S Same subtree appears in multiple branches 97

98 Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models? 98

99 Metrics for Performance Evaluation Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix: PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes a b Class=No c d a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative) 99

100 Metrics for Performance Evaluation PREDICTED CLASS Class=Yes Class=No ACTUAL CLASS Class=Yes Class=No a (TP) c (FP) b (FN) d (TN) Most widely-used metric: Accuracy = a a + b + + d c + d = TP TP + TN + TN + FP + FN 100

101 Limitation of Accuracy Consider a 2-class problem Number of Class 0 examples = 9990 Number of Class 1 examples = 10 If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % Accuracy is misleading because model does not detect any class 1 example 101

102 Cost Matrix PREDICTED CLASS C(i j) Class=Yes Class=No ACTUAL CLASS Class=Yes C(Yes Yes) C(No Yes) Class=No C(Yes No) C(No No) C(i j) : Class j 를 class i 로잘못예측한코스트 C(i j): Cost of misclassifying class j example as class i 102

103 Cost vs Accuracy Count ACTUAL CLASS PREDICTED CLASS Class=Yes Class=No Class=Yes a b Class=No c d Accuracy is proportional to cost if 1. C(Yes No)=C(No Yes) = q 2. C(Yes Yes)=C(No No) = p N = a + b + c + d Accuracy = (a + d)/n Cost ACTUAL CLASS PREDICTED CLASS Class=Yes Class=No Class=Yes p q Class=No q p Cost = p (a + d) + q (b + c) = p (a + d) + q (N a d) = q N (q p)(a + d) = N [q (q-p) Accuracy]

104 Computing Cost of Classification Cost Matrix ACTUAL CLASS PREDICTED CLASS C(i j) Model M 1 PREDICTED CLASS Model M 2 PREDICTED CLASS ACTUAL CLASS ACTUAL CLASS Accuracy = 80% Cost = 3910 Accuracy = 90% Cost =

105 Cost-Sensitive Measures PREDICTED CLASS Class=Yes Class=No ACTUAL CLASS Class=Yes Class=No True Positive False Positive False Negative True Negative - Precision( 정밀도 ) 혹은 positive predictive value: - 탐지했다고주장하는것 (positive) 이그게맞는경우임 - Recall( 재현율 ) 혹은민감도 (sensitivity) - 실제정답중에서제대로이를찾아내는경우임 - F-Measure - 정밀도와재현율의조화평균 예 ) 만약 10,000 개의패킷중에서실제악성패킷의빈도가매우적은경우 : - Precision( 정밀도 ) 는탐지했다고주장한것 (positive) 중에서그게정확한경우임. - Recall( 재현율 ) 은실제정답중에서이를제대로찾아내는경우임. - 이에만약, 실제악성공격패킷이매우적은경우에는 ( 실제공격을찾아내는것이중요한데, 즉높은재현율 ), 무조건찾았다.. 라고남발할경우 ( 즉, fp 기많아질경우 ), 재현율은좋아지지만, 정밀도는떨어짐. 이에, 이둘간의조화평균이중요해짐 (F- Measure) 105

106 Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation 시험데이터를사용한성능평가 How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models? 106

107 Methods for Performance Evaluation 시험데이터를사용한성능평가 How to obtain a reliable estimate of performance? Performance of a model may depend on other factors besides the learning algorithm: Class distribution Cost of misclassification Size of training and test sets 107

108 Methods of Estimation (1/4: 분류기성능평가방법 ) Holdout ( 예비기법 ) 예 ) 훈련데이터 (2/3), 시험데이터 (1/3) 로구성 훈련데이터부족시 model 이부실해짐 훈련데이터집합과시험데이터집합구성에의존적임 Random subsampling ( 랜덤서브샘플링 ) Holdout( 예비기법 ) 을모델의성능향상을위해여러번반복하는것 이때, 시험데이터가무작위로선택됨 108

109 Methods of Estimation (2/4: 분류기성능평가방법 ) Cross validation ( 교차검증 ) Partition data into k disjoint subsets k-fold: train on k-1 partitions, test on the remaining one Leave-one-out: k=n 예 ) 데이터를동일한크기로 5 개로나눈후, 한번에정확히하나의데이터만사용하여, 시험함. 총 5 번시험가능 fold #1 fold #2 fold #3 fold #4 fold #5 만약데이터를훈련용시험용 2 개로나누는경우, 이중교차검증 (two-fold cross validation) 이라고함 109

110 Methods of Estimation (3/4: 분류기성능평가방법 ) Stratified sampling ( 층화추출 ) 데이터를어떤기준으로그룹을만듦 (Strata) 각 Strata 에서 subsample 이선택되어시험에사용됨 110

111 Methods of Estimation (4/4: 분류기성능평가방법 ) Bootstrap ( 부트스트랩 ) Sampling with replacement ( 이전에사용된데이터가다시반환되므로다시샘플링되어사용될수있음 ) 111

112 Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models? 112

113 Learning Curve (1/2) Learning curve shows how accuracy changes with varying sample size Requires a sampling schedule for creating learning curve: Arithmetic sampling (Langley, et al) Geometric sampling (Provost et al) Effect of small sample size: - Bias in the estimate - Variance of estimate 113

114 Learning Curve (2/2) Arithmetic sampling 샘플크기가단순한산술수식으로표시됨 예 : arithmetic sampling with the following equation: Si = S0 + I C 여기서, S0 is the initial sample size and C is a constant. S0, S0 + C, S0 + 2C, S0 + 3C, 만약 S0 = 1,000 이고 C = 100 이라면, S1 = 1,100, S2 = 1,200, 이됨 Geometric sampling Sample size is increased geometrically so that sample sizes are in geometrical progression 예 : Si = S0 C^I 예 : S0, S0 C, S0 C^2, S0 C^3 S0 = 1,000이고 C = 2 이라면, S1 = 2,000, S2 = 4,000, S3=8,000,

115 ROC (Receiver Operating Characteristic) Developed in 1950s for signal detection theory to analyze noisy signals Characterize the trade-off between positive hits and false alarms ROC curve plots TP (on the y-axis) against FP (on the x-axis) Performance of each classifier represented as a point on the ROC curve changing the threshold of algorithm, sample distribution or cost matrix changes the location of the point 115

116 ROC Curve - 1-dimensional data set containing 2 classes (positive and negative) - any points located at x > t is classified as positive At threshold t: TP=0.5, FN=0.5, FP=0.12, FN=

117 Using ROC for Model Comparison 화살표쪽으로 curve 가당겨질수록좋은성능 No model consistently outperform the other M 1 is better for small FPR M 2 is better for large FPR AUC = 0.5 인경우는최하의성능 Area Under the ROC curve (AUC) Ideal: Area = 1 Random guess: Area =

118 ROC Curve (TP,FP): (0,0): declare everything to be negative class (1,1): declare everything to be positive class (1,0): ideal Diagonal line: Random guessing Below diagonal line: prediction is opposite of the true class 118

119 How to Construct an ROC curve Instance P(+ A) True Class Use classifier that produces posterior probability for each test instance P(+ A) Sort the instances according to P(+ A) in decreasing order Apply threshold at each unique value of P(+ A) Count the number of TP, FP, TN, FN at each threshold TP rate, TPR = TP/(TP+FN) FP rate, FPR = FP/(FP + TN) 119

120 How to construct an ROC curve Class Threshold >= TP FP TN FN TPR FPR ROC Curve: 120

121 Test of Significance ( 유의성테스트 ) 데이터크기에따라두개의분류기모델에서관찰된정확도는의미가없을수도있음 Given two models: Model M1: accuracy = 85%, tested on 30 instances Model M2: accuracy = 75%, tested on 5000 instances Can we say M1 is better than M2? M1 이 M2 보다더높은정확도를갖지만, 더작은시험집합에대해시험됨. M1 의정확도를얼마나신뢰할수있나? 121

122 Confidence Interval for Accuracy Prediction can be regarded as a Bernoulli trial A Bernoulli trial has 2 possible outcomes Possible outcomes for prediction: correct or wrong Collection of Bernoulli trials has a Binomial distribution: x Bin(N, p) x: number of correct predictions e.g: Toss a fair coin 50 times, how many heads would turn up? Expected number of heads = N p = = 25 Given x (# of correct predictions) and N (total # of test instances) 실험적정확도 acc=x/n Can we predict p (true accuracy of model)? 122

123 Confidence Interval for Accuracy For large test sets (N > 30), P acc has a normal distribution with mean p and variance p(1-p)/n ( Z < Z α / 2 1 α / 2 = 1 α acc p p(1 p) / N < ) Area = 1 - α Z α/2 Z 1- α /2 Confidence Interval for p: N acc + Z ± Z + 4 N α / 2 α / 2 p = 2 2( N + Z ) α / 2 acc 4 N acc 2 123

124 Confidence Interval for Accuracy Consider a model that produces an accuracy of 80% when evaluated on 100 test instances: N=100, acc = 0.8 Let 1-α = 0.95 (95% confidence) From probability table, Z α/2 = α Z N p(lower) p(upper)

125 Comparing Performance of 2 Models Given two models, say M1 and M2, which is better? M1 is tested on D1 (size=n1), found error rate = e 1 M2 is tested on D2 (size=n2), found error rate = e 2 Assume D1 and D2 are independent If n1 and n2 are sufficiently large, then e e 1 2 ~ ~ N N ( µ, σ ) 1 ( µ, σ ) Approximate: ˆ σ i = e (1 e ) i i n i 125

126 Comparing Performance of 2 Models To test if performance difference is statistically significant: d = e1 e2 d ~ N(d t,σ t ) where d t is the true difference Since D1 and D2 are independent, their variance adds up: 2 σ t = σ + σ ˆ σ + ˆ σ 1 2 e1(1 e1) e2(1 e2) = + n1 n2 1 2 At (1-α) confidence level, d t = d ± Z α / 2 σ ˆ t 126

127 An Illustrative Example Given: M1: n1 = 30, e1 = 0.15 M2: n2 = 5000, e2 = 0.25 d = e2 e1 = 0.1 (2-sided test) 0.15(1 0.15) 0.25(1 0.25) ˆ = + = σ d At 95% confidence level, Z α/2 =1.96 d t = ± = ± => Interval contains 0 => difference may not be statistically significant 127

128 Comparing Performance of 2 Algorithms Each learning algorithm may produce k models: L1 may produce M11, M12,, M1k L2 may produce M21, M22,, M2k If models are generated on the same test sets D1,D2,, Dk (e.g., via cross-validation) For each set: compute d j = e 1j e 2j d j has mean d t and variance σ t k Estimate: 2 ( d d) 2 j= 1 j σ = ˆ d t t = d k( k ± t 1) ˆ σ 1 α, k 1 t 128