An Effective Sentence-Extraction Technique Using Contextual Information and Statistical Approaches for Text Summarization

한국 BI 데이터마이닝학회 2010 추계학술대회 Random Forests 기법을사용한 저수율반도체웨이퍼검출및혐의설비탐색 고태훈, 김동일, 박은정, 조성준 * Data Mining Lab., Seoul National University, hooni915@snu.ac.kr

Introduction 반도체웨이퍼의수율 반도체공정과웨이퍼의수율 반도체공정은수백개의프로세스로이루어져있음 공정의단위는여러개의마이크로프로세서가새겨져있는웨이퍼 (wafer) 수율 : 한웨이퍼내의반도체중정상적으로작동하는반도체의개수로측정 2

Introduction 수율측정의어려움 수율측정의어려움 고수율의웨이퍼를지속적으로생산해내는것이중요수율은모든프로세스를거친후에야도출각공정프로세스사이에계측공정을두어생산품질을실시갂으로측정하는대안 비용및생산성에의해젂체웨이퍼중약 4% 만품질계측실시 생산설비에서나오는센서정보 (FDC data : Fault Detection & Classification data) 존재 3

Introduction 혐의설비파라미터 혐의설비파라미터탐색의필요성 저수율웨이퍼가나오는이유 : 반도체공정의각프로세스에서정해진공장레시피 ( 온도, 압력, 가공시갂등 ) 대로작업이이루어지지않았기때문수율을낮추는혐의설비파라미터를찾아내면, 해당프로세스의집중적인관리를통해최종적인수율을향상시킬수있음 4

Introduction FDC 데이터의특성 FDC 데이터의특성 레코드수에비해입력변수의수가상대적으로많음 : 이유? 반도체공정과정이복잡하기때문 예 ) 공정단계 = 400 단계, 각단계에서 100 단위의공정이이루어진다고가정 - 총공정단위수 = 400 * 100 = 40,000 단위 - 각단위에서설비파라미터가발생하면, 총 40,000 개의입력변수가발생! 데이터가 fat 한형태 5

Random Forests Random Forests Developed by Leo Breiman(father of CART ) at University of California, Berkeley (1996, 1999) Special case of the model averaging approach Attempt to reduce bias of single tree W 1 W 2 W 3 0 0 1 1 6

Random Forests Why Random Forests? Decision Tree Advantages : Extracting decision rules (If A, then B) & Selecting important predictors automatically Limitations : High bias (Poor fitting to a nonlinear decision boundary) Why RF? To maintain some advantage(selecting important predictors) while reducing bias! How to? Evaluating each predictor 2 randomization (1) Bagging (or Bootstrap aggregation) (L.Brieman, 1994) (2) Predictor subsets chosen randomly 7

RF Evaluating each predictor Evaluating each predictor In Random Forests, each single tree selects important predictors automatically. Random Forests can evaluate each predictor by combining all single tree s opinion. 1,2,3,4 mean = 2.5 MSE= 1.25 MSE = 1.25 x1 x1 MSE decrease = 1 1,2 3,4 mean = 1.5 MSE = 0.25 mean = 3.5 MSE = 0.25 MSE = 0.25 * 0.5 + 0.25 * 0.5 = 0.25 8

RF Evaluating each predictor Evaluating each predictor Consider a forest of 4 trees MSE decrease by X1 = 1 MSE decrease by X1 = 1.2 MSE decrease by X1 = 3.2 MSE decrease by X1 = 0.75 Importance of X1 = mean MSE decrease = 1 1.2 3.2 0.75 1.5375 4 9

RF 2 randomization Randomization through bagging Parallel combination of learners, independently trained on distinct bootstrap samples Final prediction is the mean prediction (regression) or class with maximum votes (classification) Bagging methods reduce variance 검은선 : 실제 decision boundary 초록선 : Single tree s decision boundary 빨간선 : Random Forest s decision boundary 10

RF 2 randomization Randomization through predictor subsets If each single tree in forest uses all predictors, it is just a simple bagging method Random Forests algorithm chooses predictor subsets randomly, and constructs a single tree by training each predictor subset. The number of predictors of each tree = Generally, in a classification problem, Generally, in a regression problem, [Example] Predictor set = {X1, X2, X3,, X12} Regression tree 1 s predictor subset = {X1, X4, X5, X9} Regression tree 2 s predictor subset = {X1, X2, X10, X11} Regression tree 3 s predictor subset = {X7, X8, X10, X12} m try m k ( k : number of all predictors) mtry Regression tree n s predictor subset = {X1, X2, X6, X12} try k /3 11

데이터설명 데이터설명 데이터의레코드수가비교적충분한 4 개의 Group 과젂체데이터에대한분석 : Group 2, Group 4, Group 5, Group 7 and All Groups ( 5 개 의경우 ) Group 데이터수집기간 # of records 1 1일갂수집 3 2 17일간수집 74 3 1일갂수집 6 4 44일간수집 139 5 28일간수집 157 6 16일갂수집 34 7 19일간수집 128 8 1일갂수집 3 All 젂체기간 544 12

데이터설명 데이터설명 입력변수 : 젂처리후 387 개사용 다양한형태의시계열분포 출력변수 : 품질을나타내는 3 개의지표존재 13

RF 를이용한회귀분석 분석알고리즘 Regression Random Forests algorithm 벤치마크알고리즘 Stepwise Linear Regression (Stepwise-LR) 입력변수의추가와제거를반복하여입력변수를선택 변수추가 : 선형회귀분석의정확도에기여도가높은변수를하나씩추가 변수제거 : 선형회귀분석의정확도향상에불필요한변수를하나씩제거 Genetic Algorithm Linear Regression (GA-LR) 랜덤하게초기해집단을생성 해의적합도를판단하기위해회귀모델평가기준인 MSE(Mean Squared Error) 를사용 재생산 (selection), 교배 (crossover), 돌연변이 (mutation) 등을거쳐알고리즘종료 3 알고리즘에대해모두 10-fold cross validation 실시 14

RF 를이용한회귀분석 성능평가지표 RMSE(Root Mean Squared Error) 를사용 회귀분석시행결과낮은 RMSE 를선택 실제출력변수와예측된출력변수의유사도를나타내는사용 R 2 교차검증 (cross validation) 실시 데이터레코드의수가적기때문 모델을 10-fold 교차검증을이용하여평가 15

RF 를이용한회귀분석 Stepwise-LR RMSE Group Y1 Y2 Y3 2 0.1646 0.6134 0.2267 4 0.4164 0.3394 0.3695 5 1.1283 1.3608 0.7568 7 1.6608 1.4786 2.0585 All 2.0752 3.0479 1.9992 # of selected predictors Group Y1 Y2 Y3 2 41 27 27 4 16 32 39 5 20 14 31 7 32 48 45 All 35 34 31 16

RF 를이용한회귀분석 GA-LR RMSE Group Y1 Y2 Y3 2 0.3332 0.5073 0.1616 4 0.3250 0.3323 0.3253 5 0.5406 1.0010 0.6110 7 2.1755 1.1897 1.1967 All 2.0109 2.8851 1.9704 # of selected predictors Group Y1 Y2 Y3 2 53 69 91 4 78 74 83 5 75 64 59 7 117 99 91 All 96 87 77 17

RF 를이용한회귀분석 Random Forests RMSE Group Y1 Y2 Y3 2 0.7650 1.1150 0.5014 4 0.3962 0.4313 0.5511 5 1.0368 1.1860 0.9031 7 2.4396 3.9988 2.8984 All 1.4944 2.1527 1.4753 # of selected predictors N/A 18

RF 를이용한회귀분석 실험결과 Best Model Group # of records Y Best Model RMSE R^2 2 74 4 139 5 157 7 128 All 544 1 GA-LR 0.1646 0.9820 2 GA-LR 0.5073 0.9175 3 GA-LR 0.1616 0.9570 1 GA-LR 0.3250 0.6570 2 GA-LR 0.3323 0.7994 3 GA-LR 0.3253 0.7780 1 GA-LR 0.5406 0.8780 2 GA-LR 1.0010 0.6288 3 GA-LR 0.6110 0.7829 1 Stepwise-LR 1.6608 0.8849 2 GA-LR 1.1897 0.9685 3 GA-LR 1.1967 0.9113 1 Random Forests 1.4944 0.7448 2 Random Forests 2.1527 0.7516 3 Random Forests 1.4753 0.7540 19

RF 를이용한분류 출력변수의 2- 클래스화 : Excursion vs. Normal Excursion : 갑자기불량정도가크게증가한데이터레코드 그외의레코드는 Normal 로분류 20

RF 를이용한분류 Excursion 과 Normal 의구분기준 각 Group 과출력변수의조합별로서로다른기준을적용 Group 2, Group 4, Group 5, Group 7, All groups 5 개 target : 각 Group 별로 3 개 총 15 가지의경우의수 평균보다 c-sigma 이상큰데이터는 excursion 으로분류 Expert domain knowledge 기반으로하여, 각각의 15 가지경우의 c 를다르게설정 If Y mean( Y ) c ( Y ), then "excursion" ij ij ij ij ( i 1, 2,3, 4,5 j 1, 2,3) 21

RF 를이용한분류 오버샘플링수행 각 group 별로 excursion 레코드의수가매우적기때문에, 오버샘플링수행 Excursion 데이터에노이즈를추가 Group Y1 Y2 Y3 normal excursion total normal excursion total normal excursion total 2 71 3 74 71 3 74 70 4 74 4 133 6 139 133 6 139 137 2 139 5 148 9 157 147 10 157 137 20 157 7 105 23 128 108 20 128 110 18 128 All 476 68 544 472 72 544 491 53 544 Group Y1 Y2 Y3 normal excursion total normal excursion total normal excursion total 2 71 71 142 71 71 142 70 70 140 4 133 133 266 133 133 266 137 137 274 5 148 148 296 147 147 294 137 137 274 7 105 105 210 108 108 216 110 110 220 All 476 476 952 472 472 944 491 491 982 22

Actual RF 를이용한분류 분석알고리즘 Classification Random Forests algorithm 벤치마크알고리즘 Logistic Regression Single Decision Tree 3 알고리즘에대해모두 10-fold cross validation 실시 평가지표 Predictive Excursion Normal Excursion TP FN Normal FP TN Sensitive : TP/(TP+FN) ( 실제 Excursion을모델이 Excursion이라고예측하는비율 ) Specificity : TN/(FP+TN) ( 실제 Normal을모델이 Normal이라고예측하는비율 ) 23

RF 를이용한분류 실험결과 Group Y # of records Logistic Regression Decision Tree Random Forests Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity 1 142 0.8425 0.9754 0.8754 0.9921 0.9296 1.0000 2 2 142 0.8454 0.9825 0.8763 0.9874 0.9296 1.0000 3 140 0.8541 0.9798 0.8698 0.9823 0.9286 1.0000 1 266 0.8579 0.9465 0.8493 0.9745 0.9549 1.0000 4 2 266 0.8621 0.9874 0.8458 0.9789 0.9699 1.0000 3 274 0.8520 0.9547 0.8721 0.9825 0.9635 1.0000 1 296 0.8654 0.9614 0.8948 0.9890 0.9662 1.0000 5 2 294 0.8745 0.9501 0.8714 0.9901 0.9660 1.0000 3 274 0.8512 0.9682 0.8412 0.9821 0.9416 1.0000 1 210 0.8621 0.9732 0.9021 0.9800 0.9524 0.9905 7 2 216 0.8685 0.9520 0.9114 0.9514 0.9630 0.9722 3 220 0.8579 0.9421 0.8942 0.9632 0.9273 0.9909 1 952 0.8754 0.9541 0.8821 0.9588 0.9664 0.9979 ALL 2 944 0.8954 0.9325 0.9102 0.9520 0.9746 0.9979 3 982 0.8746 0.9387 0.8925 0.9687 0.9695 0.9980 24

RF 를이용한분류 변수선택결과 Domain expert 로부터실제공정에중요한변수임을확인함 변수명 중요도 * 로정렬된순서 X313 1 X387 2 X145 3 변수 Xi 의중요도 = 변수가선택된모델의수 * ( 각모델에서 ) Xi 의중요도합 X285 4 X333 5 X31 6 25

요약및결롞 Regression Random Forests 회귀분석문제인경우, 데이터레코드가어느정도충분히있는경우에는 Linear Regression 모델보다더나은성능을보임 유의미한입력변수들을추출해냄 Classification Random Forests 분류문제인경우, Excursion 레코드와 Normal 레코드를상당히높은수준의성능으 로예측함 Excursion 레코드를 miss 함으로써발생하는추가비용감소 Excursion 이많이발생하는공정초기단계에서유용하게쓰일수있을것 Normal 레코드를 miss 함으로써발생하는기회비용감소 유의미한입력변수들을추출해냄 26

요약및결롞 Fat data 에대한모델링 데이터레코드개수에비해입력변수의개수가많은 fat data를분석하는것은패턴인식및데이터마이닝에서도중요한화두임반도체공정데이터는대표적인 fat data의유형임 Random Forests는입력변수들의여러가지조합을고려하는알고리즘이므로이러한 fat data를처리하기에적합한모델로보임 향후과제 선택된변수의유의성에대한객관적인검증 Domain expert 로부터어느정도유의성을입증받았으나실제공정에서도유의한지좀더 객관적인검증이필요 2-class 분류문제가아닌, Novelty Detection(1-class) 등의다양한기법을적용 27

References Brieman, L., 1996. Bagging predictors, Machine Learning 24, pp.123-140. Brieman, L., 2001. Random Forests, Machine Learning 34, pp.5-32. Segal, Mark R., 2004. Machine Learning Benchmarks and Random Forest Regression, Center for Bioinformatics & Molecular Biostatistics. 28