Chapter 7 분산분석 (ANalysis Of VAariance, ANOVA) 2014/4/29
7.1 머리말 (Introduction) 분산분석 (analysis of variance) : 전체변동을몇개의성분으로분할하는기법 (Divide total variation into several components) 전체변동에대해각각의변동요인의기여규모를파악 (contribution of particular components) 목적 (Aims) : 모분산의추정과가설검정 (estimation & testing for the variances) 모평균의추정과가설검정 (estimation & testing for the means)
motivation 비교하고싶은그룹이두개이면 (comparisons of two groups) -> t-test 비교하고싶은그룹이두개이상이면 (more than two groups) -> 두개그룹씩뽑아서쌍을만든후에여러개의 t-test 를실시한다. (pairwise t-tests) 번거롭기도하고이론적으로틀린결론에도달할수있다. (cumbersome & theoretically wrong -> 다중비교의문제 (multiple-comparisons problems) 전체자료를사용하지않고자료의부분만을사용하므로효율이떨어진다. (efficiency problems due to the usage of partial data) 전체자료를이용하여서세그룹이상을비교하는분석 (more than 3 groups using whole data) -> 분산분석 ANOVA ( 종속변수는연속형, 독립변수는이산형 ) response var: conti, explanatory var: categorical
7.2 완전확률계획법 (Completely Randomized Design) 정의 : 처리방법을확률적으로할당하고그처리효과를판단할수있다. (complete randomization) (treatments) (total) (average)
일원분산분석 (one-way analysis of variance) 보기 7.2.1 Glucose 가인슐린분비량에미치는영향 (glucose & insulin)
모형 (model) x µ ε = + ij j ij ij번째측정치 j처리의평균 ij번째오차 ij-th observation mean of j-th treatment group error of ij-th observation µ τ ε = + + j ij k µ j j = 1 µ = k τ j = µ j µ : 전체평균 Grand mean : j번째처리효과 Effect of j-th treatment group
모형의가정 1. e ~ N (0, σ ) independent ij 2. 평균, 등분산, 정규성, 독립적 2 (mean, variance (homogeneity), normality, independence) 모형의가설 (Hypothesis of the model) H 0 : µ 1 = µ 2 = = µ k H A : 모든 µ j 가같은것은아니다. (All the µ j 's are not the same)
Same variances & same means Same variances but different means
총자승합 (sum of squares, total) k SST = ( x - x ) ij.. n j j = 1 i = 1 2 k n j = j = 1 i = 1 x 2 ij - 2 T.. N
k k n n SST= (x -x ) j=1 i= 1 j = ( x x + x x ) j= 1 i= 1 j ij. j. j.. 2 k nj k nj k nj 2 2 ( xij x. j ) 2 ( xij x. j )( x. j x.. ) ( x. j x.. ) j= 1 i= 1 j= 1 i= 1 j= 1 i= 1 = + + k n j k 2 2 ( xij x. ) j nj ( x. j x.. ) j= 1 i= 1 j= 1 ij.. = + 2
Within-group SS Among(Between)-group SS SST = SSW + SSA within among group 집단내자승합 집단간평균자승분산비 = 집단내평균자승 집단간자승합 MSA variance ratio= MSW -> 분산비가커지면집단간의 variation 이크다. 집단간의성질이다르다. 집단의효과가크다. ->larger VR -> larger between-group SS 0 -> groups are different -> bigger group effect!
유전율예제 (Heritability Example)
ANOVA Table factor Sum of squares df Mean square Variance ratio Between group Within group total
SAS program * file eg7_2_1.sas ; data insul; input glu ins ; cards; 1 1.53 1 1.61 1 3.75 1 2.89 1 3.26 2 3.15 2 3.96 2 3.59 2 1.89 2 1.45 2 1.56 3 3.89 3 3.68 3 5.70 3 5.62 3 5.79 3 5.33 4 8.18 4 5.64 4 7.36 4 5.33 4 8.82 4 5.26 4 7.10 5 5.86 5 5.46 5 5.69 5 6.49 5 7.81 5 9.03 5 7.49 5 8.98 ;run; proc means sum mean ; by glu; var ins ; run; proc anova ; class glu ; model ins=glu ; run;
The ANOVA Procedure Dependent Variable: ins Sum of Source DF Squares Mean Square F Value Pr > F Betw With total Model 4 121.1854282 30.2963570 19.78 <.0001 Error 27 41.3573937 1.5317553 Corrected Total 31 162.5428219 VR R-Square Coeff Var Root MSE ins Mean 0.745560 24.27491 1.237641 5.098438 Source DF Anova SS Mean Square F Value Pr > F glu 4 121.1854282 30.2963570 19.78 <.0001
Multiple Comparisons ( 다중비교 ) α ex) significance level = for a test Let H : α = 0 p( do not reject H H is true) = 1 01 1 01 01 H : α = 0 p( do not reject H H is true) = 1 α 02 2 02 02 then p( do not reject H H ) where H = H and H p( do not reject H 0 0 0 01 02 and do not reje = 01 ct H02 H0 = (1- α) (1- α) = (1- α) In general, if we want to test, then k (1 α) (1 α) 2 α α α α k 1 = 2 = 3 = = = 0 4 1 0.1855 = 0.8145 = (.95).95 overall is 0.1855, not 0.05 -> inflated type I error!! α ) α
Bonferroni Correction : Set individual significance α m the overall significance level is about α for m multiple tests. m=4 0.05 = 4 4 (1 ) 0.95 1 0.05 example) When we have 10 hypotheses, Individual p=0.05 -> multiple comparisons problem (too many false findings) Individual p= 0.05 10 = 0.005 This is often called Bonferroni corrected p-value.
Detecting pairwise differences After rejecting H 0: µ 1 = µ 2 = = µ 5 have larger differences?, which pairs 1. LSD (least significant difference, 최소유의차검정법 ) 2. Duncan s new multiple range test Duncan의새로운다중범위검정법
3. Tukey 의 HSD (honestly significance difference) 검정 MSE HSD = q n n α, kn, k j MSE HSD = q n n * * α, kn, k * j j 's are the same : sample size of smaller cell
보기 7.2.2 Pairwise mean-differences of glucose example
표 7.2.6 Pairwise comparisons by Tukey s HSD test
e.g. 24 4.17 30 4.10 보간법 (interapolation) 0.07 : x = (30 24) : (27 24) 4.17 4.10 6x= 0.07 3 24 27 30 0.07 3 x= = 0.035 6 4.17 0.035= 4.135 4.14 2.60 2.61 5.00 6.81 7.10
proc anova ; class glu ; model ins=glu ; means glu /Tukey ; run; The ANOVA Procedure Tukey's Studentized Range (HSD) Test for ins NOTE: This test controls the Type I experimentwise error rate. Alpha 0.05 Error Degrees of Freedom 27 Error Mean Square 1.531755 Critical Value of Studentized Range 4.13047 Comparisons significant at the 0.05 level are indicated by ***. Difference glu Between Simultaneous 95% Comparison Means Confidence Limits 5-4 0.2884-1.5824 2.1592 5-3 2.0996 0.1474 4.0518 *** 5-1 4.4933 2.4325 6.5540 *** 5-2 4.5013 2.5491 6.4534 *** 4-5 -0.2884-2.1592 1.5824 4-3 1.8112-0.1999 3.8223 4-1 4.2049 2.0883 6.3214 *** 4-2 4.2129 2.2018 6.2239 ***
Comparisons significant at the 0.05 level are indicated by ***. Difference glu Between Simultaneous 95% Comparison Means Confidence Limits 5-4 0.2884-1.5824 2.1592 5-3 2.0996 0.1474 4.0518 *** 5-1 4.4933 2.4325 6.5540 *** 5-2 4.5013 2.5491 6.4534 *** 4-5 -0.2884-2.1592 1.5824 4-3 1.8112-0.1999 3.8223 4-1 4.2049 2.0883 6.3214 *** 4-2 4.2129 2.2018 6.2239 *** 3-5 -2.0996-4.0518-0.1474 *** 3-4 -1.8112-3.8223 0.1999 3-1 2.3937 0.2048 4.5825 *** 3-2 2.4017 0.3147 4.4886 *** 1-5 -4.4933-6.5540-2.4325 *** 1-4 -4.2049-6.3214-2.0883 *** 1-3 -2.3937-4.5825-0.2048 *** 1-2 0.0080-2.1808 2.1968 2-5 -4.5013-6.4534-2.5491 *** 2-4 -4.2129-6.2239-2.2018 *** 2-3 -2.4017-4.4886-0.3147 *** 2-1 -0.0080-2.1968 2.1808
Homework 다음문제들을공식을이용해서분산분석표를계산하시오 ( 엑셀사용가능 ). 그리고 SAS 를이용한결과와비교하시오 Make Anova tables using the formulae (you may use MS Excel). Compare your results with the results from SAS 연습문제 7.2.2 연습문제 7.2.7
7.3 확률화완전블록계획법 (Randomized complete block design) R.A.Fisher (1925) : to compare the yields of certain species 땅을블록 (block=land) 으로나누고블록안에서 Randomize (other factors) in a block 하는것이다. block treatments total average total average
보기 7.3.1 # of days to lean how to use a dental device Age Teaching methods
모형(model) x block effect trt effect = µ + β + τ + e ij i j ij 블럭효과처리효과 e = x µ + β + τ N σ 2 ij ij ( i j )~ (0, ) 가설 (hypothesis) H : τ = 0 j = 1, 2,, k H 0 j :All τ = 0 is not true. Some τ 0. A j j
k n * SST= (x -x ) j=1 i= 1 ij.. 2 k n k n j k n 2 2 2 ( xi. x.. ) ( x. j x.. ) ( xij xi. x. j x.. ) j= 1 i= 1 j= 1 i= 1 j= 1 i= 1 = + + SST = SSBl + SSTr + SSE df : nk 1 = ( n 1) + ( k 1) + ( n 1)( k 1)
ANOVA table factor trt block error total
연습문제 7.3.4 (SAS) Homework
7.4 요인실험 (Factorial Design) 반응시간 (reduction of response time ) = 약품수준 ( 소량, 중간, 다량 )* 연령층 ( 중년, 노년 ) drug level (min, med, max)*age(mid, old) 교호작용이없을때 (Without interaction) 요인 B 약품용량 (Factor-B, drug level) 요인 A 연령 Factor A-age j=1 j=2 j=3 중년층 (Mid) i=1 5 10 20 노년층 (old) i=2 10 15 25 age Drug level reduction of response time Drug level age
교호작용이있을때 (With interaction) 요인B 약품용량 요인A j=1 j=2 j=3 j=2-1 j=3-2 - 연령 중년층 (i=1) 5 10 20 5 10 노년층 (i=2) 15 10 5-5 -5
2 요인완전확률화할당계획법 (2 factors) Factor B Factor A
보기 7.4.2 간호사의가정방문시간 (time of staying home for a nurse) = 간호사의연령, 환자의질환 (age of the nurse, disease of the patient) 모형 (Model) x ijk = µ + α i + β j + ( αβ ) ij + e ijk i = 1,, a j = 1,, b k = 1,, n
Hypotheses( 가설 ) H : α = 0 i = 1,, a H 0 A 0 0 i : Not Ho α 0 for some i. i H : β = 0 j = 1,, b H A j : Not Ho α 0 for some j. i H :( αβ ) = 0 i = 1,, a j = 1,, b H A ij :Not Ho ( αβ ) 0 for some i, j. ij SST=SSA+SSB+SSAB+SSE
factor treatment error total > qf(0.95,3,64) [1] 2.748191 > qf(0.95,9,64) [1] 2.029792 > qf(0.95,15,64) [1] 1.825586 > 1-pf(67.95,3,64) [1] 0 > 1-pf(27.27,9,64) [1] 0 > 1-pf(4.61,15,64) [1] 7.473861e-06
Homework 연습문제 7.4.2 (SAS) 연습문제 7.4.3 (SAS)
7.5 miscellaneous ( 기타 ) Log transformation: when normal assumption is violated. Normality is still problematic even after the variable transformation. Sample size is too small to check normality -> Nonparametric approach e.g. income, concentration
One Way ANOVA Type of Sum of Squares * Type Ⅰ:sequential (if we know the relative importance of the variables) Type Ⅱ: partial without interaction terms **TypeⅢ:partial with interactions (If we don t know the relative importance of the variables) TypeⅣ: There are missing cells (if none, same as TypeⅢ) *, ** : defaults model : Y = µ + Ai + εij
One Way ANOVA, mod12.sas /* File : mod12.sas To demonstrate one way ANOVA */ filename in 'd:\intro\taillite.dat'; data one; infile in; input id vehtype group position speedzn resptime follotme folltmec ; if group = 1; run; proc sort ;by vehtype ; proc means; var resptime; by vehtype ; title 'Means of Response Time by Vehicle Type'; run; proc gplot ; plot resptime*vehtype ; symbol i=box; title 'Box Plot Response Time by Vehicle Type'; run; proc anova; class vehtype; model resptime = vehtype ; means vehtype /tukey lines bon cldiff scheffe snk lsd ; title 'One way Aonva for Tail Light Study'; title2 ; run;
Two Way ANOVA, mod13.sas /* File : mod13.sas To demonstrate Two way ANOVA */ filename stiff 'd:\intro\dummy.dat'; data one; infile stiff; input species $ impactor $ stiff1 stiff2 calcium magnesm ; run; proc gchart ; block species / group=impactor sumvar=stiff1 type=mean ; title 'Block Chart of Stiff1 by Impactor and Species'; run; proc anova; class species impactor; model stiff1 = species impactor species*impactor ; means species impactor / duncan lines ; title 'Two way Aonva Dummy Data'; run;