Chapter 8 실험계획및분산분석 (ANalysis Of VAariance, ANOVA) Updated 2018/4/30
7.1 머리말 (Introduction) 분산분석 (analysis of variance) : 전체변동을몇개의성분으로분할하는기법 (Divide total variation into several components) 전체변동에대해각각의변동요인의기여규모를파악 (contribution of particular components) 목적 (Aims) : 모분산의추정과가설검정 (estimation & testing for the variances) 모평균의추정과가설검정 (estimation & testing for the means)
motivation 비교하고싶은그룹이두개이면 (comparisons of two groups) -> t-test 비교하고싶은그룹이두개이상이면 (more than two groups) -> 두개그룹씩뽑아서쌍을만든후에여러개의 t-test 를실시한다. (pairwise t-tests) 번거롭기도하고이론적으로틀린결론에도달할수있다. (cumbersome & theoretically wrong -> 다중비교의문제 (multiple-comparisons problems) 전체자료를사용하지않고자료의부분만을사용하므로효율이떨어진다. (efficiency problems due to the usage of partial data) 전체자료를이용하여서세그룹이상을비교하는분석 (more than 3 groups using whole data) -> 분산분석 ANOVA ( 종속변수는연속형, 독립변수는이산형 ) response var: conti, explanatory var: categorical
7.2 일원배치분산분석 (one-way analysis of variance) 하나의설명변수 (one explanatory variable)
완전확률계획법 (Completely Randomized Design) 정의 : 처리방법을확률적으로할당하고그처리효과를 randomization) 완전확률화계획법으로관측한표본 처리변수 (treatment) 1 2 3 k x 11 x 12 x 13 x 1k x 21 x 22 x 23 x 2k x 31 x 32 x 33 x 3k x n1 1 x n2 2 x n3 3 x nk k 합 (total) T.1 T.2 T.3 T.k T.. 평균 (sum) x.1 x.2 x.3 x.k x..
일원분산분석 (one-way analysis of variance) 보기 8.2.1 소의연령에따른육류의셀레니움농도비교 Comparison of selenium concentration of meat according to age of cattle 나이그룹 (age group of cattle) A B C D 1820 1483 191 724 1020 1652 775 752 2588 1723 1098 613 805 1309 1393 804 2670 727 644 918 631 1002 533 1182 1022 1463 136 949 641 966 734 1243 1555 1777 1605 877 760 788 485 985 222 1129 1247 1368 1085 472 449 1295 1197 1529 1692 775 471 236 1676 1249 1422 697 1307 771 831 754 1520 445 849 344 869 698 937 489 990 1199 961 513 167 1022 2575 489 429 239 731 824 1073 1426 2408 798 944 1130 448 948 1846 1064 631 1096 1034 991 222 1088 629 1016 1261 590 721 912 1025 42 994 375 1383 948 767 1781 1187
모형 (model) x ij j ij ij번째측정치 j처리의평균 ij번째오차 ij-th observation mean of j-th treatment group error of ij-th observation j ij k j j 1 k j j : 전체평균 Grand mean : j번째처리효과 Effect of j-th treatment group
모형의가정 2 1. ~ N(0, ) independent ij 2. 평균, 등분산, 정규성, 독립적 (mean, variance (homogeneity), normality, independence) 모형의가설 (Hypothesis of the model) H 0 : 1 2 k H A : 모든 j 가같은것은아니다. (All the j 's are not the same)
Same variances & same means Same variances but different means
총제곱합 (sum of squares, total) k n j SST ( x x ) ij.. j 1 i 1 2 k n j j 1 i 1 x ij 2 2 T.. N
k SST ( x x ) k j 1 i 1 n j ( x x x x ) j 1 i 1 n j ij.. 2 ij. j. j.. 2 k n j k n j k n j 2 2 ( xij x. j ) 2 ( xij x. j )( x. j x.. ) ( x. j x.. ) j 1 i 1 j 1 i 1 j 1 i 1 k n j k 2 2 ( xij x. ) j n j ( x. j x.. ) j 1 i 1 j 1
SST SSW SSA 분산비 = Within-group SS within MSA variance ratio= MSW among group 집단내제곱합집단간제곱합 집단간평균제곱집단내평균제곱 Among(Between)-group SS -> 분산비가커지면집단간의 variation 이크다. 집단간의성질이다르다. 집단의효과가크다. ->larger VR -> larger between-group SS 0 -> groups are different -> bigger group effect!
유전율예제 (Heritability Example)
ANOVA Table Mean square Variance ratio Sum of squares df 요인 factor 제곱합자유도평균제곱합 F 집단간제곱합 Between group SSA = k j=1 n j x.j x.. 2 k 1 MSA = SSA/(k 1) MSA MSW 집단내제곱합 Within group SSW = k n j j=1 i=1 x ij x.j 2 N k MSW = SSW/(N k) 총제곱합 total SST = k n j j=1 i=1 x ij x.. 2 N 1
ANOVA Table 1 E(MSA), 2 E(MSW) k 2 2 2 2 k A A j k 1 j 1 The null hypothesis 1 1 k ( 0) indicates the equivalence of variances estimated with MSA and MSW. However, under the alternative test, variance estimate from MSA is bigger than that from MSW. 오른쪽검정 (right-tailed test)
program PROC IMPORT OUT= WORK.sele e.csv" RUN; DATAFILE= "E:\kim\yes\myweb\int\2018\newlectureNote\data\sel DBMS=CSV REPLACE; GETNAMES=YES; DATAROW=2; * SAS 코드 ; proc anova data=sele; class group; model value=group; run; # R code > sele<read.table('e:\\kim\\yes\\myweb\\int\\2018\\newlecturenote\\data\\sele.csv',s ep=',',header=t) aov(value~group,data=sele) > boxplot(value~group,data=sele)
Betw With total VR Call: aov(formula = value ~ Group, data = sele) Terms: Group Residuals Sum of Squares 5931208 23026500 Deg. of Freedom 3 109 Residual standard error: 459.6219 Estimated effects may be unbalanced >
Multiple Comparisons ( 다중비교 ) ex) significance level = for a test Let H : 0 p( do not reject H H is true) 1 01 1 01 01 H : 0 p( do not reject H H is true) 1 02 2 02 02 then p( do not reject H H ) where H H and H 0 0 0 01 02 p( do not reject H01 and do not reject H02 H0 (1- ) (1- ) (1- ) In general, if we want to test 1 2 3 k 0, then k (1 ) (1 ) 2 4 1 0.1855 0.8145 (.95).95 overall is 0.1855, not 0.05 -> inflated type I error!! )
Bonferroni Correction : Set individual significance the overall significance level is about for m multiple tests. m m=4 4 0.05 1 0.95 1 0.05 4 example) When we have 10 hypotheses, Individual p=0.05 -> multiple comparisons problem (too many false findings) Individual p= 0.05 10 0.005 This is often called Bonferroni corrected p-value.
[ 처리그룹쌍별두모평균차이의검정 ] Detecting pairwise differences After rejecting H0 : 1 2 5, which pairs have larger differences? 1. LSD (least significant difference, 최소유의차검정법 ) 2. Duncan s new multiple range test Duncan 의새로운다중범위검정법 3. Tukey s HSD Liberal Conservative Duncan LSD SNK Tukey HSD Scheffe
3. Tukey 의 HSD (honestly significance difference) 검정 MSE HSD = q, k, N k n n MSE HSD q n * *, k, N k * j n j j j 's are the same : sample size of smaller cell ymax ymin q, k, N k : dist of, S 2/ n : significance level, k : number of gropus, N K: df
보기 8.2.2 Pair-wise differences A B C D A - 455.72 574.54 596.63 B - 118.82 140.91 C - 22.10 D -
표 8.2.6 Pairwise comparisons by Tukey s HSD test 개별영가설 HSD* 검정결과 H 0 : μ A = μ B HSD = 3.690 211252 2 1 22 + 1 14 = 409.99 455.72 > 409.99 이므로 H 0 을기각함. H 0 : μ A = μ C HSD = 3.690 211252 2 1 22 + 1 29 = 339.05 574.54 > 339.05 이므로 H 0 을기각함. H 0 : μ A = μ D HSD = 3.690 211252 2 1 22 + 1 48 = 308.75 596.63 > 308.75 이므로 H 0 을기각함. H 0 : μ B = μ C HSD = 3.690 211252 2 1 14 + 1 29 = 390.27 118.82 < 390.27 이므로 H 0 을기각하지못함. H 0 : μ B = μ D HSD = 3.690 211252 2 1 14 + 1 48 = 364.25 140.91 < 364.25 이므로 H 0 을기각하지못함. H 0 : μ C = μ D HSD = 3.690 211252 2 1 29 + 1 48 = 282.05 22.10 < 282.05 이므로 H 0 을기각하지못함.
proc anova ; class group ; model value= group ; means group /Tukey ; run;
Homework 1-8 9-> 다음문제들을공식을이용해서분산분석표를계산하시오 ( 엑셀사용가능 ). 그리고 SAS 를이용한결과와비교하시오 9-> Make Anova tables using the formulae (you may use MS Excel). Compare your results with the results from SAS
8.3 확률화완전블록계획법과이원배치분산분석 (Randomized complete block design and two-way ANOVA) R.A.Fisher (1925) : to compare the yields of certain species 땅을블록 (block=land) 으로나누고블록안에서 Randomize (other factors) in a block 하는것이다. block 처리 treatments total average 블록 1 2 3 k 합 평균 1 x 11 x 12 x 13 x 1k T 1. 2 x 21 x 22 x 23 x 2k T 2. 3 x 31 x 32 x 33 x 3k T 3. x 1. x 2. x 3. total average n x n1 x n2 x n3 x nk T n. 합 T.1 T.2 T.3 T.k T.. 평균 x.1 x.2 x.3 x.k x n. x..
보기 8.3.1 약에따른치료시간의차이 (treatment duration (days) by drug) Drug 약의종류 Sum Average 나이그룹 A B C 합 평균 20 미만 11 8 10 29 9.667 Age group 20 이상 29 미만 6 5 11 22 7.333 30 이상 39 미만 7 10 13 30 10 40 이상 49 미만 9 12 13 34 11.333 50 이상 10 17 15 42 14 합 (sum) 43 52 62 157 평균 (avegage) 8.6 10.4 12.4 10.467
모형(model) x block effect trt effect ij i j ij 블럭효과 처리효과 x 2 ij ij ( i j ) ~ N(0, ) 가설 (hypothesis) H : 0 j 1,2,, k H 0 j :All 0 is not true. Some 0. A j j
* k SST ( x x ) n j 1 i 1 ij.. 2 k n k n j k n 2 2 2 ( xi. x.. ) ( x. j x.. ) ( xij xi. x. j x.. ) j 1 i 1 j 1 i 1 j 1 i 1 SST SSBl SSTr SSE df : nk 1 ( n 1) ( k 1) ( n 1)( k 1)
ANOVA table factor Sum of squares Degree of freedom Mean square 요인제곱합자유도평균제곱 F trt block error 처리 SSTr (k 1) MSTr = SSTr/(k 1) 블록 SSBl (n 1) MSBl = SSBl/(n 1) 잔차 SSE (n 1)(k 1) MSE = SSE/(n 1)(k 1) MSTr MSE total 합 SST kn 1
> response<-c(11, 6, 7, 9, 10, 8, 5, 10, 12, 17, 10, 11, 13, 13, 15) > drug<-factor(rep(c('a','b','c'),each=5)) > age<-factor(rep(1:5),3) # 교과서오류 > dat<-data.frame(response=response,drug=drug,age=age) > anova(lm(response~drug+age,data=dat)) Analysis of Variance Table data d; do j=1 to 3; do i=1 to 5; drug=j ; output; end; end; run; Response: response Df Sum Sq Mean Sq F value Pr(>F) drug 2 36.133 18.0667 3.4522 0.08300. age 4 71.733 17.9333 3.4268 0.06505. Residuals 8 41.867 5.2333 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 > data a; do i=1 to 3; do j=1 to 5; age=j ; output; end; end; run; Data cc; Input response @@; Cards; 11 6 7 9 10 8 5 10 12 17 10 11 13 13 15 ;run; Data res; Merge a(keep=age) d(keep=drug) cc; run; proc print data=res;run; proc anova ; class age drug ; model response= drug age ; run;
8.4 요인실험 (Factorial Design) 과이원배치분산분석 (two-way ANOVA) 반응시간 (reduction of response time ) = 약품수준 ( 소량, 중간, 다량 )* 연령층 ( 중년, 노년 ) drug level (min, med, max)*age(mid, old) 교호작용이없을때 (Without interaction) 요인 B 약품용량 (Factor-B, drug level) 요인 A 연령 Factor A-age j=1 j=2 j=3 중년층 (Mid) i=1 5 10 20 노년층 (old) i=2 10 15 25
교호작용이있을때 (With interaction) 요인B 약품용량 요인A j=1 j=2 j=3 j=2-1 j=3-2 - 연령 중년층 (i=1) 5 10 20 5 10 노년층 (i=2) 15 10 5-5 -5
2 요인완전확률화할당계획법 (2 factors) Factor A Factor B 요인 B 요인 A 1 2 b 합 평균 x 11n x 12n x 1bn T 1.. x 1.. 1 x 111 x 121 x 1b1 x 21n x 22n x 2bn T 2.. 2 x 211 x 221 x 2b1 x 2.. x a1n x a2n x abn T a.. a x a11 x a21 x ab1 x a.. 합 T.1. T.2. T.b. T... 평균 x.1. x.2. x.b. x...
EX 8.4.2 간호사의가정방문시간 (time of staying home for a nurse) = 간호사의연령, 환자의질환 (age of the nurse, disease of the patient) 모형 (Model) x ijk i j ( ) ij ijk i 1,, a j 1,, b k 1,, n
Hypotheses( 가설 ) H : 0 i 1,, a 0 H : Not H 0 for some i. A 0 0 i 0 0 H : 0 j 1,, b j H : Not H 0 for some j. A 0 i j H :( ) 0 i 1,, a j 1,, b ij H :Not H ( ) 0 for some i, j. A ij SST=SSA+SSB+SSAB+SSE
PROC IMPORT OUT= WORK.nurse DATAFILE= "E:\kim\yes\myweb\int\201 8\newlectureNote\data\nurse.csv" DBMS=CSV REPLACE; GETNAMES=YES; DATAROW=2; RUN; proc anova; class a b; model time= a b a*b ; run; > qf(0.95,3,64) [1] 2.748191 > qf(0.95,9,64) [1] 2.029792 > qf(0.95,15,64) [1] 1.825586 > 1-pf(67.95,3,64) [1] 0 > 1-pf(27.27,9,64) [1] 0 > 1-pf(4.61,15,64) [1] 7.473861e-06
miscellaneous ( 기타 ) Log transformation: when normal assumption is violated. Normality is still problematic even after the variable transformation. Sample size is too small to check normality -> Nonparametric approach e.g. income, concentration
One Way ANOVA Type of Sum of Squares * Type Ⅰ:sequential (if we know the relative importance of the variables) Type Ⅱ: partial without interaction terms **TypeⅢ:partial with interactions (If we don t know the relative importance of the variables) TypeⅣ: There are missing cells (if none, same as TypeⅢ) *, ** : defaults model : Y Ai ij
One Way ANOVA, mod12.sas /* File : mod12.sas To demonstrate one way ANOVA */ filename in 'd:\intro\taillite.dat'; data one; infile in; input id vehtype group position speedzn resptime follotme folltmec ; if group = 1; run; proc sort ;by vehtype ; proc means; var resptime; by vehtype ; title 'Means of Response Time by Vehicle Type'; run; proc gplot ; plot resptime*vehtype ; symbol i=box; title 'Box Plot Response Time by Vehicle Type'; run; proc anova; class vehtype; model resptime = vehtype ; means vehtype /tukey lines bon cldiff scheffe snk lsd ; title 'One way Aonva for Tail Light Study'; title2 ; run;
Two Way ANOVA, mod13.sas /* File : mod13.sas To demonstrate Two way ANOVA */ filename stiff 'd:\intro\dummy.dat'; data one; infile stiff; input species $ impactor $ stiff1 stiff2 calcium magnesm ; run; proc gchart ; block species / group=impactor sumvar=stiff1 type=mean ; title 'Block Chart of Stiff1 by Impactor and Species'; run; proc anova; class species impactor; model stiff1 = species impactor species*impactor ; means species impactor / duncan lines ; title 'Two way Aonva Dummy Data'; run;