2 χ Chapter 10 분포와도수분석 Chi-square dist n & the analysis of frequencies 2014/5/22
2 χ 10.2 분포의수리적특징 2 χ 의정의 (definition) Z,, Z ~ independent N(0,1) 1 n n i = 1 Z ~ χ 2 2 i n Y µ 2 eg.. Z = i Y ~ N( µσ, ) i i σ 2 χ 의응용 (Usage) 적합도검정 (Tests of Goodness-of-Fit) 독립성검정 (Tests of Independence) 동질성검정 (Tests of Homogeneity)
10.3 적합도검정 (Goodness-of- fit) 우리의 data 가가설상의분포 ( 정규분포, 이항분포, 포아슨분포등 ) 와일치하는가? Data = theoretical distribution (normal, binomial, Poisson, etc)? Inpatient occupancy ratio # of hospitals
보기 10.3.1(Normal dist n) χ ( O E ) 2 k 2 i i 2 = ~ χk r i = 1 Ei O i 관측치 (observed) r : 제약조건 ( O = E i i E 기대치 (expected) i )+ 추정하는모수의개수 # restriction # parameters estimated interval expected rel. freq expected freq Observed freq
보기 10.3.1 의결과 χ ( O E ) 2 k 2 i i 2 χ9 3 i = 1 Ei = = 25.854 > (0.005) = 18.548 H 0 Reject : 정규분포를따르지않는다. -> Not normally distributed. 기대도수가충분히커야 ( >10) 근사값이좋음. E i <5인경우 cell을합쳐서 10보다크게다시범주화시켜야한다. Chi-square approximation is valid when expected freq is large enough ( E i >10). When E i <5, we can re-categorize the levels to have enough cell sizes. E i
보기 10.3.2 이항분포 (binomial dist n) 이항분포의가정하에서기대도수 = 기대상대도수 * 총합 Expected freq under binomial dist n=prob*total 25 px = = x 0.2 = 500 / 2500 p x 25 x ( ) (0.2) (1 0.2), x 0,1, 2,, 25
χ (11 2.74) (0 1.73) 2.74 1.73 2 2 2 2 = + + = 47.624 χ10 2 유의하므로이항분포의귀무가설을기각한다. Significant -> reject Ho (data~binomial dist n)
보기 10.3.3 포아슨분포 (Poisson dist n) 포아슨분포의가정하에서상대도수의기대치 Expected relative freq under Poisson dist n x e λ λ px ( ), x 0,1, 2, x! = = =3: known λ
χ (5 4.50) (2 1.08) 4.50 1.08 2 2 2 2 = + + = < = χ9 1 3.664 15.557 (0.05)
연습문제 10.3.5 homework
10.4 독립성검정 Tests of independence 분할표 (contingency table)
보기 10.4.1 Blood type Progress of disease None some serious
H : two variables are independent. 0 P = P P E = P N P ij i j ij ij ij N N N N i. = = 2 N N N χ =. j i.. j...... ( O E ) 2 ij ij 2 E χ = 1, 2, 3 j = 1, 2, 3, 4 ( r 1)( c 1) i j ij # row # col 2.. ~ i
작은기대도수 (small expected freq) 기대치 5미만의 cell수가전체 20% 를넘지않으며, 최소기대치가 1이상이면무관하다. (If min >1 and cells <5 are less than 20% then not a problem) 2Ⅹ2 분할표 (table) n<20 or 20<n<49 그리고기대도수 5이하일 2 경우에는 χ-test를하지말라!! -test is not valid if n<20 nor (20<n<49 and expected freq of one or more cells < 5 ) Yates adjustment ( 보정 ) : 꼭읽어보자!! Read!!
연습문제 10.4.6 homework
10.5 동질성검정 (homogeneity test) 동질성검정 : 각각의모집단에서독립적으로뽑은표본들의분포가서로동질의것인가? Homogeneity test: Are two samples selected from one population? 독립성검정 : 한모집단에서표본추출, 행과열의합계는조절이아니고우연히나타난다. Independent test : selected from a population. Marginal totals are randomly determined. 독립성검정 v.s. 동질성검정 Independent test vs. homogeneity test
보기 10.5.1 class Drug usage experimental casual Moderate to heavy Freshman Sophomore Junior Senior H 0 가설 : 4개의집단 (1,2,3,4학년) 에서환각제사용정도의분포가동일하다. H 0 Distributions of drug usage are the same (homogeneous) among 4 groups. ( O E ) 2 k 2 i i 2 χ = = 19.4 > χ 12.592 (4 1)(3 1) E = i = 1 Reject i H 0
2Ⅹ2 table 1 2 χ test 2 χ n( ad bc) 220(60.72 40.48) = = ( a + c)( b + d)( a + b)( c + d) 108 112 100 120 = 8.7302 > 3.841 Reject H 0 2 2 probabilities of having the disease for two groups are significantly different.
2두집단의확률에대한비교 (Comparing two probabilities) H : p = p H : p p 0 1 2 a 1 2 Z = pˆ ˆp 2 ( p p ) 1 1 2 p(1 p) p(1 p) + n n eg.. 1 2 n = 100 pˆ =.60 n = 120 ˆp 2 = 0.40 1 1 2.60 100 +.40 120 p = = 0.4909 100+ 120 0.60 0.40 Z = = 2.95469 > 1.96 significant.4903.5091.4903.5091 + 100 120
연습문제 10.5.4 homework
debatea.sas * File : debatea.sas ; options ls=70 ps=55 nodate nonumber ; data one; input id school gender compare argue research reason speak ; if school in (3,5,6,8) ; label id='survey Number' school='high School' compare='how Debate Compares to OthersClasses' argue='argumentation' research='research' reason='reasoning' speak='speaking' ; cards; 1 6 1 1 1 1 1 1 108 7 1 1 1 1 1 2 56 3 1 1 1 1 1 1,,, 생략 70 6 1 1 1 1 1 1 69 6 2 1 1 1 1 1 ; run; proc freq data=one; tables school*compare/chisq expected ; title 'Comparing Schools in the Debate Survey'; run; proc freq data=one; tables school*compare/exact ; title 'Comparing Schools in the Debate Survey'; run;
data respire; input treat $ outcome $ count ; cards; test f 40 test u 20 placebo f 16 placebo u 48; proc freq; weight count; tables treat*outcome/chisq; run;
SAS 시스템 FREQ 프로시저 treat * outcome 교차표 treat outcome 빈도 백분율 행백분율 칼럼백분율 f u 총합 -----------+--------+--------+ placebo 16 48 64 12.90 38.71 51.61 25.00 75.00 28.57 70.59 -----------+--------+--------+ test 40 20 60 32.26 16.13 48.39 66.67 33.33 71.43 29.41 -----------+--------+--------+ 총합 56 68 124 45.16 54.84 100.00
treat * outcome 테이블에대한통계량 통계량 자유도 값 확률값 ---------------------------------------------------------- 카이제곱 1 21.7087 <.0001 우도비카이제곱 1 22.3768 <.0001 연속성수정카이제곱 1 20.0589 <.0001 Mantel-Haenszel 카이제곱 1 21.5336 <.0001 파이계수 -0.4184 분할계수 0.3860 크래머의 V -0.4184 Fisher 의정확검정 ---------------------------- (1,1) 셀빈도 (F) 16 하단측 p 값 Pr <= F 2.838E-06 상단측 p 값 Pr >= F 1.0000 테이블확률 (P) 양측 p값 Pr <= P 2.397E-06 4.754E-06 표본크기 = 124
data severe; input treat $ outcome $ count ; cards; Test f 10 Test u 2 Control f 2 Control u 4 ; proc freq order=data; tables treat*outcome / chisq nocol; weight count; run;
SAS 시스템 FREQ 프로시저 treat * outcome 교차표 treat outcome 빈도 백분율 행백분율 f u 총합 -----------+--------+--------+ Test 10 2 12 55.56 11.11 66.67 83.33 16.67 -----------+--------+--------+ Control 2 4 6 11.11 22.22 33.33 33.33 66.67 -----------+--------+--------+ 총합 12 6 18 66.67 33.33 100.00
treat * outcome 테이블에대한통계량 통계량자유도값확률값 ---------------------------------------------------------- 카이제곱 1 4.5000 0.0339 우도비카이제곱 1 4.4629 0.0346 연속성수정카이제곱 1 2.5313 0.1116 Mantel-Haenszel 카이제곱 1 4.2500 0.0393 파이계수 0.5000 분할계수 0.4472 크래머의 V 0.5000 경고 : 셀들의 75% 가 5 보다작은기대도수를가지고있습니다. 카이제곱검정은올바르지않을수있습니다. Fisher 의정확검정 ---------------------------- (1,1) 셀빈도 (F) 10 하단측 p 값 Pr <= F 0.9961 상단측 p 값 Pr >= F 0.0573 테이블확률 (P) 0.0533 양측 p 값 Pr <= P 0.1070 표본크기 = 18
Exact Test Table Cell (1,1) (1,2) (2,1) (2,2) probabilities 12 0 0 6.0001 11 1 1 5.0039 10 2 2 4.0533 9 3 3 3.2370 8 4 4 2.4000 7 5 5 1.2560 6 6 6 0.0498
Table Probabilities One-tailed p-value p = 0.0533 + 0.0039 + 0.0001 = 0.0573 Two-tailed p-value p = 0.0533 + 0.0039 + 0.0001+ 0.0498 = 0.1071
McNemar Test : Matched pairs
data one; input hus_resp $ wif_resp $ no ; datalines; yes yes 20 yes no 5 no yes 10 no no 10 ;run; proc freq ; tables hus_resp*wif_resp / agree ; weight no ; run; Ho : husband and wife 의 approval rates 는같다 를기각하지못함. We do not reject Ho : approval rates of husband and wife are the same.
신뢰구간이 0 을포함하지않으므로 K=0 이라는귀무가설을 95% 신뢰수준에서기각한다. CI does not include 0. -> we reject the null hypo. of K=0 by 95% confidence level. Kappa=1 Kappa > 0.8 Kappa > 0.4 >> perfect agreement, >> excellent agreement >> moderate agreement
prop.table(m, margin=2)*100 # Chisq test by R filename: chisq.r data <- matrix(c(25, 5, 15, 15), ncol=2, byrow=t) prop.table(m, margin=2)*100 data prop.table(m, margin=1)*100 data2 <- matrix(c(16, 11, 3, 21, 8, 1), ncol=2, byrow=t) data2 (Xsq <- chisq.test(m)) # Prints test summary chisq.test(data) Xsq$observed # observed counts (same as M) chisq.test(data2) Xsq$expected # expected counts under the null Xsq$residuals # Pearson residuals fisher.test(data2) data <- matrix(c(6, 2, 8, 4), ncol=2, byrow=t) data mcnemar.test(data) sum((xsq$residuals)**2) 1-pchisq(sum((Xsq$residuals)**2), (ncol(m)-1)*(nrow(m)-1)) ## From Agresti(2007) p.39 M <- as.table(rbind(c(762, 327, 468), c(484,239,477))) dimnames(m) <- list(gender=c("m","f"), party=c("democrat","independent", "Republican")) M colsums(m) rowsums(m) cbind(m,rowsums(m)) rbind(m,colsums(m))
변수종류에따른통계분석법 (statistical tests) 종속변수 (dep. Var) 독립변수 (indep var) 통계분석법 (tests) 연속변수 ( 혈압 ) Conti. (BP) 연속변수 ( 혈압 ) Conti. (BP) 범주형 ( 병발생여부 ) Categorical (disease status) 연속형 ( 아기의체중 ) Conti (weight) 연속형 ( 출생시체중 ) Conti (weight) 생존시간 ( 연속형, >0) Survival time (conti, >0) 명목척도 (2 개범주 ) Categorical (2 level) 범주형 (3 개이상 ) Categorical (>2 level) 범주형 ( 투약여부 ) Categorical (treatments: A,B,C, etc) 연속형 ( 재태임신기간 ) Conti (gestation) 연속형 + 범주형 ( 재태기간 smoking 여부 ) Conti+ categorical (gestation + smk status) 연속형 + 범주형나이 smoking 여부 Conti+ categorical (gestation + smk status) T test, paired T test 분산분석 (ANOVA) 카이제곱검정 ( 하나의독립변수 ) Chi-square test (1 indep. Var) 로지스틱회귀분석 ( 둘이상의변수 ). Logistic regression (>1 indep var) 회귀분석 (regression analysis) 공분산분석 (ANCOVA) : analysis of covariance 생존분석 (survival analysis)
Characteristics of the data parametric Non-parametric 종속변수가범주형 (dep=categorical) 종속변수가연속형두개의독립된집단 Dep=conti, two groups 두개의짝지은집단 Paired observations 세개이상의집단 More than 2 groups 제3의변수의영향고려 Adjusting other variables 상관분석 Correlation 카이제곱검정 chi-square test T-test Paired t-test ANOVA 2-way ANOVA Pearson correlation Fisher s exact test Ncnemar test Cochran s Q Wilcoxon rank sum test Man-whitney median test Wilcoxon signed rank test Kruscal-Wallis test Friedman s 2-way ANOVA Spearman s correlation Kendall s tau Stuart s tau