1 Chapter 12 비모수통계학 (nonparametric analysis) 2017/6/5
2 9.1 머리말 (introduction) 모수적방법 모집단의분포를가정 그분포는모수의함수 모수를알면분포를완전히안다. 모수의추정과검정이주요문제 모집단의분포가정이틀리면전체논리가다틀리게된다. Parametric approach * assumes dist n of the pop * dist n is the function of the parameters * Characteristics of the pop is determined by the parameters * Estimation and testing of the parameters are main problems * If the parametric assumptions are not valid, all the results of the analysis are questionable.
3 9.1 머리말 (introduction) 비모수적방법 ; * 모집단의분포를가정하지않음 ( 무분포방법 ) * data 의순위를사용 * 모수가정이합리적인경우모수적방법이훨씬더효과적 (efficient) Nonparametric approach * does not assumes the distributions of the pop (distribution-free method) * uses order of the data * If the parametric assumes are valid then parametric method is more efficient (smaller variance, less p- value)
4 data mean median 1,2,3,4, ,2,3,4,5, Median is robust to the outliers comparing to mean. (<-> sensitive) median is the same if 100 -> Nonparametric methods typically uses order of the data, not the value of the data.
5 Parametric vs. nonparametric methods 비모수적방법은자료의 ( 정규성 ) 분포가정을하지않는다 Nonparametric methods are not dependent on parametric distributions. 자료의평균과분산이아닌순위를이용한방법을사용한다. It typically uses ranks rather than the mean and variance. 자료의분포가정 (eg 정규성 ) 이만족되면효율이떨어진다. If the distributional assumptions are valid, then nonparametric methods are less efficient (larger variance) Robust 한결과를준다. (outlier 에둔감 ) It is robust (not sensitive) to outliers
6 12.2 측정척도 (measurement scale) 명목척도 (Nominal Scale) 남자, 여자, (male, female) 서울, 부산 (NY, LA) 서열척도 (Ordinal Scale) 上, 中, 下 (high, medium, low) 구간척도 (Interval Scale) 서열도의미, 절대적차이도의미 비척도 (Ratio Scale) 비율도의미
7 12.3 부호검정 (Sign Test) Ex 학생번호 (No) 점수 (Score) 학생번호 (No) 점수 (Score) 가설 Ho : 중위수 (Median)=102, Ha : 중위수 (Median) 102
8 Scores above(+) or below(-) the hypothesized median (103) 학생번호 관측값 Decision rule H A H A H A :P(+)>P(-)=Median>102 : enough # of + s -> Reject :P(+)<P(-)=Median<102: enough # of - s -> Reject :P(+) P(-)=Median 102: enough # of + or - s -> Reject H 0 H 0 H 0 Ex 에서 H :( 중위수 =102) H : P(+) P(-) 0 A # of + s out of 15 under ~ Bin(15,1/2)
9 Test statistic P k 6 15,0.5 = = We cannot reject Ho [ 짝비교를위한부호검정 ] 짝지은관측값들의차이의 + 혹은 여부를사용함. We may apply Sign test for paired observations (like paired t- test)
10 data sign; input score datalines; ; run; proc univariate mu0=102 ; run; 2-sided 1-sided=.6072/2 =.3036
11 Ex ( 쌍을이룬집단비교 ) paired data instructed Dental Hygiene Score 점수 id 양치질교육을받은사람 (X i ) 양치질교육을받지않은사람 (Y i ) Not-instructed Hypothesis H 0 H A : median of the difference is P(+)=P(-) : median of the difference is negative P(+) < P(-)
12 Test statistic : # of (+) id X i Y i P k 2 11, p = 0.5 = 2 r=0 0.5 r r r pbinom(2,11,0.5)=0.0327< > α = 0.05에서영가설을기각한다. (Reject Ho) [ 오른쪽부호검정 ] (Sign Test Using right tail) [ 표본의크기 ] (Sample size)
13 data pair; input edu noedu ; diff=noedu-edu ; datalines; ;run; proc univariate ; var diff ; run; 2-sided 1-sided=.0654/2 =.03275
14 12.4 Wilcoxon 의위치에대한부호순위검정 (Wilcoxon s signed rank test) 관측값 (obs) d i = x i d i 의순서 d i 의순서와부호의곱 W + = 86, W = 34, W = 52 Ho: mean=5.50, Ha: Mean 5.50 Test stat: W= W + + W = 52 Reject Ho if W is too large or too small >wilcox.test(c(4.90,4.1,6.73,7.27,7.42,7.5,6.76,4.64, 5.98,3.14,3.24,5.8,6.17,5.39, 5.78), mu=5.05) p- 값은
15 12.5 중위수검정법 (Median Test) H 0 : 중위수 ( 농촌 )= 중위수 ( 도시 ) Median(rural)=Median(urban) Mental health score urban rural urban rural # >= Median # < Median urban rural 도시 시골 합계 중위수보다큰값의수 중위수보다작은값의수 합계
16 H 0 하에서는 2ⅹ2분할표의 row와 column이독립 Row and column are independent under Ho 2 n( ad bc ) ( a c)( b d )( a b)( c d) = 따라서 p 0.10 Do not reject 두집단의중위수는동일하다. 2 H 0 Medians of two groups are not different.
17 12.6 Mann-Whitney test 가정 : 두집단의 sample size 가각각 n, m 일때 1 독립적이고확률적으로뽑았다. 2 서열적이다. 3 두집단은같은분포이고, 중위수만다르다. Assumptions: samples are n, m, respectively. 1 sampled independently and randomly. 2 ordinal scale. 3 different only by the medians. Shapes are exactly the same
18 Ex 몸무게 (Weight) Group 1 (X) Group 2 (Y) 그룹 1 의모중위수가그룹 2 의모중위수보다작다고할수있나? Is population median of group 1 is smaller than that of group 2? H 0 M X M Y vs H A M X < M Y
19 rank rank 그룹 1 순서 그룹 2 순서 Total Rank sum of X m m + 1 U = W = = Rule: Reject Ho if U is small enough. p-value=0.14 Evidence is not enough to reject Ho.
20 install.packages('coin') > library(coin) > xx<c(252,240,205,200,170,170,320,148,214,185,310,212,238,184,136,200,27 0) > yy<c(254,164,288,138,240,217,240,302,312,254,164,288,138,240,217,240,30 2,312) > dat<-data.frame(val=c(xx,yy),group=factor(rep(1:2,c(17,18))) ) > wilcox_test(val~group,data=dat,distribution = 'exact') Exact Wilcoxon-Mann-Whitney Test data: val by group (1, 2) Z = , p-value = alternative hypothesis: true mu is not equal to 0
21 11.6 Kolmogorov-Smirnov (K-S) goodness-of-fit test Are cumulative dist ns the same? Are dist ns of two pops the same? H : F ( x) F ( x) 0 S T H : F ( x) F ( x) A S T Fˆ ( x ) : 표본누적분포함수 Pr( x x ) S F ( x ) : 모집단누적분포함수 Pr( X x ) T S T sample cumulative dist n ft (pop) Cumulative dist n ft 검정통계량 (test stat) D sup F ˆ ( x ) F ˆ ( x ) x S T
22 계산방법, 보기 공복시혈당량이정규분포를따르는가? Glucose level ~ normal dist n? x 도수누적도수 F S (x) 합계 36
23 x F S x F T (x) F S x F T (x) x z = (x 80) 6 F T (x) [67,72) [72,75) [75,76) [76,77) [77,78) [78,80) [80,83) [83,84) [84,86) [86,87) [87,92) [92, ) D= < 0.221
24 경고메시지 ( 들 ): In ks.test(xx, "pnorm", mean = 80, sd = 6) : Kolmogorov-Smirnov 테스트를이용할때는 ties 가있으면안됩니다 > 근사적인 p- 값을사용한다. > xx<-c(75,92,80,80,83,72,83,77,81,77,75,81,80,92,72,77,78,76,77,86,77,92,80,78, + 67,78,92,67,80,81,87,76,80,87,77,86) > ks.test(xx,'pnorm',mean=80,sd=6) One-sample Kolmogorov-Smirnov test data: xx D = , p-value = alternative hypothesis: two-sided
25 12.8 Kruskal-Wallis One-way ANOVA7 가정 H 0 : k 개의집단은같은분포에서나왔다. H A : 적어도하나의집단은다른집단과다른분포 ( 큰값혹은작은값 ) 에서나왔다. Assumptions H 0 : k samples from the same distributions H A : one or more sample from distribution with larger or smaller location parameter
26 H 0 하에서는각집단에서의순위합들은비슷하다. R, R,, R 1 2 k 원래는 R R 2 i 의형태이고 i 값들이비슷하면 R R 2 i 값이작아지므로 Ho를 reject 못한다. rank-sums R, R,, R are similar under Ho 1 2 k If Ri s are similar then R R 2 i are small -> H is small, we cannot reject Ho R
27 보기 Rj H 3( n 1) ~ n( n 1) n Original values 반응값 A B C j 2 k 1 Ordered values 반응값 A B C H = (16) P<0.009 Page = 9.14
28 > xx<c(12.01,3.67,55.63,29.44,4.05,27.88,28.02,6.49,66.81,38.33,21.12,46.27,55.91,1.11,31.19) > dat<-data.frame(val=xx,group=factor(rep(1:3,5))) > kruskal.test(val~group,data=dat) Asymptotic Kruskal-Wallis Test data: val by group (1, 2, 3) chi-squared = 9.14, df = 2, p-value =
29 Ex Treatment cost by drug type per bed by hospital type Drug type A B C D E 17.38(11) 52.59(35) 27.87(20) 34.55(26) 60.77(40) 15.20(2) 44.55(28) 24.00(12) 31.15(22) 59.99(38) 14.76(1) 44.80(29) 26.55(16) 30.50(21) 58.94(37) 16.88(7) 43.25(27) 25.00(13) 31.25(23) 57.05(36) 17.02(10) 50.75(32) 27.55(19) 32.75(24) 60.50(39) 26.67(17) 52.25(34) 25.92(14) 33.00(25) 61.50(41) 15.75(4) 46.13(30) 26.01(15) 27.30(18) 51.10(33) 16.02(5) 48.87(31) 16.48(6) 15.30(3) 17.00(9) 16.98(8) R 1 =68 R 2 =246 R 3 =124 R 4 =159 R 5 =264 H = 12 41(41 + 1) = pchisq(36.39,4,lower=f)=
30 > val<c(17.38,15.20,14.76,16.88,17.02,26.67,15.75,16.02,15.30,16.98,52.59,44.55,44.80,43.25, 50.75, 52.25,46.13,48.87,27.87,24.00,26.55,25.00,27.55,25.92,26.01,16.48,17.00,34.55,31.15,30.50,31.25,32.75,33.00,27.30,60.77,59.99,58.94,57.05,60.50,61.50,51.10) > group<-factor(rep(c('a','b','c','d','e'),c(10,8,9,7,7))) > dat<-data.frame(val,group) > kruskal.test(val~group,data=dat) Kruskal-Wallis rank sum test data: val by group Kruskal-Wallis chi-squared = , df = 4, p-value = 2.401e-07
31 Ex Friedman s 2-way ANOVA Physical therapists ranks of three low-volt electrical simulators Therapist Medical device 의료기기 물리치료사 A B C R j H 0 : 3 가지의료기기의성능은동일하다. (Three devices are equivalent) H A : 적어도하나의의료기기성능은다르다. (They are not equivalent)
32 X 2 12 r = [ ] 3(9)(3 + 1) = [ 표 B(a)]-> p= 유의수준 0.05에서영가설기각 (Reject Ho) > val<-c(2,3,1,2,3,1,2,3,1,1,3,2,3,2,1,1,2,3,2,3,1,1,3,2,1,3,2) > group<-factor(rep(1:3,9)) > id<-factor(rep(1:9,each=3)) > friedman.test(val,group,id) Friedman rank sum test data: val, group and id Friedman chi-squared = , df = 2, p-value =
33 12.10 Spearman rank correlation coefficient 양측검정 H 0 : X 와 Y 는서로독립적이다. H A : X 와 Y 는독립적이아니다. 단측검정 H 0 : X 와 Y 는서로독립적이다. H A : X 와 Y 는정비례 H 0 : X 와 Y 는서로독립적이다. H A : X 와 Y 는반비례 2-sided H 0 : X and Y are indep. H A : X and Y are not indep. 1-sided H 0 : X and Y are indep. H A : X and Y: + association H 0 : X and Y are indep. H A : X and Y: - association
34 Ex 식별번호 X Y 식별변호 X Y 식별번호 순서 (X) 순서 (Y) d 2 i =246.5 d i d i 2
35 가설검정의순서 1 X,Y 따로순위를준다. 2 d = i 순위 (x )- i 순위 (Y ) i 3 을구한다. d 2 i r s = 1 6 d i 2 =0.697 > n(n 2 1) (table C) 2 steps 1 rank X, Y seperately. 2 d =rank(x i )-rank(y i ) i 3 calculate d 2 i 반비례의관계가있다면 d i 가커지고 r s 가작아진다. 2 비례의관계가있다면 d i 가작아지고 r s 가커진다. -> 충분히큰 r s -> 두변수가독립이라는귀무가설을기각함 negative association -> large positive association -> small d 2 i d 2 i -> small r s -> large r s r s is large enough -> reject H 0 : independence We conclude positive association between X and Y
36 Ex (n>30 일경우 ) 식별번호 나이 (X) 무기질농도 (Y) 식별번호 나이 (X) 무기질농도 (Y)
37 Ex (n>30 일경우 ) 식별번호 순서 (X) 순서 (Y) d i 2 d i 식별번호 순서 (X) 순서 (Y) d i 2 d i d 2 i = r s 0.75 Z r n S reject H 0 Z가너무크거나 ( 반비례관계 ) Z가너무작거나 ( 비례관계 ) larger Z (- asso) smaller Z(+asso) larger smaller if Z Z then reject H d 2 i d i 2 d i 2 d i 들이크고들이작고
38 12.11 비모수회귀분석 (non-parametric regression) Ex [Theil s method] β = median S 12,, S n 1,n, S ij = y j y i / x j x i, S 12 = = 테스토스테론 (Y) 구연산 (X) 절편의추정 (Estimating intercept ) β = median y 1 β 1 x 1,, y n β 1 x n β = median mean y 1 β 1 x 1, y 2 β 1 x 2, mean y
39 /* File name : Nonparametric One-Way Anova */ options pageno=1 nodate ls=130 ps=60 nocenter; filename inbrakes 'd:\myweb\intro\taillite.dat'; data one; infile inbrakes ; input id vehtype group positn speedzn resptime follotme folltmec; if group=1; label vehtype='vehicle Type' group='group - Light On=1 Light Off=2' positn='light Position' speedzn='speed Zone' resptime='response Time' follotme='following Time in Vedio Frames' folltmec='following Time in Categories ; run; proc sort; by vehtype; /* Let's do one-way ANOVA to see the effect of vehicle type */ proc anova; class vehtype; model resptime=vehtype; title 'Parametric ANOVA analysis'; run; /* What's wrong with this? We didn't check the normality assumption. Let's do proc univariate to check the normality*/ proc univariate normal plot; var resptime; by vehtype; title 'Normality Check'; run;
40 /* NOT NORMALLY DISTRIBUTED >> NONPARAMETRIC ANOVA */ proc npar1way wilcoxon; class vehtype; var resptime ; title 'Nonpara One-Way ANOVA for Tail Light Study'; run; /* The other way is transformation. Let's take log transformation so that we have normal distribition.*/ data t; set one; t=log(resptime); label t='ln (response time)'; run; proc sort; by vehtype; proc univariate normal plot; var t; by vehtype; title 'Normality Check for transformed variable'; run; /* The transformed variable seems to normally ditributed. */ Then we can do parametric ANOVA with normality assumption proc anova; class vehtype; model t=vehtype; title 'ANOVA for the log transformed response time'; run;
41 Nonpapametric Smoothing (1) Smoothing Consider X Y plot. Draw a regression line which requires no parametric as sumptions The regression line is not linear The regression line is totally dependent on the data Two components of smoothing Kernal function : How to calculate weighted mean Bandwidth : width of the window (span), determines the smoothness of the regression line; wider > smoother
42 Nonpapametric Smoothing (2) Uniform Kernel
43 Nonpapametric Smoothing (3) Triangular Kernel
44 Nonpapametric Smoothing (4) Normal Kernel
45 Nonpapametric Smoothing (5) Default Lowess line : Span=0.5
46 Nonpapametric Smoothing (6) Lowess line : Span=0.2
47 Nonpapametric Smoothing (7) Lowess line : Span=0.1
48 data A; input x datalines; ; title "sm45 spline smoother"; proc gplot data=a; plot y*x; symbol1 interpol=sm45 value=circle height=2; /* note that x is sorted */ run; title "sm70 spline smoother"; proc gplot data=a; plot y*x; symbol1 interpol=sm70 value=circle height=2; /* note that x is sorted */ run; title "sm20 spline smoother"; proc gplot data=a; plot y*x; symbol1 interpol=sm20 value=circle height=2; /* note that x is sorted */ run;
49 require(graphics) plot(cars, main = "lowess(cars)") lines(lowess(cars), col = 2) lines(lowess(cars, f =.2), col = 3) legend(5, 120, c(paste("f = ", c("2/3", ".2"))), lty = 1, col = 2:3)
50 data<- read.csv(" sep=",") head(data) data$$date) sl <-subset(data, ccode==11 ) boxplot(meanpm10~yy, ylab=expression(pm[10]), axes=t, data=sl) plot(sl$date,sl$meanpm10, ylab=expression(pm[10]), xaxt='n', cex=0.6)<-seq(" ")," "),"year") xname<-c("' ","' ", "' ", "' ", "' ", "' ", "' ", "' ") axis(side=1,, labels=xname) table($meanpm10)) which($meanpm10)) sl[829,"meanpm10"]<-(sl[828,"meanpm10"]+ sl[830,"meanpm10"])/2 sl[829,"meanpm10"] plot(sl$date, sl$meanpm10, ylab=expression(pm[10]),xlab="date",main="(a)f=.1", xaxt='n', cex=0.6) lines(lowess(sl$date, sl$meanpm10, f=0.1), col="red", lwd=2) axis(side=1,, labels=xname) plot(sl$date, sl$meanpm10, ylab=expression(pm[10]),xlab="date",main="(b)f=.05", xaxt='n', cex=0.6) lines(lowess(sl$date, sl$meanpm10, f=0.05), col="red", lwd=2) axis(side=1,, labels=xname) plot(sl$date, sl$meanpm10, ylab=expression(pm[10]),xlab="date",main="(c)f=.5", xaxt='n', cex=0.6) lines(lowess(sl$date, sl$meanpm10, f=0.5), col="red", lwd=2) axis(side=1,, labels=xname) par(mfrow=c(3,1))
Abstract Musculoskeletal Symptoms and Related Factors for Nurses and Radiological Technologists Wearing a Lead Apron for Radiation Pro t e c t i o n Jung-Im Yoo, Jung-Wan Koo 1 ) Angio Unit, Team of Radiology,