6. 추 정 (Estimation) - PDF Free Download

6. 통계적추정 (Estimatio) updated: 017/4/10

6.1 머리말 (Itroductio) 통계적추론 (statistical iferece) 어느모집단으로부터구한표본에서얻어진결과를기초로그모집단에관해추측하는과정 To say somethig about the populatio based o the iformatio of the sample 1) 추정 (estimatio) ) 가설검정 (hypothesis testig)

추정치 (estimate) 1) 점추정 (poit estimate) ) 구간추정 (iterval estimate) 추정식 (estimator) 불편이성 (ubiasedess) x x i 의추정식 ˆ( rv.. based o data) is a ubaised estimator of (parameter) if E( ˆ ) ex. E( X ), so sample mea is a ue of the populatio mea if the samples are radomly selected from N(, ) 3

* 예 ) sample variace (S ) is a ubiased estimator of σ So E(S ) = σ 1 E ( yi y) Bias = E( ˆ ) : ot a ubiased estimator Bias of a ubiased estimator is zero Probability samplig ad o-probability samplig Radomizatio Blidig

표집모집단과목표모집단 (sampled populatio ad target populatio) 랜덤표본과비랜덤표본 (radom sample ad o-radom sample) 편의표본 (coveiece sample)

6. 모집단평균의신뢰구간 (Cofidece iterval of populatio mea) 추정량 ± 신뢰성계수 표준오차 estimator ± reliability coefficiet stadard error x ± z 1 α σ x, P Z < z 1 α = 1 α If we select samples repeatedly from ormal populatio, x ± z α 1 σ x will iclude μ with the probability of 100 1 α % 1 α:cofidece level (ex..95) 신뢰수준 α :sigificace level (ex..05) 유의수준 정밀도 (precisio), 오차범위 (margi of error): reliability coefficiet stadard error

μ 에대한 95% 신뢰구간 (95% cofidece iterval of μ)

<Ex 6..1> 연구자가특정집단의효소의평균을추정하기위하여 10명의표본을뽑아효소값을측정했다. 그결과표본평균 x = 이었다. 효소값은분산이 45인정규분포를따른다고할때, μ의 95% 신뢰구간을추정하라. A researcher measures amout of a certai ezyme. =10, sample mea=, We ca assume ormality with pop variace=45.. 95% C.I. of? x ± 1.96σ x ± 1.96 45 10 = ± 1.96.113 17.84, 6.16

<Ex 6..> 물리치료사가한집단의개인에대한특정근육의최대근력의평균치를 99% 신뢰수준에서추정하려고한다. 근력지수는분산이 144 인정규분포를따른다고한다. 실험에참가한 15 명의근력지수평균은 84.3 이다. Measurig maximum stregth of a certai muscle. We wat 99% CI of the pop mea. We assume ormality with pop variace=144. =15, sample mea=84.3, 신뢰수준 0.99 에대응되는신뢰성계수는 R 함수 qorm(0.995) 에의해.58 표준오차는 σ x = 1 15 = 3.0984 μ 에대한 99% 신뢰구간은 84.3 ±.58 3.0984 = 84.3 ± 8.0 (76.3, 9.3)

정규모집단이아닌경우 Sample from o-ormal pop cetral limit theorem <Ex 6..3> 환자 35 명의지각시간조사. 평균 =17. 분이고, 모표준편차 = 8 분. 모집단이정규분포를따르는지모른다는가정하에지각시간의모평균 μ 에대한 90% 신뢰구간?. delay time because of patiet s beig late at a cliic, =35, sample mea=17. mi, sd from the previous study (assumed to be kow)=8 mi. Pop is ot ormally dist ed. what is 90% CI of? Sol) Sample size is big eough (=35>30) -> apply CLT qorm(0.95)= 1.645 이다. σ x = 8 35 = 1.35 17. ± 1.645 1.35 = 17. ±. 15.0, 19.4

* <Ex 6..4> whe pop variace is ot kow. birth.csv 자료에서 bweight 의 95% 신뢰구간을구하시오. (95% cofidece iterval of the variable bweight i birth.csv data) Sol) birth <- read.csv( C:\\Users\\ower\\Desktop\\ 보건통계학개론 \\birth.csv',header=t) head(birth) ; x <- birth$bweight <- legth(x) mea(x)-qorm(0.975)*sd(x)/sqrt(legth(x)) mea(x)+qorm(0.975)*sd(x)/sqrt(legth(x)) summary(x) sd(x)

* CI calculatio usig R m <- mea(x) ; m s <- sqrt(var(x)) ; s <- legth(x) ; alpha <- 0.05 error <- qorm(1-alpha/)*s/sqrt() left <- m-error right <- m+error left; right cofit <- fuctio(m,s,,alpha=0.05){ error <- qorm(1-alpha/)*s/sqrt() left <- m-error right <- m+error prit(c(left,right)) } cofit(m,s,) cofit(0.7164,sqrt(0.36),35)

[ 중위수를이용한추정 ] Estimatig populatio mea usig media of the sample. Robust 한결과를준다 -> 오차가많이포함된경우에선호 [ 절사평균 ] 가장크고작은관찰치들을제거한후평균계산 -> robust 한결과를준다. Trimmed mea > x<-c(1:9,100) > mea(x);?mea [1] 14.5 > mea(1:10, trim=0.1) [1] 5.5 > mea(:9) [1] 5.5 > mea(x, trim=0.5);media(x) [1] 5.5 [1] 5.5

6.3 t- 분포 (t-dist ) 모분산이알려져있고표본수가충분히큰경우에는 Z = X μ 에표준정규분포를적용한다. σ/ (Pop variace is kow ad is large-> use Z) 모분산을모르지만표본수가충분히큰경우에는 s = x i x /( 1) 를 σ대신에사용한다. (Pop variace is ot kow ad is large : use s istead of σ) 표본수가적은경우 (<30) Small sample size : derived by Gosset Studet s t-dist t = X μ S/ ~t 1

T- 분포의특성 (Some properties of t-dist ) 1) 평균은 0 (mea= 0) ) 확률밀도함수가평균에대해서대칭 (Symmetric about the mea) 3) t 4), 5) 분산 =df/(df-) for df >, -> 1 as -> Variace=df/(df-) for df >, -> 1 as -> 6) t- 분포는정규분포에비해꼬리가두꺼운형태. Tail of t-dist is thick tha that of ormal dist. 7) 자유도가커질수록정규분포에근사 T-dist approaches to ormal dist as df icreases

정규분포와 t- 분포의비교 자유도에따른 t- 분포의형태

[ 신뢰구간 ] CI : x ± t df= 1, 1 α Ex 6.3.1 = 19 인관측값의평균은 50.8, 표준편차는 130.9 라고한다. 모집단이정규분포를따른다고할때, 모평균의 95% 신뢰구간을구하라. =19, measure physical stregth mea=50.8, sd= 130.9, pop variace is ot kow. 95% CI of the pop mea? x = 50.8, s/ = 130.9/ 19 = 30.0305, df = 1 = 18 이다. qt(0.975,df=18), t df=18,0.975 =.1009 s 50.8 ±.1009 30.0305 = 50.8 ± 63.1 (187.7, 313.9)

Z 와 t 의선택 (Choice of z ad t) No-parametric methods

* x <- seq(-4, 4, legth=100) hx <- dorm(x) degf <- c(1, 3, 8, 30) colors <- c("red", "blue", "darkgree", "gold", "black") labels <- c("df=1", "df=3", "df=8", "df=30", "ormal") plot(x, hx, type="l", lty=, xlab="x value", ylab="desity", mai="compariso of t Distributios") for (i i 1:4){ lies(x, dt(x,degf[i]), lwd=, col=colors[i]) } leged("topright", iset=.05, title="distributios", labels, lwd=, lty=c(1, 1, 1, 1, ), col=colors)

6.4 두모집단평균차이의신뢰구간 (CI of the differece of the two meas) Samples from ormal pop s x 1 x ± z 1 α σ 1 1 + σ <Ex 6.4.1> 어떤대형병원에서 1명의다운증후군환자들로부터계산한혈청요산수치의평균값은 x 1 = 4.5mg/100ml이고, 동일연령, 동일성별인정상인 15 명으로부터구한혈청요산수치의평균은 x = 3.4mg/100ml이라고한다. 만약두모집단이분산이각각 1, 1.5인정규분포를따른다고할때, μ 1 μ 의 95% 신뢰구간을구하라.

<Ex 6.4.1> Measure serum uric acid from 1 patiets x 1 = 4.5mg/100ml, measuremets from 15 ormal cotrols x = 3.4mg/100ml, variaces are kow to be 1 ad 1.5 for pt ad ct group, 95% CI for μ 1 μ? Sol) σ x 1 x = σ 1 + σ 1 = 1 + 1.5 = 0.48 1 15 1.1 ± 1.96 0.48 = 1.1 ±.84 0.6, 1.94 CI does ot iclude 0

[ 모집단이정규분포를따르지않을때의신뢰구간 ] Sample from o-ormal pop ->cetral limit theorem < 보기 6.4.> To compare # cigarettes for pregat wome for two groups A: 1 = 38, x 1 = 5., s = 6.33, B: = 64, x = 15, s = 7.16, 99% CI of? Sample sizes are eough s x 1 x = 6.33 38 + 7.16 64 = 0.96 9.8 ±.58 0.96 ( 1.8, 7.3)

평균비교시 t-분포를사용할때 (t-dist ad differece of the meas) 1) 모분산이동일할때 (Same variaces), ) 모분산이동일하지않을때 (Differet variaces) - 모분산이동일한경우 : 합동추정량을사용한다. (Whe the variaces are the same: we calculate pooled estimate by calculatig weighted average of the variaces) s p = 1 1 s 1 + 1 s 1 +, s x 1 x = s p μ 1 μ 에대한 100 1 α % 신뢰구간 CI : ( x 1 x ) ± t 1 +,(1 α ) s p + s p 1 1 + s p

- 모분산이동일하지않은경우 (Whe the variaces are differet) ( x x ) ( ) 1 1 1 s s 1 does ot follow t-dist! w1 t1 wt t ' 1 w1 w ( x1 x ) t '(1 ) s 1 s 1 1 1 1 * w s, w s, df 1 t t, df 1 t t 1 1 1 1

<Ex 6.4.3>18 명의조현병환자의치료일수의평균은 4.7 일, 표준편차는 9.3 이다. 또한 10 명의조울증환자들의치료일수평균은 8.8 일, 표준편차는 11.5 이다. 두표본을이용하여두모평균차이의 95% 신뢰구간을구하라 mea sd mea sd Dx A: 18 4.7 9.3 Dx B:10 8.8 11.5 95% CI of? 동일분산의가정하에서 (if we assume that the variaces are the same) 분산의합동추정치 (pooled estimate of the variace) s 18 1 9.3 + 10 1 (11.5 ) p = = 10.33 18 + 10 1 모평균의신뢰구간 4.7 8.8 ±.0555 10.33 18 4.1 ± 8.0 1.3, 4.1 + 10.33 10 =

<Ex 6.4.4> 분산이다르다고가정한다면 (uder the heterogeeous assumptio) t 17 t 9 t = 9.3 18.1098 + 9.3 18 + 11.5 10.6 11.5 10 =.16 4.7 8.8 ±.16 9.3 18 + 11.5 10 4.7 8.8 ±.16 4.46175 13.5, 5.3

1-1, 1, Homework

6.5 모집단비율의신뢰구간 (CI of proportio) p 에대한 100 1 α % 신뢰구간 p 1 p p ± z α 1 <Ex 6.5.1> 1,000명의의약품사용자, 0% 는정보검색위하여인터넷사용. 모비율의 95% 신뢰구간? =1,000, 0% iteret user p(1 p ) = 0.0)(0.80 ) 1000 =.013 0.0± 1.96 0.013 = 0.0± 0.05 0175, 0.5

6.6 두모집단비율의차이의신뢰구간 CI of differece of two proportios p 1 p 의 100 1 α % 신뢰구간 (CI) p 1 p ± z 1 α p 1 1 p 1 1 + p 1 p <Ex 6.6.1>73 명의여자와 315 명의남자로구성된 388 명의어린이와청소년의확률표본에서 1 명의여자와 45 명의남자가자살충동을느낀경험이있다고한다. 두모집단에서자살충동을느낀사람의비율의차이에대한 99% 신뢰구간을구하라.

6.6 두모집단비율의차이의신뢰구간 CI of differece of two proportios <Ex 6.6.1> Out of 73 female, 315 male, 1, 45 said yes (suicidal thoughts) 99% CI for the differece p F = 1 73 = 0.877, p M = 45 = 0.149 315 p F p M = 0.877 0.149 = 0.1448 σ p F p M = (0.877)(0.713) 73 qorm(0.995)=.58 + 0.149 (0.8571) 315 = 0.0565 0.1448 ±.58 0.0565 ( 0.0010, 0.906)

6.7 표본크기의결정 : 모평균 sample size calculatio: iferece of the mea 표본의크기가크거나혹은복원추출하는경우 (whe sample size is eough or samplig w replacemet) d = z 1 α/ : 신뢰구간의한쪽방향의길이 (width CI/) σ = z 1 α/σ d 표본의크기가작고비복원추출하는경우 (whe sample size is ot eough ad samplig w/o replacemet) d = z σ N N 1 = Nz 1 α/ σ d N 1 +z 1 α/ σ

[ 분산의추정 ]Estimatio of the variace 1. 모집단으로부터시험표본 (pilot sample) 을뽑고, 시험표본으로부터표본분산을이용하여필요한표본의크기를계산할수있다. 시험표본은나중에뽑을표본과함께분석에활용할수있다. 따라서필요한표본의크기는 ( 산출된표본크기 ) ( 시험표본의크기 ) 이다.. 이전 (previous) 혹은유사 (similar) 한연구 (studies) 에서 σ 의추정값을이용할수있다. 3. 모집단이정규분포를따를때, 범위는대략적으로표준편차의 6 배 (σ R/6) 이다. 따라서모집단의최솟값과최대값을알면표준편차의추정값을얻을수있다.

<Ex 6.7.1> 신뢰구간의폭은 0, 신뢰수준은 0.95, 그리고모분산은 15 라고할때, 표본의크기를구하는과정을설명하라. Width of CI=0 (+-10). Cofidece level= 0.95, pop sd=15, pop is very large; we ca igore fiite pop correctio factor z 1 α = 1.96, σ = 15, d = 10 = 1.96 15 10 = 8.6436 -> 9

6.8 표본크기의결정 : 모비율 sample size calculatio: iferece of the proportio 무한모집단 (Ifiite populatio) = z 1 α/pq d 유한모집단 (fiite populatio) = Nz 1 α/ pq d N 1 +z 1 α/ pq 모집단크기가충분히크면유한모집단가정가능 If /N.05 ifiite pop ca be assumed.

[ 모비율의추정 ] Estimatig the pop. proportio 시험표본 (pilot study) 으로부터계산한점추정값을모비율로이용할수있다. 이전연구나유사한연구 (previous or similar studies) 에서 p 의추정값을이용할수있다. p 를제외한다른값들이고정되어있다고가정하자. p 가 0.5 일때표본의크기가최대가된다. 따라서 p 에대하여알려진사실이전혀없으면 0.5 를이용하여표본의크기를계산할수있다. 하지만이럴경우필요이상으로표본의크기가커지므로, 연구에필요한비용이증가함을기억하자. ( is maximized whe p=.5. You may assume p=0.5 if you have o idea.) 만약 p 의범위를알고있다고하자. 범위에들어가는값들중에서표본의크기를최대로만들어주는 p 를이용하여표본의크기를계산할수있다. 표본의크기를최대로만들어주는 p 의값은식 (6.8.1), (6.8.) 의경우 0.5 에가장가까운값이다. 가령성차별을경험한여성의비율에대하여추정한다고가정하자. 이때성차별을경험한여성의비율 p 는 0.40 보다클수없다는사실이알려져있다면, p 의값으로 0.40 를이용하면된다. (If you kow the rage of p, choose p closest to.5)

<Ex 6.8.1> 어떤도시에서아파트에거주하는사람의비율을추정하려고한다. 아파트에거주하는사람의비율이 0.45 보다작다고알려져있다. 이때신뢰구간의폭이 0.1 보다작으며, 95% 의신뢰도를갖는신뢰구간을얻기위하여필요한표본의크기를구하라. proportio livig i a apartmet. We kow p<0.45. We wat that width of 95% CI < 0.10, =? = z 1 α/ pq d = 1.96 0.45 0.55 0.05 = 380.3184 381

6.9 정규분포모집단분산의신뢰구간 CI of the variace from ormal dist Poit estimator of variace Good estimator? ubiasedess 각표본이정규분포에서나왔다면 (uder the ormal assumptio) x i x ) = ( 1)σ E( i=1 E( 1 1 i=1 x i x ) = σ E(S )=σ

* 일반적으로모수 θ 를추정하기위한방법은수없이많으며그중에서 bias 를 0 로하면서 (ubiased estimator) 분산을최소화시키는방법이이상적이라고할수있다 -> 이러한방법을 Uiformly Miimum Variace Ubiased Estimator (UMVUE) 이라고한다. UMVUE (Uiformly Miimum Variace Ubiased Estimator) is a very good estimator satisfyig ubiasedess with small variace. 표본평균은정규분포조건하에서모평균의최소분산불편추정치이다. (sample mea is the UMVUE of the pop mea uder the ormal assumptio.)

카이제곱분포 (chi-square distributio) 1 S σ = ( 1) σ 1 1 ~χ (df = 1) i=1 x i x = i=1 (x i x) σ α χ df,α 카이제곱분포의확률밀도함수 (Chi-square distributio) 카이제곱분포의분위수 (Quatiles of Chi-square distributio)

100(1 )% CI of? / (1 / ) ( 1) s ( 1) s 100(1 )% CI of ( 1) s ( 1) s (1 / ) / 100(1 )% CI of <Ex 6.9.1> 다음은 10명의 형당뇨병환자의공복혈을측정한결과이다. 150.3, 140.1, 144.3, 155.3, 175.4, 18.9, 140.7, 143.7, 139.0, 14.3 Uder the ormal assumptio, what is 95% CI of the pop variace? Sol) s = 41.4578이고, qchisq(0.975, df=9), qchisq(0.05, df=9) χ 0.975 = 19.08, χ 0.05 =.7004 9(41.4578) 19.08 < σ < 9(41.4578).7004 114.365 < σ < 804.7401

6.10 두정규분포모집단의분산비의대한신뢰구간 CI for the ratio of two variaces F-distributio s s 1 1 F 1, 1 1

1 100(1 )% CI of? F s, s s F s s s F F 1 1 1 1 1 / (1 / ) (1 / ) / <Ex 6.10.1> 여성 16 명의체질량지수의표준편차는 5.84 이고, 4 명의남성체질량지수의표준편차는 6.3 이었다고한다. 남성과여성의분산의비에대한 95% 신뢰구간을구하라. ormal adults, =16 females ad 4 males. Sample sd s are 5.84 ad 6.3. 95% C.I. of the ratio of the variaces? 1 = 16 = 4 s 1 = 5.84 = 34.11, s = 6.3 = 39.69 df 1 = 15, df = 3 -> F 0.05 = 0.4096 F 0.975 = 14.5. 34.11/39.69 < σ 1 14.57 σ < 34.11/39.69 0.408 > qf(0.05,15,3) [1] 0.40801 > qf(0.975,15,3) [1] 14.571 0.0603 < σ 1 σ < 3.5690

[F df1,df,1 (α/) 와 F df1,df, α 의관계 ] F df1,df,1 α = > qf(0.975,3,15) [1] 4.15804 1 F df,df1,α > 1/qf(0.05,15,3) [1] 4.15804

*[Levee s test: 등분산성검정 (Homogeeity test)] library(car) male <- c(10.673, 14.103, 5.731, 30.081) female <- c(6.086, 13.37, 5.195, 15.40,.537, 0.860,.409, 18.106, 19.779, 17.651, 4.403, 18.474, 15.063, 14.64, 9.136, 40.354) data <- c(male,female) leve.test(male,female) leveetest(male,female)?leveetest group <-c(rep(1,legth(male)),rep(,legth(female))) leveetest(data,factor(group)) t.test(male,female)?t.test t.test(male,female,var.equal=t)

> t.test(male,female) Welch Two Sample t-test data: male ad female t = -0.0139, df = 3.936, p-value = 0.9896 alterative hypothesis: true differece i meas is ot equal to 0 95 percet cofidece iterval: -13.90105 13.76368 sample estimates: mea of x mea of y 0.14700 0.1569 > t.test(male,female,var.equal=t) Two Sample t-test data: male ad female t = -0.0164, df = 18, p-value = 0.9871 alterative hypothesis: true differece i meas is ot equal to 0 95 percet cofidece iterval: -8.883343 8.745968 sample estimates: mea of x mea of y 0.14700 0.1569 > leveetest(data,factor(group)) Levee's Test for Homogeeity of Variace (ceter = media) Df F value Pr(>F) group 1 1.11 0.3035 18 >

* SAS Example: Two Idepedet Samples dataset: bullets Bullets Dataset Obs powder velocity 1 1 7.3 1 8.1 3 1 7.4 4 1 7.7 5 1 8.0 6 1 8.1 7 1 7.4 8 1 7.1 9 8.3 10 7.9 11 8.1 1 8.3 13 7.9 14 7.6 15 8.5 16 7.9 17 8.4 18 7.7

proc ttest data=bullets; var velocity;class powder; ru; The TTEST Procedure Lower CL Upper CL Lower CL Variable powder N Mea Mea Mea Std Dev velocity 1 8 7.309 7.638 7.966 0.596 velocity 10 7.841 8.06 8.79 0.106 velocity Diff (1-) -0.771-0.4-0.074 0.58 Upper CL Variable powder Std Dev Std Dev Std Err Miimum Maximum velocity 1 0.396 0.799 0.1388 7.1 8.1 velocity 0.306 0.5591 0.0968 7.6 8.5 velocity Diff (1-) 0.3467 0.576 0.1644 Variable Method Variaces DF t Value Pr > t velocity Pooled Equal 16 -.57 0.006 velocity Satterthwaite Uequal 13.1 -.50 0.067 Equality of Variaces Variable Method Num DF De DF F Value Pr > F velocity Folded F 7 9 1.64 0.478 For H0: Variaces are equal, F = 1.64 DF = (7,9)

* Statistical distributios : sum of idepedet ormal rv s 1,,, N(, ) Y Y Y radom sample from iid, Y N 1 1 ( 1) i i Y Y s 0,1 / Y N 1 / Y t s 1 (0,1) i i i Y Y N /, 1 1 /, 1 ( 1) 1 s P

F 1, : ratio of idepedet chi-squares (df= 1, ) 1 / / 1 F, 1 s s 1 / 1 / F 1, 1 1 P F 1 s1 / F 1 / s 1