Chapter 8 단순선형회귀분석과 상관분석

Chapter 9 회귀모형 regression analysis

9.1 머리말 (Intro) Sir Francis Galton (18-1911) s studies on genetics Heights of parents and children: 부모의신장에비해 세의신장이일반평균치에복귀 (revert to the pop mean) 하는특성을발견하였다. 복귀 (revert) 는회귀 (regression) 로표현하기로하였다.

회귀분석기본모형 (model) 종속변수 Y Dependent var x i i i i ~ N(0, ) iid 정규 normal 독립변수 : 고전적인모델에선비확률 동일분산 Same variance Independent var: not random 독립 (independently) 같은분포 (identically distributed)

회귀모형 (Regression Model) 단순회귀분석법에서의가정 (assumptions) Y : 종속변수, 반응변수 (dependent, response variable) X : 독립변수, 설명변수 (independent, explanatory variable) 1. Y 는분포가있는확률변수 (Y: random variable). X 는고정된값으로오차없는통제가능한값 (x: fixed, so controllable variable) 3. Y 는 X 값에따라부분모집단이존재하고부분모집단은각각정규분포를하여야한다. (Y~sub-population based on x which is a normal dist n)

그림 9.1.1 단순선형회귀모형의도식 graphical description of simple regression

4. 부분모집단의분산은동일 variances of sub-popo are all equal 5. 선형가정 (linear assumption) E x 6. Y 값들은통계적으로독립이다. Y s are independent y x (linear association) i i i i -> 모든가정은 Check 하는것이원칙 yx ~ N(0, ),(independece, normality, homogeneity) 독립, 정규성, x 와무관한동일분산 (All assumptions need to be checked!)

9. 표본회귀방정식 simple linear regression 보기 9..1

최소제곱직선 (least square line) get a and b which minimizes the sume of squares n i1 ( y a bx ) i i n n ( ˆ i i ) ( i i ) i1 i1 A y y y a bx n da ( yi a bxi) 0 da da db i1 n ( y a bx ) x 0 i1 i i i yi na bxi x y a x b x i i i i (8.3.) (8.3.3) We need x, i yi, xi, xi yi

최소제곱추정치 (least square estimator) β 1 = i=1 n (x i x)(y i y) n (x i x) i=1 β 0 = y β 1 x y = β 0 + β 1 x 최소제곱법은다음과같이모든데이터의편차의제곱합을최소로만들어주는 β 0, β 1 을 β 0, β 1 로이용하는방법으로 β 0, β 1 는다음의부등식을만족해야한다. n (y i β 0 β 1 x i ) y i β 0 β 1 x i, β0, β 1 i=1 i=1 n

> chap9.<read.csv("e:\\kim\\yes\\myweb\\int\\018\\newlecturenote\\data\\waist.csv",header=t) > reg<-lm(fat~waist, data=chap9.) > summary(reg) Call: lm(formula = fat ~ waist, data = chap9.) Residuals: Min 1Q Median 3Q Max -15.7975-4.6357-0.9774 4.4471 18.168 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 71.6897 1.3350 53.70 <e-16 *** waist 0.194 0.0114 17.04 <e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 6.788 on 105 degrees of freedom Multiple R-squared: 0.7343, Adjusted R-squared: 0.7318 F-statistic: 90.3 on 1 and 105 DF, p-value: <.e-16 > anova(reg) Analysis of Variance Table Response: fat Df Sum Sq Mean Sq F value Pr(>F) waist 1 13374.5 13374.5 90.5 <.e-16 *** Residuals 105 4838.3 46.1 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1

9.3 회귀방정식의평가 evaluation of the regression line 0 선형관계가없다. no linear association 0 x가증가할수록 y는증가한다. x: proportional to y < 0 x가증가할수록 y는감소한다. x: inverse-proportional to y

그림 9.3.1 그림 9.3. H 0 β 1 = 0 을기각하지못하는경우 H 0 β 1 = 0 을기각하는경우

총편차 (sum of squares) Total SS = explained SS + unexplained SS SST ( y y) y i i SSR ( yˆ i y) b xi SSE SST SSR ( y ) n i ( x ) n i

결정계수 coefficient of determination 총변동중에서회귀방정식으로설명되는변동의비율을결정계수라고한다. (proportion of SSR over SST) r 결정계수가클수록회귀방정식이 data 를잘설명한다. 0 < r < 1, ( x ) i b xi yˆ i y n SSR i ( yi ) yi ( ) ( y y) SST n

그림 9.3.4

그림 9.3.5 (a) r 0.99 (b) r 0.3 (c) r = 1 (d) r 0

ANOVA ANOVA table for simple linear regression 요인 (source) 회귀모형 (model) 제곱합 (SS) 자유도 (df) 평균제곱 (mean square) F SSR 1 MSR = SSR/1 MSR MSE 오차 (error) SSE n MSE = SSE/(n ) 합 (total) SST n 1

1 Hypothesis Variance Ratio F- 통계량을이용한검정 가설 : H0 : 0 H : 0 A 검정통계량 F-test : V. R ~ F(1, n - ) 아래에서 V.R.=90.5 > F(1,105;0.95)=3.91 -> Reject Ho > anova(reg) Analysis of Variance Table Response: fat Df Sum Sq Mean Sq F value Pr(>F) waist 1 13374.5 13374.5 90.5 <.e-16 *** Residuals 105 4838.3 46.1 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1

T-test T- 통계량을이용한검정 가설 : b a a b H b H A 0 : 0 : 0 Ea ( ) Var( a) Eb ( ) Var( b) s i n ( x x) s ( x x) i ( x x) i i x Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 71.6897 1.3350 53.70 <e-16 *** waist 0.194 0.0114 17.04 <e-16 *** ---

가설 : t H H b : 0 0 0 A b 0 s 예제에서 : 0 t 0 ~ t( n ) 에대한신뢰구간 신뢰구간 17.04 t(105) 1.988 Confidence interval Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 71.6897 1.3350 53.70 <e-16 *** waist 0.194 0.0114 17.04 <e-16 *** --- b t s (1 ) ( xi x) 0.194 ± 1.983(0.0114) (0.1716, 0.169)

Application 9.4 회귀방정식의사용 주어진 x 에대한 y x 의예측 p Predicting Y for a given X p ( x x) 신뢰구간: yˆ t s 1, yˆ a bx 1 p (1 ) n ( xi x) 예측구간 prediction interval

8.5 회귀방정식의사용 Estimating the mean of Y for a given X 주어진 x 에대한 E( y) x 의평균추정 p y ± t 1 α s 1 n + x p x i p x x Confidence interval when x p =100 y s 1 n + x p x i x x 95% 신뢰구간 Confidence interval 95% 예측구간 Predictive interval 91.114 0.6566 (89.81, 9.416) (77.59, 104.63)

Estimating the mean of Y for a given X (a) 95% 신뢰구간 (b) 95% 예측구간 > newdata = data.frame(waist=100) > predict(reg, newdata, interval="confidence") fit lwr upr 1 91.1141 89.8115 9.41605 > predict(reg, newdata, interval="prediction") fit lwr upr 1 91.1141 77.59164 104.6366

chap9.<read.csv("e:\\kim\\yes\\myweb\\int\\018\\newlec turenote\\data\\waist.csv",header=t) reg<-lm(fat~waist, data=chap9.) attach(chap9.) plot(waist,fat,xlim=c(0,50),ylim=c(60,10)) abline(reg) result=data.frame(waist=1:5*10) p<-as.data.frame(predict(reg, result, level=0.95, interval="confidence")) lines(cbind(result,p$lwr), lty=) lines(cbind(result,p$upr), lty=) win.graph() plot(waist,fat,xlim=c(0,50),ylim=c(60,10)) abline(reg) result=data.frame(waist=1:5*10) pp<-as.data.frame(predict(reg, result, level=0.95, interval="prediction")) lines(cbind(result,pp$lwr), lty=) lines(cbind(result,pp$upr), lty=) PROC IMPORT OUT= WORK.waist DATAFILE= "E:\kim\yes\myweb\int\018\newl waist.csv" DBMS=CSV REPLACE; GETNAMES=YES; DATAROW=; RUN; * SAS 코드 ; proc reg; model fat=waist ; plot fat*waist; run;

Homework 1-9 10 -> 손으로도한번하고 (manual calculation), SAS and R 로하기

9.5 다중회귀분석의개념 (multiple regression) One Y& k independent variables x1,, xk Y 종속변수 (Dependent variable) 독립변수 x1,, xk (Independent variable) 반응변수 (Response variable) 설명변수 (explanatory variable) 예측변수 (predictor variable)

다중회귀모형 (model) y x x x j 0 1 1 j j k kj j j j ~ iid N(0, ) 1,, n Independently & identically distributed 회귀계수의의미 (Interpreting the coefficients) e.g. independent var s Y x x 0 1 1 ( Y : 입원기간, x : 과거입원회수, x : 연령) 1 ( Y :length of hospital stay, x :length of hospital stay, previous visit, x :age) 1

E[ Y( x x 0)] 0 1 가 0일때 Y의기대치 Centering 필요 E(Y x 1 =x =0) x1, x E[ y( x a 1, x b)] E[ y( x a, x b)] 1 1 1 ( a 1) b ( a) b 0 1 0 1 1 1 x 가같은값으로남아있을때 x 이한단위증가할때 y의기대치의증가값 increment of E( Y ) corresponding to unit increase of x when x is fixed 1 x 의 effect 를 adjust 한후의 x 의 y 에대한 effect 1 Effect of x on Y after controlling the effect of x 1 x 가같은값으로남아있을때 x 이한단위증가할때 y의기대치의증가값 1

9.6 다중회귀방정식의추정 estimating regression coef. 정규방정식 (normal equation) nb b x b x y Estimate 0 1 1 j j j b x b x b x x x y 0 1 j 1 1 j 1 j j 1 j j b x b x x b x x y 0 j 1 1 j j j j j,, 0 1 which minimize L 0 1 1 L y x x j j j j dl dl dl d d d 0 1 0

예제 9.6.1 chap9r<read.csv("e:\\kim\\yes\\myweb\\int\\018\\newlectureno te\\data\\cda.csv",header=t) plot(chap9r) head(chap9r) line<-lm(cda~age+ed, data=chap9r) summary(line) anova(line) y j = 5.693 0.1898x 1j + 0.668x j

9.7 다중회귀방정식의평가 evaluating regression model 중결정계수 (Multiple Coeff. of Determination) SST SSR SSE 총변수 = 설명되는자승합 + 설명되지않는자승합 sum of squares, total=ss explained + SS unexplained R ˆ j y.1... k y y SSR y y SST j

Ex 9.7.1 > aa <- anova(line) > aa[,] [1] 06.054 09.748 678.030 > sst=sum(aa[,]) > sst [1] 1094 > ssr=sum(aa[1,],aa[,]) > ssr [1] 415.797 > Rsq=ssr/sst > Rsq [1] 0.3800704 >

Notion of Matrix y X n1 n( k 1) ( k 1) 1 n1 y1 1 x11 x1 xk1 0 1 y 1 x1 x xk 1 y 1 x x x n 1n n kn k n L ( yx ) ( yx ) yy X ' y X ' X L X ' y X ' X 0 ( X ' X ) X ' y ˆ ( X ' X ) 1X ' y

ˆ ( ) 1 LSE X X X Y 1 1 1 1 x11 x1 xk1 x11 x1 x1 n 1 x1 x xk x1 x x n 1 x13 x3 x k3 XY xk1 xk x kn 1 x1 n xn x kn n x1j x j xkj y j x1 j x1 j x1 j x j x1 j xkj x1 jyj xkj x1 jxkj x kj xkj y j 1 1 Var ˆ ( X X ) ˆ 1

when k n x1j x j x ˆ 1 j x1 j x1 jx j x j x1 jx j x j var( ˆ b0 ) cov( b0, b1 ) cov( b0, b ) cov( ˆ b0, b1 ) var( b1 ) cov( b1, b ) cov( b ˆ 0, b ) cov( b1, b ) var( b ) C00 C01 C0 1 ( X X ) ˆ C01 C11 C1 ˆ C0 C1 C 1 0 0 C00 C01 C0 I 0 1 0 ( X X ) C01 C11 C1 0 0 1 C0 C1 C 1

ANOVA Table 요인 (Source) 제곱합 자유도 (df) 평균제곱합 (Mean Sq) F 회귀모형 (model) SSR k MSR = SSR/k MSR/MSE 잔차 (error) SSE n k 1 MSE = SSE/(n k 1) 합 (total) SST n 1 H H : 0 0 1 A : Not H 0 if V. R F( k, n - k -1,1- ) then reject H each b i N ( i, c ii ) k s β i 0

검정 (Testing) Hypothesis : H : 0 Test stat : b i bi 0 H A: i 0 s standard error : i i s bi s C If t t ( n k 1), then reject H ii 1 0

특정한 9.8 다중회귀방정식의사용 X i 값이주어졌을때 Y 값의부분 모집단평균에대한신뢰구간 y j ± t 1 α,df=n k 1s y j X i Application Estimating the mean of Y for a given X Predicting Y for a given X 특정한값이주어졌을때얻게되는 Y값의예측구간 y j ± t 1 α, df=n k 1s y j

Ex 9.8.1 new=data.frame(age=68,ed=1) predict(line, new, level=0.95, interval="confidence") predict(line, new, level=0.95, interval="prediction")

9.9 회귀분석가정의위반 (checking the assumptions of regression model) 비정규분포 (not normal distributed) 이분산성 (heterogeneity) 독립변수사이의상관성 (co-linearity between independent vars)

9.10 질적독립변수 (Qualitative indep. Var) 변수 (variable) 양적 (quantitative) 연속 - 성적, 연령 질적 (qualitative) Continuous-score, age 범주 성별, 인종, 직업 Categorical-sex, race, job 질적변수를가변수 (dummy variable) 로이용 ( 가변수 : (0,1) 의값을갖는것 ) 질적변수 k 개범주 k-1 개의가변수사용 k categories -> k-1 dummy variables

가변수의예 (Examples of dummy var s) * * 성별 sex x 1 1 0 거주지역 x 1 0 Residential area (urban, rural, suburban) x 1 3 0 남자여자 도시 otherwise 농촌 otherwise male female urban rural * 흡연상태 ( 흡연자, 금연자-5 년내금연자, 금연자-5 년이상금연자, 비흡연자) Smoking status (current smoker, ex-smoker(<=5yrs), ex-smoker(.5 yrs) x 1 4 흡연자 smoker 0 otherwise x 1 5 0 5년내금연자 ex-smoker (<=5 years) otherwise x 1 6 0 5년이상금연자 otherwise ex-smoker (>5 years)

Ex 9.10.1 Case # Birth weight Gestation (week) Smk status of the mother

Y 출생시체중 (birth weight, grams) x x 1 임신기간주 gestation (weeks) 산모의흡연 smk status of the mother model 1: E( Y ) x x E( Y ) x 0 1 1 0 1 1 E( Y ) ( ) x 0 1 1 for nonsmoker for smoker S, N 1 0 smoker nonsmoker same slope, different intercept data<read.csv("d:\\kim\\yes\\myweb\\int\\018\\newlecturenote\\data \ 태아몸무게.csv",header=T) reg<-lm(gram~weeks+factor(smoke),data=data) summary(reg)

Y 출생시체중 (birth weight, grams) x x 1 임신기간주 gestation (weeks) 산모의흡연 smk status of the mother model 1: E( Y ) x x E( Y ) x 0 1 1 0 1 1 E( Y ) ( ) x 0 1 1 for nonsmoker for smoker S, N 1 0 smoker nonsmoker same slope, different intercept lm(formula = gram ~ weeks + factor(smoke), data = data) Residuals: Min 1Q Median 3Q Max -160.45-65.80 9.6 311.53 1016.59 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -174.4 558.84-3.086 0.0065 ** weeks 130.05 14.5 8.957.39e-14 *** factor(smoke)1-94.40 135.78 -.168 0.0360 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 484.6 on 97 degrees of freedom Multiple R-squared: 0.4636, Adjusted R-squared: 0.455 F-statistic: 41.9 on and 97 DF, p-value: 7.594e-14

* E( Y X x ) E( Y X x ) 1 1 smoker 1 1 non - smoker 임신기간이같다고할때, 주어진 x 값에대해서 1 어머니가흡연자인경우와어머니가비흡연자인경우의출생아의체중의차이 expected diff of birth weights between babies from smokers and nonsmokers ˆ 45grams y.. j = β 0 + β 1 x 1j + β x j = 174.4 + 130.05x 1j 94.40x j ˆ 0 * T y j = 174.4 + 130.05x 5.83.045 1j 94.40 1 se( = ˆ ) 018.8 + 130.05x 1j for smoker reject H b 0 : significantally different. *95% 신뢰구간 y j = 174.4 (CI) + 130.05x 1j 94.40 0 = 174.4 + 130.05x b ts ( 330.3975, 158.685) 1j for non-smoker

Nonsmoker smoker

model : E( Y) x x x x E( Y ) 0 1 1 3 1 x 0 1 1 for nonsmoker E( Y) ( ) ( ) x 0 1 3 1 different slope, different intercept for smoker If If 3 is significant -> slopes are diff btn smoker/nonsmoker is significant -> intercepts are diff not important without centering

체중 Model 그림 nonsmoker 1 smoker 0 0 1 3 38week 임신기간

centering if x x 38( week) 1 1 E( Y ) x x x x E( Y ) 0 1 1 3 1 x 0 1 1 ( x 38) 0 1 1 for nonsmoker 는 x 38일때 Y의기대치가된다 ( 의미, 관심있는모수 ) 0 1 E( Y ) ( ) ( ) x 0 1 3 1 는 x = 0일때기대치의차이가아니라 x = 38일때 1 1 for smoker E(Y x 1 =38) 흡연자와비흡연자의기대치의차이가된다. E(Y x 1 =38, smoker) -E(Y x 1 =38, non-smoker) * 교훈 : 연속변수를 centering을시켜주면절편이 x = 0일때의기대치가아니라 x = 특정값일때의기대치가되므로더욱의미있게된다. * centering 의다른효과 x 간의 mult - colinearity( 공선성 ) 를약화시켜준다. Intercept becomes more meaningful after centering. Multicolinearity becomes weaker after centering

Ex 9.10. effectiveness age treatment effectiveness age treatment 56 1 A 65 43 A 41 3 B 55 45 B 40 30 B 57 48 B 8 19 C 59 47 C 55 8 A 64 48 A 5 3 C 61 53 A 46 33 B 6 58 B 71 67 C 36 9 C 48 4 B 69 53 A 63 33 A 47 9 B 5 33 A 73 58 A 6 56 C 64 66 B 50 45 C 60 67 B 45 43 B 6 63 A 58 38 A 71 59 C 46 37 C 6 51 C 58 43 B 70 67 A 34 7 C 71 63 C

Model- 예제 9.10. * Y 치료효과 (trt effect) * X 1 연령 ( 양적 ) age ( quantitative) X 1, if trt * 치료방법 ( 질적 ) trt ( qualitative) X 1, if trt 3 Y x x x x x x x 0 1 1 3 3 4 1 5 1 3 E( Y ) x : for trt = A 0 1 1 EY ( ) ( ) ( ) x : for trt = B 0 1 4 1 E( Y ) ( ) ( ) x : for trt = C 0 3 1 5 1 B C

, : intercept & slope for reference cell A 0 1 3 4 5 : diff of intercepts (B-A), =0? : diff of intercepts (C-A), =0? : diff of slopes (B-A), =0? :diff of slopes (C - A), = 0?

예제 9.10.- sas * File: mreg018.sas ; data reg; input effect age method $; x1=age;x=(method= B');x3=(me thod= C'); x1=x1*x;x13=x1*x3; cards; 56 1 A 41 3 B 40 30 B 8 19 C 55 8 A 5 3 C 46 33 B 71 67 C 48 4 B 63 33 A 5 33 A 6 56 C 50 45 C 45 43 B 58 38 A 46 37 C 58 43 B 34 7 C 65 43 A 55 45 B 57 48 B 59 47 C 64 48 A 61 53 A 6 58 B 36 9 C 69 53 A 47 9 B 73 58 A 64 66 B 60 67 B 6 63 A 71 59 C 6 51 C 70 67 A 71 63 C ; run; proc reg; model effect=x1 x x3 x1 x13; output out=d p=pred; id age method; run; proc sort;by method; proc gplot; plot effect*age=method/ legend; symbol1 v='a' i=r c=c l=1; symbol v='b' i=r c=c l=; symbol3 v='c' i=r c=c l=3; run; proc glm ; class method; model effect=age method age*method / solution; run; proc glm ; class method(reference='a'); model effect=age method age*method / solution; run;

9.11 변수선택절차 variable(model) selection Forward selection Backward elimination Stepwise selection

Mod18.sas /* file : mod18.sas Multiple Regression Model with stepwise selection */ Filename electric 'd:\myweb\int\electric.dat'; data peak; infile electric ; input housize 1-3 income 6-11 aircapac 14-16 applindx 19-3 family 6-8 peak 31-35 ; label housize = 'House Size' income = 'Family Income' aircapac = 'Air Conditioning Capacity' applindx = 'Appliance Index' family = 'Number of Family Members' peak = 'Peak Hour Electric Load' ; run; proc reg data=peak; model peak = housize income aircapac applindx family /selection=stepwise; title 'Multiple Regression Model with stepwise selection'; run; proc reg data=peak outest=est; model peak = housize income aircapac applindx family /selection=rsquare cp adjrsq mse best= ; title 'Multiple Regression Model with stepwise selection'; run; proc print; title 'Actual Coefficients, etc.'; proc plot; plot _cp_*_in_ ='C' _p_*_in_='*'/overlay vaxis= 0 to 5 by 5 haxis=1 to 5 hpos=40 vpos=30; title; run;