Chapter 8 단순선형회귀분석과 상관분석

Similar documents
eda_ch7.doc

Microsoft Word - multiple

<4D F736F F D20BDC3B0E8BFADBAD0BCAE20C1A B0AD5FBCF6C1A45FB0E8B7AEB0E6C1A6C7D E646F63>

Microsoft PowerPoint - IPYYUIHNPGFU

G Power

Chapter 7 분산분석

공공기관임금프리미엄추계 연구책임자정진호 ( 한국노동연구원선임연구위원 ) 연구원오호영 ( 한국직업능력개발원연구위원 ) 연구보조원강승복 ( 한국노동연구원책임연구원 ) 이연구는국회예산정책처의정책연구용역사업으로 수행된것으로서, 본연구에서제시된의견이나대안등은

슬라이드 1

abstract.dvi

2156년올림픽 100미터육상경기에서여성의우승기록이남성의기록보다빠른첫해로남을수있음 2156년올림픽에서 100m 우승기록은남성의경우 8.098초, 여성은 8.079초로예측 통계적오차 ( 예측구간 ) 를고려하면빠르면 2064년, 늦어도 2788년에는그렇게될것이라고주장 유사


조사연구 권 호 연구논문 한국노동패널조사자료의분석을위한패널가중치산출및사용방안사례연구 A Case Study on Construction and Use of Longitudinal Weights for Korea Labor Income Panel Survey 2)3) a

methods.hwp

untitled

Chapter 7 분산분석

ANOVA 란? ANalysis Of VAriance Ø 3개이상의모집단의평균의차이를검정하는방법 Ø 3개의모집단일경우 H0 : μ1 = μ2 = μ3 H0기각 : μ1 μ2 = μ3 or μ1 = μ2 μ3 or μ1 μ2 μ3 àpost hoc test 수행

<352E20BAAFBCF6BCB1C5C320B1E2B9FDC0BB20C0CCBFEBC7D120C7D1B1B920C7C1B7CEBEDFB1B8C0C720B5E6C1A1B0FA20BDC7C1A120BCB3B8ED D2DB1E8C7F5C1D62E687770>

슬라이드 1

cat_data3.PDF

hwp


Chapter 7 분산분석

슬라이드 1

저작자표시 - 비영리 - 변경금지 2.0 대한민국 이용자는아래의조건을따르는경우에한하여자유롭게 이저작물을복제, 배포, 전송, 전시, 공연및방송할수있습니다. 다음과같은조건을따라야합니다 : 저작자표시. 귀하는원저작자를표시하여야합니다. 비영리. 귀하는이저작물을영리목적으로이용할

슬라이드 1

선형모형_LM.pdf

14.531~539(08-037).fm

제 4 장회귀분석

슬라이드 1

Microsoft PowerPoint - chap_11_rep.ppt [호환 모드]

전립선암발생률추정과관련요인분석 : The Korean Cancer Prevention Study-II (KCPS-II)

R t-..

nonpara6.PDF

Microsoft PowerPoint - LM 2014s_Ch4.pptx

슬라이드 1

2 / 27 목차 1. M-plus 소개 2. 중다회귀 3. 경로모형 4. 확인적요인분석 5. 구조방정식모형 6. 잠재성장모형 7. 교차지연자기회귀모형

Microsoft PowerPoint - Info R(3) pptx

<31372DB9DABAB4C8A32E687770>

nonpara1.PDF

저작자표시 - 비영리 - 변경금지 2.0 대한민국 이용자는아래의조건을따르는경우에한하여자유롭게 이저작물을복제, 배포, 전송, 전시, 공연및방송할수있습니다. 다음과같은조건을따라야합니다 : 저작자표시. 귀하는원저작자를표시하여야합니다. 비영리. 귀하는이저작물을영리목적으로이용할

제 1 부 연구 개요

확률과통계 강의자료-1.hwp

MATLAB for C/C++ Programmers

PowerPoint 프레젠테이션

Chapter 11 비모수 및 무분포통계학

<C8A3C5DABBEABEF720B0E6B1E2B5BFC7E220BFB9C3F820B8F0B5A8BFA120B4EBC7D120BFACB1B85FC3D6C1BE28C7D1C3A2BFB1292E687770>

<3136C1FD31C8A35FC3D6BCBAC8A3BFDC5F706466BAAFC8AFBFE4C3BB2E687770>

에너지경제연구 Korean Energy Economic Review Volume 17, Number 2, September 2018 : pp. 1~29 정책 용도별특성을고려한도시가스수요함수의 추정 :, ARDL,,, C4, Q4-1 -

012임수진

비선형으로의 확장

- 1 -

<B0A3C3DFB0E828C0DBBEF7292E687770>

DBPIA-NURIMEDIA


고객관계를 리드하는 서비스 리더십 전략

슬라이드 1

22 장정규성검정과정규화변환 22.1 시각적방법 Q-Q 플롯과정규확률그림 Q-Q 플롯( 분위수- 분위수플롯, Quantile-Quantile plot) 은하나의자료셋이특정분포( 정규분 포나와이블분포등) 를따르는지또는두개의자료셋이같은모집단분포로부터나왔는지를

<4D F736F F D20B1E2BBF3C5EBB0E85F36C0E55FC7D0BBFD2E646F6378>


저작자표시 - 비영리 - 변경금지 2.0 대한민국 이용자는아래의조건을따르는경우에한하여자유롭게 이저작물을복제, 배포, 전송, 전시, 공연및방송할수있습니다. 다음과같은조건을따라야합니다 : 저작자표시. 귀하는원저작자를표시하여야합니다. 비영리. 귀하는이저작물을영리목적으로이용할

시스템경영과 구조방정식모형분석

untitled

(72) 발명자 정진곤 서울특별시 성북구 종암1동 이용훈 대전광역시 유성구 어은동 한빛아파트 122동 1301 호 - 2 -

R&D : Ⅰ. R&D OECD 3. Ⅱ. R&D

Chapter 분포와 도수분석

PowerPoint 프레젠테이션

DBPIA-NURIMEDIA

슬라이드 1

에너지경제연구 제13권 제1호

<4D F736F F F696E74202D FC0E5B4DCB1E220BCF6BFE4BFB9C3F8205BC8A3C8AF20B8F0B5E55D>

1..


(Exposure) Exposure (Exposure Assesment) EMF Unknown to mechanism Health Effect (Effect) Unknown to mechanism Behavior pattern (Micro- Environment) Re

OR MS와 응용-03장


歯1.PDF

PowerPoint 프레젠테이션

DBPIA-NURIMEDIA

Microsoft Word - sbe13_reg.docx

Orcad Capture 9.x

서론 34 2

Microsoft Word - ch2_simple.doc

DIY 챗봇 - LangCon

<3130C0E5>

09È«¼®¿µ 5~152s

<4D F736F F D20C0C0BFEBB0E8B7AE20C1A B0AD202D20B0E8B7AEB0E6C1A6C7D E646F63>

슬라이드 1

Jeeshim & KUCC625 (08/04/2009) Statistical Data Analysis Using R:22 6. 집단간평균비교 집단간평균을비교하는것은기본방법이다. 따라서비교할변수는평균을계산할수있어야하고, 의미있게해석할수있어야한다. 두집단

슬라이드 1

歯4차학술대회원고(장지연).PDF



<4D F736F F D20C0C0BFEBB0E8B7AE20C1A B0AD202D20B0E8B7AEB0E6C1A6C7D E646F63>

Multi-pass Sieve를 이용한 한국어 상호참조해결 반-자동 태깅 도구

Microsoft Word - skku_TS2.docx

Buy one get one with discount promotional strategy

<352EC7E3C5C2BFB55FB1B3C5EBB5A5C0CCC5CD5FC0DABFACB0FAC7D0B4EBC7D02E687770>

Microsoft Word - SAS_Data Manipulate.docx

DBPIA-NURIMEDIA

Transcription:

Chapter 9 회귀모형 regression analysis

9.1 머리말 (Intro) Sir Francis Galton (18-1911) s studies on genetics Heights of parents and children: 부모의신장에비해 세의신장이일반평균치에복귀 (revert to the pop mean) 하는특성을발견하였다. 복귀 (revert) 는회귀 (regression) 로표현하기로하였다.

회귀분석기본모형 (model) 종속변수 Y Dependent var x i i i i ~ N(0, ) iid 정규 normal 독립변수 : 고전적인모델에선비확률 동일분산 Same variance Independent var: not random 독립 (independently) 같은분포 (identically distributed)

회귀모형 (Regression Model) 단순회귀분석법에서의가정 (assumptions) Y : 종속변수, 반응변수 (dependent, response variable) X : 독립변수, 설명변수 (independent, explanatory variable) 1. Y 는분포가있는확률변수 (Y: random variable). X 는고정된값으로오차없는통제가능한값 (x: fixed, so controllable variable) 3. Y 는 X 값에따라부분모집단이존재하고부분모집단은각각정규분포를하여야한다. (Y~sub-population based on x which is a normal dist n)

그림 9.1.1 단순선형회귀모형의도식 graphical description of simple regression

4. 부분모집단의분산은동일 variances of sub-popo are all equal 5. 선형가정 (linear assumption) E x 6. Y 값들은통계적으로독립이다. Y s are independent y x (linear association) i i i i -> 모든가정은 Check 하는것이원칙 yx ~ N(0, ),(independece, normality, homogeneity) 독립, 정규성, x 와무관한동일분산 (All assumptions need to be checked!)

9. 표본회귀방정식 simple linear regression 보기 9..1

최소제곱직선 (least square line) get a and b which minimizes the sume of squares n i1 ( y a bx ) i i n n ( ˆ i i ) ( i i ) i1 i1 A y y y a bx n da ( yi a bxi) 0 da da db i1 n ( y a bx ) x 0 i1 i i i yi na bxi x y a x b x i i i i (8.3.) (8.3.3) We need x, i yi, xi, xi yi

최소제곱추정치 (least square estimator) β 1 = i=1 n (x i x)(y i y) n (x i x) i=1 β 0 = y β 1 x y = β 0 + β 1 x 최소제곱법은다음과같이모든데이터의편차의제곱합을최소로만들어주는 β 0, β 1 을 β 0, β 1 로이용하는방법으로 β 0, β 1 는다음의부등식을만족해야한다. n (y i β 0 β 1 x i ) y i β 0 β 1 x i, β0, β 1 i=1 i=1 n

> chap9.<read.csv("e:\\kim\\yes\\myweb\\int\\018\\newlecturenote\\data\\waist.csv",header=t) > reg<-lm(fat~waist, data=chap9.) > summary(reg) Call: lm(formula = fat ~ waist, data = chap9.) Residuals: Min 1Q Median 3Q Max -15.7975-4.6357-0.9774 4.4471 18.168 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 71.6897 1.3350 53.70 <e-16 *** waist 0.194 0.0114 17.04 <e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 6.788 on 105 degrees of freedom Multiple R-squared: 0.7343, Adjusted R-squared: 0.7318 F-statistic: 90.3 on 1 and 105 DF, p-value: <.e-16 > anova(reg) Analysis of Variance Table Response: fat Df Sum Sq Mean Sq F value Pr(>F) waist 1 13374.5 13374.5 90.5 <.e-16 *** Residuals 105 4838.3 46.1 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1

9.3 회귀방정식의평가 evaluation of the regression line 0 선형관계가없다. no linear association 0 x가증가할수록 y는증가한다. x: proportional to y < 0 x가증가할수록 y는감소한다. x: inverse-proportional to y

그림 9.3.1 그림 9.3. H 0 β 1 = 0 을기각하지못하는경우 H 0 β 1 = 0 을기각하는경우

총편차 (sum of squares) Total SS = explained SS + unexplained SS SST ( y y) y i i SSR ( yˆ i y) b xi SSE SST SSR ( y ) n i ( x ) n i

결정계수 coefficient of determination 총변동중에서회귀방정식으로설명되는변동의비율을결정계수라고한다. (proportion of SSR over SST) r 결정계수가클수록회귀방정식이 data 를잘설명한다. 0 < r < 1, ( x ) i b xi yˆ i y n SSR i ( yi ) yi ( ) ( y y) SST n

그림 9.3.4

그림 9.3.5 (a) r 0.99 (b) r 0.3 (c) r = 1 (d) r 0

ANOVA ANOVA table for simple linear regression 요인 (source) 회귀모형 (model) 제곱합 (SS) 자유도 (df) 평균제곱 (mean square) F SSR 1 MSR = SSR/1 MSR MSE 오차 (error) SSE n MSE = SSE/(n ) 합 (total) SST n 1

1 Hypothesis Variance Ratio F- 통계량을이용한검정 가설 : H0 : 0 H : 0 A 검정통계량 F-test : V. R ~ F(1, n - ) 아래에서 V.R.=90.5 > F(1,105;0.95)=3.91 -> Reject Ho > anova(reg) Analysis of Variance Table Response: fat Df Sum Sq Mean Sq F value Pr(>F) waist 1 13374.5 13374.5 90.5 <.e-16 *** Residuals 105 4838.3 46.1 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1

T-test T- 통계량을이용한검정 가설 : b a a b H b H A 0 : 0 : 0 Ea ( ) Var( a) Eb ( ) Var( b) s i n ( x x) s ( x x) i ( x x) i i x Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 71.6897 1.3350 53.70 <e-16 *** waist 0.194 0.0114 17.04 <e-16 *** ---

가설 : t H H b : 0 0 0 A b 0 s 예제에서 : 0 t 0 ~ t( n ) 에대한신뢰구간 신뢰구간 17.04 t(105) 1.988 Confidence interval Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 71.6897 1.3350 53.70 <e-16 *** waist 0.194 0.0114 17.04 <e-16 *** --- b t s (1 ) ( xi x) 0.194 ± 1.983(0.0114) (0.1716, 0.169)

Application 9.4 회귀방정식의사용 주어진 x 에대한 y x 의예측 p Predicting Y for a given X p ( x x) 신뢰구간: yˆ t s 1, yˆ a bx 1 p (1 ) n ( xi x) 예측구간 prediction interval

8.5 회귀방정식의사용 Estimating the mean of Y for a given X 주어진 x 에대한 E( y) x 의평균추정 p y ± t 1 α s 1 n + x p x i p x x Confidence interval when x p =100 y s 1 n + x p x i x x 95% 신뢰구간 Confidence interval 95% 예측구간 Predictive interval 91.114 0.6566 (89.81, 9.416) (77.59, 104.63)

Estimating the mean of Y for a given X (a) 95% 신뢰구간 (b) 95% 예측구간 > newdata = data.frame(waist=100) > predict(reg, newdata, interval="confidence") fit lwr upr 1 91.1141 89.8115 9.41605 > predict(reg, newdata, interval="prediction") fit lwr upr 1 91.1141 77.59164 104.6366

chap9.<read.csv("e:\\kim\\yes\\myweb\\int\\018\\newlec turenote\\data\\waist.csv",header=t) reg<-lm(fat~waist, data=chap9.) attach(chap9.) plot(waist,fat,xlim=c(0,50),ylim=c(60,10)) abline(reg) result=data.frame(waist=1:5*10) p<-as.data.frame(predict(reg, result, level=0.95, interval="confidence")) lines(cbind(result,p$lwr), lty=) lines(cbind(result,p$upr), lty=) win.graph() plot(waist,fat,xlim=c(0,50),ylim=c(60,10)) abline(reg) result=data.frame(waist=1:5*10) pp<-as.data.frame(predict(reg, result, level=0.95, interval="prediction")) lines(cbind(result,pp$lwr), lty=) lines(cbind(result,pp$upr), lty=) PROC IMPORT OUT= WORK.waist DATAFILE= "E:\kim\yes\myweb\int\018\newl waist.csv" DBMS=CSV REPLACE; GETNAMES=YES; DATAROW=; RUN; * SAS 코드 ; proc reg; model fat=waist ; plot fat*waist; run;

Homework 1-9 10 -> 손으로도한번하고 (manual calculation), SAS and R 로하기

9.5 다중회귀분석의개념 (multiple regression) One Y& k independent variables x1,, xk Y 종속변수 (Dependent variable) 독립변수 x1,, xk (Independent variable) 반응변수 (Response variable) 설명변수 (explanatory variable) 예측변수 (predictor variable)

다중회귀모형 (model) y x x x j 0 1 1 j j k kj j j j ~ iid N(0, ) 1,, n Independently & identically distributed 회귀계수의의미 (Interpreting the coefficients) e.g. independent var s Y x x 0 1 1 ( Y : 입원기간, x : 과거입원회수, x : 연령) 1 ( Y :length of hospital stay, x :length of hospital stay, previous visit, x :age) 1

E[ Y( x x 0)] 0 1 가 0일때 Y의기대치 Centering 필요 E(Y x 1 =x =0) x1, x E[ y( x a 1, x b)] E[ y( x a, x b)] 1 1 1 ( a 1) b ( a) b 0 1 0 1 1 1 x 가같은값으로남아있을때 x 이한단위증가할때 y의기대치의증가값 increment of E( Y ) corresponding to unit increase of x when x is fixed 1 x 의 effect 를 adjust 한후의 x 의 y 에대한 effect 1 Effect of x on Y after controlling the effect of x 1 x 가같은값으로남아있을때 x 이한단위증가할때 y의기대치의증가값 1

9.6 다중회귀방정식의추정 estimating regression coef. 정규방정식 (normal equation) nb b x b x y Estimate 0 1 1 j j j b x b x b x x x y 0 1 j 1 1 j 1 j j 1 j j b x b x x b x x y 0 j 1 1 j j j j j,, 0 1 which minimize L 0 1 1 L y x x j j j j dl dl dl d d d 0 1 0

예제 9.6.1 chap9r<read.csv("e:\\kim\\yes\\myweb\\int\\018\\newlectureno te\\data\\cda.csv",header=t) plot(chap9r) head(chap9r) line<-lm(cda~age+ed, data=chap9r) summary(line) anova(line) y j = 5.693 0.1898x 1j + 0.668x j

9.7 다중회귀방정식의평가 evaluating regression model 중결정계수 (Multiple Coeff. of Determination) SST SSR SSE 총변수 = 설명되는자승합 + 설명되지않는자승합 sum of squares, total=ss explained + SS unexplained R ˆ j y.1... k y y SSR y y SST j

Ex 9.7.1 > aa <- anova(line) > aa[,] [1] 06.054 09.748 678.030 > sst=sum(aa[,]) > sst [1] 1094 > ssr=sum(aa[1,],aa[,]) > ssr [1] 415.797 > Rsq=ssr/sst > Rsq [1] 0.3800704 >

Notion of Matrix y X n1 n( k 1) ( k 1) 1 n1 y1 1 x11 x1 xk1 0 1 y 1 x1 x xk 1 y 1 x x x n 1n n kn k n L ( yx ) ( yx ) yy X ' y X ' X L X ' y X ' X 0 ( X ' X ) X ' y ˆ ( X ' X ) 1X ' y

ˆ ( ) 1 LSE X X X Y 1 1 1 1 x11 x1 xk1 x11 x1 x1 n 1 x1 x xk x1 x x n 1 x13 x3 x k3 XY xk1 xk x kn 1 x1 n xn x kn n x1j x j xkj y j x1 j x1 j x1 j x j x1 j xkj x1 jyj xkj x1 jxkj x kj xkj y j 1 1 Var ˆ ( X X ) ˆ 1

when k n x1j x j x ˆ 1 j x1 j x1 jx j x j x1 jx j x j var( ˆ b0 ) cov( b0, b1 ) cov( b0, b ) cov( ˆ b0, b1 ) var( b1 ) cov( b1, b ) cov( b ˆ 0, b ) cov( b1, b ) var( b ) C00 C01 C0 1 ( X X ) ˆ C01 C11 C1 ˆ C0 C1 C 1 0 0 C00 C01 C0 I 0 1 0 ( X X ) C01 C11 C1 0 0 1 C0 C1 C 1

ANOVA Table 요인 (Source) 제곱합 자유도 (df) 평균제곱합 (Mean Sq) F 회귀모형 (model) SSR k MSR = SSR/k MSR/MSE 잔차 (error) SSE n k 1 MSE = SSE/(n k 1) 합 (total) SST n 1 H H : 0 0 1 A : Not H 0 if V. R F( k, n - k -1,1- ) then reject H each b i N ( i, c ii ) k s β i 0

검정 (Testing) Hypothesis : H : 0 Test stat : b i bi 0 H A: i 0 s standard error : i i s bi s C If t t ( n k 1), then reject H ii 1 0

특정한 9.8 다중회귀방정식의사용 X i 값이주어졌을때 Y 값의부분 모집단평균에대한신뢰구간 y j ± t 1 α,df=n k 1s y j X i Application Estimating the mean of Y for a given X Predicting Y for a given X 특정한값이주어졌을때얻게되는 Y값의예측구간 y j ± t 1 α, df=n k 1s y j

Ex 9.8.1 new=data.frame(age=68,ed=1) predict(line, new, level=0.95, interval="confidence") predict(line, new, level=0.95, interval="prediction")

9.9 회귀분석가정의위반 (checking the assumptions of regression model) 비정규분포 (not normal distributed) 이분산성 (heterogeneity) 독립변수사이의상관성 (co-linearity between independent vars)

9.10 질적독립변수 (Qualitative indep. Var) 변수 (variable) 양적 (quantitative) 연속 - 성적, 연령 질적 (qualitative) Continuous-score, age 범주 성별, 인종, 직업 Categorical-sex, race, job 질적변수를가변수 (dummy variable) 로이용 ( 가변수 : (0,1) 의값을갖는것 ) 질적변수 k 개범주 k-1 개의가변수사용 k categories -> k-1 dummy variables

가변수의예 (Examples of dummy var s) * * 성별 sex x 1 1 0 거주지역 x 1 0 Residential area (urban, rural, suburban) x 1 3 0 남자여자 도시 otherwise 농촌 otherwise male female urban rural * 흡연상태 ( 흡연자, 금연자-5 년내금연자, 금연자-5 년이상금연자, 비흡연자) Smoking status (current smoker, ex-smoker(<=5yrs), ex-smoker(.5 yrs) x 1 4 흡연자 smoker 0 otherwise x 1 5 0 5년내금연자 ex-smoker (<=5 years) otherwise x 1 6 0 5년이상금연자 otherwise ex-smoker (>5 years)

Ex 9.10.1 Case # Birth weight Gestation (week) Smk status of the mother

Y 출생시체중 (birth weight, grams) x x 1 임신기간주 gestation (weeks) 산모의흡연 smk status of the mother model 1: E( Y ) x x E( Y ) x 0 1 1 0 1 1 E( Y ) ( ) x 0 1 1 for nonsmoker for smoker S, N 1 0 smoker nonsmoker same slope, different intercept data<read.csv("d:\\kim\\yes\\myweb\\int\\018\\newlecturenote\\data \ 태아몸무게.csv",header=T) reg<-lm(gram~weeks+factor(smoke),data=data) summary(reg)

Y 출생시체중 (birth weight, grams) x x 1 임신기간주 gestation (weeks) 산모의흡연 smk status of the mother model 1: E( Y ) x x E( Y ) x 0 1 1 0 1 1 E( Y ) ( ) x 0 1 1 for nonsmoker for smoker S, N 1 0 smoker nonsmoker same slope, different intercept data<read.csv("d:\\kim\\yes\\myweb\\int\\018\\newlecturenote\\data \ 태아몸무게.csv",header=T) reg<-lm(gram~weeks+factor(smoke),data=data) summary(reg) Call: lm(formula = gram ~ weeks + factor(smoke), data = data) Residuals: Min 1Q Median 3Q Max -160.45-65.80 9.6 311.53 1016.59 Coefficients:

Y 출생시체중 (birth weight, grams) x x 1 임신기간주 gestation (weeks) 산모의흡연 smk status of the mother model 1: E( Y ) x x E( Y ) x 0 1 1 0 1 1 E( Y ) ( ) x 0 1 1 for nonsmoker for smoker S, N 1 0 smoker nonsmoker same slope, different intercept lm(formula = gram ~ weeks + factor(smoke), data = data) Residuals: Min 1Q Median 3Q Max -160.45-65.80 9.6 311.53 1016.59 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -174.4 558.84-3.086 0.0065 ** weeks 130.05 14.5 8.957.39e-14 *** factor(smoke)1-94.40 135.78 -.168 0.0360 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 484.6 on 97 degrees of freedom Multiple R-squared: 0.4636, Adjusted R-squared: 0.455 F-statistic: 41.9 on and 97 DF, p-value: 7.594e-14

* E( Y X x ) E( Y X x ) 1 1 smoker 1 1 non - smoker 임신기간이같다고할때, 주어진 x 값에대해서 1 어머니가흡연자인경우와어머니가비흡연자인경우의출생아의체중의차이 expected diff of birth weights between babies from smokers and nonsmokers ˆ 45grams y.. j = β 0 + β 1 x 1j + β x j = 174.4 + 130.05x 1j 94.40x j ˆ 0 * T y j = 174.4 + 130.05x 5.83.045 1j 94.40 1 se( = ˆ ) 018.8 + 130.05x 1j for smoker reject H b 0 : significantally different. *95% 신뢰구간 y j = 174.4 (CI) + 130.05x 1j 94.40 0 = 174.4 + 130.05x b ts ( 330.3975, 158.685) 1j for non-smoker

Nonsmoker smoker

model : E( Y) x x x x E( Y ) 0 1 1 3 1 x 0 1 1 for nonsmoker E( Y) ( ) ( ) x 0 1 3 1 different slope, different intercept for smoker If If 3 is significant -> slopes are diff btn smoker/nonsmoker is significant -> intercepts are diff not important without centering

체중 Model 그림 nonsmoker 1 smoker 0 0 1 3 38week 임신기간

centering if x x 38( week) 1 1 E( Y ) x x x x E( Y ) 0 1 1 3 1 x 0 1 1 ( x 38) 0 1 1 for nonsmoker 는 x 38일때 Y의기대치가된다 ( 의미, 관심있는모수 ) 0 1 E( Y ) ( ) ( ) x 0 1 3 1 는 x = 0일때기대치의차이가아니라 x = 38일때 1 1 for smoker E(Y x 1 =38) 흡연자와비흡연자의기대치의차이가된다. E(Y x 1 =38, smoker) -E(Y x 1 =38, non-smoker) * 교훈 : 연속변수를 centering을시켜주면절편이 x = 0일때의기대치가아니라 x = 특정값일때의기대치가되므로더욱의미있게된다. * centering 의다른효과 x 간의 mult - colinearity( 공선성 ) 를약화시켜준다. Intercept becomes more meaningful after centering. Multicolinearity becomes weaker after centering

Ex 9.10. effectiveness age treatment effectiveness age treatment 56 1 A 65 43 A 41 3 B 55 45 B 40 30 B 57 48 B 8 19 C 59 47 C 55 8 A 64 48 A 5 3 C 61 53 A 46 33 B 6 58 B 71 67 C 36 9 C 48 4 B 69 53 A 63 33 A 47 9 B 5 33 A 73 58 A 6 56 C 64 66 B 50 45 C 60 67 B 45 43 B 6 63 A 58 38 A 71 59 C 46 37 C 6 51 C 58 43 B 70 67 A 34 7 C 71 63 C

Model- 예제 9.10. * Y 치료효과 (trt effect) * X 1 연령 ( 양적 ) age ( quantitative) X 1, if trt * 치료방법 ( 질적 ) trt ( qualitative) X 1, if trt 3 Y x x x x x x x 0 1 1 3 3 4 1 5 1 3 E( Y ) x : for trt = A 0 1 1 EY ( ) ( ) ( ) x : for trt = B 0 1 4 1 E( Y ) ( ) ( ) x : for trt = C 0 3 1 5 1 B C

, : intercept & slope for reference cell A 0 1 3 4 5 : diff of intercepts (B-A), =0? : diff of intercepts (C-A), =0? : diff of slopes (B-A), =0? :diff of slopes (C - A), = 0?

예제 9.10.- sas * File: mreg018.sas ; data reg; input effect age method $; x1=age;x=(method= B');x3=(me thod= C'); x1=x1*x;x13=x1*x3; cards; 56 1 A 41 3 B 40 30 B 8 19 C 55 8 A 5 3 C 46 33 B 71 67 C 48 4 B 63 33 A 5 33 A 6 56 C 50 45 C 45 43 B 58 38 A 46 37 C 58 43 B 34 7 C 65 43 A 55 45 B 57 48 B 59 47 C 64 48 A 61 53 A 6 58 B 36 9 C 69 53 A 47 9 B 73 58 A 64 66 B 60 67 B 6 63 A 71 59 C 6 51 C 70 67 A 71 63 C ; run; proc reg; model effect=x1 x x3 x1 x13; output out=d p=pred; id age method; run; proc sort;by method; proc gplot; plot effect*age=method/ legend; symbol1 v='a' i=r c=c l=1; symbol v='b' i=r c=c l=; symbol3 v='c' i=r c=c l=3; run; proc glm ; class method; model effect=age method age*method / solution; run; proc glm ; class method(reference='a'); model effect=age method age*method / solution; run;

9.11 변수선택절차 variable(model) selection Forward selection Backward elimination Stepwise selection

Mod18.sas /* file : mod18.sas Multiple Regression Model with stepwise selection */ Filename electric 'd:\myweb\int\electric.dat'; data peak; infile electric ; input housize 1-3 income 6-11 aircapac 14-16 applindx 19-3 family 6-8 peak 31-35 ; label housize = 'House Size' income = 'Family Income' aircapac = 'Air Conditioning Capacity' applindx = 'Appliance Index' family = 'Number of Family Members' peak = 'Peak Hour Electric Load' ; run; proc reg data=peak; model peak = housize income aircapac applindx family /selection=stepwise; title 'Multiple Regression Model with stepwise selection'; run; proc reg data=peak outest=est; model peak = housize income aircapac applindx family /selection=rsquare cp adjrsq mse best= ; title 'Multiple Regression Model with stepwise selection'; run; proc print; title 'Actual Coefficients, etc.'; proc plot; plot _cp_*_in_ ='C' _p_*_in_='*'/overlay vaxis= 0 to 5 by 5 haxis=1 to 5 hpos=40 vpos=30; title; run;