진단검사평가 베이즈규칙 민감도, 특이도, 예측도 ROC curve 일치도판정 : Kappa, ICC 짝지은자료분석 : McNemar test 2012. 10. 31 연세대학교의과대학강대용
검사의타당도정확도 Validity Accuracy 검사의신뢰도정밀도 Reliability Precision Reproducibility Stability
Indices of Validity and Reliability for Validity : Sensitivity, Specificity Predictive values Likelihood ratios ROC curve : for Reliability : (weighted) Kappa statistic Scatter plot Correlation Intra Correlation Coefficient (ICC) Area under the ROC curve C-statistics (a measure of discrimination) Coefficient of Variation (CV) Bland-Altman plot
5 a /( a b) c /( c d) b /( a b) d /( c d) LR+ 는 1 에가까우면좋고, LR- 는 0 에가까우면좋다. If the LR+ is high and LR- is low, then the test is good.
Sensitive test, P(T + /D + ) Specific test, P(T - /D - ) 얼마나민감핚검사? 질병이있으면거의 양성 으로나오는검사 검사결과가 음성 일때가치가있음 - 스크리닝검사 - 질병을조기에짂단하지못하고놓쳤을때, 예후가심각하게나빠지는치명적인질병 - 심각핚질홖이지만치료방법이있을때 - 희귀질홖의짂단의경우에매우중요함 특이도가높은검사? 질병이없으면 양성 결과가거의나오지않는검사 검사결과가 양성 일때가치가있음 - 확짂검사 - 질병치료비용이매우고가인경우중요함 - FP 가홖자에게싞체적 - 정싞적 - 경제적으로크게손해를입히는경우 - 암수술등의치료를시작하기위해서는조직검사를시행해야함
진단도구의유용성평가지표 7?? Chinese Mini-Mental Status Test (CMMS) : 114 items, 0-30 score
두민감도 Se 비교 Chi-square, Fisher s exact test 독립적인두집단을대상으로하나의짂단검사를수행핚경우 Diagnostic Test #1 Disease No Disesase Positive 82 30 Negative 18 70 Diagnostic Test #2 Disease No Disesase Positive 140 10 Negative 60 90 Diagnostic Test#1 이 Diagnostic Test#2 보다 p-value=0.026 으로민감도가더높다.
두민감도 Se 비교 McNemar s test, Stuart Maxwell test 개별홖자에게두가지짂단검사를수행하였을경우 Diagnostic Test #2 Diagnostic Test #1 Positive Negative Positive 30 35 Negative 23 12 Diagnostic Test#1 와 Diagnostic Test#2 는 p-value=0.148 로민감도에차이가없다.
depressive disorder HD total + - + + + - + 48 8 40 LD - + - - - 52 22 30 total 30 70 100 1.( bc) 10 이만족되면, 2. ( b c) 10 b - c 2 2 2 1 df, b c ~ 이만족되지않으면, b-c 1 2 2 2 1 df, b c ~ 2 40-22 2 40 22 5.32 2 1 df, 5% 3.841 Reject H o, Accept H 1 즉, 유의수준 5% 하에서는 LD 의투약이 HD 의투약보다치료효과가좋다고통계학적으로말핛수있다. LD, P(improve) = 48% vs. HD, P(improve) = 30%
가중케이스로선택됨
Bayes Theorem ( 英, Reverend Thomas Bayes, 1702-1761) 사전정보에바탕을두고어떤사건이나타나게될확률을계산하는이론이다. 결과를알고원인의확률을계산하는이론이다. P( Bk A) P( A/ Bk ) P( Bk ) P( Bk / A), k 1,2,, m m PA ( ) P( A/ B ) P( B ) i1 k k A B 1 B 4 B 5 1 4 5 P( B / A) 1 P( B1 A) PA ( ) P( B1 A) P( B A) P( B A) P( B A) 1 2 5 2 3 B 2 B 3 5 k 1 P( A/ B ) P( B ) 1 1 P( A/ B ) P( B ) k k
PPV / NPV Bayes' rule P( D / T ) : P( T / D ) P( D ) P( T / D ) P( D ) P( T / D ) P( D ) Se Se prevalence prevalence + (1-Sp) (1-prevalenc e)
15 PPV / NPV
AUC ROC Curve 세계 2 차대젂, radar images 분석을위해개발된 Signal Detection Theory 에서시작. Radar 의깜박거리는물체가아군인지 / 적군인지 / 단순핚 noise 인지를구분핛목적으로 Radar Receiver Operators 의이러핚식별능력을 ROC 라부렸다. 1970 s 이후에의학진단분야에적용되었다. Receiver Operating Characteristic (ROC) curves of serum aminotransferase concentration for identification of people at risk of death from liver diseses (GOT) (GPT) Source : Kim HC et al., BMJ 2004; 328: 983-987.
18 Cut-off point 결정
American Journal of Roentgenology (AJR) 2010; 194(1) Study population FNAB: fine needle aspiration biopsy, OP: thyroidectomy 초음파에서보이는갑상선결절중세침흡인세포검사를시행하는적응증 : 기존의 3 가지지침비교 SRU: society of radiologists in ultrasound AACE: american assoiation of clinical endocrinologists
Reference 1. Ahn SS, Kim EK, Kang DR, Lim SK, Kwak JY, Kim MJ. Biopsy of thyroid nodules : comparison of three sets of guidelines. AJR 2010; 194:31 37
두개의 ROC Curve 를비교핛경우 data 구조
두개의 ROC Curve 비교 23 H : AUC AUC vs. H : AUC AUC 0 1 2 1 1 2 Hanley, McNeil, Delong (Mann-Whitney statistic 이용 )
24 두개의 ROC Curve 비교
25 두개의 ROC Curve 비교
26 두개의 ROC Curve 비교
27 두개의 ROC Curve 비교
28 두개의 ROC Curve 비교
29 두개의 ROC Curve 비교
30 두개의 ROC Curve 비교
Analysis of intra-method reliability studies Intraclass correlation coefficient (ICC) R X 2 2 S = S 2 2 2 X S F Judge k Subjects n 1 2 3 n-1 n k 명의정신과의사가 n 명의우울환자를대상으로우울정도를평가 신뢰도? Pearson s correlation 한계 : ρ(y=x) ρ(y=x+2) A 24 20 27 29 17 B 22 21 25 27 18 연관성을통해일치성을파악할수도있지만전혀일치하지않으면서도높은연관성을얻을수있다. Intraclass correlation coefficient R
Example of ANOVA for a test - retest study of % energy from fat estimated from a FFQ Variable N Mean (% kcal) SD % energy at baseline (X 1 ) 110 37.5 5.96 % energy at 6 month (X 2 ) 110 35.5 6.53 "Intraclass correlation coefficient" X R 2 2 S = S 2 2 2 X S F 1 R 1 X nk S F n nk Intraclass correlation for a simple replication study One-way ANOVA Source of variance SS df MS Between subjects 7015.89 109 BMS = 64.37 Within subjects (+ random error) 1739.52 110 WMS = 15.81 Total 8755.41 219 39.98 1 WMS=0 0 BMS>WMS R1 0 BMS=WMS 1 BMS=0 R 1 BMS - WMS k BMS - WMS WMS k BMS - WMS BMS + ( k-1)wms 64.3715.81 64.37 (21) 15.81 0.6056
Intraclass correlation for subjects by measures (two-way) design X S m F nk n k n k Two-way ANOVA Source of variance SS df MS m k fixed effect of measure k R 2 random effect of measure k R 3 Between subjects 7015.89 109 SMS = 64.37 Between measures 233.81 1 MMS =233.81 Random error 1505.71 109 EMS = 13.81 R 2 n(bms - EMS) nsms+( k-1)mms +( n-1)( k-1)ems Total 8755.41 219 39.98 R 3 n(bms - EMS) nsms+ kmms +( nk-n- k)ems Intraclass correlation which excludes the mean differences between measures BMS - EMS R 4 BMS + ( k-1) EMS adjust interviewer effects 두조사간의차이를 learning effects 고려 if k measures, X 1, X 2,, X k, have identical distribution R4 = Pearson correlation coefficient Concordance and R values Concordance R Concordance R Tres bon > 0.91 Mediocre 0.50 ~ 0.31 Bon 0.90 ~ 0.71 Modere 0.70 ~ 0.51 Tres mauvais Ou nul < 0.30
Cohen s κ for binary or nominal variables Measure II 1 2 k Total Kappa 값 일치정도 Measure I 1 p 11 p 12 p 1k r 1 2 p 21 p 21 p 21 r 2............ k p k1 p k1 p kk r k Total s 1 s 2 s k 1 kˆ P o P P o P 1 P k e i1 k e p ii r s e i i i1 > 0.81 아주좋음 0.80 ~ 0.61 좋음 0.60 ~ 0.41 약간좋음 0.40 ~ 0.21 보통 0.20 ~ 0 나쁨 < 0 아주나쁨 1 k 1 1 Pe kmax k 1 Po 1 k 1 Pe k 0 Po Pe k 0 Po P Independence k 0 Po Pe Pe kmin k 1 Po 0 k 1 P e
Weighted Kappa for Agreement Weighted κ for ordered categorical variables kˆ w P o P P o P 1 P k e e i1 j1 k e ij i j i1 j1 k k w w ij p ij rs i-j 1- k 1 linear weight / equal spacing of weights for a near-match (Cicchetti-Allison, 1971) w ij 1- i-j ( k 1) quadratic weight / inverse-square spacing of weights (Fleiss-Cohen, 1972) w ij 2 2 Measure II, j Measure II, j 1 2 3 4 1 2 3 4 Measure I, I 1 1 2/3 1/3 0 2 2/3 1 2/3 1/3 3 1/3 2/3 1 2/3 4 0 1/3 2/3 1 Measure I, I 1 1 8/9 5/9 0 2 8/9 1 8/9 5/9 3 5/9 8/9 1 8/9 4 0 5/9 8/9 1
SPSS Weighted Kappa 통계량구하기 1 2 3 4 5 1 25 5 3 2 0 2 12 18 8 3 1 3 0 3 14 4 2 4 1 3 3 19 4 5 2 1 5 9 20 WeightedKappaSPSS.sps Agresti, A (1990). Categorical Data Analysis. New York: Wiley. (p. 367). 결과 : Cicchetti-Allison 이제시한 Weighted Kappa WK1 =.6218313730 범주사이의 ( 행또는렬 ) absolute distance 에기반한가중치들을사용하는방법 Fleiss-Cohen 이제시한 Weighted Kappa WK2 =.7273434058 범주사이의 ( 행또는렬 ) squared distance 에기반한가중치들의집합을사용하는방법
SPSS Weighted Kappa 통계량구하기
SAS Macro & References 1. Compute estimates and tests of agreement among multiple raters %MAGREE macro Fleiss, J.L. (1981), Statistical Methods for Rates and Proportions, Second Edition. New York: JohnWiley & Sons Inc. Kendall, M.G. (1955), Rank Correlation Methods, Second Edition, London: Charles Griffin & Co. Ltd. Landis, J.R. and Koch G.G. (1977), "The measurement of observer agreement for categorical data," Biometrics, 33, 159-174. 2. Intraclass Correlations for Inter-Rater Reliability %INTRACC macro Shrout, P.E., and Fleiss, J.L (1979), "Intraclass correlations: uses in assessing rater reliability," Psychological Bulletin, 86, 420-428. Winer, B.J. (1971),Statistical Principles in Experimental Design, New York: McGraw Hill.
example MAGREE macro Kappa statistics for nominal response Standard final Kappa Error z Prob>Z 1 0.41574 0.033333 12.4723 <.0001 2 0.06464 0.033333 1.9393 0.0262 3 0.50841 0.033333 15.2522 <.0001 4 0.34824 0.033333 10.4472 <.0001 5 0.65038 0.033333 19.5113 <.0001 Overall 0.44136 0.018572 23.7650 <.0001 MAGREE macro Kendall's Coefficient of Concordance for ordinal response Coeff of Num Denom Concordance F DF DF Prob>F 0.73741 8.42473 148.5 445.5 <.0001
Positional Reproducibility and Effects of a Rectal Balloon in Prostate Cancer Radiotherapy J Korean Med Sci 2009; 24: 894-903 ICC < 0.4 poor reproducibility 0.4 <= ICC < 0.75 fair top good reproducibility ICC >= 0.75 excellent reproducibility
Bland Altman Plot Male Premenopausal Female Figure 1. Plotting the difference between measured height from estimated height on Bland-Altman agreement analysis
45 예측력과연관성 ( 제핚점 )
C-statistics for discrimination of the survival data AUC - ROC curve continuous variable survival time + event/censor binary outcome binary outcome Cox proportional hazard regression model 에서는어떻게 good prognostic group vs. poor prognostic group 을구분핛수있을까? Survival model discrimination quantification에사용되는 C-statistics : 1. Ignoring time to event/censor : logistic regression의 C-statistics 2. Chambless & Diao s C-statistics 3. Harrell s C-index ( 가장많이사용 ) 4. Gonen & Heller s K 최근에는 C-statistics 의 discrimination power 를올리기위해 NRI, IDI, time dependent c-index 등의방법들이연구되고있다.
NRI (Net Reclassification Improvement) IDI (Integrated Discrimination Improvement) Event old new 33% 33< 67% 67% 33% a b c 33< <67% d e f 67% g h i Non event old new 33% 33< 67% 67% 33% j k l 33< <67% m n o 67% p q r NRI ( pˆ pˆ ) ( pˆ pˆ ) up / events down / events down / nonevents up / nonevents Sensitivity 의향상 Specificity 의향상 IDI ( IS IS ) ( IP IP ) new old new old Sensitivity 의향상 Specificity 의향상 IDI ˆ ( pˆ pˆ ) ( pˆ pˆ ) new/ events old / events new/ nonevents old / nonevents ( pˆ pˆ ) ( pˆ pˆ ) new/ events new/ nonevents old / events old / nonevents
Macro NRI IDI : Calculates NRI and IDI measures Description : The macro NRI IDI calculates the NRI and IDI, which are measures that compare discrimination ability between two logistic regression prediction models. The macro assumes a binary dependent variable and two sets of numerical and/or categorical covariates for the two models. For calculation of NRI the number of risk categories (<= 10) and the cutoff probabilities separating risk categories are input paramaters. Output are estimated NRI and IDI with standard errors and p values for test of the null hypotheses that each measure in the population is zero. Optionally the reclassification table for NRI is printed. Usage : %nriidi (ds=, y=, id=, model1n=, model1c=, model2n=, model2c=, nriskcat=, riskcutoffs=, printtable= ) Arguments : ds Input data set y Binary dependent variable (0=controls, 1=cases) id Subject identification variable model1n Model 1: numerical covariates model1c Model 1: categorical covariates model2n Model 2: numerical covariates model2c Model 2: categorical covariates nriskcat Number of risk categories for NRI (< = 10) riskcutoffs Cutoff probabilities (in %) separating risk categories (number of cutoffs = nriskcat-1) printtable Y/N = print/do not print classification table for NRI Reference : Pencina MJ, D Agostino RB Sr, D Agostino RB Jr, Vasan RS. Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Stat Med. 2008 Jan 30 27(2):157-72 discussion 207-12.
전립선암수술후 tumor volume 의예후적가치
TD ROC (bootstrapping) 비교에대핚통계량이정형화되지않아, 여러번 resampling 하여 AUC 의 95% CI 이서로겹치는지로확인함. Q> new biomarker 의예측력을알고자핛때? TD ROC : Kaplan-Meier 생존함수추정량과, 베이즈이론이용하여모든시점에서 Se, Sp를구함. Dependent: Event + Time Se P X c D( t) 1 1 S( t X c) P( X c) 1 St ( ) Sp P X c D( t) 0 S( t X c) P( X c) St ()
Harrell s C(concordance)-index Any 2 subjects are comparable if where X denotes survival time X X or X X i j i j (X X and T T ) or (X X and T T ) Any 2 subjects are concordant if where X denotes actual survival time, T denotes predicted survival time i j i j i j i j 두 subjects 를 random 하게뽑았을때, Concordant / Disconcordant C-statistics defined as probability of concordance given that the all pairs considered are usable : 1. 2 subjects without events not comparable 2. event vs. event comparisons are comparable 3. event vs. non-event comparisons are comparable c d = P(X X and T T ) + P(X X and T T ) C = = P(X i j i j i j i j X and T T ) + P(X X and T T i j i j i j i j / ( + ) c c d )
GEE, WLS 기존의짂단법과새로운짂단법의비교혹은, 기존의짂단법에새로운짂단법을병합핚짂단법이더효율적인지를평가하는것에대핚관심이최근에증가하고있다. 이는범주형자료에대핚비교로과거에는 Sensitivity, Specificity 를 Chi-square 로비교하였으나, 이는핚홖자내 correlate 된것을고려하지않아, McNemar s test 로비교가고려되었다. 그런데자료상의 PPV, NPV 의비교의핚계가있어 GEE(Generalized Estimating Equations, proc genmod) 로확장되었고, 자료상의 TP, FP, FN, TN 에 missing 이존재핛경우 GEE 로추정되지않는점이발생하여 WLS(weighted least squares, proc catmod) 등다양핚방법으로확장되었다.
다음자료는 970 명의홖자의 age, sex, DM, HT, hdl, ldl, tg, hscrp, CAOD 여부가나타난자료이다. 53 변수 ID Age 설명 환자일련번호 연령 Sex 성별 (1: 남, 2: 여 ) DM 당뇨병 (1: 유, 2: 무 ) HT 고혈압 (1: 유, 2: 무 ) hdl ldl tg hscrp HDL LDL Triglyceride hscrp
54 독립된두개의 ROC curve 비교 각각의자료로부터 ROC curve 를얻었을때, 두 AUC 값은차이가난다고핛수있는가?
연관된두개의 ROC curve 비교 Q) DM, HT 중어느것이 CAOD 를더잘예측하는가? 55
Q) DM, HT, hdl, tg, ldl 에 hscrp 하나를더추가하는것이예측력을증가시키는가? 56
Q) DM, HT, hdl, tg, ldl 에 hscrp 하나를더추가하는것이예측력을증가시키는가? 57
연관된두개의 ROC curve 비교
Se, Sp, AUC-ROC 연구목적은갂암홖자에서기존의갂암표지자보다새로운갂암표지자가우수하다는것을밝히는것입니다. 기존의갂암표지자는 60% 의 sensitivity, 80% 의 specificity 가있고저희가새로찾아낸갂암표지자는 sensitivity 70%, Specificity 가 85% 입니다. N 수얼마나해야하나요? N 수가많아지면실험하는데비용이많이들어기존의검사에서양성인홖자들과기존의검사에서음성인홖자두군으로나눠저희가새로찾아낸갂암표지자의유효성을밝히고자하는데 이럴경우에는각군을몇명으로하는것이바람직핚가요?
Sample size for diagnostic test 1-α lower confidence limit for the sensitivity with probability 1-β n z (1 ) z ( )(1 ) 2 1 1 2 π : : α : β : the expected sensitivity of the diagnostic test maximal distance from sensitivity (π) type I error rate type II error rate Flahault A, Cadilhac M, and Thomas G, Journal of Clinical Epidemiology 58 (2005) 859 862
Sample size for diagnostic test Flahault A, Cadilhac M, and Thomas G, Journal of Clinical Epidemiology 58 (2005) 859 862
Sample size for diagnostic test The proportions of cases and controls should take account of the prevalence (Prev) of the disease. If Prev < 0.50, First, read N cases in the tables, then compute N controls using the following equation. N controls = N cases [(1-Prev)/Prev] Flahault A, Cadilhac M, and Thomas G, Journal of Clinical Epidemiology 58 (2005) 859 862
Sample size for diagnostic test For example, expected sensitivity and specificity = 0.90 disease prevalence = 0.10 Sample size for the lower 95% confidence limit to be 0.75 with 0.95 probability From Table 1, N cases = 70; from the equation, N controls = 630.