untitled

Similar documents
전립선암발생률추정과관련요인분석 : The Korean Cancer Prevention Study-II (KCPS-II)

(Exposure) Exposure (Exposure Assesment) EMF Unknown to mechanism Health Effect (Effect) Unknown to mechanism Behavior pattern (Micro- Environment) Re

04-다시_고속철도61~80p

확률과통계 강의자료-1.hwp

untitled


歯1.PDF

저작자표시 - 비영리 - 변경금지 2.0 대한민국 이용자는아래의조건을따르는경우에한하여자유롭게 이저작물을복제, 배포, 전송, 전시, 공연및방송할수있습니다. 다음과같은조건을따라야합니다 : 저작자표시. 귀하는원저작자를표시하여야합니다. 비영리. 귀하는이저작물을영리목적으로이용할


서론 34 2

11¹Ú´ö±Ô


Rheu-suppl hwp

- 2 -

Oracle Apps Day_SEM

step 1-1

<31342D3034C0E5C7FDBFB52E687770>

PowerPoint 프레젠테이션

methods.hwp

강의10

<32382DC3BBB0A2C0E5BED6C0DA2E687770>


PowerPoint 프레젠테이션

Page 2 of 6 Here are the rules for conjugating Whether (or not) and If when using a Descriptive Verb. The only difference here from Action Verbs is wh

BK21 플러스방법론워크숍 Data Management Using Stata 오욱찬 서울대사회복지학과 BK21 플러스사업팀


Page 2 of 5 아니다 means to not be, and is therefore the opposite of 이다. While English simply turns words like to be or to exist negative by adding not,

ISO17025.PDF

<B1E2C8B9BEC828BFCFBCBAC1F7C0FC29322E687770>

DBPIA-NURIMEDIA


Vol. 20, December 2014 Tobacco Control Issue Report Contents Infographic 년 전 세계 FCTC 주요 이행현황 Updates 04 이 달의 정책 06 이 달의 연구 Highlights 09 담배규제기본

6자료집최종(6.8))

원고스타일 정의

김기남_ATDC2016_160620_[키노트].key

09È«¼®¿µ 5~152s

한국성인에서초기황반변성질환과 연관된위험요인연구


012임수진

석사논문.PDF

182 동북아역사논총 42호 금융정책이 조선에 어떤 영향을 미쳤는지를 살펴보고자 한다. 일제 대외금융 정책의 기본원칙은 각 식민지와 점령지마다 별도의 발권은행을 수립하여 일본 은행권이 아닌 각 지역 통화를 발행케 한 점에 있다. 이들 통화는 일본은행권 과 等 價 로 연

Microsoft PowerPoint - Stat03_Numerical technique(New) [Compatibility Mode]

Dialog Box 실행파일을 Web에 포함시키는 방법

Microsoft PowerPoint - AC3.pptx


ETL_project_best_practice1.ppt

Journal of Educational Innovation Research 2019, Vol. 29, No. 1, pp DOI: (LiD) - - * Way to

#KLZ-371(PB)

DBPIA-NURIMEDIA

<30382E20B1C7BCF8C0E720C6EDC1FD5FC3D6C1BEBABB2E687770>

03.Agile.key

- 2 -

광덕산 레이더 자료를 이용한 강원중북부 내륙지방의 강수특성 연구


,,,.,,,, (, 2013).,.,, (,, 2011). (, 2007;, 2008), (, 2005;,, 2007).,, (,, 2010;, 2010), (2012),,,.. (, 2011:,, 2012). (2007) 26%., (,,, 2011;, 2006;

#KM-235(110222)

Journal of Educational Innovation Research 2018, Vol. 28, No. 4, pp DOI: * A Research Trend


歯5-2-13(전미희외).PDF

Main Title

저작자표시 - 비영리 - 변경금지 2.0 대한민국 이용자는아래의조건을따르는경우에한하여자유롭게 이저작물을복제, 배포, 전송, 전시, 공연및방송할수있습니다. 다음과같은조건을따라야합니다 : 저작자표시. 귀하는원저작자를표시하여야합니다. 비영리. 귀하는이저작물을영리목적으로이용할

03-서연옥.hwp

< FC7D1BEE7B4EB2DB9FDC7D0B3EDC3D132382D332E687770>

#KM560

975_983 특집-한규철, 정원호

Buy one get one with discount promotional strategy

DBPIA-NURIMEDIA

GEAR KOREA

FMX M JPG 15MB 320x240 30fps, 160Kbps 11MB View operation,, seek seek Random Access Average Read Sequential Read 12 FMX () 2

(5차 편집).hwp

#Ȳ¿ë¼®

2009년 국제법평론회 동계학술대회 일정

,,,,,,, ,, 2 3,,,,,,,,,,,,,,,, (2001) 2

Vol.257 C O N T E N T S M O N T H L Y P U B L I C F I N A N C E F O R U M

Journal of Educational Innovation Research 2017, Vol. 27, No. 2, pp DOI: : Researc

Crt114( ).hwp

Vol.258 C O N T E N T S M O N T H L Y P U B L I C F I N A N C E F O R U M

<31372DB9DABAB4C8A32E687770>

저작자표시 - 비영리 - 변경금지 2.0 대한민국 이용자는아래의조건을따르는경우에한하여자유롭게 이저작물을복제, 배포, 전송, 전시, 공연및방송할수있습니다. 다음과같은조건을따라야합니다 : 저작자표시. 귀하는원저작자를표시하여야합니다. 비영리. 귀하는이저작물을영리목적으로이용할

4번.hwp

ÀÌÁÖÈñ.hwp

Hi-MO 애프터케어 시스템 편 5. 오비맥주 카스 카스 후레쉬 테이블 맥주는 천연식품이다 편 처음 스타일 그대로, 부탁 케어~ Hi-MO 애프터케어 시스템 지속적인 모발 관리로 끝까지 스타일이 유지되도록 독보적이다! 근데 그거 아세요? 맥주도 인공첨가물이

PowerSHAPE 따라하기 Calculate 버튼을 클릭한다. Close 버튼을 눌러 미러 릴리프 페이지를 닫는다. D 화면을 보기 위하여 F 키를 누른다. - 모델이 다음과 같이 보이게 될 것이다. 열매 만들기 Shape Editor를 이용하여 열매를 만들어 보도록

Output file


WIDIN - Toolholding Catalogue.pdf

Journal of Educational Innovation Research 2018, Vol. 28, No. 1, pp DOI: A study on Characte

#KM-340BL

3 Gas Champion : MBB : IBM BCS PO : 2 BBc : : /45

<3130C0E5>

DBPIA-NURIMEDIA

27 2, * ** 3, 3,. B ,.,,,. 3,.,,,,..,. :,, : 2009/09/03 : 2009/09/21 : 2009/09/30 * ICAD (Institute for Children Ability


기관고유연구사업결과보고

°í¼®ÁÖ Ãâ·Â

public key private key Encryption Algorithm Decryption Algorithm 1

untitled

PowerPoint 프레젠테이션

1..

Coriolis.hwp

Transcription:

Data cleaning & Exploratory data analysis Seungho Ryu, MD, PhD Kanguk Samsung Hospital, Sungkyunkwan University

Data Cleaning Definition: A process used to determining inaccurate, incomplete, or unreasonable data and then improving the quality through correction of detected errors and omission. Error prevention is far superior to error detection and cleaning. No matter how efficient the process of data entry, errors will still occur and therefore data validation and correction cannot be ignored. One important product of data cleaning is the identification of the basic errors detected and using that information to improvement the data entry process to prevent those errors from reoccurring.

Need for Data Cleaning Centre around improving the quality of data to make them fit for use by users through reducing errors in the data. They are bad, but a good understanding of errors and error propagation can lead to active quality control and managed improvement in the overall data quality. It is important that errors not just be deleted, but corrections documented and changes traced. It is best to add corrections to the database while retaining the original data in a separate field or fields so that there is always the chance of going back to the original information.

Principles of Data Cleaning Planning is essential Organizing data improves efficiency Prevention is better than cure Responsibility belongs to everyone Partnerships improves efficiency Minimize duplication Documentation of validation procedures Feedback is a two-way street Education and training

Overview of data cleaning process Data request & download without original unique identifier/ with a new assigned study-id Read data files with STATA by types of variables Examine individual variable data files check potential errors including duplicates define labels check inconsistency Edit commands Integrate labels Check incorrect id Merge (combine) individual variables comprehensive visits versus other visits define index visit/ define incorrect id Create STATA data set for analysis apply labels & possible ranges impute values for limits of detection create derived variables

Download format To define a proper format for each variable Lab/anthropometry/exam/image or procedures To define different time variables request time- the time a participant registered a health check-up in the center sampling time- proxy of blood sampling time (the time for printing out a lab label test time the time an analyzer starts to measure report time the time a value is confirmed and saved

Simplified data sources Local equipment or server Laboratory Raw exam values -directly from analyzer machine Procedures (PFT, PWV) PACS Server server

Data sources Raw exam values Raw values variable data files (*.csv) Reported values Without personal unique identifiers

Data cleaning files

General principles Preserving all original data files received

General principles STATA data files self-explanatory

Stata do files Do files - Include all data cleaning processes - Self explanatory - All reproducible - Apply feedback - Record tasks - Date and version

Self-explanatory titles & tasks Define tasks Record version & date

kshc_data_dictionary_v01_2012_02_29 Provide summary documentation of all data changes and help investigators to use the data efficiently (avoid same problems of data)

Checking data quality & correction Checking erroneous values Range and consistency checks Developing boundaries for out-of range values required a collaborative effort between data cleaning team and clinicians

Data cleaning examples Hearing High sensitivity C-reactive protein Blood pressure Direct bilirubin Waist Duplicates Decision process

Simple example - hearing

Assign a different code for possible a different meaning value Assign a same code for same meaning values

High sensitivity C-reactive protein Impossible value of zero (0) limits of detection varied over time in terms of magnitude

Inconsistent between raw values and reported values. tab year if raw_values=="<0.30" raw_values=="<0.31" raw_values=="< > 0.32" raw_values=="<0.34" year Freq. Percent Cum. 2005 48 96.00 96.00 2007 2 4.00 100.00 Total 50 100.00. tab year if raw_values=="<0.03" year Freq. Percent Cum. 2005 10 90.91 90.91 2009 1 9.09 100.00 Total 11 100.00.

High sensitivity C-reactive protein

Example- SBP (systolic blood pressure)

Density 0.02.04.06.08 50 100 150 200 250 systolic blood pressure, mmhg Density 0.02.04.06.08.1 Year 2002 Year 2003 Density 0.02.04.06.08 50 100 150 200 250 systolic blood pressure, mmhg Year 2004 50 100 150 200 250 systolic blood pressure, mmhg

Direct bilirubin

waist

Duplicates issue All variable duplicates were removed

Inconsistent identifier Decide to remove

Creating data sets for analysis To help investigators use data efficiently and properly remove duplicates and unreliable partial duplicates set erroneous values of each variable to missing set out-of range to missing check for consistency and correct inconsistency Impute a certain value for limit of detection

Difficulties Data -Not designed for study No current available documentation for range check (e.g., limit of detection over time) Required a collaborative effort between data cleaning team and clinicians for developing out of range in terms of both laboratory and clinical aspects Raw exam values- Not available before 2003

Data cleaning KSHC 288,419 subjects; 619,763 visits; 1,238 deaths Clean, internally consistent data sets for analysis Maximize the existing database Help investigators perform studies efficiently Promote better quality products Improve hospital image with reliable data and proper data usage

Kangbuk Samsung Cohort Study Data entry, timely review and correction will continue throughout the data cleaning possible recovery of erroneous data Feedback throughout retrospective data cleaning improvement of reporting consistency Data collection through standardized examination best quality of cohort study data

Exploratory Data Analysis

Main Reasons of EDA Detection of mistakes Checking of assumptions Preliminary selection of appropriate models Determining relationships among the explanatory variables Assessing the direction and rough size of relationships between explanatory and outcome variables

( ) 1

Data

Univariate Analysis

Mean Median Measures of central location Geometric mean

Mean (arithmetic mean) If there are n observations, x 1, x 2,, x n, the sample mean is x = x 1 + x 2 +... + n x Properties Intuitive Nice statistical properties Sensitive to outliers Appropriate for continuous and discrete variables n = n i= 1 n x i

Computing the mean in a random sample of 10 SBP measurements Random sample of 10 SBP measurements in NHANES III 111, 134, 105, 138, 110, 122, 196, 126, 177, 117 x = n i= 1 n x i = 111+ 134 + 10... + 117 = 133.6 mmhg

Median (50 th percentile) Middle value of the sample (50 th percentile) Order the sample (from lowest to highest) If n is odd, the median is the middle value If n is even, the median is the average of the two middle values Properties: The mean may not be a good measure of the middle value of a distribution Insensitive to outliers

Computing the median in a random sample of 10 SBP measurements Random sample of 10 SBP measurements in NHANES III 111, 134, 105, 138, 110, 122, 196, 126, 177, 117 Order the sample values 105, 110, 111, 117, 122, 126, 134, 138, 177, 196 The median is the average of the 5 th and 6 th values Median = (122 + 126) / 2 = 124

Geometric mean If there are n observations, x1, x2,, xn, the geometric mean is x geom = = n x 1 x 2... x n logx1 + logx2 +... + logx exp n n

Serum β-carotene in NHANES III adults (1988 1994) Frequency 1000 750 500 Geom. mean: 14.3 µg/dl Min: 0.48 µg/dl P 25: 8.0 µg/dl P 50: 14.0 µg/dl P 75: 24.0 µg/dl Max: 674.0 µg/dl 250 0 1 25 50 100 400 Serum β-carotene (µg/dl) Based on unweighted analysis of NHANES III data. N = 16,629. N missing = 2,989

Measures of spread Standard deviation Interquartile range Range Coefficient of variation

Standard deviation If there are n observations, x 1, x 2,, x n, the sample standard deviation is s = n i= 1 ( ) Note: s 2 is the sample variance Properties x n i 1 x 2

s Computing the SD in a random sample of 10 SBP measurements Random sample of 10 SBP measurements in NHANES III 111, 134, 105, 138, 110, 122, 196, 126, 177, 117 Mean: 133.6 mmhg = = = n i= 1 ( x x) i n 1 2 2 2 ( 111 133.6) + ( 134 133.6) +... + ( 117 133.6) 30.09 mmhg 10 1 2

Other measures of spread Interquartile range 75 th percentile 25 th percentile Range Highest lowest observation Coefficient of variation s CV = 100% x

Skewness of distributions Pagano and Gauvreau, page 42

Univariate non-graphical EDA: Continuous variable Summary statistics

Univariate non-graphical EDA: Categorical variable A simple tabulation of the frequency of each category is the best univariate non-graphical EDA for categorical data

Graphical methods for continuous data (univariate) Bar chart Histogram Box plot Stem-and-leaf plot

SBP in NHANES III adults (1988 1994) 0.03 0.02 Properties of a distribution Location Probability 0.01 Spread Shape 0.0 80 90 100 110 120 130 140 150 160 170 180 190 200 Systolic blood pressure (mm Hg) Based on unweighted analysis of NHANES III data. N = 19,256. N missing = 362

SBP in NHANES III adults (1988 1994) Probability 0.03 0.02 0.01 Mean: 126.4 mm Hg SD: 20.6 mm Hg Min: 69.0 mm Hg P 25: 111.0 mm Hg P 50: 122.0 mm Hg P 75: 138.0 mm Hg Max: 246.0 mm Hg IQR: 27.0 mm Hg CV: 16.3 % 0.0 80 90 100 110 120 130 140 150 160 170 180 190 200 Systolic blood pressure (mm Hg) Based on unweighted analysis of NHANES III data. N = 19,256. N missing = 362

Histogram The key with a histogram is to use a sufficient number of intervals to define the shape of the distribution clearly and not lose much information. A rough rule of thumb is to choose the number of bins to be about 1+3.3log 10 (n) where no is the sample size.

Boxplots

Boxplot Location, as measured by the median Spead, as measured by the height of the box (this is called the interquartile range for IQR) Range of the observations Presence of outliers Some information about shape

SBP in NHANES III adults (1988 1994) Stem and leaf plot N = 500 Median = 123 Quartiles = 112, 140 Decimal point is 1 place to the right of the colon 8 : 1 8 : 9 : 1123 9 : 55577788899999 10 : 000001111122222222333333444444444444 10 : 55555555666666666666777777888888888899999999 11 : 00000001111111111111111122222222222222233333333334444444444444444 11 : 55555555555666666666666677777777788888888888999999999 12 : 000000000001111111111111222222233333333444444 12 : 5555556666667777777777778888888899999 13 : 000000111111122222222233333333444444444 13 : 555555566666777777888888888899999999 14 : 0000000000001111222223334444444 14 : 555555666667899999 15 : 001113344444 15 : 5567788888899 16 : 0000000011122233 16 : 66778999 17 : 0444 17 : 56677788899 18 : 134 18 : 8 19 : 24 19 : 6 High: 197 210 220 221 224 231 Based on a random sample of 500 adult NHANES III participants

Quantile-normal plots

KSHS fasting triglycerides

Transformation of Data

bivariate Analysis

Scatterplots

Correlation Analysis. pwcorr bmi gluc ldl hdl tchol, sig

Side-by-side boxplots

Side-by-side boxplots Age 30 40 50 60 non-smoker ex-smoker current smoker

* ttest(mean-comparison tests) Group mean-comparison test that varname has the different mean(unequal variance) within the two groups defined by group variable ttest variable name, by(group variable) option. ttest age, by(diabetes)unequal Two-sample t test with unequal variances ------------------------------------------------------------------------------ Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- 0 no 27982 40.14032.0594981 9.952736 40.0237 40.25693 1 yes 857 50.22418.3693619 10.81291 49.49922 50.94914 ---------+-------------------------------------------------------------------- combined 28839 40.43997.0596218 10.12501 40.32311 40.55684 ---------+-------------------------------------------------------------------- diff -10.08386.3741233-10.81812-9.34961 ------------------------------------------------------------------------------ diff = mean(0 no) - mean(1 yes) t = -26.9533 Ho: diff = 0 Satterthwaite's degrees of freedom = 900.981 Ha: diff < 0 Ha: diff!= 0 Ha: diff > 0 Pr(T < t) = 0.0000 Pr( T > t ) = 0.0000 Pr(T > t) = 1.0000

* Chi-square test.. tab2 dm1 agegr, col chi2 -> tabulation of dm1 by agegr +-------------------+ Key ------------------- frequency column percentage +-------------------+ Two-way tab2 1 2, col chi2 3 2, 1 Diabetes mellitus RECODE of age (Age) in 2002 30-39 40-49 50+ Total -----------+---------------------------------+---------- no 9,902 4,961 205 15,068 98.86 97.05 93.61 98.18 -----------+---------------------------------+---------- yes 114 151 14 279 1.14 2.95 6.39 1.82 -----------+---------------------------------+---------- Total 10,016 5,112 219 15,347 100.00 100.00 100.00 100.00 Pearson chi2(2) = 88.5612 Pr = 0.000

. tab2 dm1 overweight, col exact chi2 -> tabulation of dm1 by overweight +-------------------+ Key ------------------- frequency column percentage +-------------------+ Diabetes RECODE of bmi1 (Body mellitus mass index in 2002) in 2002 BMI<25 BMI>=25 Total -----------+----------------------+---------- no 9,320 5,748 15,068 98.78 97.23 98.18 -----------+----------------------+---------- yes 115 164 279 1.22 2.77 1.82 -----------+----------------------+---------- Total 9,435 5,912 15,347 100.00 100.00 100.00 Pearson chi2(1) = 49.2477 Pr = 0.000 Fisher's exact = 0.000 1-sided Fisher's exact = 0.000

. tab2 dm1 smoke02, col chi2 -> tabulation of dm1 by smoke02 +-------------------+ Key ------------------- frequency column percentage +-------------------+ Diabetes mellitus Smoking habit in 2002 in 2002 non-smoke ex-smoker current s Total -----------+---------------------------------+---------- no 4,383 3,432 6,755 14,570 98.52 98.00 98.10 98.20 -----------+---------------------------------+---------- yes 66 70 131 267 1.48 2.00 1.90 1.80 -----------+---------------------------------+---------- Total 4,449 3,502 6,886 14,837 100.00 100.00 100.00 100.00 Pearson chi2(2) = 3.7146 Pr = 0.156