Data cleaning & Exploratory data analysis Seungho Ryu, MD, PhD Kanguk Samsung Hospital, Sungkyunkwan University
Data Cleaning Definition: A process used to determining inaccurate, incomplete, or unreasonable data and then improving the quality through correction of detected errors and omission. Error prevention is far superior to error detection and cleaning. No matter how efficient the process of data entry, errors will still occur and therefore data validation and correction cannot be ignored. One important product of data cleaning is the identification of the basic errors detected and using that information to improvement the data entry process to prevent those errors from reoccurring.
Need for Data Cleaning Centre around improving the quality of data to make them fit for use by users through reducing errors in the data. They are bad, but a good understanding of errors and error propagation can lead to active quality control and managed improvement in the overall data quality. It is important that errors not just be deleted, but corrections documented and changes traced. It is best to add corrections to the database while retaining the original data in a separate field or fields so that there is always the chance of going back to the original information.
Principles of Data Cleaning Planning is essential Organizing data improves efficiency Prevention is better than cure Responsibility belongs to everyone Partnerships improves efficiency Minimize duplication Documentation of validation procedures Feedback is a two-way street Education and training
Overview of data cleaning process Data request & download without original unique identifier/ with a new assigned study-id Read data files with STATA by types of variables Examine individual variable data files check potential errors including duplicates define labels check inconsistency Edit commands Integrate labels Check incorrect id Merge (combine) individual variables comprehensive visits versus other visits define index visit/ define incorrect id Create STATA data set for analysis apply labels & possible ranges impute values for limits of detection create derived variables
Download format To define a proper format for each variable Lab/anthropometry/exam/image or procedures To define different time variables request time- the time a participant registered a health check-up in the center sampling time- proxy of blood sampling time (the time for printing out a lab label test time the time an analyzer starts to measure report time the time a value is confirmed and saved
Simplified data sources Local equipment or server Laboratory Raw exam values -directly from analyzer machine Procedures (PFT, PWV) PACS Server server
Data sources Raw exam values Raw values variable data files (*.csv) Reported values Without personal unique identifiers
Data cleaning files
General principles Preserving all original data files received
General principles STATA data files self-explanatory
Stata do files Do files - Include all data cleaning processes - Self explanatory - All reproducible - Apply feedback - Record tasks - Date and version
Self-explanatory titles & tasks Define tasks Record version & date
kshc_data_dictionary_v01_2012_02_29 Provide summary documentation of all data changes and help investigators to use the data efficiently (avoid same problems of data)
Checking data quality & correction Checking erroneous values Range and consistency checks Developing boundaries for out-of range values required a collaborative effort between data cleaning team and clinicians
Data cleaning examples Hearing High sensitivity C-reactive protein Blood pressure Direct bilirubin Waist Duplicates Decision process
Simple example - hearing
Assign a different code for possible a different meaning value Assign a same code for same meaning values
High sensitivity C-reactive protein Impossible value of zero (0) limits of detection varied over time in terms of magnitude
Inconsistent between raw values and reported values. tab year if raw_values=="<0.30" raw_values=="<0.31" raw_values=="< > 0.32" raw_values=="<0.34" year Freq. Percent Cum. 2005 48 96.00 96.00 2007 2 4.00 100.00 Total 50 100.00. tab year if raw_values=="<0.03" year Freq. Percent Cum. 2005 10 90.91 90.91 2009 1 9.09 100.00 Total 11 100.00.
High sensitivity C-reactive protein
Example- SBP (systolic blood pressure)
Density 0.02.04.06.08 50 100 150 200 250 systolic blood pressure, mmhg Density 0.02.04.06.08.1 Year 2002 Year 2003 Density 0.02.04.06.08 50 100 150 200 250 systolic blood pressure, mmhg Year 2004 50 100 150 200 250 systolic blood pressure, mmhg
Direct bilirubin
waist
Duplicates issue All variable duplicates were removed
Inconsistent identifier Decide to remove
Creating data sets for analysis To help investigators use data efficiently and properly remove duplicates and unreliable partial duplicates set erroneous values of each variable to missing set out-of range to missing check for consistency and correct inconsistency Impute a certain value for limit of detection
Difficulties Data -Not designed for study No current available documentation for range check (e.g., limit of detection over time) Required a collaborative effort between data cleaning team and clinicians for developing out of range in terms of both laboratory and clinical aspects Raw exam values- Not available before 2003
Data cleaning KSHC 288,419 subjects; 619,763 visits; 1,238 deaths Clean, internally consistent data sets for analysis Maximize the existing database Help investigators perform studies efficiently Promote better quality products Improve hospital image with reliable data and proper data usage
Kangbuk Samsung Cohort Study Data entry, timely review and correction will continue throughout the data cleaning possible recovery of erroneous data Feedback throughout retrospective data cleaning improvement of reporting consistency Data collection through standardized examination best quality of cohort study data
Exploratory Data Analysis
Main Reasons of EDA Detection of mistakes Checking of assumptions Preliminary selection of appropriate models Determining relationships among the explanatory variables Assessing the direction and rough size of relationships between explanatory and outcome variables
( ) 1
Data
Univariate Analysis
Mean Median Measures of central location Geometric mean
Mean (arithmetic mean) If there are n observations, x 1, x 2,, x n, the sample mean is x = x 1 + x 2 +... + n x Properties Intuitive Nice statistical properties Sensitive to outliers Appropriate for continuous and discrete variables n = n i= 1 n x i
Computing the mean in a random sample of 10 SBP measurements Random sample of 10 SBP measurements in NHANES III 111, 134, 105, 138, 110, 122, 196, 126, 177, 117 x = n i= 1 n x i = 111+ 134 + 10... + 117 = 133.6 mmhg
Median (50 th percentile) Middle value of the sample (50 th percentile) Order the sample (from lowest to highest) If n is odd, the median is the middle value If n is even, the median is the average of the two middle values Properties: The mean may not be a good measure of the middle value of a distribution Insensitive to outliers
Computing the median in a random sample of 10 SBP measurements Random sample of 10 SBP measurements in NHANES III 111, 134, 105, 138, 110, 122, 196, 126, 177, 117 Order the sample values 105, 110, 111, 117, 122, 126, 134, 138, 177, 196 The median is the average of the 5 th and 6 th values Median = (122 + 126) / 2 = 124
Geometric mean If there are n observations, x1, x2,, xn, the geometric mean is x geom = = n x 1 x 2... x n logx1 + logx2 +... + logx exp n n
Serum β-carotene in NHANES III adults (1988 1994) Frequency 1000 750 500 Geom. mean: 14.3 µg/dl Min: 0.48 µg/dl P 25: 8.0 µg/dl P 50: 14.0 µg/dl P 75: 24.0 µg/dl Max: 674.0 µg/dl 250 0 1 25 50 100 400 Serum β-carotene (µg/dl) Based on unweighted analysis of NHANES III data. N = 16,629. N missing = 2,989
Measures of spread Standard deviation Interquartile range Range Coefficient of variation
Standard deviation If there are n observations, x 1, x 2,, x n, the sample standard deviation is s = n i= 1 ( ) Note: s 2 is the sample variance Properties x n i 1 x 2
s Computing the SD in a random sample of 10 SBP measurements Random sample of 10 SBP measurements in NHANES III 111, 134, 105, 138, 110, 122, 196, 126, 177, 117 Mean: 133.6 mmhg = = = n i= 1 ( x x) i n 1 2 2 2 ( 111 133.6) + ( 134 133.6) +... + ( 117 133.6) 30.09 mmhg 10 1 2
Other measures of spread Interquartile range 75 th percentile 25 th percentile Range Highest lowest observation Coefficient of variation s CV = 100% x
Skewness of distributions Pagano and Gauvreau, page 42
Univariate non-graphical EDA: Continuous variable Summary statistics
Univariate non-graphical EDA: Categorical variable A simple tabulation of the frequency of each category is the best univariate non-graphical EDA for categorical data
Graphical methods for continuous data (univariate) Bar chart Histogram Box plot Stem-and-leaf plot
SBP in NHANES III adults (1988 1994) 0.03 0.02 Properties of a distribution Location Probability 0.01 Spread Shape 0.0 80 90 100 110 120 130 140 150 160 170 180 190 200 Systolic blood pressure (mm Hg) Based on unweighted analysis of NHANES III data. N = 19,256. N missing = 362
SBP in NHANES III adults (1988 1994) Probability 0.03 0.02 0.01 Mean: 126.4 mm Hg SD: 20.6 mm Hg Min: 69.0 mm Hg P 25: 111.0 mm Hg P 50: 122.0 mm Hg P 75: 138.0 mm Hg Max: 246.0 mm Hg IQR: 27.0 mm Hg CV: 16.3 % 0.0 80 90 100 110 120 130 140 150 160 170 180 190 200 Systolic blood pressure (mm Hg) Based on unweighted analysis of NHANES III data. N = 19,256. N missing = 362
Histogram The key with a histogram is to use a sufficient number of intervals to define the shape of the distribution clearly and not lose much information. A rough rule of thumb is to choose the number of bins to be about 1+3.3log 10 (n) where no is the sample size.
Boxplots
Boxplot Location, as measured by the median Spead, as measured by the height of the box (this is called the interquartile range for IQR) Range of the observations Presence of outliers Some information about shape
SBP in NHANES III adults (1988 1994) Stem and leaf plot N = 500 Median = 123 Quartiles = 112, 140 Decimal point is 1 place to the right of the colon 8 : 1 8 : 9 : 1123 9 : 55577788899999 10 : 000001111122222222333333444444444444 10 : 55555555666666666666777777888888888899999999 11 : 00000001111111111111111122222222222222233333333334444444444444444 11 : 55555555555666666666666677777777788888888888999999999 12 : 000000000001111111111111222222233333333444444 12 : 5555556666667777777777778888888899999 13 : 000000111111122222222233333333444444444 13 : 555555566666777777888888888899999999 14 : 0000000000001111222223334444444 14 : 555555666667899999 15 : 001113344444 15 : 5567788888899 16 : 0000000011122233 16 : 66778999 17 : 0444 17 : 56677788899 18 : 134 18 : 8 19 : 24 19 : 6 High: 197 210 220 221 224 231 Based on a random sample of 500 adult NHANES III participants
Quantile-normal plots
KSHS fasting triglycerides
Transformation of Data
bivariate Analysis
Scatterplots
Correlation Analysis. pwcorr bmi gluc ldl hdl tchol, sig
Side-by-side boxplots
Side-by-side boxplots Age 30 40 50 60 non-smoker ex-smoker current smoker
* ttest(mean-comparison tests) Group mean-comparison test that varname has the different mean(unequal variance) within the two groups defined by group variable ttest variable name, by(group variable) option. ttest age, by(diabetes)unequal Two-sample t test with unequal variances ------------------------------------------------------------------------------ Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- 0 no 27982 40.14032.0594981 9.952736 40.0237 40.25693 1 yes 857 50.22418.3693619 10.81291 49.49922 50.94914 ---------+-------------------------------------------------------------------- combined 28839 40.43997.0596218 10.12501 40.32311 40.55684 ---------+-------------------------------------------------------------------- diff -10.08386.3741233-10.81812-9.34961 ------------------------------------------------------------------------------ diff = mean(0 no) - mean(1 yes) t = -26.9533 Ho: diff = 0 Satterthwaite's degrees of freedom = 900.981 Ha: diff < 0 Ha: diff!= 0 Ha: diff > 0 Pr(T < t) = 0.0000 Pr( T > t ) = 0.0000 Pr(T > t) = 1.0000
* Chi-square test.. tab2 dm1 agegr, col chi2 -> tabulation of dm1 by agegr +-------------------+ Key ------------------- frequency column percentage +-------------------+ Two-way tab2 1 2, col chi2 3 2, 1 Diabetes mellitus RECODE of age (Age) in 2002 30-39 40-49 50+ Total -----------+---------------------------------+---------- no 9,902 4,961 205 15,068 98.86 97.05 93.61 98.18 -----------+---------------------------------+---------- yes 114 151 14 279 1.14 2.95 6.39 1.82 -----------+---------------------------------+---------- Total 10,016 5,112 219 15,347 100.00 100.00 100.00 100.00 Pearson chi2(2) = 88.5612 Pr = 0.000
. tab2 dm1 overweight, col exact chi2 -> tabulation of dm1 by overweight +-------------------+ Key ------------------- frequency column percentage +-------------------+ Diabetes RECODE of bmi1 (Body mellitus mass index in 2002) in 2002 BMI<25 BMI>=25 Total -----------+----------------------+---------- no 9,320 5,748 15,068 98.78 97.23 98.18 -----------+----------------------+---------- yes 115 164 279 1.22 2.77 1.82 -----------+----------------------+---------- Total 9,435 5,912 15,347 100.00 100.00 100.00 Pearson chi2(1) = 49.2477 Pr = 0.000 Fisher's exact = 0.000 1-sided Fisher's exact = 0.000
. tab2 dm1 smoke02, col chi2 -> tabulation of dm1 by smoke02 +-------------------+ Key ------------------- frequency column percentage +-------------------+ Diabetes mellitus Smoking habit in 2002 in 2002 non-smoke ex-smoker current s Total -----------+---------------------------------+---------- no 4,383 3,432 6,755 14,570 98.52 98.00 98.10 98.20 -----------+---------------------------------+---------- yes 66 70 131 267 1.48 2.00 1.90 1.80 -----------+---------------------------------+---------- Total 4,449 3,502 6,886 14,837 100.00 100.00 100.00 100.00 Pearson chi2(2) = 3.7146 Pr = 0.156