Descriptive Statistics Describing data with tables and graphs (quantitative or categorical variables) Descriptive Statistics (Numerical techniques) Numerical descriptions of center, variability, position (quantitative variables) 2014-1 Hyejung Chang (hjchang@khu.ac.kr) Bivariate descriptions (In practice, most studies have several variables) 0 Numerical Descriptive Technique 중심성 (Central tendency) 의척도 - 평균 (Mean), 중앙값 (Median), 최빈값 (Mode) 변동성 (Variability) 의척도 - 범위 (Range), 표준편차 (Standard Deviation), 분산 (Variance), 변동계수 (Coefficient of Variation) 상대위치 (Relative position) 의척도 - 백분위수 (Percentiles), 십분위수 (Deciles), 오분위수 (Quintiles), 사분위수 (Quartiles) 선형관계 (Linear relationship) 의척도 - 공분산 (Covariance), 상관계수 (Correlation Coefficient), 결정계수 (Coefficient of Determination), 최소자승선 (Least Squares Line) Numerical descriptions Let y denote a quantitative variable, with observations y 1, y 2, y 3,, y n 1) Describing the central tendency Median: Middle measurement of ordered sample Mean: 1
평균 (Mean) N = 모집단에속한관측치의수 n = 표본에속한관측치의수 = 모평균 ( 모집단의산술평균 ) mu = 표본평균 ( 표본의산술평균 ) x-bar 중앙값 (Median) 중앙값 (median) 은모든관측치를순서대로정렬할때중심에있는관측치 데이터 : {0, 7, 12, 5, 14, 8, 0, 9, 22} N=9 ( 홀수 ) 데이터를작은값으로부터큰값으로정렬하고중심에있는값을선택 0 0 5 7 8 9 12 14 22 데이터 : {0, 7, 12, 5, 14, 8, 0, 9, 22, 33} N=10( 짝수 ) 데이터를작은값으로부터큰값으로정렬하고중심에있는 8 과 9 의산술평균값을선택 0 0 5 7 8 9 12 14 22 33 중앙값 (median) = (8+9) 2 = 8.5 표본중앙값 (Sample Median) 과모중앙값 (Population Median) 은동일한방법으로계산 최빈값 (Mode) 발생되는도수가가장많은관측치 한세트의데이터에는최빈값이하나또는둘이상이존재할수있음 최빈값은주로명목데이터의경우에사용되지만모든데이터유형에대하여유용한중심위치의척도 대규모데이터세트의경우최빈계급구간 (modal class) 이단일값을가지는최빈값보다더유용 평균 (Mean), 중앙값 (Median), 최빈값 (Mode) 만일변수의분포가대칭이면, 평균, 중앙값, 최빈값은모두동일할수있음 mode median mean 표본최빈값 (Sample Mode) 과모최빈값 (Population Mode) 은동일한방법으로계산 2
평균 (Mean), 중앙값 (Median), 최빈값 (Mode) 만일변수의분포가비대칭이면, 즉왼쪽으로기울져있거나오른쪽으로기울어져있으면, 평균, 중앙값, 최빈값은서로다를수있음 mode median 평균, 중앙값, 최빈값중어느것이가장좋은중심위치의척도인가? 평균 : 일반적으로가장널리사용되는유용한중심경향의척도 그러나중앙값이더좋은중심위치의척도인상황들이존재 mean 중앙값 : 평균과는달리극단값들에대하여민감하지않음 최빈값 : 결코가장좋은중심위치의척도는아님 Example: Annual per capita carbon dioxide emissions (metric tons) for n = 8 largest nations in population size Bangladesh 0.3, Brazil 1.8, China 2.3, India 1.2, Indonesia 1.4, Pakistan 0.7, Russia 9.9, U.S. 20.1 Ordered sample: 0.3, 0.7, 1.2, 1.4, 1.8, 2.3, 9.9, 20.1 Median = (1.4 + 1.8)/2 = 1.6 Mean = (0.3 + 0.7 + 1.2 + + 20.1)/8 = 4.7 Properties of mean and median For symmetric distributions, mean = median For skewed distributions, mean is drawn in direction of longer tail, relative to median Mean valid for interval scales, median for interval or ordinal scales Mean sensitive to outliers (median often preferred for highly skewed distributions) When distribution symmetric or mildly skewed or discrete with few values, mean preferred because uses numerical values of observations 3
기하평균 (Geometric Mean) - 평균 복리수익률이기하평균의값이다. - R i 는기간 i 의수익률 ( 소수점으로표시한수익률 ) 이라고하자 (i = 1, 2,, n). - 수익률의기하평균 (geometric mean ) R g 기하평균 (Geometric Mean): Example < 예시 > $1,000 를 2 년간투자하였다. 첫해에투자가치가 100% 증가하여 $2,000 가되고, 두번째해에투자가치가 50% 감소하여 ( 손실발생 ) 다시 $1,000 가되었다. - 연도 1, 2의수익률 : R 1 = 100%, R 2 = 50% - 두연도수익률의산술평균 ( 과중앙값 ): - R g 에대하여풀면, 투자가이루어지는 2 년동안투자가치는변화가없기때문에, 평균 복리수익률은 0% 이다. 기하평균 (Geometric Mean): Example - 주어진예에서투자수익률의기하평균은 - 따라서투자수익률의기하평균은 0% 0% 의복리이자율공식을사용하면투자기간말의투자가치 = 1,000(1 + R g ) 2 2) Describing variability 관측치들이평균주위에서얼마나흩어져있는가를측정하는척도 Example) 두과목의점수 평균은두과목모두 50 으로같다. 그러나붉은색으로나타낸과목의점수가파란색으로나타낸과목의점수보다변동성이더크다 ( 평균주위에서더많이흩어져있다 ). = 1,000(1 + 0) 2 = 1,000 기하평균 : 일정한기간동안 평균 성장률또는변화율을계산하기위해사용 4
Measurement of variability(dispersion) 범위 (Range) Range: Difference between largest and smallest observations (but highly sensitive to outliers, insensitive to shape) Standard deviation: A typical distance from the mean The deviation of observation i from the mean is - 범위 (range): 가장간단한변동성의척도 : 범위 (Range) = 최대관측치 최소관측치 [ 예제 ] 데이터 : {4, 4, 4, 4, 50} Range = 46 데이터 : {4, 8, 15, 24, 39, 50} Range = 46 -두경우범위는같으나두데이터세트는서로다른분포를가진다. 범위 (Range) 의특징 장점 : 쉽게계산될수있다는점 단점 : 양쪽끝에있는관측치사이에존재하는관측치들이흩어져있는정도에관한정보를제공하지못한다는점 따라서모든관측치들을포함하는변동성의척도가필요 분산 (Variance) 분산 (variance) 과표준편차 (standard deviation): 가장중요한변동성의척도, 거의모든통계적추론에서중요한역할수행 기호 : = 모분산 (population variance) sigma squared = 표본분산 (sample variance) s squared 5
분산 (Variance) Sample Variance 모평균 (population mean) 표본분산을계산하기위해서는먼저표본평균을계산해야함 모분산 : 모집단크기 (population size) 표본평균 (sample mean) 표본평균을계산하는중간단계없이데이터로부터표본분산을계산하는간편공식 : 표본분산 : 표본분산의분모는표본크기 n 1 이다! ( 평균추정으로인한 degree of freedom) Properties of the standard deviation: s 0, and only equals 0 if all observations are equal s increases with the amount of variation around the mean Division by n - 1 (not n) is due to technical reasons s depends on the units of the data (e.g. measure KRW vs USD) Like mean, affected by outliers Empirical rule: If distribution is approx. bell-shaped, about 68% of data within 1 standard dev. of mean about 95% of data within 2 standard dev. of mean all or nearly all data within 3 standard dev. of mean 경험법칙 (Empirical Rule) 데이터의히스토그램 ( 분포 ) 이종모양이면 (1) 모든관측치의약 68% 는평균으로부터 1 표준편차이내에속한다.. (2) 모든관측치의약 95% 는평균으로부터 2 표준편차이내에속한다. (3) 모든관측치의약 99.7% 는평균으로부터 3 표준편차이내에속한다. 6
체비세프의정리 (Chebysheff s Theorem) 표준편차에대한보다일반적인해석 : 종모양을포함하여모든형태의히스토그램 ( 분포 ) 에적용 - 평균으로부터 k 표준편차 (k>1) 이내에속하는관측치들의비율은적어도다음과같음 변동계수 (Coefficient of Variation) - 변동성에대한상대적 ( 비례적 ) 척도 - 표준편차를평균으로나눈척도 모변동계수 (Population coefficient of variation) = CV = k=2 인경우, 체비세프의정리에의하면모든관측치의적어도 ¾ 는평균으로부터 2 표준편차이내에속한다. 경험법칙의근사 (95%) 하한 표본변동계수 (Sample coefficient of variation) = cv = 3) Measures of position p th percentile: p percent of observations below it, (100 - p)% above it. p = 50: median p = 25: lower quartile (LQ) p = 75: upper quartile (UQ) Quartiles portrayed graphically by box plots (John Tukey) Example: weekly TV watching for n=60 from student survey data file, 3 outliers Interquartile range IQR = UQ - LQ 7
Box plots have box from LQ to UQ, with median marked. They portray a five-number summary of the data: Minimum, LQ, Median, UQ, Maximum except for outliers identified separately Outlier = observation falling below LQ 1.5(IQR) or above UQ + 1.5(IQR) Ex. If LQ = 2, UQ = 10, then IQR = 8 and outliers above 10 + 1.5(8) = 22 Bivariate description Associations between two or more variables (e.g., how does number of close friends depend on gender, income, education, age, working status, rural/urban, religiosity ) Response variable: the outcome variable Explanatory variable(s): defines groups to compare Example: number of close friends is a response variable, while gender, income, are explanatory variables Response variable also called dependent variable Explanatory variable also called independent variable Summarizing associations: Example: Income by highest degree Categorical var s: show data using contingency tables Quantitative var s: show data using scatterplots Mixture of categorical var. and quantitative var. (e.g., number of close friends and gender) can give numerical summaries (mean, standard deviation) or side-by-side box plots for the groups 8
Contingency Tables Cross classifications of categorical variables in which rows (typically) represent categories of explanatory variable and columns represent categories of response variable. Scatterplots (for quantitative variables) plot response variable on vertical axis, explanatory variable on horizontal axis Example: UN data - fertility (births per woman) vs. per capita gross domestic product (GDP) Counts in cells of the table give the numbers of individuals at the corresponding combination of levels of the two variables Correlation: strength of association Falls between -1 and +1, with sign indicating direction of association The larger the correlation in absolute value, the stronger the association (in terms of a straight line trend) Examples: (positive or negative, how strong?) Mental impairment and life events, correlation = GDP and fertility, correlation = GDP and percent using Internet, correlation = 9