군집분석 : 군집의개수 서울시립대통계학과이용희 FALL 2017 1 예제자료의준비 k- 평균군집분석의예제로서다음과같은 3 개의군집을인공적으로만들었다. library(mvtnorm) set.seed(244211) # number units in each cluster nn <-100 # mean and covariance center <- list(c(2,1),c(0,4),c(3,4)) covmat <- list(matrix(c(0.5,0,0,0.5),2,2), matrix(c(0.4,0.1,0.1,0.4),2,2), matrix(c(1,-0.9,-0 # data x1 <- rmvnorm(nn,mean=center[[1]],sigma = covmat[[1]]) x2 <- rmvnorm(nn,mean=center[[2]],sigma = covmat[[2]]) x3 <- rmvnorm(nn,mean=center[[3]],sigma = covmat[[3]]) data1 <- as.data.frame(rbind(x1,x2,x3)) # plot cols <- rep(c("red","blue","green"),each=nn) # colors with(data1,plot(v1,v2,col=cols)) title("true number of clusters = 3") 1
True number of clusters = 3 V2 0 2 4 6 1 0 1 2 3 4 5 V1 2 군집의수선택 2.1 군집내와군집간제곱합이용 군집분석은각군집의평균의차이를크게하고 ( 군집간의변동을크게하고 ) 군집내의변동을작게하는 것이좋다. 군집의개수가늘어날수록커지고군집내의변동은작아진다. 군집내의변동은다음과같이 각군집에서제곱합을구하여군집내변동의측도로사용할수있다. x ijk 를 k 번째군집에속한 i 번째관측벡터의 j 번째변수이라고하자. 각군집내제곱합 SSW k 는 다음과같이계산된다. SSW k = n k i=1 j=1 p n k (x ijk x jk ) 2 = (n k 1)s 2 k 여기서 s 2 k 은군집의표본분산이다. 각군집내제곱합을더한군집내제곱합 (within group sum of 2 i=1
sqaure; SSW) 는다음과같다. K SSW = SSW k 또한군집간의제곱합 (between group sum of square; SSB) 는다음과같다. K n k p K p SSB = ( x jk x j ) 2 = n k ( x jk x j ) 2 k=1 k=1 i=1 j=1 k=1 j=1 여기서 x j 는모든자료의 j 번째변수의평균이다. n k 는 k 번쨰군집의개체의수이다. 하지만군집의개수가늘어나서너무많으면자료를적절하게나누어유용한정보를가진그룹으로나누는목적과는멀어진다. 따라서군집의수와군집내변동을적절하게고려하여군집의개수를정해야한다. 이제군집의개수를 10까지변화시키면서각군집분석의결과를저장하자. K <- 1:10 cluslist <- list() for ( k in K) { cluslist[[k]] <- kmeans(data1,center=k) } # this is another way # cluslist <- lapply(k, function(x,d) { kmeans(d,center=x) },d=data1 ) sst <- getelement(cluslist[[1]],"totss") # 총제곱합 sst ## [1] 1461 10 개의서로다른군집에대하여제곱합을구하여그변화를살펴보자. ssb <- sapply(cluslist, function(x){getelement(x,"betweenss")} ) # 군집간의제곱합 ssb ## [1] 9.095e-13 7.552e+02 1.102e+03 1.213e+03 1.248e+03 1.268e+03 1.312e+03 1.310e+03 ## [9] 1.337e+03 1.343e+03 plot(k, ssb, xlab="number of clusters", ylab="between group sum of square") lines(k, ssb) 3
between group sum of square 0 400 800 1200 2 4 6 8 10 number of clusters 군집간의제곱합의변화를보면군집의개수가늘어나면계속커진다. 군집의개수가 3 개를넘어가면 제곱합이증가하는변화율이작다. ssw <- sapply(cluslist,function(x){getelement(x,"tot.withinss")}) # 군집내제곱합을더한값 ssw ## [1] 1461.2 706.0 358.7 247.7 213.0 193.5 149.6 150.8 123.7 118.1 plot(k, ssw, xlab="number of clusters", ylab="within group sum of sqaure") lines(k, ssw) 4
within group sum of sqaure 200 600 1000 1400 2 4 6 8 10 number of clusters 군집내의제곱합의변화를보면군집의개수가늘어나면계속작아진다. 다만군집의개수가 3개를넘어가면제곱합이감소하는변화율이작다. 따라서군집간의제곱합과군집내의제곱합을따로이용하면군집의개수를정하는데크게도음이되지못한다. 2.2 CH 지수이제군집간의제곱합과군집내의제곱합의비율을고려한아래와같이정의된 CH 지수 (CH index) 를고려해보자. CH 지수는두제곱합의변화를동시에고려한양으로서 CH 지수이가장큰값을가지는군집의수를선택하는기준이다. CH(k) = 위에서 k 는군집의개수이고 n 은총자료의개수이다. SSB/(k 1) SSW /(n k) n <- dim(data1)[1] chindex <- (ssb/(k-1))/(ssw/(n-k)) chindex 5
## [1] Inf 318.8 456.4 483.4 432.1 385.2 428.0 362.5 393.1 366.5 plot(k, chindex, xlab="number of clusters", ylab="ch index") lines(k, chindex) CH index 350 400 450 2 4 6 8 10 number of clusters CH 지수로볼때최적의군집의수는 3 개가아닌 4 개로나타난다. 이는하나의군집이길게늘어지는 경향 ( 초록색군집 ) 을가져서두개로나누어지는경향이있는것으로보인다. with(data1,plot(v1,v2,col=cols,cex=2)) with(data1,points(v1,v2,pch=cluslist[[3]]$cluster)) title("k-means clustering (K=3)", sub=" color is true point, shape is clustering point") 6
K means clustering (K=3) V2 0 2 4 6 1 0 1 2 3 4 5 V1 color is true point, shape is clustering point with(data1,plot(v1,v2,col=cols,cex=2)) with(data1,points(v1,v2,pch=cluslist[[4]]$cluster)) title("k-means clustering (K=4)", sub="color is true point, shape is clustering point") 7
K means clustering (K=4) V2 0 2 4 6 1 0 1 2 3 4 5 V1 color is true point, shape is clustering point 2.3 Average silhouette 방법 Average silhouette 방법은군집의결과에의한분류된하나의자료가이웃하는군집과의거리를계산하여군집분석이잘이루어져있는가를판단하는방법이다. 하나의관측치에대하여 silhouette 값이주어지는데이는 [ 1, 1] 사이에있는값으로 +1 에가까울수록관측치가속한군집에분류가좋은것이고 1에가까울수록자신이배정된군집이아닌다른군집에가깝다. 모든관측치에대하여 silhouette 값을계산하여평균을구해가장큰값을나타내는군집의수를선택한다. silhouette 값은 cluster 패키지의 silhouette() 함수를사용하여계산한다. 다음의그림은군집의개수가 3개와 4개인경우 silhouette 값을군집별로그린그림이다. library(cluster) sil3 <- silhouette(cluslist[[3]]$cluster,dist(data1)) 8
plot(sil3, main="silhouette graph of 3 clusters") silhouette graph of 3 clusters n = 300 3 clusters C j j : n j ave i Cj s i 1 : 99 0.59 2 : 86 0.45 3 : 115 0.59 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width s i Average silhouette width : 0.55 sil4 <- silhouette(cluslist[[4]]$cluster,dist(data1)) plot(sil4, main="silhouette graph of 4 clusters") 9
silhouette graph of 4 clusters n = 300 4 clusters C j j : n j ave i Cj s i 1 : 48 0.49 2 : 99 0.55 3 : 98 0.57 4 : 55 0.44 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width s i Average silhouette width : 0.53 이제각군집에대하여 silhouette 값의평균을구해보자. KK <- 2:10 silval <- KK for ( k in KK) { ss <- silhouette(cluslist[[k]]$cluster, dist(data1)) silval[k] <- mean( ss[,3]) } silval ## [1] 2.0000 0.4789 0.5488 0.5273 0.4957 0.4220 0.4305 0.3607 0.3671 0.3565 # NOTICE: eleminate the first value of silhouette value plot(kk, silval[-1], xlab="number of clusters", ylab="silhouette value") 10
lines(kk, silval[-1]) silhouette value 0.35 0.40 0.45 0.50 0.55 2 4 6 8 10 number of clusters silhouette 값의평균은군집의개수가 3 인경우가장크다. 2.4 GAP 통계량 군집을선택하는경우사용하는또하나의기준은 GAP 통계량이다. 군집의개수가 k 인경우군집내의 제곱합을 SSW (k) 라고하면 GAP 통계량 G(k) 는다음과같이정의된다. G(k) = log SSW (k) uni log SSW (k) 위에서 SSW (k) uni 는자료가 p-차원일양분포 (uniform distribution) 에서나왔다고가정했을경우 k 개군집에대한군집내제곱합이다. SSW (k) uni 은모의실험을통헤서구하고모의실험의결과를가지고 SSW (k) uni 의표준편차 s(k) 를구하여다음의기준으로군집의수를결정한다. 11
K optimal = min{k {1, 2,..., K max } G(k) G(K + 1) s(k + 1)} 통계량 G(k) 를구하는함수는 cluster 패키지의 clusgap() 이며다음과같이계산한다. Kmeans 방법으로모의실험의반복수는 b = 50 으로지정하였다. gap_stat <- clusgap(data1, FUNcluster = kmeans, nstart=25, K.max = 10, B = 50) print(gap_stat, method = "firstmax") ## Clustering Gap statistic ["clusgap"] from call: ## clusgap(x = data1, FUNcluster = kmeans, K.max = 10, B = 50, nstart = 25) ## B=50 simulated reference sets, k = 1..10; spaceh0="scaledpca" ## --> Number of clusters (method 'firstmax'): 3 ## logw E.logW gap SE.sim ## [1,] 5.342 5.539 0.1971 0.01982 ## [2,] 4.942 5.227 0.2848 0.01929 ## [3,] 4.604 5.014 0.4101 0.02133 ## [4,] 4.428 4.828 0.4007 0.01845 ## [5,] 4.329 4.714 0.3850 0.01386 ## [6,] 4.239 4.613 0.3745 0.01597 ## [7,] 4.163 4.527 0.3647 0.01560 ## [8,] 4.096 4.451 0.3557 0.01715 ## [9,] 4.029 4.382 0.3530 0.01698 ## [10,] 3.986 4.318 0.3323 0.01661 plot(gap_stat, frame = FALSE, xlab = "Number of clusters k",main="crime data: GAP statistics") 12
Crime data: GAP statistics Gap k 0.20 0.25 0.30 0.35 0.40 2 4 6 8 10 Number of clusters k GAP 통계량으로보면군집의개수가 3 인경우가선택된다. 2.5 군집분석결과 (k = 3, 4) cluslist[3] ## [[1]] ## K-means clustering with 3 clusters of sizes 99, 86, 115 ## ## Cluster means: ## V1 V2 ## 1 2.0158 0.9668 ## 2 3.2293 3.8309 13
## 3 0.1966 4.1921 ## ## Clustering vector: ## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [43] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [85] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 ## [127] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 ## [169] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 3 3 2 2 2 2 2 ## [211] 2 2 3 2 2 2 2 2 2 2 2 2 2 3 3 2 2 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 2 2 2 ## [253] 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 2 2 2 2 2 3 ## [295] 2 2 2 2 2 2 ## ## Within cluster sum of squares by cluster: ## [1] 105.1 132.3 121.3 ## (between_ss / total_ss = 75.5 %) ## ## Available components: ## ## [1] "cluster" "centers" "totss" "withinss" "tot.withinss" ## [6] "betweenss" "size" "iter" "ifault" cluslist[4] ## [[1]] ## K-means clustering with 4 clusters of sizes 48, 99, 98, 55 ## ## Cluster means: ## V1 V2 ## 1 2.0244 4.8944 ## 2 2.0158 0.9668 ## 3-0.0127 4.0115 14
## 4 3.7163 3.3361 ## ## Clustering vector: ## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ## [43] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ## [85] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 ## [127] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 ## [169] 1 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 4 4 1 1 1 4 1 4 4 ## [211] 4 4 1 1 1 1 1 1 1 4 4 4 4 1 1 4 4 1 4 1 4 4 4 1 4 4 4 4 1 4 4 1 1 1 1 4 1 1 1 4 1 1 ## [253] 4 4 4 1 1 4 1 1 1 4 4 4 1 1 1 4 4 4 1 4 4 4 4 1 4 4 1 4 4 1 4 1 4 1 4 1 4 4 4 4 4 1 ## [295] 4 4 1 4 4 1 ## ## Within cluster sum of squares by cluster: ## [1] 31.46 105.09 60.61 50.54 ## (between_ss / total_ss = 83.0 %) ## ## Available components: ## ## [1] "cluster" "centers" "totss" "withinss" "tot.withinss" ## [6] "betweenss" "size" "iter" "ifault" 3 실제자료에대한적용 : 범죄율자료 k- 평균군집분석의예제로서범죄율프로화일에기초한미국의 50 개주에대한군집화에군집의수를 결정하는방법을적용해보자. `crime` <- structure(c(2, 2.2, 2, 3.6, 3.5, 4.6, 10.7, 5.2, 5.5, 5.5, 6, 8.9, 11.3, 3.1, 2.5, 1.8, 9.2, 1, 4, 3.1, 4.4, 4.9, 9, 31, 7.1, 5.9, 8.1, 8.6, 11.2, 11.7, 6.7, 10.4, 10.1, 11.2, 8.1, 12.8, 15
8.1, 13.5, 2.9, 3.2, 5.3, 7, 11.5, 9.3, 3.2, 12.6, 5, 6.6, 11.3, 8.6, 4.8, 14.8, 21.5, 21.8, 29.7, 21.4, 23.8, 30.5, 33.2, 25.1, 38.6, 25.9, 32.4, 67.4, 20.1, 31.8, 12.5, 29.2, 11.6, 17.7, 24.6, 32.9, 56.9, 43.6, 52.4, 26.5, 18.9, 26.4, 41.3, 43.9, 52.7, 23.1, 47, 28.4, 25.8, 28.9, 40.1, 36.4, 51.6, 17.3, 20, 21.9, 42.3, 46.9, 43, 25.3, 64.9, 53.4, 51.1, 44.9, 72.7, 31, 28, 24, 22, 193, 119, 192, 514, 269, 152, 142, 90, 325, 301, 73, 102, 42, 170, 7, 16, 51, 80, 124, 304, 754, 106, 41, 88, 99, 214, 367, 83, 208, 112, 65, 80, 224, 107, 240, 20, 21, 22, 145, 130, 169, 59, 287, 135, 206, 343, 88, 106, 102, 92, 103, 331, 192, 205, 431, 265, 176, 235, 186, 434, 424, 162, 148, 179, 370, 32, 87, 184, 252, 241, 476, 668, 167, 99, 354, 525, 319, 605, 222, 274, 408, 172, 278, 482, 285, 354, 118, 178, 243, 329, 538, 437, 180, 354, 244, 286, 521, 401, 103, 803, 755, 949, 1071, 1294, 1198, 1221, 1071, 735, 988, 887, 1180, 1509, 783, 1004, 956, 1136, 385, 554, 748, 1188, 1042, 1296, 1728, 813, 625, 1225, 1340, 1453, 2221, 824, 1325, 1159, 1076, 1030, 1461, 1787, 2049, 783, 1003, 817, 1792, 1845, 1908, 915, 1604, 1861, 1967, 1696, 1162, 1339, 2347, 2208, 2697, 2189, 2568, 2758, 2924, 2822, 1654, 2574, 2333, 2938, 3378, 2802, 2785, 2801, 2500, 2049, 1939, 2677, 3008, 3090, 2978, 4131, 2522, 1358, 2423, 2846, 2984, 4373, 1740, 2126, 2304, 1845, 2305, 3417, 3142, 3987, 3314, 2800, 3078, 4231, 3712, 4337, 4074, 3489, 4267, 4163, 3384, 3910, 3759, 164, 228, 181, 906, 705, 447, 637, 776, 354, 376, 328, 628, 800, 254, 288, 158, 439, 120, 99, 168, 258, 272, 545, 975, 219, 169, 208, 277, 430, 598, 193, 544, 267, 150, 195, 442, 649, 714, 215, 181, 169, 486, 343, 419, 223, 478, 315, 402, 762, 604, 328),.Dim = c(51l, 7L ),.Dimnames = list(c("me", "NH", "VT", "MA", "RI", "CT", "NY", "NJ", "PA", "OH", "IN", "IL", "MI", "WI", "MN", "IA", "MO", "ND", "SD", "NE", "KS", "DE", "MD", "DC", "VA", "WV", "NC", "SC", "GA", 16
"FL", "KY", "TN", "AL", "MS", "AR", "LA", "OK", "TX", "MT", "ID", "WY", "CO", "NM", "AZ", "UT", "NV", "WA", "OR", "CA", "AK", "HI" ), c("murder", "Rape", "Robbery", "Assault", "Burglary", "Theft", "Vehicle"))) crime <- as.data.frame(crime) # eliminate outlier crime_t <- subset(crime, Murder < 15) rge <- sapply(crime_t, function(x) diff(range(x))) crime_s <- sweep(crime_t, 2, rge, FUN = "/") head(crime_s) ## Murder Rape Robbery Assault Burglary Theft Vehicle ## ME 0.160 0.2422 0.05523 0.1780 0.4374 0.7784 0.2032 ## NH 0.176 0.3519 0.04734 0.1606 0.4112 0.7323 0.2825 ## VT 0.160 0.3568 0.04339 0.1798 0.5169 0.8945 0.2243 ## MA 0.288 0.4861 0.38067 0.5777 0.5833 0.7260 1.1227 ## RI 0.280 0.3502 0.23471 0.3351 0.7048 0.8517 0.8736 ## CT 0.368 0.3895 0.37870 0.3578 0.6525 0.9148 0.5539 3.1 CH 지수 K <- 1:10 cluslist <- list() cluslist <- lapply(k, function(x,d) { kmeans(d,center=x) },d=crime_s ) ssb <- sapply(cluslist, function(x){getelement(x,"betweenss")} ) # 군집간의제곱합 ssw <- sapply(cluslist,function(x){getelement(x,"tot.withinss")}) # 군집내제곱합을더한값 n <- dim(crime_s)[1] chindex <- (ssb/(k-1))/(ssw/(n-k)) chindex ## [1] Inf 45.41 31.32 25.69 23.76 22.61 22.35 22.05 20.77 20.93 17
plot(k, chindex, xlab="number of clusters", ylab="ch index", main="crime data: CH index") lines(k, chindex) Crime data: CH index CH index 20 25 30 35 40 45 2 4 6 8 10 number of clusters 3.2 Average silhouette 방법 KK <- 2:10 silval <- KK for ( k in KK) { ss <- silhouette(cluslist[[k]]$cluster, dist(crime_s)) silval[k] <- mean( ss[,3]) } silval 18
## [1] 2.0000 0.3987 0.2590 0.2653 0.2161 0.2267 0.2467 0.2485 0.2282 0.2627 # NOTICE: eleminate the first value of silhouette value plot(kk, silval[-1], xlab="number of clusters", ylab="silhouette value",main="crime data:avera lines(kk, silval[-1]) Crime data:average silhouette silhouette value 0.25 0.30 0.35 0.40 2 4 6 8 10 number of clusters 3.3 GAP 통계량 gap_stat <- clusgap(crime_s, FUNcluster = kmeans, nstart=25, K.max = 10, B = 100) print(gap_stat, method = "firstmax") ## Clustering Gap statistic ["clusgap"] from call: ## clusgap(x = crime_s, FUNcluster = kmeans, K.max = 10, B = 100, nstart = 25) 19
## B=100 simulated reference sets, k = 1..10; spaceh0="scaledpca" ## --> Number of clusters (method 'firstmax'): 2 ## logw E.logW gap SE.sim ## [1,] 2.339 2.587 0.2476 0.03485 ## [2,] 2.015 2.318 0.3035 0.03004 ## [3,] 1.914 2.208 0.2935 0.02946 ## [4,] 1.817 2.120 0.3029 0.02955 ## [5,] 1.737 2.044 0.3069 0.03034 ## [6,] 1.658 1.975 0.3165 0.03054 ## [7,] 1.574 1.913 0.3390 0.03129 ## [8,] 1.509 1.855 0.3466 0.03234 ## [9,] 1.454 1.802 0.3472 0.03248 ## [10,] 1.396 1.750 0.3544 0.03317 plot(gap_stat, frame = FALSE, xlab = "Number of clusters k",main="crime data: GAP statistics") 20
Crime data: GAP statistics Gap k 0.26 0.28 0.30 0.32 0.34 2 4 6 8 10 Number of clusters k 21