Toward Open Platform 오픈소스기반의통계언어 R 과 빅데이터분석 NexR Data Scientist Jeon Hee-Won
목차 R 의소개 R 의정의, R 의역사, R 의철학, R 의특징, R 패키지시스템 빅데이터분석 빅데이터, 데이터과학그리고과학자 The Marriage of Hadoop and R NexR's Way for Big Data Analysis Etc KRUG(Korean R User Group) Korea R CRAN Mirror Ststistics Toward Open Platform -2-
R 의소개 정의 R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R. 태동전파확산 Bell Lab Commercial GNU/Open source O/S UNIX BSD/System V HP, IBM, SUN LINUX Application Analysis System The S system S-PLUS R Packages Toward Open Platform -3-
R 의소개 역사 1976 1980 1988 1998 John Chamber Version 1 Fortranbased Version 2 UNIX Version 3 C-base Class/Method Version 4 Java interface Class/Method 1988 1993 2001 2008 StatSci With MathSoft E-license Insightful 05/V.7/Big data 07/V.8/R package TIBCO 1993 1997.4.1 1997.4.23 1997.12.5 2000.1 Ross Ihaka Robert Gentleman Mailing list CRAN GNU Project Version 1.0 Toward Open Platform -4-
R 의소개 철학 Free Software Foundation GNU (GNU is Not Unix) Project Richard Stallman GNU GPL (General Public License) : 배포의자유를허락한다. 누구나자유롭게 " 실행, 복사, 수정, 배포 " 할수있고, 누구도그런권리를제한하면안된다는사용허가권 (License) 아래소프트웨어를배포 Free Software = 무료 < 자유로움 Related Projects organization The R Foundation for Statistical Computing (R Development Core Team) Windows UNIX OS X BioConductor Analysis genomic data More 460 Packages The Comprehensive R Archive Network (CRAN) distribution 3,452 Packages (2011/12/01) Toward Open Platform -5-
R 의소개 특징 Interpreter Language 기반의분석시스템 > tot = 0 > for (i in 1:10) { + tot = tot + i + } > print(tot) [1] 55 > sum(1:10) [1] 55 맞춤복 VS 기성복 자유로움 VS 편리성 SAS Procedure 중심 SPSS 메뉴중심 PROC FREQ OPTIONS1; TABLES requests/oprions2; WEIGHT variable; BY variables; Toward Open Platform -6-
R 의소개 특징 -cont Connectivity 시스템통합의용이성 Language Interface: C, C++, FORTRAN, JAVA, Python, Tcl/tk, VB, Perl, Ruby Application Interface: Excel, Google earth, ArcView, COM/DCOM, etc Application 이나 Platform 을구축할경우분석영역의솔루션으로 R 을사용하는것이용이함 DB Interface: ODBC (Oracle, Mysql, MS-SQL, PostgreSql,...) IDE: Rstudio, eclipse, emacs, Bluefish, Crimson Editor, ConTEXT, Vim, Jedit, Kate, TextMate, gedit, SciTE, WinEdt 통합사례 Revolution Analytics - Revolution R IBM - Netteza Appliance DB EMC - Greenplum Appliance DB Toward Open Platform -7-
R 의소개 특징 -cont 자료구조 통계계산에최적화 Data Objects Vector : 벡터연산을위한구조 Factor : 범주형자료 Ordered factor : 순서범주형자료 Matrix : 행렬연산을위한행렬 List : 리스트객체, C의구조체와유사 Data Frame : 다변량데이터구조, DBMS의 Table과유사한구조 Array : 배열연산을위한구조 Time Series : 시계열데이터분석을위한구조 Vectorize 연산 : Loop 문을피하고행렬이나벡터연산으로계산 apply, lappy, tapply, outer, 통계분석에최적화된자료구조 matrix, vector 등 Vectorize 사례 > mat = matrix(1:12, ncol=4) > mat [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 > apply(mat, 2, sum) [1] 6 15 24 33 > colmeans(mat) [1] 2 5 8 11 Toward Open Platform -8-
R 의소개 특징 -cont 통계계산최적화사례 - 회귀분석 > stack.loss[1:6] [1] 42 37 37 28 18 18 > X <- cbind(1,stack.x) > head(x) Air.Flow Water.Temp Acid.Conc. [1,] 1 80 27 89 [2,] 1 80 27 88 [3,] 1 75 25 90 [4,] 1 62 24 87 [5,] 1 62 22 87 [6,] 1 62 23 87 > solve(t(x) %*% X) %*% t(x) %*% stack.loss [,1] -39.9196744 Air.Flow 0.7156402 Water.Temp 1.2952861 Acid.Conc. -0.1521225 > lm(stack.loss ~ stack.x) Call: lm(formula = stack.loss ~ stack.x) Coefficients: (Intercept) stack.xair.flow stack.xwater.temp stack.xacid.conc. -39.9197 0.7156 1.2953-0.1521 행렬 / 벡터데이터타입지원 과 행렬연산지원 으로 복잡한구조의반복문제거 코드를이해가쉬움 Toward Open Platform -9-
R 의소개 특징 -cont Like UNIX Command Bell Lab 명령어 ( 함수 ) ls : 객체조회 rm : 객체삭제 grep : 패턴매칭 apropos : 명령어 ( 함수 ) 목록조회 find : 객체찾기 vi, emacs : text editor 호출 cat : 객체내용보기 haed : 앞줄데이터보기 tail : 뒷줄데이터보기 diff : 두객체차이보기 paste : 묶기 split : 쪼개기정규표현식지원 Hidden Objects은. 으로시작 Bell Lab 시절 S Language 가 UNIX 의특성을많이가져옴 명령어사례 > ls(pat="^p") [1] "pattern.features" > apropos("sum$") [1] "contr.sum" "cumsum" "rowsum" "sum" > head(iris[,1:2], n=3) Sepal.Length Sepal.Width 1 5.1 3.5 2 4.9 3.0 3 4.7 3.2 Toward Open Platform -10-
R 의소개 특징 -cont Graphics Graphics Devices bmp, jpeg, png, tiff, pdf, postscript, SVG(R 2.14) other Support OpenGL, Spatial(Archview, googlemap), Low level Plot points, lines, box, rect, polygon text, title, mtext legend, axis, grid High level Plot plot, barplot, boxplot, pie, qqplot,. trellis(lattice packages), rgl, sna, wordcloud, 다양한그래프를사용자가세세하게조정하여그릴수있음 명령어사례 x <- 1:10 y <- x^2 plot(x, y, type="b", col="red", lwd=1.2, pch=16, xlab="x", ylab="y", main=expression(x^2)) y 0 20 40 60 80 100 x 2 2 4 6 8 10 Toward Open Platform -11- x
Using R R 은영어처럼통계분석 / 대회에서가장일반화된언어로사용되고있다. http://www.kdnuggets.com/2011/08/poll-languages-for-data-mining-analytics.html http://blog.revolutionanalytics.com/2011/11/r-still-the-preferred-tool-of-predictive-modelers-competing-at-kaggle.html Toward Open Platform -12-
R Packaging System 지금까지 R 이성장하는데가장큰기여를한시스템 버전이증가함에따라기하급수적으로패키지수가증가함 R 언어가데이터분석분야에서가장파워풀한플랫폼인주요이유 ftp://cran.r-project.org/incoming 에올리고 cran@r-project.org 메일 http://journal.r-project.org/archive/2009-2/rjournal_2009-2_fox.pdf Toward Open Platform -13-
빅데이터 - The Data Flood Toward Open Platform -14-
빅데이터 - 개념에서데이터과학자까지 - ' 빅데이터 ' 라는단어는상대적인개념. 'Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. -wikipedia 데이터크기그자체가문제의일부가될때 " 빅데이터 " 라고한다. 가장좋아하는정의 빅데이터가불러온것들 기존분석가의소양 + 컴퓨터사이언스지식 데이터과학자 (Data Scientist) 라는괴상한직군... Hadoop과 R의결혼 (?) Toward Open Platform -15-
데이터과학 (Data Science) 사용자데이터로부터가치를추출해이를다시사용자에게기여하는피드백루핑을하는형식의서비스를제공하기위한행위 컴퓨터과학 + 통계학 + 인지심리학 + 디자인종합예술 대용량데이터를분석해야되는필연성때문에컴퓨터과학이차지하는비중이커지고있다. 대량의데이터를어떻게하면효과적으로시각화해보여줄것인가도한축임 그림하나가수백가지의숫자들보다대부분더낫다. 통계학은 " 데이터과학의문법 " 이다. http://benfry.com/phd/dissertation-050312b-acrobat.pdf Toward Open Platform -16-
Why R in Big Data analysis? 이미많은사용자 학교 기업 공공기관? 이제분석의시대, 어떤걸사용하지? R이외에대한이없다. (SAS, SPSS, numpy??) 오픈소스그리고참여의개발문화 Hadoop과잘어울림 Paper(or book) + R Packages -> 최근추세 기존 R의한계점들에대한다양한접근방법 이미 R core팀에서해결하고있음 ( 멀티코어, 메모리한계이슈들 ) 다양한벤더들의참여그리고해결책들 (Revolution Analytics) Toward Open Platform -17-
빅데이터분석에서의 R 의문제점 / 해결책 메모리한계이슈 모든데이터를메모리에로딩후처리하는작업방식 ff, bigmemory, RevoScaleR GB 급데이터처리가능 10GB 이상데이터는처리가능하나너무느리다는단점 불필요한데이터저장으로인한메모리부족현상 gc(), rm() 32 비트에서표현가능한숫자만이사용, 2^31-1 R 2.15 부터 2^51 이상의벡터길이사용가능 No int64 int64 package from Google 메모리단편화 64bit 머신사용 더많은메모리 Single Core 이슈 멀티코어 CPU 에서 1 코어만사용한다. R 2.14 부터 parallel 패키지기본탑재 TB 급빅데이터 는여전히처리 하기힘듬 Toward Open Platform -18-
Why Hadoop for Big Data Analysis? Hadoop has become the kernel of the distributed operating system for Big Data Hadoop World 2011 from Doug Cutting keynote 대부분의데이터처리는데이터분석을위한기반작업 이미많은업체에서 Hadoop 을데이터분석및처리용도로활용중 역시 Hadoop 이외의다른대안은거의없음 Toward Open Platform -19-
The Marriage of Hadoop and R(1) R 은빅데이터핸들링능력이필요하며, Hadoop 은고급분석능력이필요했다. 서로부족한부분을매꿔줄수있는필연적인만남 상호간의대안이없다. 대부분의 R 사용자들은 R shell 을떠나서작업하고싶어하지않는다. R 코드로 map/reduce 코드를만들수있으면얼마나좋을까? R 에서만든데이터마이닝모델을대용량데이터에피팅하려면? PMML? Toward Open Platform -20-
The Marriage of Hadoop and R(2) RHIPE(R and Hadoop Integrated Processing Environment) 는 Purdue Univ. 의통계학박사과정학생이었던 Saptarshi Guha 에의해개발된 R 패키지 R 을 Hadoop 환경에서 MapReduce 개념의분산처리가가능하게해줌 Amazon 의 EC2 에서사용가능함 (http://www.stat.purdue.edu/~sguha/rhi pe/doc/html/ec2.html ) 최근에 RHadoop 이라고하는 Revolution Analytics 에서나온오픈소스패키지공개 RHipe 코드 Facebook 에서의 R+RHIPE 에대한 Guha s lecture http://www.lecturemaker.com/2011/02/rhipe/ Toward Open Platform -21-
질문! 분석가가 Map/Reduce 를알아 야하나? Pig Streaming native map/reduce with Java RHipe RHadoop... Toward Open Platform -22-
NexR's Way for Big Data Analysis select * from foo; Map/Reduce for data analysis? 배워야한다. 그러나어렵다 SQL for data analysis! 대부분배울필요가없다. 그리고쉽다. Toward Open Platform -23-
RHive Sample Flight Delay Prediction l R: Building a prediction model of flight delay using linear regression with a training data set (sampled from Hive) l Hive: Running the prediction model(r objects) with an entire data set in Hive 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 library(rhive) rhive.connect("127.0.0.1") # get a training data set from Hive trainset <- rhive.query("select dayofweek,arrdelay,distance FROM airlines",fetchsize=30,limit=100) # convert to numeric, and extract out missing values trainset$arrdelay <- as.numeric(trainset$arrdelay) trainset$distance <- as.numeric(trainset$distance) trainset <- trainset[!(is.na(trainset$arrdelay) is.na(trainset$distance)),] # create a prediction model using R model objects and internal funtions Flight arrival and departure details for model <- lm(arrdelay ~ distance + dayofweek,data=trainset) all commercial flights within the USA, rhpredict <- function(arg1,arg2,arg3) { from October 1987 to April 2008. if(arg1 == "NULL" arg2 == "NULL" arg3 == "NULL") return(0.0) res <- predict.lm(model, data.frame(dayofweek=arg1, arrdelay=arg2, distance=arg3)) return(as.numeric(res)) } null <- "NULL" # set up R objects in Hive rhive.assign("null", null) rhive.assign("rhpredict", rhpredict) rhive.assign("model", model) Data set: airline on-time performance http://stat-computing.org/dataexpo/2009/ # export the R prediction model and run it in Hive rhive.exportall("rhpredict", c("10.1.3.2","10.1.3.3","10.1.3.4","10.1.3.5","10.1.3.6","10.1.3.7")) rhive.query("create table delaypredict as select R('rhpredict', dayofweek, arrdelay, distance, 0.0) from airlines") Toward Open Platform -24-
R 빅데이터분석을위한조언 Toward Open Platform -25-
KRUG 소개활동영역 KRUG (Korean R Users Group) GNU 의철학에입각하여, R 을한국어사용자가올바르고쉽게사용될수있도록문서를번역하고지식과기술을공유하는사용자모임으로 2007 년 1 월부터공식적으로활동한비영리모임 Online 활동 : 문서번역, 기술공유, Q&A R User Conference 개최 http://www.openstatistics.net http://www.r-project.kr/ 대외협력 : 문서 /White paper/blog 의번역 / 배포권리 Offline 활동 : Meetup 을통한기술교류 Toward Open Platform -26-
Korea R CRAN Mirror Statistics 개방후 6 일동안약 24,000 개의패키지다운로드 전체 3,452 가지의 R 패키지중에서 3,392 가지가다운로딩되었음 Toward Open Platform -27-
Q & A http://freesearch.pe.kr madjakartra@gmail.com Toward Open Platform -28-