<Insert Picture Here> Big Data 분석을위한 Oracle 의전략 Oracle R Enterprise 한승철 (sean.han@oracle.com) Exadata & Appliance Solution Consulting
목 차 R 의소개 R 이란? 오픈소스 R 의제약사항 Oracle 의 Big Data 분석전략 Big Data 분석을위한 Oracle 의 R 전략 Oracle R Enterprise 접근법 Oracle R Enterprise 소개 ORE 특징 ORE 실행절차 Transparency Layer Statistics Engine Embedded R Execution Big Data 분석을위한 ORE 와 Exadata 2
R 이란? 1994년개발 Open Source 상용플랫폼 시각화의용이성 Add-on 패키지설치로기능확장 The community of users and developers around R are 2-million strong. A significant body of their work on R is available freely on The Comprehensive R Archive Network. 두터운사용자층 Names You Need to Know in 2011: R Data Analysis Software, Forbes on Nov 10, 2010 3
오픈소스 R 의제약사항 단일사용자모드 다중쓰레드사용불가 메모리제약 보안문제 Enterprise 환경에 적합하지않음 4
정형 반정형 비정형 Oracle 의 Big Data 분석전략 Master & Reference Transactions Machine Generated Oracle 11g Files Oracle NoSQL Oracle Data Integrator Oracle Golden Gate Oracle 11g Oracle TimesTen Endeca MDEX Oracle Advanced Analytics Option (ORE & ODM) Oracle BI Enterprise Oracle Real Time Decisions Text, Image, Video, Audio Oracle Hadoop HDFS Oracle Hadoop MapReduce Oracle Essbase Oracle Endeca Information Discovery Data Acquire Organize Analyze Decide 5
Big Data 분석을위한 Oracle 의 R 전략 R 작업콘솔 Oracle R Connector for Hadoop MapReduce in R (ORCH) HDFS 파일 Bulk import Oracle R Enterprise Oracle statistics engine External Tables File systems (ORE) Applications, BI, Web Services Database Links Other databases 6
Oracle R Enterprise 접근법 데이터와분석모듈이 DB 에저장되고 DB 내에서실행 R Open Source 기존과동일한 R 개발환경 운영시스템에임베딩가능 Oracle Data Mining 와공유 7
Oracle R Enterprise 의특징 Transparency Layer R 데이터타입 DB 데이터타입 R 스크립트를 SQL로변환 R 만으로 DB내의데이터를직접조작가능 데이터셋의용량에무관 Statistics Engine 자주쓰는통계라이브러리를 DB 에서실행 SQL 확장 운영시스템에서 R 코드를임베딩가능 데이터의이동불필요 보안관련이슈에정답제공 8
Oracle R Enterprise 실행절차 1 2 3 R Engine 추가 R 패키지 Oracle R Enterprise 패키지 SQL 결과 Oracle Database User tables R 결과 R Engine 추가 R 패키지 Oracle R Enterprise 패키지 User R Engine on desktop Transparency 프레임워크는 R 함수를 SQL 로변환 DB 에전체 R 스크립트를전달하여실행 Database Compute Engine 큰용량의데이터에도확장성보장 변환된 SQL 의병렬실행지원 분석및데이터마이닝기능을동시에사용가능 R Engine(s) managed by Oracle DB DB 자체가 R 엔진의병렬실행을보장하기위해다수의 R 엔진을관리 데이터를 MapReduce 방식으로각 R 엔진에병렬전송함으로써효율성증대 R 스크립트를원격으로실행가능 사용자에게결과표시 DB 내에서 R 엔진동시실행 DB 에없는분석방법 9
( 초 ) 0 5 10 15 20 25 30 Transparency Layer 성능비교평가 투명하게 R 함수및데이터타입을 SQL 로변환 데이터처리함수 (select, project, join) 집계함수 (avg, sum, summary) 고급통계분석및함수 (gamma, beta, step-wise regression) 성능의향상뿐만아니라데이터의용량제한문제를효과적으로해결가능 적용사례 ) ONTIME 데이터에서지연도착이심한 36개공항의상자그림그리기메모리부족으로오픈소스 R 실행불가 120K 행 오픈소스 R 120M 행 Oracle R Enterprise 10
Transparency Layer aggdata <- aggregate(ontime$dest, class(aggdata) head(aggdata) by = list(ontime$dest), FUN = length) Client R Engine Transparency Layer Oracle R package with OBJ1_13 as ( select "DEST" NAME001, "DEST" VAL001, count(*) VAL002 from "OREADMIN"."ONTIME" where ("DEST" is not null) group by "DEST" ) select * from OBJ1_13 order by NAME001; Oracle Database ONTIME In-DB Statistics 11
Transparency Layer ontime <- ONTIME delay <- ontime$arrdelay dayofweek <- ontime$dayofweek bd <- split(delay, dayofweek) boxplot(bd, notch = TRUE, col = "red", cex = 0.5, outline = FALSE, axes = FALSE, main = "Airline Flight Delay by Day of Week", ylab = "Delay (minutes)", xlab = "Day of Week") axis(1, at=1:7, labels=c("monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) axis(2) 12
Statistics Engine 특수함수 Gamma function Natural logarithm of the Gamma function Digamma function Trigamma function Error function Complementary error function 통계검정 Chi-square, McNemar, Bowker Simple and weighted kappas Cochran-Mantel-Haenzel correlation Cramer's V Binomial, KS, t, F, Wilcox 분포및확률함수 Transparency Layer Beta distribution Binomial distribution Cauchy distribution Chi-square distribution Exponential distribution F-distribution Gamma distribution Geometric distribution Log Normal distribution Logistic distribution Statistics Engine 특수함수통계검정분포함수확률함수 SAS 관련함수 Freq, Summary, Sort Rank, Corr, Univariate Negative Binomial distribution Normal distribution Poisson distribution Sign Rank distribution Student's t distribution Uniform distribution Weibull distribution Density Function Probability Function Quantile 13
Embedded R Execution DB 함수로매핑이되지않는함수인경우에적용 데이터베이스를데이터 Provider 의목적으로활용 데이터레벨의병렬처리가능 SQL 함수 R 함수 함수설명 rqeval() ore.doeval() 독립적으로 R 스크립트실행 rqtableeval() ore.tableapply() 하나의테이블전체를이용한 R 스크립트실행 rqroweval() ore.rowapply() 한번에테이블의일부분만을이용한 R 스크립트실행 rqgroupeval() ore.indexapply() 테이블을특정컬럼값으로파티셔닝한후각파티션에대해 R 스크립트실행 sys.rqscriptcreate() R 스크립트생성 sys.rqscriptdrop() R 스크립트삭제 14
Embedded R Execution R Interface modlist <- ore.groupapply( X = ONTIME, INDEX = ONTIME$DEST, function(dat) { lm(arrdelay ~ DISTANCE + DEPDELAY, dat) }); modlist_local <- ore.pull(modlist) summary(modlist_local$bos) ## return model for BOS 1 Client R Engine Transparency Layer ORE 2 DB R Engine ORE Oracle Database 3 4 User tables DB R Engine ORE rq*apply() interface extproc extproc 5 5 15
Embedded R Execution SQL Interface begin sys.rqscriptdrop('example1'); sys.rqscriptcreate('example1', 'function(dat,datastore_name) { mod <- lm(arrdelay ~ DISTANCE + DEPDELAY, dat) ore.delete(datastore_name) ore.save(mod, name = datastore_name) }'); end; / select * from table(rqtableeval( cursor(select ARRDELAY, DISTANCE, DEPDELAY from ontime), cursor(select 1 "ore.connect", 'mydatastore' as "datastore_name" from dual), 'XML', 'Example1')); begin sys.rqscriptcreate('example2', 'function(dat, datastore_name) { ore.load(datastore_name) prd <- predict(mod, newdata=dat) prd[as.integer(rownames(prd))] <- prd res <- cbind(dat, PRED = prd) res }'); end; / select * from table(rqtableeval( cursor(select ARRDELAY, DISTANCE, DEPDELAY from ontime where year = 2003 and month = 5 and dayofmonth = 2), cursor(select 1 "ore.connect", 'mydatastore' as "datastore_name" from dual), 'select ARRDELAY, DISTANCE, DEPDELAY, 1 PRED from ontime', 'Example2')) order by 1, 2, 3; 16
Big Data 분석을위한 ORE 와 Exadata OREpredict 패키지는 Exadata storage 서버에서 R 모형에대한 Scoring 을지원 다양한표본추출을위한지원이 Database 내에서지원 OREdm 패키지는고성능의 Oracle Data Mining 예측알고리즘지원 17
Big Data 분석을위한 ORE 와 Exadata - Scoring Oracle Database 에서모형스코어링을위한가장빠른방법 R-generated Predictive Model ore.predict() SQL 모형구축에서스코어링까지의절차단순화 현재지원가능한통계모형 glm, glm.nb, hclust, kmeans, lm, multinom, nnet, rpart 개발완료된모형은 ORE 전용 datastore 에저장되어관리됨 Oracle Database R Object Datastore Data 18
Big Data 분석을위한 ORE 와 Exadata - Sampling Simple random sampling Split data sampling Systematic sampling Stratified sampling Cluster sampling Quota sampling Accidental sampling dat <- ore.pull( ) samp <- dat[sample(nrow(x),size,] samp <- x[sample(nrow(x), size),,] samp <- ore.pull( ) Oracle Database Data Oracle Database Data 19
Summary Oracle R Enterprise Oracle R Connector for Hadoop 20
관련자료 블로그 : https://blogs.oracle.com/r/ 포럼 : https://forums.oracle.com/forums/forum.jspa?forumid=1397 Oracle R Distribution: http://www.oracle.com/technetwork/indexes/downloads/r-distribution-1532464.html ROracle: http://cran.r-project.org/web/packages/roracle Oracle R Enterprise: http://www.oracle.com/technetwork/database/options/advanced-analytics/r-enterprise Oracle R Connector for Hadoop: http://www.oracle.com/us/products/database/big-data-connectors/overview 21
22
23