Apache Tajo 와 R 을연동한빅데이터분석 고영경 / 그루터 ykko@gruter.com
목차 : R Tajo Tajo RJDBC Tajo Tajo UDF( ) TajoR Demo Q&A
R 과빅데이터분석 ' R 1) R 2) 3) R (bigmemory, snowfall,..) 4) R (NoSQL, MapReduce, Hive / RHIPE, RHive,..) 5) 6) ~
R 과빅데이터기술의역할분담 Hadoop (Hadoop/MapReduce, NoSQL, SQL-on-Hadoop), R
Tajo 소개 Apache Tajo : 대용량분산 Data Warehouse 시스템 (SQL-on-Hadoop) Tajo ( ) SQL MapReduce (M/R, Hive, Pig) 3 (HDFS, S3, Swift, Local filesystem, HBase, ElasticSearch ) (CSV, JSON, Apache log,, ) Tajo API, JDBC (ODBC ) UDF (User-Defined Function) - ( )
Tajo 로처리하기적합한작업 Tajo SQL (eg.,,, JOIN) JSON, Apache log HDFS, Local file, Amazon S3 iterative (eg. machine learning) (eg. )
다양한환경에서 Tajo 사용하기 Tajo Desktop Package ( ) Mac, Linux stand-alone Tajo 엑셀로다루기힘든큰데이터를변환 /Load 없이바로 SQL로분석 mysql, mysql Tajo-as-a-Service (http://taas.gruter.com) Tajo S3 (No ETL) EMR bootstrap action EC2 AMI (Amazon Machine Image) ( ) Tajo DW H/W ( G-DPU)
Tajo 와 R 연동 - 데이터준비 1. Airline $ head /Users/ykko/tajo/data/air2008/air2008.csv 2008,1,3,4,926,930,1054,1100,WN,1746,N612SW,88,90,78,- 6,- 4,IND,BWI,515,3,7,0,,0,NA,NA,NA,NA,NA 2008,1,3,4,1829,1755,1959,1925,WN,3920,N464WN,90,90,77,34,34,IND,BWI,515,3,10,0,,0,2,0,0,0,32 2008,1,3,4,1937,1830,2037,1940,WN,509,N763SW,240,250,230,57,67,IND,LAS,1591,3,7,0,,0,10,0,0,0,47 2008,1,3,4,617,615,652,650,WN,11,N689SW,95,95,70,2,2,IND,MCI,451,6,19,0,,0,NA,NA,NA,NA,NA 2. Tajo CSV External Table CREATE EXTERNAL TABLE ext2008 ( year int, month int, ) USING text with ('csvfile.delimiter'=' ') LOCATION file:///users/ykko/tajo/data/air2008/ ; 3. ( ) CREATE TABLE air2008 AS SELECT year, month,, CASE airtime WHEN 'NA' THEN null ELSE airtime::int END as airtime, FROM ext2008;
RJDBC 로 Tajo 질의실행 library(rjdbc) drv <- JDBC("org.apache.tajo.jdbc.TajoDriver", "/Users/ykko/tajo/tajo_single_jdbc/tajo- jdbc- 0.10.0- SNAPSHOT.jar") conn <- dbconnect(drv, "jdbc:tajo://localhost:26002/default", "", "") sql <- "SELECT concat(origin, ' - > ', dest) as flight, count(*) as delay_cnt FROM airline.air2008 WHERE weatherdelay > 0 GROUP BY flight ORDER BY delay_cnt desc LIMIT 10 " dfrm <- dbgetquery(conn, sql) R> dfrm flight delay_cnt 1 DTW - > LGA 435 2 ORD - > DFW 418 3 ORD - > CVG 383 4 ORD - > MSP 371 5 DTW - > ORD 370
질의결과시각화 library(ggplot2) ggplot(dfrm, aes(flight, delay_cnt)) + geom_bar(aes(fill = flight), stat = "identity") + theme(axis.text.x = element_text(angle = 45))
대용량분산처리 ( 맛보기 ) 4 Tajo (1 master, 3 workers) / TPC-H 100GB conn <- dbconnect(drv, "jdbc:tajo://app01:26002/default", "", "") sql <- "SELECT count(*) as cnt FROM tpch100g.lineitem cnt <- dbgetquery(conn, sql) sql <- "SELECT l_shipdate, count(*) as cnt FROM lineitem GROUP by l_shipdate" dfrm <- dbgetquery(conn, sql)
Tajo 관리 UI - 질의실행상황확인 http://tajo-master-server:26080
다양한저장소지원, : Efficient In-situ Processing of Various Storage Types on Apache Tajo (, Hadoop Summit 2015 )
다양한저장소지원 Hadoop (HDFS), Amazon S3, Local Filesystem CREATE EXTERNAL TABLE ext2008 ( year text, ) USING text with ('csvfile.delimiter'=',') LOCATION 'hdfs://server:9000/airline/''; - - LOCATION 'file:///users/ykko/downloads/'; - - LOCATION 's3://tajo- data- us- east- 1/airline/'; HBase (cf. HBase Integration ) USING hbase WITH ('table'='blog', 'columns'=':key,info:author,info:date,content:title'); OpenStack Swift (cf. Apache Tajo on Swift ) ElasticSearch, Kafka 지원도개발중
다양한데이터포맷지원 CSV, JSON (No ETL) Custom (De)Serializer / SQL ex. Pluggable Scanner Tajo Tuple Tajo Engine JSON JSON JSON file JSON Scanner Tajo Apache Log Apache Log Scanner col1 col1 col2 col2 col3 col1 col2 col3 col3 SQL Result Set My Custom Log My Custom Scanner
다양한데이터포맷지원 - 웹로그분석예제 LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User- Agent}i\"" combined 61.23.4.16 - - [15/Oct/2014:09:00:22 +0900] "GET /main HTTP/1.1" 200 942 "- " "Mozilla/..." 65.13.2.96 - - [15/Oct/2014:09:01:23 +0900] "GET /help HTTP/1.1" 200 242 "- " "Mozilla/..." 65.33.6.10 - - [15/Oct/2014:09:01:23 +0900] "GET /view HTTP/1.1" 200 810 "- " "Mozilla/..." Tajo External Apache Log SerDe CREATE EXTERNAL TABLE web_logs ( remote_addr TEXT, logname TEXT, remote_user TEXT, access_timestamp TIMESTAMP, request_method TEXT, request_path TEXT, http_version TEXT, response_status TEXT, transferred_bytes INT, referrer TEXT, user_agent TEXT ) USING TEXT WITH ( 'text.serde'='org.apache.tajo.storage.text.apacheloglineserde', 'text.delimiter'=' ','text.null'='- ' ) LOCATION 'file:///users/ykko/tajo/tajo- desktop- weblog- 2.0/data/web- log'; SQL (eg. ) SELECT extract(hour from access_timestamp) as hh, count(*) as cnt FROM web_logs GROUP BY hh ORDER by hh * SerDe ( https://github.com/hyunsik/tajo/tree/web_log_parser )
Tajo UDF ( 사용자정의함수 ) Tajo UDF SQL Function UDF (User-Defined Functions) :. Java, Python eg. Pearson Correlation Tajo UDF airline> SELECT corr(actualelapsedtime, crselapsedtime) FROM air2008; Progress: 8%, response time: 1.139 sec Progress: 8%, response time: 1.14 sec Progress: 100%, response time: 1.324 sec?corr - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0.9798745590752236 (1 rows, 1.324 sec, 19 B selected) * corr UDF : https://issues.apache.org/jira/browse/tajo-1661
TajoR 로드맵 R Tajo CRAN Tajo JDBC (eg. ) Approximate query (eg. time-bound, error-bound) R Tajo UDF, (eg. t-test ) (multi-tenancy, Fair scheduler ) Data Frame API Tajo data frame. SQL.
함께해요 Tajo.. Getting Started Tajo Desktop Package EMR bootstrap action TajoR feedback Tajo. Apache Tajo : http://tajo.apache.org Tajo : /
GRUTER: YOUR PARTNER IN THE BIG DATA REVOLUTION Phone +82-2-508-5911 Fax +82-2-508-5912 E-mail Web contact@gruter.com www.gruter.com