빅데이터아키텍쳐소개 임상배 (sangbae.lim@oracle.com) Technology Sales Consulting, Oracle Korea
Agenda 빅데이터아키텍쳐트랜드 빅데이터활용단계별요소기술 사업방향및활용사례 요약 Q&A
빅데이터아키텍쳐트랜드
빅데이터아키텍쳐트랜드 오픈소스와기간계, 정보계시스템과의융합 현재빅데이터의열풍의근원은하둡 (Hadoop) 생태계 오픈소스기반의정보계구축에서상호보완적관계로재정립 오픈소스 ( 하둡 ) 기간계, 정보계 Big Data Enterprise Architecture - NG
빅데이터트랜드 Data 처리절차중심 비정형 / 반정형데이터 SNS Machine Data Log Data OLTP Images Document 정형데이터 ERP CRM Complex Event Processing Decision Making Data Processing (Batch 중심 ) ETL/ELT Data Integration Statistics Data Mining 기존 BI Data 저장 HDFS(batch), NoSQL(RealTime) RDBMS In-Memory Processing Connector Engineered system Machine Learning
빅데이터활용단계별요소기술
OEP Exalytics Big Data Solution Spectrum Data Variety Unstructured Stream Complex Event Processing Acquire HDFS NoSQL Organize Big Data Appliance (Opensource Hadoop) Analyze Hadoop MapReduce Data Integrator Advanced Analytics Data Mining Decide/Visualize Schema-less Event Stream Processing R B.I Schema Simple Event Processing DBMS: OLTP ETL DBMS: DW Exadata Spatial Graph
Acquire : Big Data DECIDE ANALYZE ACQUIRE HDFS, NoSQL ORGANIZE Acquire all available, schema-based and nonrelational data
하둡인프라선택시고려사항 중요기술의빠른진화 Hadoop 전문가에의해구현대형클러스터에필요한것에집중개방적접근방식 대규모환경에서검증되었음 클라우데라가관리및테스트오픈소스컴포넌트관리다기능관리 GUI 툴제공
Cloudera CDH Components Hadoop Hive Pig HBase Zookeeper Flume Sqoop Mahout Whirr Oozie Fuse-DF Hue
Cloudera CDH 도입이유 Normal Hadoop Cloudera Manager
NoSQL Database 적용분야 USE CASES(Data Capture/Services) Web applications Sensor/statistics/network capture Distributed backup service providers Online services, social media Scalable authentication services Personalization QUERIES ARE SIMPLE DYNAMIC SCHEMA HIGH VOLUME OF DATA
Oracle NoSQL DB Request Processing majorcomponents.add("smith"); majorcomponents.add("bob"); minorcomponents.add("phonenumber"); String data = "408 555 5555"; Value myvalue = Value.createValue(data.getBytes()); kvstore.put(mykey, myvalue);
Organize : Big Data DECIDE ANALYZE ACQUIRE HDFS, NoSQL MR, Hive, Pig Oracle Big Data Connectors ORGANIZE Organize and distill data using massive parallelism
Hive(HiveQL 기반 MR 수행 ) FROM (SELECT a.status, b.school, b.gender FROM status_updates a JOIN profiles b ON (a.userid = b.userid AND a.ds= 2009-03-20 )) subq1 INSERT OVERWRITE TABLE gender_summary PARTITION(ds= 2009-03-20 ) SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender INSERT OVERWRITE TABLE school_summary PARTITION(ds= 2009-03-20 ) SELECT subq1.school, COUNT(1) GROUP BY subq1.school 출처 : Hive-A petabyte Scale Data Warehouse Using hadoop, Facebook data Infrastructure Team
GUI 기반의 Hadoop 작업수행 GUI 를통해 Hadoop 기술사용의난이도를낮출수있음.(Oracle Data Integrator)
Oracle Data Integration for Big Data Big data 처리프로세스의생산성및효율성제고 Transforms Via MapReduce Oracle Data Integrator Activates Oracle Loader for Hadoop Loads Oracle Exadata Benefits Big data 처리시생산성향상 Oracle Loader for Hadoop 을이용하여 Big Data 적재작업최적화 GUI 툴을이용하여 Hadoop 처리의복잡도감소
Pig
Pig(MR vs Pig) Users = load users as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load pages as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, COUNT(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into top5sites ;
Pig(Performance)
Oracle Loader for Hadoop 고성능의병렬적재제공 (data pre-partitioned &sorted using Hadoop) 성능최대화 (Oracle internal formats) DB CPU 부하감소 (db format, partition, sort)
Oracle Loader for Hadoop 원본 apache log file DB 에저장된결과
Oracle Loader for Hadoop: Online Option 1. Read target table metadata from the database 3. Connect to the database from reducer nodes, load ORACLE LOADER FOR HADOOP into database partitions in parallel (JDBC or OCI) MAP MAP MAP SHUFFLE /SORT REDUCE REDUCE 2. Perform partitioning, sorting, and data conversion MAP REDUCE MAP MAP SHUFFLE /SORT REDUCE REDUCE
Oracle Loader for Hadoop: Offline Option 1. Read target table metadata from the database 3. Write from reducer nodes to Oracle Data Pump files 4. Copy files from HDFS to a location where database can access them MAP MAP MAP SHUFFLE /SORT REDUCE REDUCE DAT A DAT A 5. Import into the database in parallel using external table mechanism ORACLE LOADER FOR HADOOP 2. Perform partitioning, sorting, and data conversion MAP MAP MAP SHUFFLE /SORT REDUCE REDUCE REDUCE DAT A DAT A DAT A 4. 1 Access datapump file in HDFS using ODCH ( will introduce later)
Oracle Loader for Hadoop : Performance 5~20 배성능제공 3 rd 대비 85% CPU 부하감소 3 rd oracle
Oracle Direct Connector for HDFS (ODCH) Directly access data files on HDFS from external tables MAP MAP MAP SHUFFLE /SORT REDUCE REDUCE DAT A DAT A SQL QUERY ANY MAPREDUCE JOB ODC H External Table MAP REDUCE DAT A MAP MAP SHUFFLE /SORT REDUCE REDUCE DAT A DAT A
Oracle Direct Connector for HDFS (ODCH) 5 배의성능향상 3 rd 대비 75% CPU 부하감소
Analyze : Big Data DECIDE ACQUIRE HDFS, NoSQL Analyze all your data together Oracle R ORE ANALYZE MR, Hive, Pig ORGANIZE
데이터분석지원 (Opensource R) 오픈소스랭귀지 & 환경통계계산및그래픽에사용고확장성제공통계분석에서일반화된언어
Oracle R Connector for Hadoop 특징 항목 R 에서 HDFS 에상호접근 Hadoop 과 R 통합 설명 R 함수를이용하여 HDFS 에저정된데이터를다루거나탐색 R 환경에서 HDFS 와 R/Oracle DB, Local FS 간투명한데이터이동 R 사용자가 Hadoop 개념을배울필요없이익숙한 R 환경에서 MR 프로그래밍패러다임활용가능 Mapper, combiner, reducer R 함수들을모두지원하며추가적인메데데이터코딩필요없음
Oracle R Connector for Hadoop import java.io.ioexception; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.longwritable; import java.io.ioexception; import org.apache.hadoop.io.text; import java.util.iterator; import org.apache.hadoop.mapred.mapreducebase; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.mapred.mapper; import org.apache.hadoop.io.text; import import org.apache.hadoop.mapred.outputcollector; import org.apache.hadoop.mapred.reporter; import org.apache.hadoop.mapred.mapreducebase; public class import WordMapper org.apache.hadoop.mapred.reducer; extends MapReduceBase implements import org.apache.hadoop.mapred.reporter; Mapper<LongWritable, Text, Text, IntWritable> { public class SumReducer extends MapReduceBase public implements void map(longwritable key, Text value, OutputCollector<Text, Reducer<Text, IntWritable, IntWritable> Text, output, IntWritable> { Reporter reporter) public void reduce(text key, Iterator<IntWritable> throws values, IOException { String s OutputCollector<Text, = value.tostring(); IntWritable> output, for Reporter (String reporter) word : s.split("\\w+")) { if (word.length() throws IOException > 0) { { output.collect(new int wordcount = 0; Text(word), new IntWritable(1)); while (values.hasnext()) { } IntWritable value = values.next(); } wordcount += value.get(); } } } output.collect(key, new IntWritable(wordCount)); } } ontime <- ore.pull(ontime_s[ontime_s$year==2007,]) ontime.dfs <- hdfs.put(ontime, key='dest') res <- hadoop.run( ontime.dfs, mapper = function(key, ontime) { if (key == 'SFO') { keyval(key, ontime) } else { NULL } }, reducer = function(key, vals) { sumad <- 0; count <- 0 for (x in vals) { if(!is.na(x$arrdelay)) { sumad <- sumad + x$arrdelay count <- count + 1 } } res <- sumad / count keyval(key, res) } ) hdfs.get(res)
Oracle R Enterprise Approach 모델을 DB 에저장하고수행 기존 R 과동일한환경제공 R 분석시 DB 서버의성능이용 ( 기존 R 의문제해결 ) Oracle Data Mining 보완 (Advanced Analytics)
Decide : Big Data DECIDE Exalytics Oracle R ORE ANALYZE ACQUIRE HDFS, NoSQL MR, Hive, Pig ORGANIZE Make datadriven, statistical based real-time decisions
Big Data Connectors 구성 Software Oracle Loader for Hadoop Oracle Data Integrator Application Adapters for Hadoop Oracle R-to-Hadoop Connector Oracle DirectHDFS Description Hadoop 시스템에서 Oracle DB 로효율적으로데이터를로딩 ODI 에서사용할수있는새로운 application adapter 로 Hadoop 과통합되어있으며 Hadoop code 생성지원 R 프로그램이 HDFS 데이터위에서직접수행되도록하는 Oracle component SQL 질의와 HDFS 사이의데이터를통합해주어 SQL 결과집합과 HDFS 결과집합을 Direct Join 할수있도록지원
사업방향및활용사례
빅데이터사업방향활용사례기준 이상현상감지 업무에서발생하는다양한이벤트기록을통해정상, 비정 상상태의패턴파악, 새로운이벤트발생시이상현상여부 를판단 (VISA 社, 부정검지이용패턴 Hadoop, 1 개월 ->13 분 ) 가까운 미래예측 현상황분석 Forecast아닌 Nowcast, 사용자의마음이변했다라는사실을인지하는것보다변할것같다는것을파악하여사전대응 ( 일본사이버에이전트社사용자행동패턴분석하여탈퇴예방 ) 일본 Nishitetsu Store 빅데이터기반회계시스템구축진행중, 월단위회계시스템에서일단위로변경하여상품별원가율원가변동추이분석하여이익율높은상품에대한마케팅정책을수립. 출처 : 빅데이터비즈니스활용과과제 참고 ( 한국정보산업연합회 )
요약
빅데이터아키텍쳐의기본은? Hadoop Eco system + RDBMS 상호보완 Big data 의특성이 Acquire, Organize 단계를거치면서 사라졌다면우리가가장잘알고있는 SQL 세상에서처리하는것이가장 빠르고편하고안전합니다.
Oracle Big Data Platform Oracle Big Data Appliance Oracle Big Data Connectors Oracle Exadata Oracle Exalytics Acquire Organize Analyze Analyze
Oracle Big Data Appliance Software Software pre-installed, pre-optimized for optimal performance: Oracle Linux 5.6 Java Hotspot VM Cloudera CDH Cloudera Manager Oracle NoSQL Database CE/EE* Oracle Big Data Connectors* Open Source R Distribution * Separately licensed software
Oracle Big Data Appliance Hardware 18 Sun X4270 M2 Servers per Rack 864 GB memory (48*18) 216 cores (12x18) 648 TB storage (36x18) 40 Gb/s InfiniBand Fabric Inter-rack Connectivity Inter-node Connectivity 10 Gb/s Ethernet Connectivity Data center connectivity Full Rack Configuration Only
Oracle Big Data Platform 의장점 Engineered System H/W, S/W 밀결합을통한최고의성능제공 Big Data 처리시가장안전한 Infra 제공 안정적 기술지원 Oracle, Cloudera support(24x7) Cloudera Hadoop, Oracle NoSQL, Big Data Connectors 전사아키텍쳐 구현지원 기존의 Oracle DB 와의상호연결을통한 Big Data 와 DB Data 의일관된전사통합관리지원
Questions