Hadoop 과 Advanced Analytics 을활용한 Big Data 숨은가치창출 임상배부장 (sangbae.lim@oracle.com) Technology 사업본부, 한국오라클
Safe Harbor The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle s products remains at the sole discretion of Oracle. 2
Big Data Strategy Produce Data vs Use Data Big Data use case 3
Big Data s Impact on Business Run The Business Transform The Business Volume Velocity Variety Volume Velocity 4
Big Data Strategic Recommendations 1. Create Data reservoir for future value 2. Combine Data to get fast answers to new questions 3. Apply predictive analysis to get billion points of prediction 4. Accelerate data-driven actions 5
Big Data 환경, 가장큰기술적변화 ( 저장 / 처리 ) Data 전송구조에서 Program 전송구조로변경 Program Program Data VS Data 6
Big Data 환경, 가장큰기술적변화 ( 고급분석 ) 데이터이동없이, 데이터와분석을하나의환경에서 Advanced Analytics In-DB Analytics Advanced Analytics Data VS Data 데이터이동없음 데이터중복제거 높은보안성 기존추가인프라필요없음 7
Produce Data 어떻게이차이를줄일수있을까요? 대량의데이터를캡쳐 모든데이터를분석 안전하고통합된데이터플랫폼 Use Data 8
빅데이터구현접근방법 출처 : google trends(2013.11.25 기준 ) 9
Functional Assessment Hadoop vs. Relational STP Tooling maturity 5 4 Stringent Non-Functionals 3 Ingestion rate 2 1 0 ACID transactions Hadoop on BDA Oracle on Exadata Cost effectively store low value data Security ETL simplicity Variety of data formats Data sparsity 10
Unified Big Data Environment VS & 11
Big Data Architecture 구성단계기존의 DW Architecture 에 Big Data 를포함하도록점진적으로확산 기존데이터대상고급분석 Fast Data 실시간처리 1 2 3 4 Low Density Data 저밀도고용량데이터저장및처리 Discovery 탐색을통한새로운정보발견 12
1 단계 : DW 상에서의고급분석데이터의이동없이데이터가있는곳에서고급분석수행 Business Data (ERP, CRM, SCM etc) Oracle Database Oracle BI Enterprise Edition Advanced Analytics Acquire Organize Analyze Decide 13
2 단계 : 저밀도데이터저장 / 처리도입 Hadoop 적용, DW 를위한비정형 ODS, ETL 보조, MR 기반알고리즘 Business Data (ERP, CRM, SCM etc) Unstructured Big Data Hadoop Algorithm (MapReduce) Aggregate Pre-Analyze Oracle Database Advanced Analytics Oracle BI Enterprise Edition Acquire Organize Analyze Decide 14
3 단계 : 실시간처리구조로의확장 CEP 기반실시간전략및대응환경구축 Business Data (ERP, CRM, SCM etc) Unstructured Big Data Hadoop Oracle Database Oracle BI Enterprise Edition Aggregate Pre-Analyze Advanced Analytics Model Streaming Data Event Processing Real Time Decisions Acquire Organize Act Analyze Decide 15
Database, [NoSQL & Hadoop] Best Together RDBMS NoSQL Hadoop 최적사용 : 비즈니스데이타 ( 계좌, 고객등 ) High density data 엄격한트랜잭션처리 (ACID) 다수의사용자에대해정합성과안정성보장 100% SQL Compliance 고비용 최적사용 : SNS, 블로그등의텍스트 Partial Consistency Delay 허용 유연성과효율성 특화된용도에맞게사용 RDBMS와는보완관계 선택의폭이넓어짐 최적사용 : 웹 / 센서로그등의 low density data 기존데이터의 Archival Parallel Batch Processing 트랜잭션지원안함 데이터전처리및집계에적합 저비용 데이터의특성에맞추어적절한아키텍쳐에저장하는것이 TCO 절감의출발점 16
기업내 Hadoop 활용사례유형 Algorithm (MapReduce) Unstructured Big Data Hadoop ILM ETL Aggregate Pre-Analyze RDBMS Advanced Analytics BI 1. 기존 DW 확장 (Hot, Warm, Cold) 2. 비정형 / 정형 ETL 역활 3. MapReduce 기반분석수행 (MR-Style Algorithm) Query ETL Analysis Acquire Organize Analyze Decide 17
어떤 Hadoop 을도입할것인가? 100% 오픈소스기반, 중요기술의빠른진화 Hadoop 전문가에의해구현 대형클러스터에필요한것에집중 개방적접근방식 대규모환경에서검증되었음 클라우데라가관리및테스트 오픈소스컴포넌트관리 다기능관리 GUI 툴제공 18
살것인가? 만들것인가? Hadoop 인프라구축및최적화 Oracle Big Data Appliance Balanced Architecture 6x Higher Performance Specialized connectors 40 Gb/sec network Pre-built Optimized for Hadoop Redundancy built-in Simplified Support Automated Install( Mammoth) BYO Hadoop Cluster Traditional Low Performance No specialized connectors 8 10 Gb/sec network Requires tuning Build-your-own HA Complex multivendor support Manual provisioning 19
Perfect Balance Reducing Skew in Reducers China BE NL Lux Total Runtime of Reduce Phase 20
Perfect Balance Reducing Skew in Reducers Time Reduction with Perfect Balance C1 C2 China C3 BE NL Lux Oracle Big Data Appliance 에서제공하는하둡 MR 성능강화기능 Original Run Time of Reduce Phase 21
만약그래도 DIY 로구성하고싶다면 하둡인프라전문가 & 하둡개발전문가모두필요 X86 서버를구매혹은재사용 ( 성능이슈 ) 하둡클러스터를위한네트워크인프라구축은?(80/20) 설치-> 설정-> 튜닝 (OS, JVM, Network, Hadoop) Data Skew 발생시대안은? 새로운버전의하둡패치는어떻게? HA 구성은어떻게?(NN, JT) 하둡보안성은어떻게?(Kerberos, Sentry, Audit) 하둡운영시맞게될어려움 ( 장애, Knowledge base 없음 ) 22
Unified Data Analytics Environment Unified Analytics API SQL R MR Hadoop RDBMS IB Management Framework and Tools Unified Analytics Processing Platform 23
Produce Data Use Data 어떻게이차이를줄일수있을까요? 대량의데이터를캡쳐 모든데이터를분석 안전하고통합된데이터플랫폼 24
Big Data Connectors and Data Integrator 15TB / hour 10x Faster Big Data Appliance + Hadoop Exadata + Oracle Database 25
Oracle SQL 을통한 Hadoop 활용 하둡 (hive) 데이터와 DB 데이터와조인수행 26
Analyze All Your Data In-Place Advanced Analytics Big Data Appliance + Hadoop Exadata + Oracle Database 27
Oracle DBMS SQL & R Analyze Data across all your Systems Hadoop R SQL Oracle Database 분석의데이터를확장하고하둡에있는데이터를분석할수있는사용자를확보 IB 기존에알고있는 Oracle SQL 과 R 의강력한기능을이용하여비정형 / 정형모든데이터를분석 28
Advanced Analytics: 구성요소 Oracle 의 RDBMS 에데이터마이닝, 통계분석, 고급분석의기능을포함» Oracle Data Mining 데이터베이스내마이닝알고리즘 스타스키마, 문장, 트랜잭션데이터마이닝 In-DB model 생성및적용 Exadata scoring 50+ in-db statistical functions» Oracle R Enterprise DB 내에서 R 수행 ; 일부함수는 SQL 로변형됨 in-db 통계함수를지원하는폭넒은라이브러리 내장된 R 을이용해모든 R 패키지지원 29
Advanced Analytics 의가치 Value Proposition Traditional Analytics Data Import Data Mining Model Scoring Data Preparation and Transformation Data Mining Model Building Data Prep & Transformation Oracle Advanced Analytics avings 데이터는데이터베이스에존재 SQL 커널에서확장성있고병렬처리가능한데이터마이닝알고리즘구현 데이터준비자동화적은총소유비용 (TCO) 데이터중복제거 별도의분석용서버들을제거 확장성, 관리성, 보안지원 10-100x PERFORMANCE 데이터베이스의기능과통합데이터이동없이 in-db 분석수행정보지연현상제거 : 일 - 주 분 - 시간 Data Extraction Hours, Days or Weeks Model Scoring Embedded Data Prep Model Building Data Preparation Secs, Mins or Hours 10x LOWER TOTAL COST OF OWNERSHIP 전통적인통계 / 마이닝패키지의년간사용료절약 / 감소오라클 DB, DW 및 BI 기술플랫폼과레버리지됨 30
SQL Developer/Oracle Data Miner 4.0 R 스크립트를 GUI 환경에서사용 SQL Query node R 스크립트통합지원 R 31
Vector Register Oracle In-Memory DBMS(Announcing at OOW 2013) Fastest Query Performance In-Memory Column Store Sales CPU State column Load multiple State values CA >100X Faster SIMD Compare Vector all values Compare in 1 cycle all values in 1 cycle Scales Up or Scales out for very large data sets Scans use super fast SIMD vector instructions Billions of rows/sec scan rate per CPU core Joins up to 10X Faster 32
Produce Data Use Data 어떻게이차이를줄일수있을까요? 대량의데이터를캡쳐 모든데이터를분석 안전하고통합된데이터플랫폼 33
Platform Strengths Low-cost Scalability Flexible Schema on Read Abstract Storage Model Open Rapid Evolution Extreme Performance Highly Secure Analytic SQL Rich Tool Set Vast Expertise Big Data Appliance + Hadoop Exadata + Oracle Database 34
How can we leverage the strengths of both platforms? 35
Big Data 보안 Big Data 는반드시보호되어야하며감사를수행해야함. 기존 RDBMS 에저장된중요데이터와보안측면에있어차이가없음 36
Big Data 보안고려사항요약 : AAA Big Data 보안 (3A) Authenticate Users( 인증 ) Authorize access to data and services( 권한부여 ) Audit activity and users( 감사 ) 정확한감사를위해서는인증, 권한부여등이필요하며인증은필수적요소. 37
사용자위장디렉토리 / 파일레벨의접근권한 Hadoop Distributed File System: drwxr-xr-x - finance supergroup 0 2013-10-08 10:52 /fin_data drwxr-xr-x - healthcare supergroup 0 2013-10-08 10:52 /health_data Masquerade security Hadoop Cluster Sensitive data owned by users / groups 38
하둡기본보안모델해킹데모화면 39
Kerberos 를통한강력한인증제공 인증이필요한하둡서비스 주요하둡서비스대상인증필요 Flume, Hue, Oozie, Hive, HBase, ZooKeeper 등 3 rd Party 커넥터 사용자및서비스의시스템접속요청이정상인지를확인 Authenticate / Get Ticket Granting Ticket Client Key Distribution Center Access Service Using Ticket Kerberos Service Registration Key Distribution Center (Optional) Big Data Appliance 40
Apache Sentry 소개 Authorization Module for Hive & Impala Authorization Founded by Cloudera, Oracle and friends Open Source Donated to Apache Software Foundation Incubating 41
Sentry 권한인증기능 Authorization Module for Hive & Impala Authorization Secure Authorization 데이터접근및데이터권한제어 Fine-Grained Authorization 데이터베이스의서브셋 ( 컬럼 ) 수준의사용자접근권한제어 Role-Based Authorization 역활기반의템플릿화된권한을생성및적용 Multitenant Administration 각데이터베이스 / 스키마별다른정책을수립, 다른관리자에의해관리가능 42
하둡환경에서감사기능수행 Cloudera Navigator Architecture HDFS, Hive, Hbase, Cloudera Impala 서비스를통해접근한 HDFS 데이터와 Hive 메타데이터를대상으로감사수행 Audit 43
Oracle Audit Vault and Database Firewall 오라클 DBMS 감사와 Hadoop 감사를하나의솔루션으로수행 Audit Hadoop Non-Relational Data Audit Vault Operating Systems One 모든감사데이터에대한통합된, 안전한저장소 감사리포팅, 조기경보, 정책관리등을위한중앙화된플랫폼 Databases Relational Data 44
Big Data Use case 45
Turkcell Anti-Fraud Predictive Analytics Objectives Prepaid card fraud millions of dollars/year Extremely fast sifting through huge data volumes; with fraud, time is money Solution Monitor 10 billion daily call-data records Leveraged SQL for the preparation 1 PB Due to the slow process of moving data, Turkcell IT builds and deploys models in-db Oracle Advanced Analytics on Exadata for extreme speed. Analysts can detect fraud patterns almost immediately We can analyze large volumes of customer data and call-data records easier and faster than with any other tool and rapidly detect and combat fraudulent phone use. Hasan Tongu Yılmaz, Manager Oracle Advanced Analytics In-Database Fraud Models Exadata 46
National Cancer Institute Identifying Relationship between Gene to Cancer Interaction 17,000 Genes 60M Patients 5 Major Cancer Types 20M Medical Publications 47
Frederick National Laboratory Gene/Cancer Co-Occurrence 분석결과, 유전자활동에대하여상세경로를밝혀냄으로써항암제개발에많은정보를획득함 48
Frederick National Laboratory 49
FSI-Full Service Bank Before After Mainframe Mainframe Oracle Big Data Appliance Oracle Exadata Challenges: Reduce IT costs Comply with regulations requiring more data to support stress testing Consolidate and streamline data processing Benefits: Faster access to 6x more data Lower cost, simplified architecture Implemented in a matter of months 50
Big Data Use-case( 제조, 금융, 교통, 유통 ) 51
In-Database Analytics Unified Data Analytic Environment Oracle Big Data Appliance Optimized for Hadoop, R, and NoSQL Processing Oracle Big Data Connectors Oracle Exadata System of Record Optimized for DW/OLTP Oracle Exalytics Optimized for Analytics & In-Memory Workloads Hadoop (CDH Enterprise) Oracle R Oracle NoSQL Database Applications Oracle Big Data Connectors Oracle Advanced Analytics Data Warehouse Oracle Database Oracle Enterprise Performance Management Oracle Business Intelligence Applications Oracle Business Intelligence Tools Oracle Endeca Information Discovery 완성된하둡인프라 고성능 RDBMS 연계 In-DB 기반고급분석수행 탐색기반의정형 + 비정형복합처리 End to End 엔터프라이즈아키텍처지원, 유지보수및기술지원단일지원 52