SAS FORUM AI / Machine Learning 시대를선도하는 SAS 사용자를위한데이터플랫폼 구축안내서 Cloudera Korea 임상배 Copyright SAS Ins1tute Inc. All rights reserved.
Cloudera Hadoop SAS & Cloudera 활용방법
Cloudera Hadoop Overview 하둡따라잡기 Hadoop: 2003년 Google에서발표한 Google File System Whitepaper에기반한분산처리프레임워크 The Old Way The Hadoop Way Compute (RDBMS, EDW) Network Data Storage (SAN, NAS) Compute (CPU) Memory z z Storage (Disk) Hard to scale Network inevitably becomes a bolleneck Only handles structured/relanonal data Difficult to add new fields & data types Scales out indefinitely Network eliminated as a bottleneck Easy to ingest any type of data Agile schema-on-read data access
Cloudera Hadoop Overview 하둡따라잡기 데이터이동을최소화, 경제성높은대용량데이터저장 / 분석플랫폼 Then Bring Data to Compute Now Bring Compute to Data Compute Compute Data Compute Data Dat a Dat a Process-centric businesses use: Structured data Internal data only Important data only Compute Compute Compute Data Information-centric businesses use all Data: Multi-structured, Internal & external data of all types Copyright SAS Ins1tute Inc. All rights reserved.
Cloudera Hadoop Overview 하둡따라잡기 여러대의서버를통한분산처리구조 분산처리를통해기존시스템보다빠르고많은데이터처리 ( 필요시전체데이터셋분석 ) 정형 / 비정형구분없이모든유형, 모든볼륨의데이터에대한처리가가능 하둡은국내 / 외대부분의데이터분석시스템구축시최우선으로도입하는표준데이터플랫폼 기존분석환경 (SAS) 과연계를통해기투자 IT 자산및보유인력의 Skill-set 활용극대화필요
Hadoop: Transforming Enterprise Data Architecture 신규데이터탐색데이터기반기업개방형아키텍처 Designed for 3Vs of new data NaGve security Significantly lower cost Multiple analytic engines Agile development tools Extremely fast performance Rapid innovation Large ecosystem No vendor lock-in
Cloudera 엔터프라이즈급머신러닝및분석플랫폼지원 Machine Learning Pattern recognition Anomaly detection Prediction Customers Run on Cloudera Analytics Self-service intelligence Real-time analytics Secure reporting Customers Run IMPALA on Cloudera
4 3 2 1 Big Data App / Customer Care Churn Analytics Discover Unknowns Big data application PredicHve AnalyHcs Pro achvely respond to issues PredicHve Maintenance Predict customer behaviour Analytics / Self Service BI Combine different workloads on common data (i.e. SQL + Search) Optimize infrastructure usage Reduce Outages Reduce BI backlog requests EDW Optimization System Logs, Maintenance Data Set Top Box Data, Mobile Data, Online Data, 3rd Party Datasets Lowest cost storage 3 4 2 SERVERS MARTS Teradata SAS STORAGE SEARCH ARCHIVE 1 Account/Customer/TransacHon CLICKSTREAMS, System logs, Set Top box, Mobile 3 rd Party DATA SOURCES
하둡플랫폼의진화 The stack is continually evolving and growing! Core Hadoop (HDFS, MapReduce) Solr Pig Core Hadoop HBase ZooKeeper Solr Pig Core Hadoop Hive Mahout HBase ZooKeeper Solr Pig Core Hadoop Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig Core Hadoop Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Spark Tez Impala KaKa Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Knox Flink Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop CDSW Altus Kudu RecordService Ibis Falcon Knox Flink Parquet Sentry Spark Tez Impala KaKa Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop 2006 2007 2008 2009 2010 2011 2012 2013 2014..2017
요즘분석과제하면서들리는용어들 Hadoop, HDFS, Spark, Hive, Impala, Kudu
요즘분석과제하면서들리는용어들 용어 하둡 (Hadoop) HDFS(Hadoop Distributed File System) 설명 분산컴퓨팅프로젝트로대량의데이터를병렬분산환경에서처리하는것을목적으로합니다. 대용량파일을분산된서버에저장하고, 개별노드의하드디스크용량보다큰데이터를저장및처리하는것을지원하는파일시스템입니다. 맵리듀스 (MapReduce) 대용량데이터를배치방식으로처리하는것을지원하는프레임워크로 Map 과 Reduce 작업으로구성됩니다. 노드 (node) 보통물리적서버 1 대를의미합니다. 클러스터 (cluster) 하둡에코시스템 (Hadoop Ecosystem) 여러대의컴퓨터를마치하나의컴퓨터처럼보이도록묶음으로제공하는것을의미합니다. 하둡은코어프로젝트와서브프로젝트들로구성되어있습니다. 하둡을편하게사용하기위한다양한기능을제공하며이를생태계같다하여에코시스템이라고합니다. 플룸 (Flume) 실시간로그데이터수집기능제공합니다. 스쿱 (sqoop) 관계형데이터베이스의데이터를하둡으로가져오거나하둡의데이터를관계형데이터베이스에전송하는기능제공합니다.
요즘분석과제하면서들리는용어들 용어 하이브 (Hive) 임팔라 (Impala) 설명 SQL Query 엔진으로사용자가 SQL 을작성하면맵리듀스방식으로데이터를처리하는기능을제공합니다. 배치처리에적합합니다. 대화형 SQL Query 엔진으로기존맵리듀스방식을사용하지않으며메모리기반의고속데이터처리기능을제공합니다. 실시간질의에적합합니다. CDSW(Cloudera Data Science Workbench) ML/AI 분석작업을지원하는협업도구입니다 (R, Python, Scala). Cloudera Manager 휴 (HUE) 하둡클러스터를설치 / 업그레이드하고개별서비스를관리및모니터링하는기능을제공하며하둡클러스터관리자가사용합니다. 하둡환경에서최종사용자및관리자가사용하는 UI 도구 ( 쿼리툴, 권한설정, 작업워크플로우작성등지원 ) 입니다.
Anatomy of a Hadoop Cluster YARN Impala Catalog Store Masters Impala Statestore Name Node Secondary Name Node HiveServer Zookeeper Zookeeper Zookeeper Cloudera Manager Kudu Master HUE Server Kudu Master Sentry Server Kudu Master Oozie Server HMaster HMaster HMaster Manager CM Agent CM Agent CM Agent Workers CM Agent CM Agent CM Agent CM Agent CM Agent CM Agent Gateway(s) YARN Resource Pool(s) YARN Resource Pool(s) YARN Resource Pool(s) YARN Resource Pool(s) YARN Resource Pool(s) YARN Resource Pool(s) CM Agent Search HBase Region Server Data Node Search HBase Region Server Data Node Search HBase Region Server Data Node Impala Daemon Kudu Tablet Server Data Node Impala Daemon Kudu Tablet Server Data Node Impala Daemon Kudu Tablet Server Data Node User App User App User App Cloudera, Inc. All rights reserved. 13
HDFS Standby Name Node Name Node Secondary Name Node File Q B X B Y B Z Data Node A Data Node B Data Node C Data Node D B X1 B X2 B X3 B Y1 B Y3 B Y2 B Z2 B Z3 B Z1 Default block size = 128MB, 256MB Rack 1 Rack 2 Rack 3 Cloudera, Inc. All rights reserved. 14
Hive Don t forget who won the race, Bucko! Spins up processes under the control of Yarn Can handle the failure of a machines Will overflow joins to HDFS HiveServer2 Location Hive Metastore Thrift Service Beeline CLI Schema File Format SerDe Driver Compiler Executor Session A Driver Compiler Executor Session B JDBC ODBC HDFS BLOB Other Cloudera, Inc. All rights reserved. 15
Impala (the fastest of the Antelopes) Written in C++, No JVM J Uses the Hive Metastore Employs algorithms from MPP databases But, I left you in the dust at the starting line, Grandpa! SQL App ODBC Hive Metastore Hadoop NN Statestore Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Executor Query Executor Query Executor HDFS DN HDFS DN HDFS DN Cloudera, Inc. All rights reserved. 16
Spark In-Memory Caching Optimized Scheduler Query optimizer A: B: B: Easy Development Rich & flexible APIs for Scala, Java, and Python Seamlessly interleave SQL syntax with code Interactive shell Batch, Stream & Machine Learning Unified framework for batch and stream processing Rich collection of distributed ML algorithms map groupby C: D: E: take join map filter = RDD = cached partition F: Cloudera, Inc. All rights reserved. 17
Big Data Pipelines 프로젝트단계별사용 ecosystem Data Ingestion Data Engineering Data Stewardship Data Science Data Analy1cs Capture Cleanse Store Model BI Move Conform Secure Score Online Stream Transform Govern Enrich APIs Enrich Tag Predict Copyright SAS Ins1tute Inc. All rights reserved.
SAS & Cloudera Partner Ecosystem ISVs & SOLUTIONS Cloudera 는 2,800 개이상의파트너생태계를구축 RESELLERS CLOUD & PLATFORM SYSTEM INTEGRATORS SAS & Cloudera enable organizations to achieve competitive advantage by gaining value from all their data, through a proven combination of enterpriseready storage, processing, analytics, and data management. Copyright SAS Ins1tute Inc. All rights reserved.
SAS & Cloudera Joint Customer Successes Optimize Discover Empower With SAS Visual Analytics, busine ss executives at Telecom Italia can compare the performance betwe en all operators for a key indicato r such as accessibility or percent age of dropped calls on a single screen for a quick overview of per tinent strengths and weaknesses. Epsilon built a next-generation marketing application, leveraging Cloudera and taking advantage of SAS capabilities by our data science/analytics team, that provides its clients with a 360- degree view of their customer AMERAN provides 360-degree vi ews into energy usage parerns and similar household comparis ons to help consumers save ener gy.
SAS & Cloudera 활용방법 이미 SAS를잘사용하고있으시고 하둡이도입되었거나도입예정이시라면 하둡에저장된데이터는어떤방식으로사용해야할지 우선하둡에저장된데이터는 SAS 사용자에게또하나의 Library HDFS, Hive, Impala 등접근방식은다양 빠른대화형쿼리수행은impala 사용을권장 ETL은 hive, hive on spark 권장
Business Users Executives Data analysts Applica9ons Impala Spark HIVE on Spark Cloud Databases Data Warehouses Flafka Web Logs Click Stream Data Semi-Structured Data
Other data sources 활용사례예 LASR SAS Visual Analytics Embedded Process (EP) Data Loader vapp SAS/Access for Hadoop Server Tier SAS Studio SAS EBI & SAS Solutions SAS Data Loader SAS Visual Analytics
SAS integrations with Cloudera From, with, In SAS accesses and extracts data from Cloudera Enterprise to a SAS server for processing and writes results back From Cloudera SAS accesses and process Cloudera Enterprise data on SAS distributed servers; lii data to SAS in-memory environment With Cloudera SAS accesses and process data directly in Cloudera Enterprise In Cloudera
SAS integrations with Cloudera SAS/Access to Hadoop SAS/Access to Impala SAS Visual Analytics Explorer SAS In-Memory Statistics for Hadoop SAS Scoring Accelerator SAS Data Loader for Hadoop From Cloudera With Cloudera In Cloudera
SAS pulls data FROM Cloudera SAS/Access to Hadoop SAS/Access to Impala SAS Visual Analytics Explorer SAS In-Memory Statistics for Hadoop SAS Scoring Accelerator SAS Data Loader for Hadoop From Cloudera With Cloudera In Cloudera
참고자료 https://github.com/jeff-bailey Copyright SAS Ins1tute Inc. All rights reserved.
SAS/Access to Hadoop HDFS 파일접근혹은 MapReduce 프로그래밍하면안되나요? 가능하지만생산성, 코드유지보수비용등을고려하시면 SQL 인터페이스를사용을권장 FileRef PROC Hadoop SAS/Access data files data files MapReduce + HDFS command Result set Hive QL (SQL like) Hiveserver2 Copyright SAS Ins1tute Inc. All rights reserved.
SAS/Access to Hadoop Features Uses exisdng SAS Interfaces Standard Libname syntax one line code change to use Hadoop Datastep and Proc SQL translated to Hive Custom SerDe support: Parquet, Avro, Text, etc. SPDE formats Integrates with YARN Use Hive or Hive-on-Spark Uses Hive and HDFS API Deployment method Connect using client jars and configuradon files REST APIs can also be used 출처 : http://documentation.sas.com/api/docsets/acreldb/9.4/content/acreldb.pdf?locale=en#nameddest=p0rkug1n9ub7b0n132xjxknz1qvv
SAS/Access to Hadoop 간단한코드, 다른성능 libname mycdh hadoop server='quickstart.cloudera' user=cloudera password=cloudera; proc sql; connect to hadoop(server='quickstart.cloudera' user=cloudera); select count(*) from connecmon to hadoop (select * from mytext); quit; proc sql; connect to hadoop(server='quickstart.cloudera' user=cloudera); select * from connecmon to hadoop (select count(*) from mytext); quit; 어떤코드가더빠를까요? 출처 : https://raw.githubusercontent.com/jeff-bailey/sgf2016_sas3880_insiders_guide_hadoop_how/master/code/ex02_sas_hadoop_sgf_2016.sas
Complex Queries Hive(MR) vs Hive(Hive on Spark) Spark 이 Hive 보다빠른이유 Set of MR jobs in sequence MR persists full dataset to HDFS aner each job 3 disk I/ Os + 3 network I/Os Spark passes data directly - at most 1 disk I/O + 1 network I/O Unless wrivng a lot, you will hit buffer cache Fetched by next Spark stage, just like Reduce task in MapReduce From MicrosoN s Dryad paper Cuts down the extra Map tasks in MapReduce! M1-R1-R2 instead of M1-R1, M2-R2
Hive(MR) vs Hive(Hive on Spark) Performance Benchmark Avg. ~3X faster than Hive-on-MapReduce More Suitable Complex workloads w/ multiple MR stages e.g. filter followed by JOIN followed by GROUP BY Disk-bound w/ multiple disk reads/writes Less Suitable Simple workloads e.g. select * CPU bound workloads e.g. complex UDFs Workloads requiring mins to hours for completion Workloads typically requiring <1 min
SAS/Access to Impala Features Same as SAS/Access to Hadoop Massively Parallel Processing (MPP) query engine Optimized for interactive analytics/queries Uses HDFS API and Impala Deployment method Connect using client jars and configurati on files REST APIs can also be used
SAS/Access to Impala libname sasflt 'SAS-data-library ; libname mydblib impala host=mysrv1 db=users user=myusr1 password=mypwd1; proc sql; create table mydblib.flights98 (BULKLOAD=YES BL_DATAFILE='/tmp/mytable.dat' BL_HOST='192.168.x.x' BL_PORT=50070) as select * from sasflt.flt98; quit; libname myimp impala server="quickstart.cloudera" user=cloudera password=cloudera dbconinit="set mem_limit=1g"; 옵션의의미? set disable_unsafe_spills=true 출처 : http://documentation.sas.com/api/docsets/acreldb/9.4/content/acreldb.pdf?locale=en#nameddest=p0rkug1n9ub7b0n132xjxknz1qvv https://support.sas.com/resources/papers/proceedings16/sas3960-2016.pdf
SAS/Access to Impala Pass-Through 사용예 Explicit Pass-Through proc sql; connect to impala (server="quickstart.cloudera" user=cloudera password=cloudera); execute(create mytable(mycol varchar(20)) by impala; disconnect from impala; quit; proc sql; connect to impala (server="quickstart.cloudera" user=cloudera password=cloudera); select * from connection to impala (select * from mytable where mycol= xx ); quit; 대부분의대용량테이블은파티션되어있어꼭파티션키를지정!!!
impala-/p 얼마나많은 count(dis/nct 컬럼 ) 을수행해야.. 모델개발전에다수의 count(distinct 컬럼 ) SQL 수행 거의유사한정확도로빨리수행할수있다면? 구글의 hyperloglog 참고 [ip-10-0-0-195.ap-northeast-2.compute.internal:21000] > select count(*) from big_pageview; Query: select count(*) from big_pageview Query submitted at: 2018-05-14 08:52:27 (Coordinator: http://ip-10-0-0-195.ap-northeast-2.compute.internal:25000) Query progress can be monitored at: http://ip-10-0-0-195.ap-northeast- 2.compute.internal:25000/query_plan?query_id=e24563149c8779e3:6d347a6700000000 +-----------+ count(*) 1 억건정도의소규모테이블 +-----------+ 100800000 +-----------+ Fetched 1 row(s) in 0.21s
[ip-10-0-0-195.ap-northeast-2.compute.internal:21000] > select count(distinct m_timestamp) from big_pageview; Query: select count(distinct m_timestamp) from big_pageview Query submitted at: 2018-05-14 08:53:20 (Coordinator: http://ip-10-0-0-195.ap-northeast-2.compute.internal:25000) Query progress can be monitored at: http://ip-10-0-0-195.ap-northeast-2.compute.internal:25000/query_plan?query_id=414dd6a82776189c:c3c436d100000000 +-----------------------------+ count(distinct m_timestamp) +-----------------------------+ 3513600 +-----------------------------+ Fetched 1 row(s) in 4.17s impala-/p 얼마나많은 count(dis/nct 컬럼 ) 을수행해야.. 3,513,600 건 / 약 4.16 초 [ip-10-0-0-195.ap-northeast-2.compute.internal:21000] > select count(distinct m_timestamp) from big_pageview; Query: select count(distinct m_timestamp) from big_pageview Query submitted at: 2018-05-14 08:53:28 (Coordinator: http://ip-10-0-0-195.ap-northeast-2.compute.internal:25000) Query progress can be monitored at: http://ip-10-0-0-195.ap-northeast-2.compute.internal:25000/query_plan?query_id=a7416f56e14c9c28:29cd86000000000 +-----------------------------+ count(distinct m_timestamp) +-----------------------------+ 3513600 +-----------------------------+ Fetched 1 row(s) in 4.16s
impala-tip 옵션하나로이걸빠르게. 적은메모리로 [ip-10-0-0-195.ap-northeast-2.compute.internal:21000] > set appx_count_distinct=true; APPX_COUNT_DISTINCT set to true [ip-10-0-0-195.ap-northeast-2.compute.internal:21000] >select count(distinct m_timestamp) from big_pageview; Query: select count(distinct m_timestamp) from big_pageview Query submitted at: 2018-05-14 08:58:07 (Coordinator: http://ip-10-0-0-195.ap-northeast-2.compute.internal:25000) Query progress can be monitored at: http://ip-10-0-0-195.ap-northeast-2.compute.internal:25000/query_plan?query_id=b34cdff2c61bd230:879ee3dd00000000 +-----------------------------+ count(distinct m_timestamp) +-----------------------------+ 3434319 +-----------------------------+ Fetched 1 row(s) in 1.03s set appx_count_distinct=false; 3,434,319 건 / 약 1 초 +-----------------------------+ count(discnct m_cmestamp) +-----------------------------+ 3513600 +-----------------------------+ Fetched 1 row(s) in 4.17s 약 98% 정확도, 4 배의성능 3,513,600 건 / 약 4.16 초
SAS process data WITH Cloudera SAS/Access to Cloudera SAS/Access to Impala SAS Visual AnalyEcs Explorer SAS In-Memory StaEsEcs for Hadoop SAS Scoring Accelerator SAS Data Loader for Hadoop From Cloudera With Cloudera In Cloudera
SAS WITH Cloudera architecture
SAS WITH Cloudera products Client applicaaons SAS Visual Analytics Explorer SAS Visual Statistics SAS In-memory Statistics for Hadoop Backend application SAS LASR Server
SAS WITH Cloudera products Features Read and write directly to HDFS using SASHDAT format or as plain-text fi les Uses HDFS API Using EP allows accessing Hive tables and custom SerDE formats (parqu et..) Integrates with Yarn (***preconfigured) Deployment method LASR can be deployed on separate SAS server or co-located Cloudera En terprise server
Features Data exploration at massive scale Intuitive visual analytics SAS Visual Analytics Explorer
SAS In-Memory Statistics for Hadoop Feature Programming interface for model development
Cloudera + SAS 장점기검증된시스템도입, 빠른개발, 데이터이동최소화 Improved Business Outcomes Accelerated Timeto-Value Better decisions by analyzing more data Solve the hard problems with interactive and iterative analytics Unlimited variables for analysis, i.e. No column restrictions In-memory data and analytics processing for faster performance. SAS simplifies working with Hadoop, Cloudera Manager simplifies system admin. Reduced Risk SAS & Cloudera integration minimizes data movement & improves governance Cloudera & SAS are stable market leaders aligned across R&D (dedicated Cloudera engineer), product mgt., services, education, and tech support More Innovation More analytic exploration of data that previously was too costly to store or troublesome to format Cloudera & SAS integrated technologies make Big Data Analytics approachable and can support innovative use cases
AI/Machine Learning 시스템구축시고려사항주변인프라규모와복잡도 (from Google) 출처 : https://pdfs.semanticscholar.org/1eb1/31a34fbb508a9dd8b646950c65901d6f1a5b.pdf
SAS FORUM 감사합니다.