SAS FORUM KOREA 2018_Cloudera_발표

Similar documents
김기남_ATDC2016_160620_[키노트].key

Intra_DW_Ch4.PDF

슬라이드 1

Oracle Database 10g: Self-Managing Database DB TSC

solution map_....

Service-Oriented Architecture Copyright Tmax Soft 2005

Portal_9iAS.ppt [읽기 전용]

PowerPoint 프레젠테이션

Open Cloud Engine Open Source Big Data Platform Flamingo Project Open Cloud Engine Flamingo Project Leader 김병곤

CONTENTS Volume 테마 즐겨찾기 빅데이터의 현주소 진일보하는 공개 기술, 빅데이터 새 시대를 열다 12 테마 활동 빅데이터 플랫폼 기술의 현황 빅데이터, 하둡 품고 병렬처리 가속화 16 테마 더하기 국내 빅데이터 산 학 연 관

ecorp-프로젝트제안서작성실무(양식3)

PCServerMgmt7

<C0CCBCBCBFB52DC1A4B4EBBFF82DBCAEBBE7B3EDB9AE2D D382E687770>

RUCK2015_Gruter_public

Oracle9i Real Application Clusters

PowerPoint 프레젠테이션

Hadoop 10주년과 Hadoop3.0의 등장_Dongjin Seo

Web Application Hosting in the AWS Cloud Contents 개요 가용성과 확장성이 높은 웹 호스팅은 복잡하고 비용이 많이 드는 사업이 될 수 있습니다. 전통적인 웹 확장 아키텍처는 높은 수준의 안정성을 보장하기 위해 복잡한 솔루션으로 구현

MS-SQL SERVER 대비 기능

Cloudera Toolkit (Dark) 2018

Domino Designer Portal Development tools Rational Application Developer WebSphere Portlet Factory Workplace Designer Workplace Forms Designer

04-다시_고속철도61~80p

AGENDA 모바일 산업의 환경변화 모바일 클라우드 서비스의 등장 모바일 클라우드 서비스 융합사례

Cache_cny.ppt [읽기 전용]

Cloudera Toolkit (Dark) 2018

The Self-Managing Database : Automatic Health Monitoring and Alerting

13 Who am I? R&D, Product Development Manager / Smart Worker Visualization SW SW KAIST Software Engineering Computer Engineering 3

Backup Exec

1217 WebTrafMon II

빅데이터분산컴퓨팅-5-수정

Windows Embedded Compact 2013 [그림 1]은 Windows CE 로 알려진 Microsoft의 Windows Embedded Compact OS의 history를 보여주고 있다. [표 1] 은 각 Windows CE 버전들의 주요 특징들을 담고

DW 개요.PDF

vm-웨어-앞부속

untitled

ETL_project_best_practice1.ppt

1.장인석-ITIL 소개.ppt

PowerPoint 프레젠테이션

따끈따끈한 한국 Azure 데이터센터 서비스를 활용한 탁월한 데이터 분석 방안 (To be named)

Basic Template

歯I-3_무선통신기반차세대망-조동호.PDF

DB진흥원 BIG DATA 전문가로 가는 길 발표자료.pptx

SW¹é¼Ł-³¯°³Æ÷ÇÔÇ¥Áö2013

untitled

±èÇö¿í Ãâ·Â

FMX M JPG 15MB 320x240 30fps, 160Kbps 11MB View operation,, seek seek Random Access Average Read Sequential Read 12 FMX () 2

PowerPoint 프레젠테이션

CD-RW_Advanced.PDF

SchoolNet튜토리얼.PDF

ORANGE FOR ORACLE V4.0 INSTALLATION GUIDE (Online Upgrade) ORANGE CONFIGURATION ADMIN O

목차 BUG offline replicator 에서유효하지않은로그를읽을경우비정상종료할수있다... 3 BUG 각 partition 이서로다른 tablespace 를가지고, column type 이 CLOB 이며, 해당 table 을 truncate

Analyst Briefing

목차 1. 제품 소개 특징 개요 Function table 기능 소개 Copy Compare Copy & Compare Erase


Azure Stack – What’s Next in Microsoft Cloud

Model Investor MANDO Portal Site People Customer BIS Supplier C R M PLM ERP MES HRIS S C M KMS Web -Based

160322_ADOP 상품 소개서_1.0

목순 차서 v KM의 현황 v Web2.0 의 개념 v Web2.0의 도입 사례 v Web2.0의 KM 적용방안 v 고려사항 1/29

분산처리 프레임워크를 활용한대용량 영상 고속분석 시스템

슬라이드 1

디지털포렌식학회 논문양식

歯목차45호.PDF

untitled

Slide 1

PowerPoint 프레젠테이션

¨ìÃÊÁ¡2

Voice Portal using Oracle 9i AS Wireless

목 차

Microsoft Word - 조병호

Chap7.PDF

Special Theme _ 모바일웹과 스마트폰 본 고에서는 모바일웹에서의 단말 API인 W3C DAP (Device API and Policy) 의 표준 개발 현황에 대해서 살펴보고 관 련하여 개발 중인 사례를 통하여 이해를 돕고자 한다. 2. 웹 애플리케이션과 네이

서현수

HTML5* Web Development to the next level HTML5 ~= HTML + CSS + JS API


Data Industry White Paper

thesis-shk

PowerPoint 프레젠테이션

Copyright 2012, Oracle and/or its affiliates. All rights reserved.,.,,,,,,,,,,,,.,...,. U.S. GOVERNMENT END USERS. Oracle programs, including any oper

03.Agile.key

about_by5

슬라이드 1

Slide 1

vm-웨어-01장

Solaris Express Developer Edition

15_3oracle

이제는 쓸모없는 질문들 1. 스마트폰 열기가 과연 계속될까? 2. 언제 스마트폰이 일반 휴대폰을 앞지를까? (2010년 10%, 2012년 33% 예상) 3. 삼성의 스마트폰 OS 바다는 과연 성공할 수 있을까? 지금부터 기업들이 관심 가져야 할 질문들 1. 스마트폰은

강의10

Interstage5 SOAP서비스 설정 가이드

_LG히다찌 브로슈어

Intro to Servlet, EJB, JSP, WS

PowerPoint 프레젠테이션

Microsoft Word - KSR2014S042

untitled

RED HAT JBoss Data Grid (JDG)? KANGWUK HEO Middleware Solu6on Architect Service Team, Red Hat Korea 1

PowerChute Personal Edition v3.1.0 에이전트 사용 설명서

DE1-SoC Board

Agenda 오픈소스 트렌드 전망 Red Hat Enterprise Virtualization Red Hat Enterprise Linux OpenStack Platform Open Hybrid Cloud

05( ) CPLV12-04.hwp

untitled

DBMS & SQL Server Installation Database Laboratory

#Ȳ¿ë¼®

Transcription:

SAS FORUM AI / Machine Learning 시대를선도하는 SAS 사용자를위한데이터플랫폼 구축안내서 Cloudera Korea 임상배 Copyright SAS Ins1tute Inc. All rights reserved.

Cloudera Hadoop SAS & Cloudera 활용방법

Cloudera Hadoop Overview 하둡따라잡기 Hadoop: 2003년 Google에서발표한 Google File System Whitepaper에기반한분산처리프레임워크 The Old Way The Hadoop Way Compute (RDBMS, EDW) Network Data Storage (SAN, NAS) Compute (CPU) Memory z z Storage (Disk) Hard to scale Network inevitably becomes a bolleneck Only handles structured/relanonal data Difficult to add new fields & data types Scales out indefinitely Network eliminated as a bottleneck Easy to ingest any type of data Agile schema-on-read data access

Cloudera Hadoop Overview 하둡따라잡기 데이터이동을최소화, 경제성높은대용량데이터저장 / 분석플랫폼 Then Bring Data to Compute Now Bring Compute to Data Compute Compute Data Compute Data Dat a Dat a Process-centric businesses use: Structured data Internal data only Important data only Compute Compute Compute Data Information-centric businesses use all Data: Multi-structured, Internal & external data of all types Copyright SAS Ins1tute Inc. All rights reserved.

Cloudera Hadoop Overview 하둡따라잡기 여러대의서버를통한분산처리구조 분산처리를통해기존시스템보다빠르고많은데이터처리 ( 필요시전체데이터셋분석 ) 정형 / 비정형구분없이모든유형, 모든볼륨의데이터에대한처리가가능 하둡은국내 / 외대부분의데이터분석시스템구축시최우선으로도입하는표준데이터플랫폼 기존분석환경 (SAS) 과연계를통해기투자 IT 자산및보유인력의 Skill-set 활용극대화필요

Hadoop: Transforming Enterprise Data Architecture 신규데이터탐색데이터기반기업개방형아키텍처 Designed for 3Vs of new data NaGve security Significantly lower cost Multiple analytic engines Agile development tools Extremely fast performance Rapid innovation Large ecosystem No vendor lock-in

Cloudera 엔터프라이즈급머신러닝및분석플랫폼지원 Machine Learning Pattern recognition Anomaly detection Prediction Customers Run on Cloudera Analytics Self-service intelligence Real-time analytics Secure reporting Customers Run IMPALA on Cloudera

4 3 2 1 Big Data App / Customer Care Churn Analytics Discover Unknowns Big data application PredicHve AnalyHcs Pro achvely respond to issues PredicHve Maintenance Predict customer behaviour Analytics / Self Service BI Combine different workloads on common data (i.e. SQL + Search) Optimize infrastructure usage Reduce Outages Reduce BI backlog requests EDW Optimization System Logs, Maintenance Data Set Top Box Data, Mobile Data, Online Data, 3rd Party Datasets Lowest cost storage 3 4 2 SERVERS MARTS Teradata SAS STORAGE SEARCH ARCHIVE 1 Account/Customer/TransacHon CLICKSTREAMS, System logs, Set Top box, Mobile 3 rd Party DATA SOURCES

하둡플랫폼의진화 The stack is continually evolving and growing! Core Hadoop (HDFS, MapReduce) Solr Pig Core Hadoop HBase ZooKeeper Solr Pig Core Hadoop Hive Mahout HBase ZooKeeper Solr Pig Core Hadoop Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig Core Hadoop Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Spark Tez Impala KaKa Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Knox Flink Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop CDSW Altus Kudu RecordService Ibis Falcon Knox Flink Parquet Sentry Spark Tez Impala KaKa Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop 2006 2007 2008 2009 2010 2011 2012 2013 2014..2017

요즘분석과제하면서들리는용어들 Hadoop, HDFS, Spark, Hive, Impala, Kudu

요즘분석과제하면서들리는용어들 용어 하둡 (Hadoop) HDFS(Hadoop Distributed File System) 설명 분산컴퓨팅프로젝트로대량의데이터를병렬분산환경에서처리하는것을목적으로합니다. 대용량파일을분산된서버에저장하고, 개별노드의하드디스크용량보다큰데이터를저장및처리하는것을지원하는파일시스템입니다. 맵리듀스 (MapReduce) 대용량데이터를배치방식으로처리하는것을지원하는프레임워크로 Map 과 Reduce 작업으로구성됩니다. 노드 (node) 보통물리적서버 1 대를의미합니다. 클러스터 (cluster) 하둡에코시스템 (Hadoop Ecosystem) 여러대의컴퓨터를마치하나의컴퓨터처럼보이도록묶음으로제공하는것을의미합니다. 하둡은코어프로젝트와서브프로젝트들로구성되어있습니다. 하둡을편하게사용하기위한다양한기능을제공하며이를생태계같다하여에코시스템이라고합니다. 플룸 (Flume) 실시간로그데이터수집기능제공합니다. 스쿱 (sqoop) 관계형데이터베이스의데이터를하둡으로가져오거나하둡의데이터를관계형데이터베이스에전송하는기능제공합니다.

요즘분석과제하면서들리는용어들 용어 하이브 (Hive) 임팔라 (Impala) 설명 SQL Query 엔진으로사용자가 SQL 을작성하면맵리듀스방식으로데이터를처리하는기능을제공합니다. 배치처리에적합합니다. 대화형 SQL Query 엔진으로기존맵리듀스방식을사용하지않으며메모리기반의고속데이터처리기능을제공합니다. 실시간질의에적합합니다. CDSW(Cloudera Data Science Workbench) ML/AI 분석작업을지원하는협업도구입니다 (R, Python, Scala). Cloudera Manager 휴 (HUE) 하둡클러스터를설치 / 업그레이드하고개별서비스를관리및모니터링하는기능을제공하며하둡클러스터관리자가사용합니다. 하둡환경에서최종사용자및관리자가사용하는 UI 도구 ( 쿼리툴, 권한설정, 작업워크플로우작성등지원 ) 입니다.

Anatomy of a Hadoop Cluster YARN Impala Catalog Store Masters Impala Statestore Name Node Secondary Name Node HiveServer Zookeeper Zookeeper Zookeeper Cloudera Manager Kudu Master HUE Server Kudu Master Sentry Server Kudu Master Oozie Server HMaster HMaster HMaster Manager CM Agent CM Agent CM Agent Workers CM Agent CM Agent CM Agent CM Agent CM Agent CM Agent Gateway(s) YARN Resource Pool(s) YARN Resource Pool(s) YARN Resource Pool(s) YARN Resource Pool(s) YARN Resource Pool(s) YARN Resource Pool(s) CM Agent Search HBase Region Server Data Node Search HBase Region Server Data Node Search HBase Region Server Data Node Impala Daemon Kudu Tablet Server Data Node Impala Daemon Kudu Tablet Server Data Node Impala Daemon Kudu Tablet Server Data Node User App User App User App Cloudera, Inc. All rights reserved. 13

HDFS Standby Name Node Name Node Secondary Name Node File Q B X B Y B Z Data Node A Data Node B Data Node C Data Node D B X1 B X2 B X3 B Y1 B Y3 B Y2 B Z2 B Z3 B Z1 Default block size = 128MB, 256MB Rack 1 Rack 2 Rack 3 Cloudera, Inc. All rights reserved. 14

Hive Don t forget who won the race, Bucko! Spins up processes under the control of Yarn Can handle the failure of a machines Will overflow joins to HDFS HiveServer2 Location Hive Metastore Thrift Service Beeline CLI Schema File Format SerDe Driver Compiler Executor Session A Driver Compiler Executor Session B JDBC ODBC HDFS BLOB Other Cloudera, Inc. All rights reserved. 15

Impala (the fastest of the Antelopes) Written in C++, No JVM J Uses the Hive Metastore Employs algorithms from MPP databases But, I left you in the dust at the starting line, Grandpa! SQL App ODBC Hive Metastore Hadoop NN Statestore Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Executor Query Executor Query Executor HDFS DN HDFS DN HDFS DN Cloudera, Inc. All rights reserved. 16

Spark In-Memory Caching Optimized Scheduler Query optimizer A: B: B: Easy Development Rich & flexible APIs for Scala, Java, and Python Seamlessly interleave SQL syntax with code Interactive shell Batch, Stream & Machine Learning Unified framework for batch and stream processing Rich collection of distributed ML algorithms map groupby C: D: E: take join map filter = RDD = cached partition F: Cloudera, Inc. All rights reserved. 17

Big Data Pipelines 프로젝트단계별사용 ecosystem Data Ingestion Data Engineering Data Stewardship Data Science Data Analy1cs Capture Cleanse Store Model BI Move Conform Secure Score Online Stream Transform Govern Enrich APIs Enrich Tag Predict Copyright SAS Ins1tute Inc. All rights reserved.

SAS & Cloudera Partner Ecosystem ISVs & SOLUTIONS Cloudera 는 2,800 개이상의파트너생태계를구축 RESELLERS CLOUD & PLATFORM SYSTEM INTEGRATORS SAS & Cloudera enable organizations to achieve competitive advantage by gaining value from all their data, through a proven combination of enterpriseready storage, processing, analytics, and data management. Copyright SAS Ins1tute Inc. All rights reserved.

SAS & Cloudera Joint Customer Successes Optimize Discover Empower With SAS Visual Analytics, busine ss executives at Telecom Italia can compare the performance betwe en all operators for a key indicato r such as accessibility or percent age of dropped calls on a single screen for a quick overview of per tinent strengths and weaknesses. Epsilon built a next-generation marketing application, leveraging Cloudera and taking advantage of SAS capabilities by our data science/analytics team, that provides its clients with a 360- degree view of their customer AMERAN provides 360-degree vi ews into energy usage parerns and similar household comparis ons to help consumers save ener gy.

SAS & Cloudera 활용방법 이미 SAS를잘사용하고있으시고 하둡이도입되었거나도입예정이시라면 하둡에저장된데이터는어떤방식으로사용해야할지 우선하둡에저장된데이터는 SAS 사용자에게또하나의 Library HDFS, Hive, Impala 등접근방식은다양 빠른대화형쿼리수행은impala 사용을권장 ETL은 hive, hive on spark 권장

Business Users Executives Data analysts Applica9ons Impala Spark HIVE on Spark Cloud Databases Data Warehouses Flafka Web Logs Click Stream Data Semi-Structured Data

Other data sources 활용사례예 LASR SAS Visual Analytics Embedded Process (EP) Data Loader vapp SAS/Access for Hadoop Server Tier SAS Studio SAS EBI & SAS Solutions SAS Data Loader SAS Visual Analytics

SAS integrations with Cloudera From, with, In SAS accesses and extracts data from Cloudera Enterprise to a SAS server for processing and writes results back From Cloudera SAS accesses and process Cloudera Enterprise data on SAS distributed servers; lii data to SAS in-memory environment With Cloudera SAS accesses and process data directly in Cloudera Enterprise In Cloudera

SAS integrations with Cloudera SAS/Access to Hadoop SAS/Access to Impala SAS Visual Analytics Explorer SAS In-Memory Statistics for Hadoop SAS Scoring Accelerator SAS Data Loader for Hadoop From Cloudera With Cloudera In Cloudera

SAS pulls data FROM Cloudera SAS/Access to Hadoop SAS/Access to Impala SAS Visual Analytics Explorer SAS In-Memory Statistics for Hadoop SAS Scoring Accelerator SAS Data Loader for Hadoop From Cloudera With Cloudera In Cloudera

참고자료 https://github.com/jeff-bailey Copyright SAS Ins1tute Inc. All rights reserved.

SAS/Access to Hadoop HDFS 파일접근혹은 MapReduce 프로그래밍하면안되나요? 가능하지만생산성, 코드유지보수비용등을고려하시면 SQL 인터페이스를사용을권장 FileRef PROC Hadoop SAS/Access data files data files MapReduce + HDFS command Result set Hive QL (SQL like) Hiveserver2 Copyright SAS Ins1tute Inc. All rights reserved.

SAS/Access to Hadoop Features Uses exisdng SAS Interfaces Standard Libname syntax one line code change to use Hadoop Datastep and Proc SQL translated to Hive Custom SerDe support: Parquet, Avro, Text, etc. SPDE formats Integrates with YARN Use Hive or Hive-on-Spark Uses Hive and HDFS API Deployment method Connect using client jars and configuradon files REST APIs can also be used 출처 : http://documentation.sas.com/api/docsets/acreldb/9.4/content/acreldb.pdf?locale=en#nameddest=p0rkug1n9ub7b0n132xjxknz1qvv

SAS/Access to Hadoop 간단한코드, 다른성능 libname mycdh hadoop server='quickstart.cloudera' user=cloudera password=cloudera; proc sql; connect to hadoop(server='quickstart.cloudera' user=cloudera); select count(*) from connecmon to hadoop (select * from mytext); quit; proc sql; connect to hadoop(server='quickstart.cloudera' user=cloudera); select * from connecmon to hadoop (select count(*) from mytext); quit; 어떤코드가더빠를까요? 출처 : https://raw.githubusercontent.com/jeff-bailey/sgf2016_sas3880_insiders_guide_hadoop_how/master/code/ex02_sas_hadoop_sgf_2016.sas

Complex Queries Hive(MR) vs Hive(Hive on Spark) Spark 이 Hive 보다빠른이유 Set of MR jobs in sequence MR persists full dataset to HDFS aner each job 3 disk I/ Os + 3 network I/Os Spark passes data directly - at most 1 disk I/O + 1 network I/O Unless wrivng a lot, you will hit buffer cache Fetched by next Spark stage, just like Reduce task in MapReduce From MicrosoN s Dryad paper Cuts down the extra Map tasks in MapReduce! M1-R1-R2 instead of M1-R1, M2-R2

Hive(MR) vs Hive(Hive on Spark) Performance Benchmark Avg. ~3X faster than Hive-on-MapReduce More Suitable Complex workloads w/ multiple MR stages e.g. filter followed by JOIN followed by GROUP BY Disk-bound w/ multiple disk reads/writes Less Suitable Simple workloads e.g. select * CPU bound workloads e.g. complex UDFs Workloads requiring mins to hours for completion Workloads typically requiring <1 min

SAS/Access to Impala Features Same as SAS/Access to Hadoop Massively Parallel Processing (MPP) query engine Optimized for interactive analytics/queries Uses HDFS API and Impala Deployment method Connect using client jars and configurati on files REST APIs can also be used

SAS/Access to Impala libname sasflt 'SAS-data-library ; libname mydblib impala host=mysrv1 db=users user=myusr1 password=mypwd1; proc sql; create table mydblib.flights98 (BULKLOAD=YES BL_DATAFILE='/tmp/mytable.dat' BL_HOST='192.168.x.x' BL_PORT=50070) as select * from sasflt.flt98; quit; libname myimp impala server="quickstart.cloudera" user=cloudera password=cloudera dbconinit="set mem_limit=1g"; 옵션의의미? set disable_unsafe_spills=true 출처 : http://documentation.sas.com/api/docsets/acreldb/9.4/content/acreldb.pdf?locale=en#nameddest=p0rkug1n9ub7b0n132xjxknz1qvv https://support.sas.com/resources/papers/proceedings16/sas3960-2016.pdf

SAS/Access to Impala Pass-Through 사용예 Explicit Pass-Through proc sql; connect to impala (server="quickstart.cloudera" user=cloudera password=cloudera); execute(create mytable(mycol varchar(20)) by impala; disconnect from impala; quit; proc sql; connect to impala (server="quickstart.cloudera" user=cloudera password=cloudera); select * from connection to impala (select * from mytable where mycol= xx ); quit; 대부분의대용량테이블은파티션되어있어꼭파티션키를지정!!!

impala-/p 얼마나많은 count(dis/nct 컬럼 ) 을수행해야.. 모델개발전에다수의 count(distinct 컬럼 ) SQL 수행 거의유사한정확도로빨리수행할수있다면? 구글의 hyperloglog 참고 [ip-10-0-0-195.ap-northeast-2.compute.internal:21000] > select count(*) from big_pageview; Query: select count(*) from big_pageview Query submitted at: 2018-05-14 08:52:27 (Coordinator: http://ip-10-0-0-195.ap-northeast-2.compute.internal:25000) Query progress can be monitored at: http://ip-10-0-0-195.ap-northeast- 2.compute.internal:25000/query_plan?query_id=e24563149c8779e3:6d347a6700000000 +-----------+ count(*) 1 억건정도의소규모테이블 +-----------+ 100800000 +-----------+ Fetched 1 row(s) in 0.21s

[ip-10-0-0-195.ap-northeast-2.compute.internal:21000] > select count(distinct m_timestamp) from big_pageview; Query: select count(distinct m_timestamp) from big_pageview Query submitted at: 2018-05-14 08:53:20 (Coordinator: http://ip-10-0-0-195.ap-northeast-2.compute.internal:25000) Query progress can be monitored at: http://ip-10-0-0-195.ap-northeast-2.compute.internal:25000/query_plan?query_id=414dd6a82776189c:c3c436d100000000 +-----------------------------+ count(distinct m_timestamp) +-----------------------------+ 3513600 +-----------------------------+ Fetched 1 row(s) in 4.17s impala-/p 얼마나많은 count(dis/nct 컬럼 ) 을수행해야.. 3,513,600 건 / 약 4.16 초 [ip-10-0-0-195.ap-northeast-2.compute.internal:21000] > select count(distinct m_timestamp) from big_pageview; Query: select count(distinct m_timestamp) from big_pageview Query submitted at: 2018-05-14 08:53:28 (Coordinator: http://ip-10-0-0-195.ap-northeast-2.compute.internal:25000) Query progress can be monitored at: http://ip-10-0-0-195.ap-northeast-2.compute.internal:25000/query_plan?query_id=a7416f56e14c9c28:29cd86000000000 +-----------------------------+ count(distinct m_timestamp) +-----------------------------+ 3513600 +-----------------------------+ Fetched 1 row(s) in 4.16s

impala-tip 옵션하나로이걸빠르게. 적은메모리로 [ip-10-0-0-195.ap-northeast-2.compute.internal:21000] > set appx_count_distinct=true; APPX_COUNT_DISTINCT set to true [ip-10-0-0-195.ap-northeast-2.compute.internal:21000] >select count(distinct m_timestamp) from big_pageview; Query: select count(distinct m_timestamp) from big_pageview Query submitted at: 2018-05-14 08:58:07 (Coordinator: http://ip-10-0-0-195.ap-northeast-2.compute.internal:25000) Query progress can be monitored at: http://ip-10-0-0-195.ap-northeast-2.compute.internal:25000/query_plan?query_id=b34cdff2c61bd230:879ee3dd00000000 +-----------------------------+ count(distinct m_timestamp) +-----------------------------+ 3434319 +-----------------------------+ Fetched 1 row(s) in 1.03s set appx_count_distinct=false; 3,434,319 건 / 약 1 초 +-----------------------------+ count(discnct m_cmestamp) +-----------------------------+ 3513600 +-----------------------------+ Fetched 1 row(s) in 4.17s 약 98% 정확도, 4 배의성능 3,513,600 건 / 약 4.16 초

SAS process data WITH Cloudera SAS/Access to Cloudera SAS/Access to Impala SAS Visual AnalyEcs Explorer SAS In-Memory StaEsEcs for Hadoop SAS Scoring Accelerator SAS Data Loader for Hadoop From Cloudera With Cloudera In Cloudera

SAS WITH Cloudera architecture

SAS WITH Cloudera products Client applicaaons SAS Visual Analytics Explorer SAS Visual Statistics SAS In-memory Statistics for Hadoop Backend application SAS LASR Server

SAS WITH Cloudera products Features Read and write directly to HDFS using SASHDAT format or as plain-text fi les Uses HDFS API Using EP allows accessing Hive tables and custom SerDE formats (parqu et..) Integrates with Yarn (***preconfigured) Deployment method LASR can be deployed on separate SAS server or co-located Cloudera En terprise server

Features Data exploration at massive scale Intuitive visual analytics SAS Visual Analytics Explorer

SAS In-Memory Statistics for Hadoop Feature Programming interface for model development

Cloudera + SAS 장점기검증된시스템도입, 빠른개발, 데이터이동최소화 Improved Business Outcomes Accelerated Timeto-Value Better decisions by analyzing more data Solve the hard problems with interactive and iterative analytics Unlimited variables for analysis, i.e. No column restrictions In-memory data and analytics processing for faster performance. SAS simplifies working with Hadoop, Cloudera Manager simplifies system admin. Reduced Risk SAS & Cloudera integration minimizes data movement & improves governance Cloudera & SAS are stable market leaders aligned across R&D (dedicated Cloudera engineer), product mgt., services, education, and tech support More Innovation More analytic exploration of data that previously was too costly to store or troublesome to format Cloudera & SAS integrated technologies make Big Data Analytics approachable and can support innovative use cases

AI/Machine Learning 시스템구축시고려사항주변인프라규모와복잡도 (from Google) 출처 : https://pdfs.semanticscholar.org/1eb1/31a34fbb508a9dd8b646950c65901d6f1a5b.pdf

SAS FORUM 감사합니다.