하둡에날개를달아주는 SAS 엔터프라이즈머신러닝플랫폼 SAS Korea / 김근태이사
CLOUDERA & SAS : OVERVIEW 2
FORCES SHAPING ANALYTICS Analytics embraces open Everyone wants to be a data scientist Changing data landscape Machine learning & Artificial Intelligence Analytics of Things Cloud enabled analytics 개방성 다양한기술연계 분석확산과제 조직의역량 정형, 비정형, 이미지, 영상 센서데이터 4 차산업혁명 인공지능 IoT, 스트림 실시간분석 인프라유연성 비용최적화 3
CLOUDERA & SAS 4
GLOAL TOP LEADER IN HADOOP & MACHINE LEARNING Big Data Hadoop Distributions Predictive Analytics & Machine Learning The Forrester Wave: Big Data Hadoop Distributions, Q1 2016 The Forrester Wave: Predictive Analytics & Machine Learning,Q1 2017 5
INTEGRATED SAS AND CLOUDERA SOLUTION FRAMEWORK AI, 머신러닝, 딥러닝, 텍스트분석, 시각화.. 데이터가공 탐색 모델생성 모델 Deployment 다양한 사용자계층 https://www.sas.com/content/dam/sas/en_us/doc/partners/intel-cloudera-sas-reduce-money-laundering-risks.pdf 6
THE SAS PLATFORM 7
THE ANALYTICS PLATFORM Analytics Requirements Analytics Lifecycle 8
THE SAS PLATFORM CI Risk Analytics Visualization Fraud & Security Data Management Streaming Cloud Hadoop Database Data Access ACCESS engines (Data Connectors) Products / Solutions Application Services (Microservices) Runtime Environments MVA CAS ESP In-DB Security, Governance, Administration APIs / UIs SAS Python R REST Java Lua SAS Studio Jupyter R Studio Enterprise Guide Host Environments On-Premises, Private, Public, Hybrid Cloud 9
MULTIPLE INTERFACES TARGET DIFFERENT USERS 10
Copyright SAS Institute Inc. All rights reserved.
THE SAS PLATFORM 2 Viya - Massively Parallel Processing Pooled RAM & multi-processors 3 Utilize Memory & Disk 1 Parallel Loading Hadoop SAS Viya Web client Fast Multi-threaded Distributed In-Memory Inter-node Communication Scalable Single Machine to Distributed MPP Scale-out On-Premise to Cloud High Availability For mission-critical business Data redundancy Controller fail-over 12
THE SAS PLATFORM : ANALYTICS LIFECYCLE SCALE DIVERSITY 다양한소스데이터를통합가공, 변수추출 트레이닝및테스트를위한데이터파티션 데이터속의미를탐색하고패턴을발굴 다양한분석실행, 최상의알고리즘선택 신뢰할수있는운영시스템으로전환 분석결과의정확성모니터링및유지보수 IoT Database PC Files Hadoop DATA DISCOVERY DEPLOYMENT SAS Python R Lua Java REST APIs TRUST Deployment, Security, Governance, Administration Environments (Cloud and On Premise) Analytics PLATFORM 13
ANALYTICS LIFECYCLE - DATA 14
PREPARING DATA FOR ANALYTICS Source Data: Customers Products Customer info, Branch Channel type Transactions, Demographics, Location, Product, Offers, 3 rd party data Purchase transactions 3 rd party customer demographics Sales territories Transformation Denormalization Promotional history Analytical Base Table Customers Products Stores Channel Promo Territories Demographics Customer spend (last 12 mo) Distance to nearest branch Columns x Thousands Customer spend (last 6 mo) Avgtime b/w transaction 15
DATA MANAGEMENT Point and click GUI SAS Data Preparation Non-Technical User Coding interfaces Data Scientist SAS R Python Lua REST SAS Data Preparation SQL Data Step Transpose Data Quality 16
DATA PREPARATION : PERFORMANCE ACCELERATION 1 데이터로딩성능 1 유연한데이터접근 2 Hadoop MPP 분산병렬프로세싱 서로다른데이터소스통합관리 사용자 Self-service 데이터접근 3 분석사용자, Application 2 인메모리 Query 성능 Serial Loading : 92 분 Parallel Loading : 2 분 DW SAS FedSQL : 2 초 A 사사례, Data 65GB Hadoop 12 / SAS 16 Nodes A 사사례, Data 40+10GB Hadoop 12 / SAS 16 Nodes SAS Platform API 병렬데이터로딩기술 17
ANALYTICS LIFECYCLE - DISCOVERY 18
The New Era of Analytics It s visual. It s analytical. It s automated Copyright SAS Institute Inc. All rights reserved.
Copyright SAS Institute Inc. All rights reserved.
DISCOVERY - KEY ISSUES 탐색과발견 어떻게데이터에숨겨진트랜드나인사이트를빨리얻어내는가? 텍스트정보의활용하고있는가? 여전히많은문서를직접읽고요약하고있는가? 과거 & 미래 보다정확한예측정보를활용할수있는가? 얼마나많은제품 /SKU 단위별로예측이필요한가? 의사결정자와기획자들이부정확한예측에대해불만이있는가? 협업 조직구성원들이정보기반으로협업하는가? PC 기반데이터가정보단절을유발하는가? ` 확률 & 예측 최적의의사결정 서로다른분석기법과데이터에대해, 다른툴을사용하는분석가들이있는가? 최신의머신 / 딥러닝알고리즘활용? 모델의정확성향상에얼마나많은시간을투자해야하는가? 제약이나목표를최소화 / 극대화해야만하는이슈를가지고있는가? 수요, 재고, 투자등계획수립에어려움을겪고있는가? 21
DISCOVERY - KEY ISSUES Ability to analyze various analytics on a single platform Data Preparation Visualization Machine Learning Text Analytics Forecasting Visual Statistics Visual Data Mining & Machine Learning Optimization Reporting Model/Decision Management Model Deployment Visual Forecasting Visual Text Analytics 22
Copyright SAS Institute Inc. All rights reserved.
ANALYTICS LIFECYCLE - DEPLOYMENT 24
MCKINSEY SURVEY - KEY ISSUES Why data and analytics initiatives fail Source: 2016 McKinsey survey of data and analytics leaders at global life insurance and P&C insurance carriers 25
HIDDEN TECHNICAL DEBT 일관적이지않은, 수작업기반의모델검증및적용 모델의운영환경적용에소요되는매우긴시간 시간경과에따른모델성능변화모니터링부재 모델개발 환경과운영 환경고려 다양한 적용 시나리오 머신러닝 & 딥러닝 적용 비즈니스 프로세스 의사결정 보안및관리에필요한정보, 문서, 관리기능부족 담당자부재시, 모델운영을위한노하우및 IP 유지 Deployment = Operationalize 26
FAST IN EXECUTION --> DEVOPS Data/ Discovery Compose and Monitor Data Assemble Deploy Monitor Update Open API, Common Services Model/Decision Management Astore Operational Targets SAS9 Base Execution ( 생성한분석모델을즉시배포 ) Viya CAS In-DB In-Hadoop MAS REST (MM, DM) Streams (ESP) Batch Online Streaming 27
TRAINING AND SCORING DEEP LEARNING MODELS Model Data Data Data 28
Copyright SAS Institute Inc. All rights reserved.
REFERENCE ARCHITECTURE FOR IOT USING SAS Edge Network Infrastructure Device Management ESP Model updates Data Center or Cloud ESP Model updates IoT Analytics, DevOps & SAS Management SAS Visual ESP Studio Analytics PLC / Sensors IoT Gateways ESP Version updates SAS Visual Statistics Other SAS Apps ESP Streams with SAS ESP Edge MSG Broker SAS ESP Server Stream Store OPC AI Data Centor 31
CLOUDERA & SAS : 기술적통합및적용 32
WHAT CAN YOU DO WITH SAS ON CLOUDERA? Access and Manage Hadoop Data 쉽고빠른 Hadoop 데이터접근 (Parallel Loading) 분석가를위한 Hadoop 데이터전처리프로세싱 Interactively Explore and Visualize Hadoop Data 시각적방법으로 Hadoop 데이터를빠르게탐색 Hadoop 대용량데이터기반리포팅및공유 Analyze and Model Hadoop 대용량데이터기반마이닝, 머신러닝, 딥러닝 다양한분석기법, 과제에통합대응 빠르고효율적인모델생성환경 (Visual, Auto Tunning) Deploy & Intergrate 분석모델을별도변환없이 Hadoop 내부에 Deploy 분석모델을별도변환없이실시간엔진에 Deploy 다양한외부환경과 In-Memory 분석연계를위한 API 33
SERIAL TRANSFER viya1.site.com SAS Viya 3.4 Infrastructure services SPRE Serial transfer with SAS Data Connector cas1.site.com cas2.site.com cas3.site.com cas4.site.com CONTROLLER (DC) WORKER (DC) WORKER (DC) WORKER (DC) A C A B B C A B C SAS CAS_DISK_CACHE In memory (RAM) Hadoop JARS & Config HiveServer2 Table A B C DATANODE DATANODE DATANODE Data 65GB Hadoop 12 / SAS 16 Nodes 34
MULTI-NODE TRANSFER Multi-node transfer with SAS Data Connector viya1.site.com SAS Viya 3.4 Infrastructure services SPRE cas1.site.com cas2.site.com cas3.site.com cas4.site.com CONTROLLER( DC) WORKER (DC) WORKER (DC) WORKER (DC) A C A B B C A B C CAS CAS_DISK_CACHE In memory (RAM) Hadoop JARS & Config HiveServer2 Table A B C DATANODE DATANODE DATANODE Data 65GB Hadoop 12 / SAS 16 Nodes 35
PARALLEL TRANSFER Parallel transfer with the SAS In-Database Embedded Process viya1.site.com SAS Viya 3.4 Infrastructure services SPRE cas1.site.com cas2.site.com cas3.site.com cas4.site.com CONTROLLER (DCA) WORKER (DCA) WORKER (DCA) WORKER (DCA) A C A B B C A B C CAS CAS_DISK_CACHE In memory (RAM) SAS Embedded Process HiveServer2 YARN Resource Manager HDFS NameNode Table A B C DATANODE DATANODE DATANODE Hadoop JARS & Config Data 65GB Hadoop 12 / SAS 16 Nodes 36
CO-LOCATED HDFS Parallel transfer of SASHDAT from co-located HDFS viya1.site.com SAS Viya 3.4 Infrastructure services SPRE cas1.site.com cas2.site.com cas3.site.com cas4.site.com CONTROLLER WORKER WORKER WORKER A B C CAS HDFS CAS_DISK_CACHE A C B A C B SASHDAT/CSV file NAME NODE DATA NODE DATA NODE DATA NODE In memory (RAM) SAS Plug-In for Hadoop The fastest, most efficient way to (re-)load data into CAS 37
MODEL DEPLOYMENT (IN-HADOOP) SAS Viya Astore Scoring Accelerator 모델실행 (Scoring) Hadoop 내부에서 머신러닝모델실행 38
ARCHITECTURE TO THE IOT ANALYTICS LIFECYCLE 39
ARCHITECTURE TO THE IOT ANALYTICS LIFECYCLE Cloudera SAS 데이터수집 ( 배치 ) 데이터수집 ( 실시간 ) 분석실행 ( 실시간 ) 데이터관리 프로파일링, 프로토타이핑 SAS 모델실행 AI, 머신러닝, 딥러닝 이미지 / 텍스트분석 데이터탐색시각화, 공유 40
WHY CLOUDERA & SAS? 41
THANK YOU