PowerPoint Template

Similar documents
김기남_ATDC2016_160620_[키노트].key

APOGEE Insight_KR_Base_3P11

solution map_....

Something that can be seen, touched or otherwise sensed

CONTENTS Volume 테마 즐겨찾기 빅데이터의 현주소 진일보하는 공개 기술, 빅데이터 새 시대를 열다 12 테마 활동 빅데이터 플랫폼 기술의 현황 빅데이터, 하둡 품고 병렬처리 가속화 16 테마 더하기 국내 빅데이터 산 학 연 관

1.장인석-ITIL 소개.ppt

FMX M JPG 15MB 320x240 30fps, 160Kbps 11MB View operation,, seek seek Random Access Average Read Sequential Read 12 FMX () 2

Intra_DW_Ch4.PDF

PCServerMgmt7

DB진흥원 BIG DATA 전문가로 가는 길 발표자료.pptx

1217 WebTrafMon II

example code are examined in this stage The low pressure pressurizer reactor trip module of the Plant Protection System was programmed as subject for

DW 개요.PDF

04-다시_고속철도61~80p

PowerPoint 프레젠테이션

분산처리 프레임워크를 활용한대용량 영상 고속분석 시스템

PowerPoint 프레젠테이션

Basic Template

Backup Exec

Oracle9i Real Application Clusters

°í¼®ÁÖ Ãâ·Â

2017 1

PowerPoint Presentation

Integ

Social Network

OP_Journalism

DIY 챗봇 - LangCon

歯I-3_무선통신기반차세대망-조동호.PDF

0125_ 워크샵 발표자료_완성.key

Microsoft PowerPoint - 3.공영DBM_최동욱_본부장-중소기업의_실용주의_CRM

빅데이터_DAY key

ETL_project_best_practice1.ppt

<31325FB1E8B0E6BCBA2E687770>

Portal_9iAS.ppt [읽기 전용]

출원국 권 리 구 분 상 태 권리번호 KR 특허 등록

AGENDA 모바일 산업의 환경변화 모바일 클라우드 서비스의 등장 모바일 클라우드 서비스 융합사례

ecorp-프로젝트제안서작성실무(양식3)

I I-1 I-2 I-3 I-4 I-5 I-6 GIS II II-1 II-2 II-3 III III-1 III-2 III-3 III-4 III-5 III-6 IV GIS IV-1 IV-2 (Complement) IV-3 IV-4 V References * 2012.

플랫폼을말하다 2

Web Application Hosting in the AWS Cloud Contents 개요 가용성과 확장성이 높은 웹 호스팅은 복잡하고 비용이 많이 드는 사업이 될 수 있습니다. 전통적인 웹 확장 아키텍처는 높은 수준의 안정성을 보장하기 위해 복잡한 솔루션으로 구현

<32382DC3BBB0A2C0E5BED6C0DA2E687770>

#Ȳ¿ë¼®

Service-Oriented Architecture Copyright Tmax Soft 2005

Open Cloud Engine Open Source Big Data Platform Flamingo Project Open Cloud Engine Flamingo Project Leader 김병곤

SchoolNet튜토리얼.PDF

<30362E20C6EDC1FD2DB0EDBFB5B4EBB4D420BCF6C1A42E687770>


Global Bigdata 사용 현황 및 향후 활용 전망 빅데이터 미도입 이유 필요성 못느낌, 분석 가치 판단 불가 향후 투자를 집중할 분야는 보안 모니터링 분야 와 자동화 시스템 분야 빅데이터의 핵심 가치 - 트랜드 예측 과 제품 개선 도움 빅데이터 운영 애로 사항

PowerPoint 프레젠테이션

HTML5가 웹 환경에 미치는 영향 고 있어 웹 플랫폼 환경과는 차이가 있다. HTML5는 기존 HTML 기반 웹 브라우저와의 호환성을 유지하면서도, 구조적인 마크업(mark-up) 및 편리한 웹 폼(web form) 기능을 제공하고, 리치웹 애플리케이 션(RIA)을

PowerPoint 프레젠테이션

¹Ìµå¹Ì3Â÷Àμâ

Output file

초보자를 위한 분산 캐시 활용 전략

À±½Â¿í Ãâ·Â

6주차.key

2 / 26

NoSQL

Chap7.PDF

歯3이화진

04_오픈지엘API.key

4 CD Construct Special Model VI 2 nd Order Model VI 2 Note: Hands-on 1, 2 RC 1 RLC mass-spring-damper 2 2 ζ ω n (rad/sec) 2 ( ζ < 1), 1 (ζ = 1), ( ) 1

UML

Oracle Database 10g: Self-Managing Database DB TSC

untitled

13 Who am I? R&D, Product Development Manager / Smart Worker Visualization SW SW KAIST Software Engineering Computer Engineering 3

J2EE & Web Services iSeminar

Oracle Apps Day_SEM

SW¹é¼Ł-³¯°³Æ÷ÇÔÇ¥Áö2013

Domino Designer Portal Development tools Rational Application Developer WebSphere Portlet Factory Workplace Designer Workplace Forms Designer

Voice Portal using Oracle 9i AS Wireless

- 2 -

강의지침서 작성 양식

... 수시연구 국가물류비산정및추이분석 Korean Macroeconomic Logistics Costs in 권혁구ㆍ서상범...

05( ) CPLV12-04.hwp

vm-웨어-01장

Manufacturing6

국내 디지털콘텐츠산업의 Global화 전략

슬라이드 1

6.24-9년 6월

06_ÀÌÀçÈÆ¿Ü0926

sdf

? Search Search Search Search Long-Tail Long-Tail Long-Tail Long-Tail Media Media Media Media Web2.0 Web2.0 Web2.0 Web2.0 Communication Advertisement

11¹Ú´ö±Ô

dbms_snu.PDF

untitled

歯이시홍).PDF

Page 2 of 5 아니다 means to not be, and is therefore the opposite of 이다. While English simply turns words like to be or to exist negative by adding not,

BSC Discussion 1

PowerPoint 프레젠테이션

목차 BUG offline replicator 에서유효하지않은로그를읽을경우비정상종료할수있다... 3 BUG 각 partition 이서로다른 tablespace 를가지고, column type 이 CLOB 이며, 해당 table 을 truncate

Microsoft Word - 조병호

CONTENTS CONTENTS CONTENT 1. SSD & HDD 비교 2. SSD 서버 & HDD 서버 비교 3. LSD SSD 서버 & HDD 서버 비교 4. LSD SSD 서버 & 글로벌 SSD 서버 비교 2

thesis


スライド タイトルなし

클라우드컴퓨팅확산에따른국내경제시사점 클라우드컴퓨팅확산에따른국내경제시사점 * 1) IT,,,, Salesforce.com SaaS (, ), PaaS ( ), IaaS (, IT ), IT, SW ICT, ICT IT ICT,, ICT, *, (TEL)

Intro to Servlet, EJB, JSP, WS

슬라이드 제목 없음

정보기술응용학회 발표

Transcription:

빅데이터실시간분석기술동향및적용사례 2013. 10. 08 ( 주 ) 리얼타임테크

목차 1. 빅데이터개요 2. 빅데이터분석개요 3. 빅데이터분석기술 4. 사례연구 2

1. 빅데이터개요 3

빅데이터개요 빅데이터기술의등장배경 Source : IDC Digital universe study(2011) Source : IDC (2012) Digital Universe: the total amount of data stored in the world s computers The rapid rate(over 45%) of data growth Problem of storage and processing speed, etc. Over 90% of data : Unstructured and semistructure data Conventional data processing? The frequency of data generation and delivery Should be applied to data in motion 4

빅데이터개요 빅데이터정의 Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis. Definition of IDC Variety Volume 데이터의다양화 비정형데이터 (Unstructured Data) 처리필요 시스템유연성지원 사용자정의프로세스및새로운처리모델 Velocity 데이터의대용량화 (Beyond DBMS capacity) 시스템의확장성 (Scalability) 분산컴퓨팅기술 Parallelism ig ata 5 데이터의고속처리 ( 분석 ) 의사결정속도중요, 지연최소화 인메모리컴퓨팅및슈퍼컴퓨팅기술 Stream processing

빅데이터개요 빅데이터플랫폼의구성 데이터수집 데이터전처리 정보저장관리 정보처리분석 지능가시화 6

빅데이터개요 Open Source 기반빅데이터플랫폼 (1/2) Data Analysis Machine Learning (Mahout) Data mining, Statistics, Visualization Lib (R) Text Mining (Near)Real-time processing Batch processing CEP (Esper) Real-time stream processing S/W (Strom, S4) Data Aggregator Web Crawler (Nutch) RDBMS Adapter (Sqoop) Collector (Flume,Scribe,Chukwa) Job Workflow Engine (oozie) RDBMS (MySQL, PostgresSQL) Data Processing Framework (MapReduce) NoSQL (Hbase, Redis, MongoDB) Data Store File System (HDFS) Data Processing Language (Pig, Hive) NewSQL (voltdb) Graph Processing (Hama, Giraph) Search Store (ElasticSearch, solr) Cluster Management (ZooKeeper) Management 7

빅데이터개요 Open Source 기반빅데이터플랫폼 (2/2) Category Software Description Data Collection Data Store Real-time Analytics Batch Analytics Mining Flume, Scribe, Chukwa sqoop Nutch HDFS Hbase, Redis, MongoDB voltdb Elastic search, Solr Storm, S4 Esper Oozie MapReduce Pig, Hive Goraph, Hama Mahout R Collecting data from data source Data delivery between HDFS and RDBMS Web crawler Distributed file system Key-value based data-base management system RDBMS supporting scalability and ACID Search engine Real-time distributed and parallel data processing Processing stream data and providing high-level language Workflow scheduler for Hadoop job Batch distributed and parallel data processing Providing analytic operation and high-level language for big-data Providing distributed and parallel programming model for big graph data Machine learning Statistics, data mining, visualization library Management zookeeper Distribution coordinator for Cluster management 8

2. 빅데이터분석개요 9

Outcomes Enables Question www.realtimetech.co.kr 빅데이터분석개요 분석기술발전방향 Flow of concept in Big-Data analytics Descriptive Predictive Prescriptive What happened? What is happening? What will happen? Why will it happen? What should I do? Why should I do it? Business reporting Dashboards Scoreboards Data warehousing Data mining Text mining Web/Media mining Forecasting Optimization Simulation Decision modeling Export system Well defined business problems and opportunities Accurate projections of future states and conditions Best possible business decisions and transactions Past Future 10

빅데이터분석개요 분석환경변화 11

빅데이터분석개요 분석기술적용분야 ( Potential Use cases ) Source : SAS & IDC 12

3. 빅데이터분석기술 빅데이터배치분석기술 빅데이터실시간분석기술 13

빅데이터배치 (Batch) 분석기술 Hadoop overview Google 플랫폼의클론으로 2004 년시작된아파치오픈소스프로젝트이며현재, Big data 저장 / 분석주류플랫폼으로성장 Software platform that lets one easily write and run applications that process vast amounts of data. It includes: MapReduce offline computing engine HDFS Hadoop distributed file system HBase (pre-alpha) online data access Why Hadoop useful Scalable: It can reliably store and process petabytes. Economical: It distributes the data and processing across clusters of commonly available computers (in thousands). Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located. Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. 14

빅데이터배치 (Batch) 분석기술 HDFS The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. highly fault-tolerant and is designed to be deployed on low-cost hardware. provides high throughput access to application data and is suitable for applications that have large data sets. relaxes a few POSIX requirements to enable streaming access to file system data. part of the Apache Hadoop Core project. 15

빅데이터배치 (Batch) 분석기술 MapReduce A programming model developed at Google Sort/merge based distributed computing Used extensively by more organizations (e.g., Yahoo, Amazon.com, IBM, etc.) It is functional style programming(e.g., LISP) parallelizable across a large cluster of workstations or PCs. Key features for Hadoop s success partitioning of the input data scheduling the program s execution across several machines handling machine failures managing required inter-machine communication. 16

빅데이터배치 (Batch) 분석기술 Working model for offline-batched analytics 17

빅데이터배치 (Batch) 분석기술 Example applications of Hadoop A9.com Amazon: To build Amazon's product search indices; process millions of sessions daily for analytics, using both the Java and streaming APIs; clusters vary from 1 to 100 nodes. Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop; biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support research for Ad Systems and Web Search AOL : Used for a variety of things ranging from statistics generation to running advanced algorithms for doing behavioral analysis and targeting; cluster size is 50 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk giving us a total of 37 TB HDFS capacity. Facebook: To store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning; 320 machine cluster with 2,560 cores and about 1.3 PB raw storage; FOX Interactive Media : 3 X 20 machine cluster (8 cores/machine, 2TB/machine storage) ; 10 machine cluster (8 cores/machine, 1TB/machine storage); Used for log analysis, data mining and machine learning University of Nebraska Lincoln: one medium-sized Hadoop cluster (200TB) to store and serve physics data; Adknowledge - to build the recommender system for behavioral targeting, plus other clickstream analytics; clusters vary from 50 to 200 nodes, mostly on EC2. Contextweb - to store ad serving log and use it as a source for Ad optimizations/ Analytics/reporting/machine learning; 23 machine cluster with 184 cores and about 35TB raw storage. Each (commodity) node has 8 cores, 8GB RAM and 1.7 TB of storage. Cornell University Web Lab: Generating web graphs on 100 nodes (dual 2.4GHz Xeon Processor, 2 GB RAM, 72GB Hard Drive) NetSeer - Up to 1000 instances on Amazon EC2 ; Data storage in Amazon S3; Used for crawling, processing, serving and log analysis The New York Times : Large scale image conversions ; EC2 to run Hadoop on a large virtual cluster Powerset / Microsoft - Natural Language Search; up to 400 instances on Amazon EC2 ; data storage in Amazon S3 18

빅데이터실시간분석기술 빅데이터실시간분석플랫폼 빅데이터분석기술은은배치처리기술에서폭증스트림처리기술로발전중임. 19 Source : ETRI

빅데이터실시간분석기술 Concept of stream processing Stream : Unbounded sequence of data Processing of data-in-motion Finite window data processing Continuous query processing Source : EMC Blog posted by William Zhou Sep 2012 20

빅데이터실시간분석기술 Storm - overview Developed by BackType which was acquired by Twitter Lots of tools for data (i.e. batch) processing Hadoop, Pig, HBase, Hive, None of them are real-time systems which is becoming a real requirement for businesses Problems of MR Scaling is painful Poor fault-tolerance Coding is tedious What we want Guaranteed data processing Horizontal scalability Fault-tolerance No intermediate message brokers! Higher level abstraction than message passing Just works!! Storm provides real-time computation Scalable Guarantees no data loss Extremely robust and fault-tolerant Programming language agnostic 21

빅데이터실시간분석기술 Storm architecture & stream processing model Storm cluster Distributed architecture as Master/Slave Nimbus : code distribution, task deployment, fault monitoring Supervisor : processing task control Zookeeper : cluster management Stream Processing model 22

빅데이터실시간분석기술 Storm stream grouping When a tuple is emitted which task does it go to? Shuffle grouping pick a random task Fields grouping consistent hashing on a subset of tuple fields All grouping send to all tasks Global grouping pick task with lowest id 23

빅데이터실시간분석기술 Storm Processing example(word count) 24

빅데이터실시간분석기술 S4 - Overview ( Simple Scalable Streaming System ) S4 is a general-purpose, distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data Released by Yahoo! in October 2010 An Apache Incubator project since September 2011 Under the Apache 2.0 license Proven Deployed in production systems at Yahoo! to process thousands of search queries per second Extensible Applications can easily be written and deployed using a simple API. Decentralized All nodes are symmetric with no centralized service and no single point of failure. Cluster management Using a communication layer built on top of ZooKeeper 25 Scalable Throughput increases linearly as additional nodes are added to the cluster. Fault-tolerance When a server in the cluster fails, a stand-by server is automatically activated to take over the tasks.

빅데이터실시간분석기술 S4 Architecture S4 is logically a message passing system computational units, called Processing Elements (PEs), send and receive messages (called Events) S4 framework defines an API which every PE must implement, and provides facilities instantiating PEs and for transporting Events 26

빅데이터실시간분석기술 S4 Stream processing model External Data Sources Data Stream Adapter Convert to Events Input Event Processing Element Processing Node PEC(Processing Element Container) Processing Element Output Event Processing Element Stream : a sequence of Events" Events Arbitrary Java Objects that can be passed between PEs of the form (K, A) K : keyed attribute/value A : other attributes Adapters convert external data sources into Events that S4 can process Attributes of events can be accessed via getters in PEs Events Events are dispatched in named streams 27 public class Person { public String class name Person = Lee ; { public int String class age = name Person 30; = Lee ; { String int String age addr = name 30; = Lee ; = Daejeon ; String int age addr = 30; = } Daejeon ; String addr = } Daejeon ; }

빅데이터실시간분석기술 S4 Stream processing model PE(Processing Element) Basic computational units in S4 Consume events and can in turn emit new events and update their state Each instance of a PE is uniquely identified by four components: its functionality as defined by a PE class and associated configuration, the named stream that it consumes, the keyed attribute in those events, and the value of the keyed attribute in events which it consumes Every PE consumes exactly those events which correspond to the value on which it is keyed A PE is instantiated for each value of the key attribute This instantiation is performed by the platform public class Person { String name; int age; String addr; } Type of event = named stream Keyed attribute Other attribute 28

빅데이터실시간분석기술 S4 Stream processing model Processing Node (PN) Logical hosts to PEs Responsible for listening to events, executing operations on the incoming events, dispatching events with the assistance of the communication layer, and emitting output events S4 : route each event to PNs based on a hash function of the values of all known keyed attributes in that event Event Listener : pass incoming events to the PEC PEC : invoke the appropriate PEs in the appropriate order Every keyless PE is instantiated once per PN Only one PE prototype exists in a PN PE Container (PEC) Holds all PE instances, including the PE prototypes Responsible for routing incoming events to the appropriate PE instances 29

빅데이터실시간분석기술 S4 processing example Word count example 30

빅데이터실시간분석기술 Twitter Strom vs Yahoo! S4 31

4. 사례연구 빅데이터실시간플랫폼개발사례 빅데이터실시간플랫폼활용사례 In-Memory computing for Big data 32

빅데이터실시간플랫폼개발사례 프로젝트 : 차세대메모리기반의빅데이터분석 관리소프트웨어원천기술개발 ( ETRI, 2012.6 ~ 2017.5 ) 33

빅데이터실시간플랫폼개발사례 빅데이터실시간분석플랫폼구성도 34

빅데이터실시간플랫폼활용사례 프로젝트 : 사이버표적공격인지및추적기술개발 ( ETRI, 2013.3 ~ 2017.2 ) 35

빅데이터실시간플랫폼활용사례 대용량누적데이터및실시간데이터처리플랫폼구성도 ( 오픈소스활용 ) 36

In-Memory computing for Big Data [ Hype Cycle for Big Data ] 37

In-Memory computing for Big Data 적용사례 1 : 실시간공간통계분석 / 제공시스템 ( 통계청 ) 통계청통계네비게이터시스템 1) 국민생활과밀접한상세지역생활통계정보를지역별공간정보와연계하여웹기반대국민서비스를제공하는공간빅데이터시스템으로, Kairos 적용을통한고속의 Web 기반통계 GIS 서비스실현 2) 기존외산소프트웨어를기반으로구축되었던시스템을국산기술과국산웹기술기반의신규시스템으로대체하여성공한사례임 3) 데이터의실시간갱신을통한서비스의신뢰성확보 Service Gateway WebGIS Server Web Server Web Server HP Superdome : HP-UX 8CPU x Quad core, 256GB DB : 100GB ( 2012 현재 ) Kairos Spatial 4.8 이중화를통한 HA 구현 (gis.nso.go.kr) Middleware Middleware (Active) Map DB Census DB 데이터이중화 (Active) Map DB Census DB Sync Agent 38

In-Memory computing for Big Data 적용사례 2 : 교통정보실시간수집 / 가공 / 분석시스템 ( 현대 / 기아자동차 ) 현대 / 기아자동차교통정보시스템고도화구축 1) 현대 / 기아자동차의교통정보빅데이터처리에디스크DBMS의성능한계로 In-Memory DBMS를도입하여운영되고있는빅데이터분야의대표적인성공사례 2) 현대 / 기아자동차본사의 In-Memory DBMS의첫적용사례 3) 가공시간단축으로기존대비더정확한교통정보제공을통해양질의서비스를제공함 4) 차량의단말기 ( 카드, 내비게이션등 ) 를이용한교통제공서비스연동가능 (Active) (Active) 39