PowerPoint Template - PDF 무료 다운로드

빅데이터실시간분석기술동향및적용사례 2013. 10. 08 ( 주 ) 리얼타임테크

목차 1. 빅데이터개요 2. 빅데이터분석개요 3. 빅데이터분석기술 4. 사례연구 2

1. 빅데이터개요 3

빅데이터개요 빅데이터기술의등장배경 Source : IDC Digital universe study(2011) Source : IDC (2012) Digital Universe: the total amount of data stored in the world s computers The rapid rate(over 45%) of data growth Problem of storage and processing speed, etc. Over 90% of data : Unstructured and semistructure data Conventional data processing? The frequency of data generation and delivery Should be applied to data in motion 4

빅데이터개요 빅데이터정의 Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis. Definition of IDC Variety Volume 데이터의다양화 비정형데이터 (Unstructured Data) 처리필요 시스템유연성지원 사용자정의프로세스및새로운처리모델 Velocity 데이터의대용량화 (Beyond DBMS capacity) 시스템의확장성 (Scalability) 분산컴퓨팅기술 Parallelism ig ata 5 데이터의고속처리 ( 분석 ) 의사결정속도중요, 지연최소화 인메모리컴퓨팅및슈퍼컴퓨팅기술 Stream processing

빅데이터개요 빅데이터플랫폼의구성 데이터수집 데이터전처리 정보저장관리 정보처리분석 지능가시화 6

빅데이터개요 Open Source 기반빅데이터플랫폼 (1/2) Data Analysis Machine Learning (Mahout) Data mining, Statistics, Visualization Lib (R) Text Mining (Near)Real-time processing Batch processing CEP (Esper) Real-time stream processing S/W (Strom, S4) Data Aggregator Web Crawler (Nutch) RDBMS Adapter (Sqoop) Collector (Flume,Scribe,Chukwa) Job Workflow Engine (oozie) RDBMS (MySQL, PostgresSQL) Data Processing Framework (MapReduce) NoSQL (Hbase, Redis, MongoDB) Data Store File System (HDFS) Data Processing Language (Pig, Hive) NewSQL (voltdb) Graph Processing (Hama, Giraph) Search Store (ElasticSearch, solr) Cluster Management (ZooKeeper) Management 7

빅데이터개요 Open Source 기반빅데이터플랫폼 (2/2) Category Software Description Data Collection Data Store Real-time Analytics Batch Analytics Mining Flume, Scribe, Chukwa sqoop Nutch HDFS Hbase, Redis, MongoDB voltdb Elastic search, Solr Storm, S4 Esper Oozie MapReduce Pig, Hive Goraph, Hama Mahout R Collecting data from data source Data delivery between HDFS and RDBMS Web crawler Distributed file system Key-value based data-base management system RDBMS supporting scalability and ACID Search engine Real-time distributed and parallel data processing Processing stream data and providing high-level language Workflow scheduler for Hadoop job Batch distributed and parallel data processing Providing analytic operation and high-level language for big-data Providing distributed and parallel programming model for big graph data Machine learning Statistics, data mining, visualization library Management zookeeper Distribution coordinator for Cluster management 8

2. 빅데이터분석개요 9

Outcomes Enables Question www.realtimetech.co.kr 빅데이터분석개요 분석기술발전방향 Flow of concept in Big-Data analytics Descriptive Predictive Prescriptive What happened? What is happening? What will happen? Why will it happen? What should I do? Why should I do it? Business reporting Dashboards Scoreboards Data warehousing Data mining Text mining Web/Media mining Forecasting Optimization Simulation Decision modeling Export system Well defined business problems and opportunities Accurate projections of future states and conditions Best possible business decisions and transactions Past Future 10

빅데이터분석개요 분석환경변화 11

빅데이터분석개요 분석기술적용분야 ( Potential Use cases ) Source : SAS & IDC 12

3. 빅데이터분석기술 빅데이터배치분석기술 빅데이터실시간분석기술 13

빅데이터배치 (Batch) 분석기술 Hadoop overview Google 플랫폼의클론으로 2004 년시작된아파치오픈소스프로젝트이며현재, Big data 저장 / 분석주류플랫폼으로성장 Software platform that lets one easily write and run applications that process vast amounts of data. It includes: MapReduce offline computing engine HDFS Hadoop distributed file system HBase (pre-alpha) online data access Why Hadoop useful Scalable: It can reliably store and process petabytes. Economical: It distributes the data and processing across clusters of commonly available computers (in thousands). Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located. Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. 14

빅데이터배치 (Batch) 분석기술 HDFS The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. highly fault-tolerant and is designed to be deployed on low-cost hardware. provides high throughput access to application data and is suitable for applications that have large data sets. relaxes a few POSIX requirements to enable streaming access to file system data. part of the Apache Hadoop Core project. 15

빅데이터배치 (Batch) 분석기술 MapReduce A programming model developed at Google Sort/merge based distributed computing Used extensively by more organizations (e.g., Yahoo, Amazon.com, IBM, etc.) It is functional style programming(e.g., LISP) parallelizable across a large cluster of workstations or PCs. Key features for Hadoop s success partitioning of the input data scheduling the program s execution across several machines handling machine failures managing required inter-machine communication. 16

빅데이터배치 (Batch) 분석기술 Working model for offline-batched analytics 17

빅데이터배치 (Batch) 분석기술 Example applications of Hadoop A9.com Amazon: To build Amazon's product search indices; process millions of sessions daily for analytics, using both the Java and streaming APIs; clusters vary from 1 to 100 nodes. Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop; biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support research for Ad Systems and Web Search AOL : Used for a variety of things ranging from statistics generation to running advanced algorithms for doing behavioral analysis and targeting; cluster size is 50 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk giving us a total of 37 TB HDFS capacity. Facebook: To store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning; 320 machine cluster with 2,560 cores and about 1.3 PB raw storage; FOX Interactive Media : 3 X 20 machine cluster (8 cores/machine, 2TB/machine storage) ; 10 machine cluster (8 cores/machine, 1TB/machine storage); Used for log analysis, data mining and machine learning University of Nebraska Lincoln: one medium-sized Hadoop cluster (200TB) to store and serve physics data; Adknowledge - to build the recommender system for behavioral targeting, plus other clickstream analytics; clusters vary from 50 to 200 nodes, mostly on EC2. Contextweb - to store ad serving log and use it as a source for Ad optimizations/ Analytics/reporting/machine learning; 23 machine cluster with 184 cores and about 35TB raw storage. Each (commodity) node has 8 cores, 8GB RAM and 1.7 TB of storage. Cornell University Web Lab: Generating web graphs on 100 nodes (dual 2.4GHz Xeon Processor, 2 GB RAM, 72GB Hard Drive) NetSeer - Up to 1000 instances on Amazon EC2 ; Data storage in Amazon S3; Used for crawling, processing, serving and log analysis The New York Times : Large scale image conversions ; EC2 to run Hadoop on a large virtual cluster Powerset / Microsoft - Natural Language Search; up to 400 instances on Amazon EC2 ; data storage in Amazon S3 18

빅데이터실시간분석기술 빅데이터실시간분석플랫폼 빅데이터분석기술은은배치처리기술에서폭증스트림처리기술로발전중임. 19 Source : ETRI

빅데이터실시간분석기술 Concept of stream processing Stream : Unbounded sequence of data Processing of data-in-motion Finite window data processing Continuous query processing Source : EMC Blog posted by William Zhou Sep 2012 20

빅데이터실시간분석기술 Storm - overview Developed by BackType which was acquired by Twitter Lots of tools for data (i.e. batch) processing Hadoop, Pig, HBase, Hive, None of them are real-time systems which is becoming a real requirement for businesses Problems of MR Scaling is painful Poor fault-tolerance Coding is tedious What we want Guaranteed data processing Horizontal scalability Fault-tolerance No intermediate message brokers! Higher level abstraction than message passing Just works!! Storm provides real-time computation Scalable Guarantees no data loss Extremely robust and fault-tolerant Programming language agnostic 21

빅데이터실시간분석기술 Storm architecture & stream processing model Storm cluster Distributed architecture as Master/Slave Nimbus : code distribution, task deployment, fault monitoring Supervisor : processing task control Zookeeper : cluster management Stream Processing model 22

빅데이터실시간분석기술 Storm stream grouping When a tuple is emitted which task does it go to? Shuffle grouping pick a random task Fields grouping consistent hashing on a subset of tuple fields All grouping send to all tasks Global grouping pick task with lowest id 23

빅데이터실시간분석기술 Storm Processing example(word count) 24

빅데이터실시간분석기술 S4 - Overview ( Simple Scalable Streaming System ) S4 is a general-purpose, distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data Released by Yahoo! in October 2010 An Apache Incubator project since September 2011 Under the Apache 2.0 license Proven Deployed in production systems at Yahoo! to process thousands of search queries per second Extensible Applications can easily be written and deployed using a simple API. Decentralized All nodes are symmetric with no centralized service and no single point of failure. Cluster management Using a communication layer built on top of ZooKeeper 25 Scalable Throughput increases linearly as additional nodes are added to the cluster. Fault-tolerance When a server in the cluster fails, a stand-by server is automatically activated to take over the tasks.

빅데이터실시간분석기술 S4 Architecture S4 is logically a message passing system computational units, called Processing Elements (PEs), send and receive messages (called Events) S4 framework defines an API which every PE must implement, and provides facilities instantiating PEs and for transporting Events 26

빅데이터실시간분석기술 S4 Stream processing model External Data Sources Data Stream Adapter Convert to Events Input Event Processing Element Processing Node PEC(Processing Element Container) Processing Element Output Event Processing Element Stream : a sequence of Events" Events Arbitrary Java Objects that can be passed between PEs of the form (K, A) K : keyed attribute/value A : other attributes Adapters convert external data sources into Events that S4 can process Attributes of events can be accessed via getters in PEs Events Events are dispatched in named streams 27 public class Person { public String class name Person = Lee ; { public int String class age = name Person 30; = Lee ; { String int String age addr = name 30; = Lee ; = Daejeon ; String int age addr = 30; = } Daejeon ; String addr = } Daejeon ; }

빅데이터실시간분석기술 S4 Stream processing model PE(Processing Element) Basic computational units in S4 Consume events and can in turn emit new events and update their state Each instance of a PE is uniquely identified by four components: its functionality as defined by a PE class and associated configuration, the named stream that it consumes, the keyed attribute in those events, and the value of the keyed attribute in events which it consumes Every PE consumes exactly those events which correspond to the value on which it is keyed A PE is instantiated for each value of the key attribute This instantiation is performed by the platform public class Person { String name; int age; String addr; } Type of event = named stream Keyed attribute Other attribute 28

빅데이터실시간분석기술 S4 Stream processing model Processing Node (PN) Logical hosts to PEs Responsible for listening to events, executing operations on the incoming events, dispatching events with the assistance of the communication layer, and emitting output events S4 : route each event to PNs based on a hash function of the values of all known keyed attributes in that event Event Listener : pass incoming events to the PEC PEC : invoke the appropriate PEs in the appropriate order Every keyless PE is instantiated once per PN Only one PE prototype exists in a PN PE Container (PEC) Holds all PE instances, including the PE prototypes Responsible for routing incoming events to the appropriate PE instances 29

빅데이터실시간분석기술 S4 processing example Word count example 30

빅데이터실시간분석기술 Twitter Strom vs Yahoo! S4 31

4. 사례연구 빅데이터실시간플랫폼개발사례 빅데이터실시간플랫폼활용사례 In-Memory computing for Big data 32

빅데이터실시간플랫폼개발사례 프로젝트 : 차세대메모리기반의빅데이터분석 관리소프트웨어원천기술개발 ( ETRI, 2012.6 ~ 2017.5 ) 33

빅데이터실시간플랫폼개발사례 빅데이터실시간분석플랫폼구성도 34

빅데이터실시간플랫폼활용사례 프로젝트 : 사이버표적공격인지및추적기술개발 ( ETRI, 2013.3 ~ 2017.2 ) 35

빅데이터실시간플랫폼활용사례 대용량누적데이터및실시간데이터처리플랫폼구성도 ( 오픈소스활용 ) 36

In-Memory computing for Big Data [ Hype Cycle for Big Data ] 37

In-Memory computing for Big Data 적용사례 1 : 실시간공간통계분석 / 제공시스템 ( 통계청 ) 통계청통계네비게이터시스템 1) 국민생활과밀접한상세지역생활통계정보를지역별공간정보와연계하여웹기반대국민서비스를제공하는공간빅데이터시스템으로, Kairos 적용을통한고속의 Web 기반통계 GIS 서비스실현 2) 기존외산소프트웨어를기반으로구축되었던시스템을국산기술과국산웹기술기반의신규시스템으로대체하여성공한사례임 3) 데이터의실시간갱신을통한서비스의신뢰성확보 Service Gateway WebGIS Server Web Server Web Server HP Superdome : HP-UX 8CPU x Quad core, 256GB DB : 100GB ( 2012 현재 ) Kairos Spatial 4.8 이중화를통한 HA 구현 (gis.nso.go.kr) Middleware Middleware (Active) Map DB Census DB 데이터이중화 (Active) Map DB Census DB Sync Agent 38

In-Memory computing for Big Data 적용사례 2 : 교통정보실시간수집 / 가공 / 분석시스템 ( 현대 / 기아자동차 ) 현대 / 기아자동차교통정보시스템고도화구축 1) 현대 / 기아자동차의교통정보빅데이터처리에디스크DBMS의성능한계로 In-Memory DBMS를도입하여운영되고있는빅데이터분야의대표적인성공사례 2) 현대 / 기아자동차본사의 In-Memory DBMS의첫적용사례 3) 가공시간단축으로기존대비더정확한교통정보제공을통해양질의서비스를제공함 4) 차량의단말기 ( 카드, 내비게이션등 ) 를이용한교통제공서비스연동가능 (Active) (Active) 39