Open Source 를이용한 Big Data 플랫폼과실시간처리분석 한국스파크사용자모임, R Korea 운영자 SK C&C 이상훈 (phoenixlee1@gmail.com)
Contents Why Real-time? What is Real-time? Big Data Platform for Streaming Apache Spark 2 KRNET 2015
Why Real-time? 3 KRNET 2015
Data explode Source : http://cloudtweaks.com/2013/02/whitepaper-big-security-for-big-data/ 4 KRNET 2015
BigData Analysis using Hadoop 5 KRNET 2015
BigData Analysis using Hadoop Source : The Definitive Guide 6 KRNET 2015
Your Data is Absorbed 현재의 Big Data 플랫폼은배치위주분석만가능 배치영역과현재사이에갭이생기고, 이것은새로운요구사항을위한새로운플랫폼이필요하다는것을의미함 7 KRNET 2015
Example IoT, Fintech 가발전함에따라실시간처리에대한 Needs 증가 -> 상품추천, 보안 2014 년하반기농협인출사고이후금감원 FDS 로드맵을발표 -> 실시간처리 / 분석필요 8 KRNET 2015
Use Cases Across Industries 9 KRNET 2015
What is Realtime? 10 KRNET 2015
Realtime processing 빅데이터환경에서의실시간컴퓨팅 엄격히정해진시간내에응답을보장해주는것 즉주어진시간안에필요한프로세싱을해서결과를내주거나처리를하게되면이를실시간이라고말함 이것은 0.1 sec~ 1min 등다양한범위에서가능 11 KRNET 2015
Realtime Processing 실시간처리에는다양한영역이있음 Source : Strata+Hadoop World, Srinath Perera 12 KRNET 2015
MapReduce (Batch) 여러가지단점들 - Job Loading 시간소요 - Job간의데이터교환오버헤드 - 불필요한기록 - 고정된 data flow - 어려움 13 KRNET 2015
Sql On Hadoop 의필요성 Needs 의변화 투자대비저렴한가격으로대용량데이터처리에만족 -> 보다높은처리성능및반응요구 많은사용자가 ad-hoc 질의를위해 db 병행사용에불만 대화형질의 (interactive query) 발견은 [ 질의 -> 결과분석과사고 -> 질의 ] 의순환 : 시스템의빠른반응속도가데이터분석의생산성 빠른의사결정가능 Legacy 시스템의상당수가 SQL 로되어있음 성능보장및사람에의한오류방지 Mapreduce 프로그래밍 개발자역량에의존적 버그가능성높음 질의언어 적절한성능은시스템이보장 버그가능성낮음 14 KRNET 2015
Olap Style in memory computing Interactive Processing (SQL On Hadoop) 15 KRNET 2015
Indexed Storage 16 KRNET 2015
CEP CEP (Complex Event Processing) 다양한실시간이벤트를분석할수있는기술이개발되고이를기반으로다양한솔루션들이나와있음 EPL(Event Processing Language) 또는 EQL(Event Query Langauge) 이라는스크립트언어를통해서 SQL 에익숙한개발자나데이터관리자가직관적으로데이터 ( 이벤트 ) 모델링과프로세스를설계해서적용할수있음 Oracle Complex Event Processing, IBM Websphere, Esper 그러나.. 빅데이터에적합한수평적인확장성 (scale-out) 이불가능함 이벤트스트림별로여러대의서버로부하분산을하거나여러개의네트워크카드가있고수백기가메인메모리를갖춘고성능서버를이용해서대량의이벤트처리 17 KRNET 2015
Big Data Platform for Streaming 18 KRNET 2015
Real Time Streaming Architecture 19 KRNET 2015
Data Collection + Message System Flume + Kafka 20 KRNET 2015
What is Storm? Twitter 로합병된 BackType 에서최초개발 Hadoop 에서는처리하지못하는실시간분석을가능하게해줌 Twitter 에서 Storm 은 Tweet 실시간분석알고리즘최적화 Anti-spam 처리 2013 년 9 월 Apache Incubator Project 로등록 21 KRNET 2015
Storm s Features 1/2 Simple programming model Mapreduce 가병렬처리프로세싱구현의복잡도를낮춰주는것과같이 Storm 또한분산 real-time 프로세싱구현의복잡도를낮춰줌 Runs any programming language 어떤언어든사용자가익숙한언어를이용하여구현을할수있음 Clojure, Java, Ruby, Python 은기본으로제공하고있으며그밖에언어도 Storm communication protocol 의구현만으로도사용이가능 Fault-tolerant Worker process 나 node 의장애를자동으로관리해줌 Horizontally scalable Multiple threads, process, server 를이용하여병렬처리가가능하며추가확장이용이 22 KRNET 2015
Storm s Features 2/2 Guaranteed message processing Fast Hadoop 과같이각각의메시지가유실되지않음. 작업실패시에는데이터의시작단계부터다시재시도하도록 replaying message system 이구현되어있음 Netty (or ZeroMQ) 를사용하여메시지를빠르게처리할수있도록설계되어있음 1M + Messages per second per node Local mode Storm 에서는 Cluster mode 와 Local mode 를제공 Local mode 로테스트하여번거로운배포작업을피하면서단위테스트를용이하게할수있음 Easy to Manage Hadoop 과는달리클러스터를관리하는작업이매우간단함 복잡한설정이나관리포인트가없이매우단순하면서도강인함 23 KRNET 2015
Storm Architecture 24 KRNET 2015
Storm Architecture Storm 의클러스터는마스터노드 (Nimbus) 와워커노드 (Supervisor) 로구성되며 Zookeeper 를이용하여노드관리 Nimbus Nimbus 라는이름의데몬이마스터노드의역할 작업할당, 실패확인등의관리역할 Supervisor Supervisor 데몬이실제적으로워커프로세스의시작과종료, 실행상태모니터링등을수행 Zookeeper Apache 프로젝트 분산되어있는노드간의관리를수행하고시스템의안정성을유지하도록관리 25 KRNET 2015
Key concepts Tuples Ordered list of elements Streams Spout Unbounded sequence of tuples Source of streams Queues, Web logs, API calls, Event data 26 KRNET 2015
Key concepts Tuples Ordered list of elements Streams Unbounded sequence of tuples Spout Bolt Source of streams Queues, Web logs, API calls, Event data Process tuples and create new streams Apply functions/transformations Filter, Aggregation, Streaming joins, access DBs, APIs, etc.. 27 KRNET 2015
Key concepts Tuples Ordered list of elements Streams Unbounded sequence of tuples Spout Bolt Source of streams Queues, Web logs, API calls, Event data Process tuples and create new streams Topologies A directed graph of Spout and Bolts 28 KRNET 2015
Storm vs hadoop Real-Time Storm Batch Hadoop Nimbus 는 Storm 에요청되고실행되는모든잡들을관리 JobTracker Supervisor 가모든워커프로세스들을관리 TaskTracker Worker 는 Spout, Bolt 를실행하는프로세서 Task Multiple stages in processing pipeline 스토리지가필요없음. ( 물론스토리지사용도가능함 ) Only two stages in processing pipeline : map and reduce HDFS 가필요함 작업의끝이없음 (Continuous Processing) Mapreduce 작업은끝이있음 29 KRNET 2015
Fault-tolerance Worker 가죽었을경우 Supervisor 가 worker 를 restart 시켜줌 지속적으로 worker 실행이실패할경우 nimbus 가다른 node 에재할당 Node 가죽었을경우 Time-out 이되면 nimbus 는다른 node 에재할당 Nimbus 나 Supervisor 가죽었을경우 재실행이되면모든작업이정상적으로작동됨 재실행되지않아도작업은정상적으로진행 Nimbus Single point of failure? Nimbus 가죽으면 node 재할당은되지않음 HA 준비중
Trident High-level abstraction for doing realtime computations Stateful stream processing Storm Transactions 의모든기능을상속함 Persistence store 기반으로다양한계산처리가가능 Memory Memcached Cassandra Redis Abstraction like Pig, Hive, Cascading 분산처리와최적의성능자동화 MR Combiner 기능존재 Data 네트워크이동최소화
Lambda Architecture Source : MapR developercentral 32 KRNET 2015
But.. Lambda Architecture 너무많은오픈소스 관리하기어려움 더빠른속도가필요 Etc Window Function Machine Learning Analytics 33 KRNET 2015
34 KRNET 2015
Unified Platform 35 KRNET 2015
Fast 36 KRNET 2015
Simple 37 KRNET 2015
Simple 38 KRNET 2015
How Fast? RDDs (Resilient Distributed Datasets) 클러스터전체에서공유되는데이터형태로대부분메모리에올라가있음 Read Only 데이터를수정할수있게되면데이터유실시복구가어려움. Check Point 등고려하지않아도됨 대신새로운메모리를확보하여새로운값을할당. Update 무시 Cache 39 KRNET 2015
Fault Tolerance? RDDs (Resilient Distributed Datasets) Fault Tolerance Lineage 를이용한데이터복구 Need not exist in physical storage RDDs 는메모리에분산임시저장하기때문에데이터처리시디스크를사용하지않음. 그러나, 데이터복구시매우안정적인저장공간으로부터 (ex> HDFS) 데이터를복원하기시작함 Laziness : 모든작업은여러작업을설정해두고마지막액션함수수행시계산함 40 KRNET 2015
Spark Streaming 41 KRNET 2015
Fault-tolerance and Zero Data Loss 42 KRNET 2015
Fault-tolerance and Zero Data Loss 43 KRNET 2015
Window Operation 44 KRNET 2015
Combine batch 45 KRNET 2015
Combine machine learning 46 KRNET 2015
Combine SQL 47 KRNET 2015
Any Question? 48 KRNET 2015