클라우드관계형데이터베이스 Aurora & 오픈소스를활용한실시간데이터분석 양승도 솔루션즈아키텍트 Web Services 2016, Web Services, Inc. or its Affiliates. All rights reserved.
When we speak of free software, we are referring to freedom, not price. Richard Stallmann Free Software Foundation, GNU Project
http://amzn.github.io
관계형데이터베이스 쉽고빠른구성 반복적인관리작업을대신수행 RDS 다양한관계형데이터베이스옵션제공 쉽고빠른확장 손쉬운고가용성구성
RDS 데이터베이스엔진 Aurora
Aurora 는? MySQL 호환관계형데이터베이스엔진 상용데이터베이스의성능과가용성제공 오픈소스데이터베이스의효율성과비용
클라우드를위한데이터베이스아키텍처 1 2 로깅및스토리지를멀티-테넌시스케일-아웃기반 DB 최적화스토리지서비스로전환서비스내부에 EC2, VPC, DynamoDB, SWF 및 Route 53 등다른 AWS 서비스들사용 Data Plane SQL Transactions Caching Logging + Storage Control Plane DynamoDB SWF 3 연속적인백업을위한 S3 와통합으로 99.999999999% 내구성제공 S3 Route 53
Aurora 주요특징 고성능뛰어난보안 MySQL 과호환 뛰어난확장성 높은가용성및내구성 완전관리형
뛰어난보안 저장시암호화 AES-256 및하드웨어가속 디스크및 S3 내모든블록들은암호화 AWS KMS 를통한키관리 전송시암호화 SSL VPC를통한네트워크격리 노드에직접접근없음 Application SQL Transactions Caching Storage 산업표준의보안및데이터보호인증서지원 S3
You ve probably heard about our benchmark numbers
SQL 성능테스트결과 Aurora r3.8xl (32 vcpu, 244 GiB RAM) 사용 MySQL SysBench 성능테스트 WRITE PERFORMANCE READ PERFORMANCE 4 클라이언트머신당각 1,000 connections 단일클라이언트머신 1,600 connections
RDS MySQL 5.6 & 5.7 보다 5X 빠른 WRITE PERFORMANCE READ PERFORMANCE 150,000 125,000 100,000 75,000 50,000 25,000 0 700,000 600,000 500,000 400,000 300,000 200,000 100,000 0 MySQL SysBench results R3.8XL: 32 cores / 244 GiB RAM Aurora MySQL 5.6 MySQL 5.7 Five times higher throughput than stock MySQL based on industry standard benchmarks.
인스턴스사이즈에따른성능 WRITE PERFORMANCE READ PERFORMANCE Aurora MySQL 5.6 MySQL 5.7 Aurora scales with instance size for both read and write.
읽기복제에따른지연감소 Updates per second Aurora RDS MySQL 30 K IOPS (single AZ) 1,000 2.62 ms 0 s 2,000 3.42 ms 1 s 5,000 3.94 ms 60 s 500x U P T O L O W E R L A G 10,000 5.38 ms 300 s SysBench OLTP 워크로드 250 테이블
성능을위한 Aurora 아키텍처 DO LESS WORK I/O의감소네트워크패킷최소화기존결과를캐시데이터베이스엔진오프로드 BE MORE EFFICIENT 비동기식처리응답속도경로감소락-없는데이터구조사용배치수행동시처리 DATABASES ARE ALL ABOUT I/O NETWORK-ATTACHED STORAGE IS ALL ABOUT PACKETS/SECOND HIGH-THROUGHPUT PROCESSING DOES NOT ALLOW CONTEXT SWITCHES
Aurora 클러스터 AZ 1 AZ 2 AZ 3 Aurora 프라이머리인스턴스 3 가용영역에걸친클러스터볼륨 S3
Aurora 클러스터및읽기복제 AZ 1 AZ 2 AZ 3 Aurora 프라이머리인스턴스 Aurora 복제 Aurora 복제 3 가용영역에걸친클러스터볼륨 S3
Aurora I/O 트래픽 MYSQL READ SCALING AMAZON AURORA READ SCALING MySQL 마스터 70% 쓰기 싱글 - 스레드 BINLOG 전송 MySQL 복제 70% 쓰기 Aurora 마스터 70% 쓰기 페이지캐시업데이트 Aurora 복제 100% 신규읽기 30% 읽기 30% 신규읽기 30% 읽기 데이터볼륨 데이터볼륨 공유 Multi-AZ 스토리지 Logical: SQL 문을복제에적용쓰기부하는양쪽노드에서유사별도스토리지마스터및복제사이에데이터차이존재 Physical: 마스터에서복제로 redo를전송복제는스토리지를공유. 쓰기수행없음캐시된페이지는 Redo 적용
Aurora 의고가용성
Aurora 의스토리지 기본고가용성 3 가용영역에 6-way 복제 AZ 1 AZ 2 AZ 3 4 / 6 쓰기, 3 / 6 읽기쿼럼 S3 저장소에연속백업 SSD, 스케일 - 아웃, 멀티 - 테넌트 스토리지 SQL Transactions Caching 연속적스토리지확장 최대 64TB 크기 사용한만큼만지불로그-구조기반스토리지 S3
스토리지자가치유및장애내구성 자동장애감지, 복제, 복구 2 개의복제및 1 개가용영역장애는읽기및쓰기가용성에영향없음 3 개의복제장애에도읽기가용성에영향없음 AZ 1 AZ 2 AZ 3 SQL Transaction Caching AZ 1 AZ 2 AZ 3 SQL Transaction Caching Read availability Read and write availability
Aurora 의인스턴스자동페일 - 오버 읽기복제있는경우 기존복제를새기본인스턴스로승격 페일오버대상인스턴스우선순위지정가능 DB 클러스터엔드포인트유지하며, 신규기본인스턴스로 DNS 레코드변경 일반적으로 1분이내에완료 Automatic Failover to Replica Instance 읽기복제없는경우 동일가용영역에새 DB 인스턴스생성시도 생성불가시다른가용영역에신규 DB 인스턴스생성시도 일반적으로 15 분이내에완료 Create new primary Instance AZ 1 AZ 2 AZ 3 AZ 1 AZ 2 AZ 3 Primary instance Replica instance Replica instance Replica instance Primary instance Primary instance Primary instance Shared Multi-AZ Storage Shared Multi-AZ Storage Aurora Replica 가있는경우 Aurora Replica 가없는경우
신속한크래시복구 기존데이터베이스 최종체크포인트이후로그재생필요 MySQL 은싱글 - 쓰레드동작및다량의디스크억세스필요 Aurora 스토리지수준에서읽기시온 - 디맨드형태로 Redo 레코드재생 병렬, 분산, 비동기 Crash at T 0 requires a re-application of the SQL in the redo log since last checkpoint Crash at T 0 will result in redo logs being applied to each segment on demand, in parallel, asynchronously Checkpointed Data Redo Log T 0 T 0
캐시유지 데이터베이스프로세스와캐시의분리 데이터베이스재기동이벤트시에도캐시웜 (warm) 상태유지 전체캐시활성화가신속 Caching process is outside the DB process and remains warm across a database restart. SQL Transactions Caching SQL Transactions Caching SQL Transactions Caching 즉각적인크래시복구 + 캐시유지 = 빠르고손쉬운 DB 장애복구
Compatible with the MySQL ecosystem
Well established MySQL ecosystem We ran our compatibility test suites against Aurora and everything just worked." - Dan Jewett, Vice President of Product Management at Tableau Business Intelligence Data Integration Query and Monitoring SI and Consulting Source:
How does Open-Source & Cloud fit into Data Analytics?
Generation Collection & Storage Analytics & Computation Collaboration & Sharing
More devices Lower cost Higher throughput Generation Collection & Storage Analytics & Computation 제약사항 Collaboration & Sharing Web Services helps remove constraints
데이터분석의세가지유형 Retrospective 분석또는보고 Here-and-now 실시간분석및대쉬보드 Predictions 보다스마트한서비스
데이터분석의세가지유형 Retrospective 분석또는보고 Here-and-now 실시간분석및대쉬보드 Predictions 보다스마트한서비스
How Fast is Real-Time?
There s no such thing as real time, only near-real time. Typically when we talk about real-time, we mean architectures that allow to respond to data without persisting it to a database first! John Akred CTO, Silicon Valley Data Science
So what is near real-time? 데이터가도착하자마자처리할수있는능력 다시말하면, 미래 가아닌 현재 상태의데이터를처리하는것 그렇다면 현재 란? ecommerce Attention span of a potential customer Options Trader Milliseconds Guided Missile Microseconds
Solution: 스트림프로세싱 Stream storage which allows processing events as they come in and react accordingly
What do we expect from a real-time data stream?
Real-Time Data Stream 에대한기대 Real-time 데이터스트림에무엇을기대합니까? 고가용성 확장성 장애복구능력 내구성 ( 임시 ) 어떻게가능한가요? 다수의데이터센터설비 자동으로확장가능한인프라 글로벌부하분산 기타.
AWS Global Infrastructure 12 Regions Oregon GovCloud Frankfurt Beijing Seoul Tokyo 33 Availability Zones 55 Edge Locations Northern California N. Virginia Ireland Sydney Continuous Expansion Singapore São Paulo
Ingest Store Process Visualize RDS Kinesis EMR Machine Learning Mobile Analytics DynamoDB CloudSearch Redshift AWS Data Pipeline AWS Import/Export S3 Glacier Lambda EC2
Ingest Store Process Visualize RDS Kinesis EMR Machine Learning Mobile Analytics DynamoDB CloudSearch Redshift AWS Data Pipeline AWS Import/Export S3 Glacier Lambda EC2
Fluentd: 오픈소스로그수집 Fluentd is an open source data collector to unify data collection and consumption Integration into many data sources (App Logs, Syslogs, Twitter etc.) Direct integration into AWS such as S3 & Kinesis <source> type tail format apache2 path /var/log/apache2/access_log tag s3.apache.access </source> <match s3.*.*> type s3 s3_bucket myweblogs path logs/ </match> https://github.com/fluent/fluentd/
Ingest Store Process Visualize RDS Kinesis EMR Machine Learning Mobile Analytics DynamoDB CloudSearch Redshift AWS Data Pipeline AWS Import/Export S3 Glacier Lambda EC2
Real Time Data Stream: Kinesis 대용량분산스트림에대한 Real- Time 데이터분석 초당수백만이벤트를처리할수있는탄력적인용량 스트림에입력되는이벤트에따라 Real-Time 으로반응 3 군데저장소에복제하는신뢰할수있는스트림 Kinesis
Kinesis for Real-Time
Kinesis: 생산자와소비자 App.1 HTTP Post [Aggregate & De-Duplicate] S3 AWS SDKs App.2 LOG4J [Metric Extraction] DynamoDB Flume App.3 Fluentd Kinesis [Decision Making Tree] Apache Storm Kinesis Producer Library (IoT) App.4 [Machine Learning] EMR
Apache Spark Streaming Apache Spark is an inmemory analytics cluster using RDD for fast processing Spark streaming can read directly from an Kinesis stream KinesisUtils.createStream( twitter-stream ).filter(_.gettext.contains( Open-Source")).countByWindow(Seconds(5)) Counting tweets on a sliding window
Ingest Store Process Visualize RDS Kinesis EMR Machine Learning Mobile Analytics DynamoDB CloudSearch Redshift AWS Data Pipeline AWS Import/Export S3 Glacier Lambda EC2
React in Real-Time: Lambda 완벽하게관리되고고가용성이지원되는 서버없는컴퓨팅 & 클라우드함수 서비스 호출또는상태변화를통해트리거 수신이벤트비율에맞게자동적으로확장 모든수신이벤트에반응할수있게 Kinesis 스트림에연결가능 Kinesis Lambda
Ingest Store Process Visualize RDS Kinesis EMR Machine Learning Mobile Analytics DynamoDB CloudSearch Redshift AWS Data Pipeline AWS Import/Export S3 Glacier Lambda EC2
DynamoDB 완전관리형 NoSQL 데이터베이스서비스 table Schemaless Data Model Seamless scalability No storage or throughput limits Consistent low latency performance High durability and availability items attributes DynamoDB
500,000 writes / second to their DynamoDB tables 200 additional servers during Superbowl 0 additional servers right after
1 instance x 100 hours = 100 instances x 1 hour
Ingest Store Process Visualize RDS Kinesis EMR Machine Learning Mobile Analytics DynamoDB CloudSearch Redshift AWS Data Pipeline AWS Import/Export S3 Glacier Lambda EC2
Kibana: 오픈소스시각화도구 Kibana is an open-source project of Elastic.IO to visualize data in browser Uses Elasticsearch as indexing engine (based on Apache Lucene) Elasticsearch on Hadoop available (es-hadoop) https://github.com/elastic/kibana
Let s put it all together!
Live Twitter Feed Analysis Twitter Blog* - On a typical day: More than 500 million Tweets sent Average 5,700 TPS DynamoDB Visualization with D3.js S3 Twitter Stream Kinesis Lambda * https://blog.twitter.com/2013/new-tweets-per-second-record-and-how
오픈소스 & 클라우드 효율적인데이터저장 / 처리