PowerPoint Presentation - PDF Free Download

Agenda Spark 워크로드최적화 Spark on EMR 성능최적화방안 EMR Runtime for Apache Spark Apache Hudi - 레코드레벨의데이터 Update, delete, insert EMR Managed Resize Spark Job 디버그및모니터링을위한 Off-cluster Spark Log 관리 Lake Formation 과 EMR 통합 Docker 환경에서 Spark 어플리케이션배포가능

관리형하둡플랫폼 - Amazon EMR Analytics and ML at scale 컴퓨팅리소스 (Amazon EMR) 와저장공간 (Amazon S3) 의분리 PB 에서 EB 규모의데이터까지저장 / 처리가능 Amazon EMR Data lake on AWS 필요한수만큼의노드로확장가능 최신 Open-source 어플리케이션 AWS 내보안기능과통합 자유롭게커스터마이징, 접근가능 오토스케일을통해 Elastic 한구성 초당과금을통한가격절감

Amazon EMR - Hadoop 엔터프라이즈레벨의 Hadoop 플랫폼 PIG SQL Applications Framework Process Layer Data Layer Infrastructure

Amazon EMR - Hadoop 엔터프라이즈레벨의 Hadoop 플랫폼 PIG SQL Amazon EMR

Amazon EMR - Hadoop 엔터프라이즈레벨의 Hadoop 플랫폼 PIG SQL EMRFS Amazon EMR Amazon S3

EMR 노드구성 마스터노드 클러스터관리 NameNode와 JobTracker 포함 코어노드 작업실행을위한 Task tracker 하둡에서의 DataNode 태스트노드 Task tracker만설치로컬 HDFS 없음 HDFS HDFS Amazon EMR cluster Master instance group Core instance group Task instance group

Stateless 클러스터아키텍쳐권장 클러스터외부에메타스토어유지 (Glue Data Catalog 또는 RDS) 빠르게시작하고목적한작업을수행 Amazon Redshift 동시에여러클러스터가 S3 데이터를활용가능 Amazon RDS AWS Glue Data Catalog Amazon Athena AWS Glue

Stateless 클러스터아키텍쳐권장 Old clustering/localized model Amazon EMR decoupled model Master node Master node CPU CPU CPU CPU CPU CPU CPU CPU Memory Memory Memory Memory Memory Memory Memory Memory HDFS storage HDFS storage HDFS storage HDFS storage HDFS 가 3 개의복제본을가져야하므로 500-TB 데이터저장을위해 1.5-PB 규모의클러스터필요 다수의 EMR 클러스터와노드가동시에 EMR file system 을통해 S3 데이터사용

Apache Spark Spark 는대량의데이터처리를위한병렬처리플랫폼으로다양한유형의워크로드를수행할수있는컴포넌트로구성되어있음 Spark SQL Spark/Struc tured Streaming Spark R Spark ML Graph X Spark Core

Spark 의실행모델 Spark 는마스터 / 워커아키텍쳐를가지고실질적인작업을수행하는 Executor 와해당 Worker 를관리하는 Driver 가있음

Spark 메모리구성 yarn.nodemanager.resource.memory-mb : 익스큐터전체의물리메모리영역 spark.executor.memory : 익스큐터가 Job 실행에사용할수있는메모리영역 spark.shuffle.memoryfraction : 전체힙영역에서익스큐터와 RDD 데이터저장에사용될비율 spark.storage.memoryfraction : 할당된메모리에서데이터저장에사용할비율 spark.yarn.executor.memoryoverhead : VM 관련오버헤드를위해할당, 필요시조정가능

Spark 워크로드최적화 (1/2) Spark 작업은일반적으로대량의데이터를 Memory 기반의리소스에서빠르게처리해야하는요구사항을가진다. 대량의데이터작업에서로딩하는데이터를줄여주는접근이가장중요 최적의데이터포멧활용 - Apache Parquet compressed by Snappy(Spark 2.x 의기본값 ) 데이터스캔범위를최소화한다. - Hive partitions, Bucketing, Push Down, Partitioning Pruning 특정파티션에 Data Skew 가발생하면파티션키에대한고려가필요 사용가능한클러스터의메모리를최대한효율적으로사용 작업에서빈번하게사용되는데이터를메모리에캐쉬한다. - dataframe.cache() 워크로드에적절한 Spark 설정값을지정해줘야함. 작업유형에적합한파티션사이즈로변경한다. - spark.sql.files.maxpartitionbytes

Spark 워크로드최적화 (2/2) 리소스를가장많이사용하는조인및셔플링최적화 셔플링과 repartioning 은가장비용이비싼오퍼레이션이므로최소화되도록 Job 모니터링 Spark 2.3 이후기본조인은 SortMerge( 조인전각데이터셋의정렬이필요 ), Broadcast Hash 조인은대상테이블의사이즈가크게차이나는경우유용, 특히 10Mb 미만테이블은자동 Broadcast, 설정을통해변경가능 spark.sql.autobroadcastjointhreshold 테이블사이즈에따라조인순서를변경하여대량데이터셔플링방지 - Join Reorder 클러스터규모와작업유형에따라적절한환경변수설정 익스큐터갯수설정 --num-executors 익스큐터의코어수설정 --executor-cores 익스큐터의메모리크기를변경 --executor-memory 스토리지성능최적화 %%configure {"executormemory": "3072M", "executorcores": 4, "numexecutors":10} 데이터활용빈도및속도요구치에맞춰 HDFS, EMRFS(S3) 활용선택

Spark 의주요메모리이슈유형 (1/2) 자바힙메모리부족오류 - Spark instance 수, executor memory, core 수등이많은양의데이터를처리할수있도록설정되지않는경우 WARN TaskSetManager: Loss was due to java.lang.outofmemoryerror java.lang.outofmemoryerror: Java heap space 물리메모리초과 - 가비지컬렉션과같은시스템작업을수행하는데필요한메모리를 Spark executor 인스턴스에서사용할수없는경우 Error: ExecutorLostFailure Reason: Container killed by YARN for exceeding limits. 12.4 GB of 12.3 GB physical memory used. Consider boosting spark.yarn.executor.memoryoverhead. Error: ExecutorLostFailure Reason: Container killed by YARN for exceeding limits. 4.5GB of 3GB physical memory used limits. Consider boosting spark.yarn.executor.memoryoverhead.

Spark 의주요메모리이슈유형 (2/1) 가상메모리초과 - 가비지컬렉션과같은시스템작업을수행하는데필요한메모리를 Spark executor 인스턴스에서사용할수없는경우 Container killed by YARN for exceeding memory limits. 1.1gb of 1.0gb virtual memory used. Killing container. 익스큐터메모리초과 - Spark executor 물리적메모리가 YARN 에서할당한메모리를초과하는경우 Required executor memory (1024+384 MB) is above the max threshold (896 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb

Spark 메모리관련주요파라미터설정사례 EMR 에서는 spark-defaults 값을통해기본값설정이되어있으나전반적으로낮게설정된값으로인해애플리케이션이클러스터전체성능을사용하지못하므로추가설정이필요함. Spark 구성파라미터 spark.executor.memory 작업을실행하는각익스큐터에사용할메모리의크기입니다. spark.executor.cores 익스큐터에할당되는가상코어의수입니다. spark.driver.memory 드라이버에사용할메모리의크기입니다. spark.driver.cores 드라이버에사용할가상코어의수입니다. spark.executor.instances 익스큐터의수입니다. spark.dynamicallocation.enabled 가 true 로설정된경우외에는이파라미터를설정합니다. spark.default.parallelism 사용자가파티션수를설정하지않았을때 join, reducdbykey 및 parallelize 와같은변환에의해반환된 RDD 의파티션수기본값입니다.

Spark 메모리관련주요파라미터설정사례. r5.12xlarge(48 vcpu, 384 Gb 메모리 ) 마스터 1 대, r5.12xlarge 코어노드 19 대의 EMR 클러스터로 S3 에저장된 10TB 데이터처리환경 익스큐터당 5 개의 vcpu 할당 spark.executor.cores = 5 (vcpu) 인스턴스당익스큐터수계산 (48-1) / 5 = 9 인스턴스메모리 384Gb 중 90% 는각익스큐터에할당 spark.executor.memory = 42 * 0.9 = 37 약 10% 는각익스큐터의 Overhead 에할당 spark.yarn.executor.memoryoverhead = 42 * 0.1 = 5 드라이버메모리는익스큐터와동일하게 spark.driver.memory = spark.executor.memory 전체익스큐터수는 spark.executor.instances = (9 * 19) - 1( 드라이버수 ) = 170 병렬처리값은 spark.default.parallelism = 170( 익스큐터수 ) * 5( 코어수 ) * 2 = 1,700

Spark configuration 변경방법 Spark configuration 을변경하기위한방법여러가지가있으나각각은 config 값을적용하는시점이달라 1 번부터우선순위가가장높게반영된다. 1. SparkConf 의 Set 함수를이용하여 Runtime 에서설정값을변경한다. conf = spark.sparkcontext._conf.setall([('spark.executor.memory', '4g'), ('spark.executor.cores','4')),('spark.driver.memory','4g )]) spark.sparkcontext.stop() spark = SparkSession.builder.config(conf=conf).getOrCreate() 2. spark-submit 을통해설정값을전달한다../bin/spark-submit --class org.apache.spark.examples.sparkpi --master yarn --deploy-mode cluster --executor-memory 20G \ --num-executors 50 3. conf/spark-defaults.conf 파일을직접수정한다. spark.executor.memory 18971M spark.executor.cores 4 spark.yarn.executor.memoryoverheadfactor 0.1875

EMR 인스턴스템플릿및구성방법 Spark 및 Yarn 의구성파라미터설정을 EMR 콘솔의 Edit software settings 항목을통해직접입력하거나 Load JSON from S3 기능을통해미리구성된 json 파일을로딩할수있음

EMR 클러스터구성예시 { "InstanceGroups":[ { "Name":"AmazonEMRMaster", "Market":"ON_DEMAND", "InstanceRole":"MASTER", "InstanceType":"r5.12xlarge", "InstanceCount":1, "Configurations":[ { "Classification": "yarn-site", "Properties": { "yarn.nodemanager.vmem-check-enabled": "false", "yarn.nodemanager.pmem-check-enabled": "false" } }, { "Classification": "spark", "Properties": { "maximizeresourceallocation": "false" } }, { "Classification": "spark-defaults", "Properties": { "spark.network.timeout": "800s", "spark.executor.heartbeatinterval": "60s", "spark.dynamicallocation.enabled": "false", "spark.driver.memory": "21000M", "spark.executor.memory": "21000M", "spark.executor.cores": "5", "spark.executor.instances": "171",

EMR 클러스터구성예시 "spark.memory.fraction": "0.80", "spark.memory.storagefraction": "0.30", "spark.executor.extrajavaoptions": "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p, "spark.driver.extrajavaoptions": "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p, "spark.yarn.scheduler.reporterthread.maxfailures": "5", "spark.storage.level": "MEMORY_AND_DISK_SER", "spark.rdd.compress": "true", "spark.shuffle.compress": "true", "spark.shuffle.spill.compress": "true", "spark.default.parallelism": "3400" }

Spark on EMR 성능최적화기본팁 데이터의크기와작업의사이즈에적절한메모리할당이가능한 instance type 을선택 ( 같은 vcpu 를가진 c 타입인스턴스대비 m / r 타입인스턴스가두배의메모리를가짐 ) 최신버젼의 EMR 버젼을사용한다. ( 최소 5.24.0 이상 ) 버젼 5.28 이후에는 EMR Runtime for Apache Spark 이포함되어있어성능향상 추가적인성능최적화를위해서워크로드특성에맞는메모리설정변경이필요

Spark on EMR 성능최적화 (1/5) Dynamic Partition Pruning - 쿼리대상테이블을보다정확하게선택하여스토리지에서읽고처리하는데이터량을줄여주어시간과리소스절약 Spark Properties : spark.sql.dynamicpartitionpruning.enabled (EMR 5.24 이후사용가능, 5.26 이후기본적으로활성화 ) 아래의쿼리에서동적으로파티셔닝된데이터의범위를줄여주어, Where 절의조건에맞는데이터만필터링하여 North America 영역에해당하는파티션데이터만읽어서처리 select ss.quarter, ss.region, ss.store, ss.total_sales from store_sales ss, store_regions sr where ss.region = sr.region and sr.country = 'North America'

Spark on EMR 성능최적화 (2/5) Flattening Scalar Subqueries - 다수개의서브쿼리를하나로통합하여재작성, 수행성능을향상 Spark Properties : spark.sql.optimizer.flattenscalarsubquerieswithaggregates.enabled (EMR 5.24 이후사용가능, 5.26 이후기본적으로활성화 ) 동일한관계를사용하는다수의스칼라쿼리를하나의쿼리로통합하여실행하여성능을향상 /* 샘플쿼리 */ select (select avg(age) from students /* Subquery 1 */ where age between 5 and 10) as group1, (select avg(age) from students /* Subquery 2 */ where age between 10 and 15) as group2, (select avg(age) from students /* Subquery 3 */ where age between 15 and 20) as group3 /* 최적화된쿼리 */ select c1 as group1, c2 as group2, c3 as group3 from (select avg (if(age between 5 and 10, age, null)) as c1, avg (if(age between 10 and 15, age, null)) as c2, avg (if(age between 15 and 20, age, null)) as c3 from students);

Spark on EMR 성능최적화 (3/5) DISTINCT Before INTERSECT - Intersect 사용시자동적으로 Left semi join 으로변환, Distinct 연산을 Intersect 하위항목을푸시하여성능향상 Spark Properties : spark.sql.optimizer.distinctbeforeintersect.enabled (EMR 5.24 이후사용가능, 5.26 이후기본적으로활성화 ) /* 샘플쿼리 */ (select item.brand brand from store_sales,item where store_sales.item_id = item.item_id) intersect (select item.brand cs_brand from catalog_sales, item where catalog_sales.item_id = item.item_id) /* 최적화쿼리 */ select brand from (select distinct item.brand brand from store_sales, item where store_sales.item_id = item.item_id) left semi join (select distinct item.brand cs_brand from catalog_sales, item where catalog_sales.item_id = item.item_id) on brand <=> cs_brand

Spark on EMR 성능최적화 (4/5) Bloom Filter Join - 사전에작성된 Bloom Filter 를통해쿼리대상데이터의범위를줄여줌으로써성능을향상 Spark Properties : spark.sql.bloomfilterjoin.enabled (EMR 5.24 이후사용가능, 5.26 이후기본적으로활성화 ) 아래의쿼리에서조인전에 sales 테이블에서 item.category 가 1, 10, 16 에해당하는데이터를먼저필터링하므로조인성능을매우향상시킬수있음 select count(*) from sales, item where sales.item_id = item.id and item.category in (1, 10, 16)

Spark on EMR 성능최적화 (5/5) Optimized Join Reorder - 쿼리에적혀있는테이블의순서를필터와데이터규모에따라재정렬하여소규모쿼리를먼저수행 Spark Properties : spark.sql.optimizer.sizebasedjoinreorder.enabled (EMR 5.24 이후사용가능, 5.26 이후기본적으로활성화 ) Spark 의기본동작은쿼리에있는테이블들의왼쪽에서오른쪽으로차례대로조인하는것임. 아래의쿼리에서원래조인순서는 store_sales, store_returns, store, item 순서이지만 select ss.item_value, sr.return_date, s.name, i.desc, from store_sales ss, store_returns sr, store s, item i where ss.id = sr.id and ss.store_id = s.id and ss.item_id = i.id and s.country = 'USA' 실제조인실행순서는 1. store_sales 와 store (store 에 country 필터가있으므로 ) 2. store_returns 3. item 순서이며, Item 에필터가추가되면 item 이 store_returns 보다먼저조인되도록재정렬될수있음

S3 를통한 Spark 성능향상 (1/2) EMRFS S3 최적화된커미터사용 Spark Properties : spark.sql.parquet.fs.optimized.committer.optimization-enabled (EMR 5.19 이후사용가능, 5.20 이후기본적으로활성화 ) 기본적으로 S3 multipart upload 옵션이활성화된상태에서 Spark SQL / DataFrames / Datasets 에서 Parquet 형식으로저장할때사용가능 테스트환경 - EMR 5.19 (Master m5d.2xlarge / Core Node m5d,2xlarge * 8) Input Data : 15Gb (100 개의 parquet 파일 ) INSERT OVERWRITE DIRECTORY s3://${bucket}/perf-test/${trial_id} USING PARQUET SELECT * FROM range(0, ${rows}, 1, ${partitions}); 60% 성능향상 80% 성능향상

S3 를통한 Spark 성능향상 (2/2) S3 Select 를통해데이터필터링을 S3 로푸시다운 Spark, Presto, Hive 에서 S3 Select 를통해대용량데이터필터링을 S3 레벨에서사전처리가능 (EMR 5.17 이후사용가능 ) CSV, JSON, Parquet 파일형식지원 / Bzip2, Gzip, Snappy 압축파일지원 기본적인 Where 조건에서의특정컬럼기반의필터링에유용, 단쿼리에명시되어야함 집계함수, 형변환이포함된필터링등은 S3 로푸시다운되지않음 테이블및파일형식은다음과같이선언하며, 쿼리는일반적인 Where 조건과동일하게사용 CREATE TEMPORARY VIEW MyView (number INT, name STRING) USING s3selectcsv OPTIONS (path "s3://path/to/my/datafiles", header "true", delimiter "\t") SELECT * FROM MyView WHERE number > 10;

Spark 성능향상을위한성능최적화 EMR 제공 2.6 배성능과 1/10 가격으로 Spark 성능최적화 Runtime 을 EMR 에포함 Runtime total on 104 queries (seconds - lower is better) Spark 워크로드실행성능향상을위한 Runtime 내장 Spark with EMR (without runtime) 3rd party Managed Spark (with their runtime) 16,478 26,478 최상의성능제공 기존 Runtime 미포함버젼과비교하여 2.6 배성능향상 3 rd party 에서제공하는 Spark 패키지대비 1.6 배성능우위 Spark with EMR (with runtime) 10,164 0 5,000 10,000 15,000 20,000 25,000 30,000 비용효율성 3 rd party 에서제공하는 Spark 패키지대비 1/10 가격 *Based on TPC-DS 3TB Benchmarking running 6 node C4x8 extra large clusters and EMR 5.28, Spark 2.4 오픈소스 Apache Spark API 와 100% 호환

최신버젼의 EMR 을통한성능향상과비용절감 Improvements since last year (minutes) 450.00 427.68 400.00 350.00 2.5x 300.00 250.00 200.00 169.41 150.00 100.00 113.13 2.4x 50.00 46.28 0.00 Runtime for 102 TPC-DS queries Geomean for 104 TPC-DS queries EMR 5.16 with Spark 2.4 EMR 5.28 with Spark 2.4

Speedup Long-running 쿼리에대해서평균 5 배의성능향상 35.5X 30.5X 25.5X 20.5X 15.5X 10.5X 5.5X.5X Query number

Runtime (hours) EMR 버젼에따른비용절감 / 성능향상 8 7 6 5 4 1.74x 3 2 2.00x 2.25x 2.27x 2.43x 1 0 5.16 5.24 5.25 5.26 5.27 5.28 EMR release

Spark 를위한 Runtime 의최적화방법 Spark 작업실행을위한최적의 configuration 값셋팅 CPU/disk ratios, driver/executor conf, heap/gc, native overheads, instance defaults 데이터작업플랜생성의최적화 Dynamic partition pruning, join reordering Query execution 최적화 Data pre-fetch and more Job startup 설정셋팅 Eager executor allocation, and more

Data Lake 운영환경에서의 Data Update 이슈 MySQL database Data lake Amazon S3 Order ID Quantity Date 001 10 01/01/2019 001 15 01/02/2019 002 20 01/01/2019 002 20 01/02/2019 Action Order ID Quantity Date I 001 10 01/01/2019 U 001 15 01/02/2019 I 002 20 01/01/2019 D 002 20 01/02/2019

Uber 의실제유스케이스 - Incremental Update Past Years Last Month Incremental Update Last Week Yesterday 엄청난양의 IO 발생 Today New Files Unaffected Files Updated Files Files Affected

Uber 의실제유스케이스 - Cascading Effects update update update Raw table ETL Table A ETL Table B New Data Unaffected Data Updated Data

Slow Data Lake 이슈 매일 Hbase 에업데이트된 500GB 데이터를 이후데이터레이크에반영하기위해실제 120 TB HBase table ingested every 8 hours; Actual change < 500GB Full recompute every 6-8 hours 120TB 데이터를처리, 8 시간이소요 Updated / Created rows from databases Raw Tables Data Lake Derived Tables Amazon S3 Streaming data Big batch jobs

Apache Hudi (Hadoop upserts and incrementals) 스파크기반의데이터관리레이어 S3, Hive metastore 와호환가능 Spark-SQL, Hive, Presto 를통해데이터쿼리가능 Hudi CLI, DeltaStreamer, Hive Catalog Sync 와같은다양한인터페이스제공

Apache Hudi (incubating) 는데이터추상화레이어 Queries Hudi Spark Data Source

워크로드유형에따라두가지 Storage type 지원 Copy On Write Read heavy Merge On Read Write heavy 읽기성능최적화 비교적예측가능한 작은규모의워크로드 데이터변경을즉시활용 어드밴스드모드 워크로드변화대응가능 Hudi Dataset

Hudi Storage types & Views - Copy on Write Storage Type: Copy On Write Views/Queries: Read-Optimized, Incremental

Hudi Storage types & Views - Copy on Write Storage Type: Copy On Write Views/Queries: Read-Optimized, Incremental File 0 A, B File 1 C, D

Hudi Storage types & Views - Copy on Write Storage Type: Copy On Write Views/Queries: Read-Optimized, Incremental File 0 A, B File 0 A, B File 1 C, D File 2, F

Hudi Storage types & Views - Copy on Write Storage Type: Copy On Write Views/Queries: Read-Optimized, Incremental 언제사용할것인가? 현재작업이데이터업데이트를위해서전체테이블 / 파티션을다시쓰기할때 현재워크로드가비교적증감이일정하고, 갑작스럽게피크를치지않을때 데이터가이미 Parquet 파일형태로저장되어있을때 오퍼레이션관련가장간단한요구사항을가지고있을때

Hudi Storage types & Views - Merge On Read Storage type: Merge On Read Views/Queries: Read Optimized, Incremental, Real Time

Hudi Storage types & Views - Merge On Read Storage type: Merge On Read Views/Queries: Read Optimized, Incremental, Real Time Log 0 A Log 1 D

Hudi Storage types & Views - Merge On Read Storage type: Merge On Read Views/Queries: Read Optimized, Incremental, Real Time Log 0 A Log 0 A A Log 1 D Log 2,F

Hudi Storage types & Views - Merge On Read Storage type: Merge On Read Views/Queries: Read Optimized, Incremental, Real Time Log 0 A Log 1 D Log 0 A A File 0 A, B File 1 C, D Log 2,F File 2, F

Hudi Storage types & Views - Merge On Read Storage type: Merge On Read Views/Queries: Read Optimized, Incremental, Real Time 언제사용할것인가? 데이터수집시점에최대한빠르게쿼리가필요할때 워크로드에갑작스런변화나쿼리패턴의변화에대응해야할때 사례 : 데이터베이스변경분의벌크업데이트로인해대량의기존 S3 파티션데이터의변경필요

Hudi DataSet Sample Code Hudi Data Set 저장을위한설정 - Storage Type, Record Key, Partition Key

Hudi DataSet Sample Code Hudi Data Set 형식으로 S3 에저장 - Bulk Insert

Hudi DataSet Sample Code Hudi Data Set 형식으로저장된데이터를로딩하여 SparkSQL 로쿼리

Hudi DataSet Sample Code 변경사항을업데이트 / 삭제하기위해서대상데이터를생성하여 Append

Hudi Dataset 사용을통한얻는장점 Apache HUDI 오픈소스커뮤니티기술지원 Spark, Hive, Presto 지원 Data Lake에서다음을가능하게한다. a) 개인정보관련법준수 b) 실시간스트림데이터와변경분데이터 (CDC) 활용을효율적으로 c) 빈번하게변경되는데이터관리 d) 변경히스토리의관리및롤백가능

EMR Managed resize

EMR Managed resize ( 베타 ) EMR 클러스터리사이즈를자동적으로관리 최대 / 최소 node 수만지정하면다른설정은필요없음 모니터링데이터기반으로 1분내외의빠른스케일아웃가능 워크로드에따라전체사용비용이 20~60% 까지절약 기존방식의오토스케일링클러스터도운영가능 사용자의선택에따라직접커스텀메트릭을사용하여설정하거나 Managed resize 옵션을통해자동화가능

EMR Managed resize ( 베타 )

Off-cluster persistent Spark History Service

EMR 어플리케이션로그 off-cluster 설정 Amazon EMR, Hadoop 에서는클러스터의상태와어플리케이션로그파일을생성하여기본적으로마스터노드에기록되며다음과같이확인가능 SSH 를통해마스터노드저장경로 (/mnt/var/log) 에연결하여작업유형별로그파일확인 EMR Console 에서 Spark history server UI 를통해확인 지정된 S3 에로그를자동저장

EMR 어플리케이션로그보기 EMR 클러스터생성시마스터노드의로그를 S3 에저장설정가능, 저장된로그를 Athena 를통해탐색 CREATE EXTERNAL TABLE `myemrlogs`( `data` string COMMENT 'from deserializer ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' LINES TERMINATED BY '\n STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.textinputformat OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat LOCATION.'s3://aws-logs-123456789012-us-west-2/elasticmapreduce/j-2ABCDE34F5GH6' 예 ERROR, WARN, INFO, EXCEPTION, FATAL 또는 DEBUG 에대한 Namenode 애플리케이션로그쿼리 SELECT "data", "$PATH" AS filepath FROM "default"."myemrlogs" WHERE regexp_like("$path",'namenode') AND regexp_like(data, 'ERROR WARN INFO EXCEPTION FATAL DEBUG') limit 100; 예 작업 job_1561661818238_0004 및 Failed Reduces 에대한 Hadoop-Mapreduce 파티션쿼리 SELECT data, "$PATH" FROM "default"."mypartitionedemrlogs" WHERE logtype='hadoop-mapreduce' AND regexp_like(data,'job_1561661818238_0004 Failed Reduces') limit 100;

한번설정으로모든데이터를안전하게 Lake Formation Admin Lake Formation Amazon Athena Amazon EMR Permissions Data catalog Amazon S3 Amazon Redshift AWS Glue

Lake Formation 을통한데이터관리 테이블의컬럼레벨까지의세부레벨의권한관리가가능 AWS Glue Data Catalog와통합된메타스토어제공 내부 ID 관리시스템 (AD, Auth0, Okta) 와통합인증시스템연동 SAML 2.0 파일지원을통한관리지원 다양한어플리케이션에서지원 Spark SQL EMR Notebooks과 Zeppelin with Livy

Hadoop 3.0 에서 Docker 지원 Hadoop 3.1.0, Spark 2.4.3 부터지원, EMR 6.0.0 에서지원시작 EMR 에서 Docker 를활용함으로써다음과같은장점을가질수있다. 복잡성감소 - 번들라이브러리와어플리케이션의종속성을관리해준다. 사용률향상 - 동일클러스터에서다수버젼의 EMR이나어플리케이션실행가능 민첩성향상 - 새로운버젼의소프트웨어신속하게테스트하고생산 응용프로그램이식성 - 사용자운영환경을변경하지않고여러 OS에서실행

Docker Registry 선택옵션제공 Public subnet 인터넷을통해 YARN 에서 Docker Hub 와같은공개리파지토리를선택한디플로이지원 Private subnet AWS PrivateLink 를통해 Amazon ECR 리파지토리정보를통한디플로이지원

Docker 를이용한 EMR 클러스터구성 다음과같이 container-executor.json 파일을생성하고 CLI 를통해 EMR 6.0.0 ( 베타 ) 클러스터를시작가능 [ { "Classification": "container-executor", "Configurations": [ { "Classification": "docker", "Properties": { "docker.trusted.registries": "local,centos, your-public-repo,123456789123.dkr.ecr.us-east- 1.amazonaws.com", "docker.privileged-containers.registries": "local,centos, your-public-repo,123456789123.dkr.ecr.us-east- 1.amazonaws.com" } } ] } ] $ aws emr create-cluster \ --name "EMR-6-Beta Cluster" \ --region $REGION \ --release-label emr-6.0.0-beta \ --applications Name=Hadoop Name=Spark \ --service-role EMR_DefaultRole \ --ec2-attributes KeyName=$KEYPAIR,InstanceProfile=EMR_EC2_DefaultRole,SubnetId=$SUBNET_ID \ --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=$INSTANCE_TYPE InstanceGroupType=CORE,InstanceCount=2,InstanceType=$INSTANCE_TYPE \ --configuration file://container-executor.json