Hadoop 10주년과 Hadoop3.0의 등장_Dongjin Seo

Similar documents
슬라이드 1

김기남_ATDC2016_160620_[키노트].key

Open Cloud Engine Open Source Big Data Platform Flamingo Project Open Cloud Engine Flamingo Project Leader 김병곤

1.장인석-ITIL 소개.ppt

Service-Oriented Architecture Copyright Tmax Soft 2005

PowerPoint 프레젠테이션

Backup Exec


スライド タイトルなし

Cloudera Toolkit (Dark) 2018

SAS FORUM KOREA 2018_Cloudera_발표

PowerPoint 프레젠테이션

PowerPoint 프레젠테이션

solution map_....

SW¹é¼Ł-³¯°³Æ÷ÇÔÇ¥Áö2013

분산처리 프레임워크를 활용한대용량 영상 고속분석 시스템

AGENDA 모바일 산업의 환경변화 모바일 클라우드 서비스의 등장 모바일 클라우드 서비스 융합사례

Portal_9iAS.ppt [읽기 전용]

00내지1번2번

<353020B9DAC3E1BDC42DC5ACB6F3BFECB5E520C4C4C7BBC6C3BFA1BCADC0C720BAB8BEC820B0EDB7C1BBE7C7D7BFA120B0FCC7D120BFACB1B82E687770>

Solaris Express Developer Edition

°í¼®ÁÖ Ãâ·Â

PCServerMgmt7

CONTENTS Volume 테마 즐겨찾기 빅데이터의 현주소 진일보하는 공개 기술, 빅데이터 새 시대를 열다 12 테마 활동 빅데이터 플랫폼 기술의 현황 빅데이터, 하둡 품고 병렬처리 가속화 16 테마 더하기 국내 빅데이터 산 학 연 관

Vol.257 C O N T E N T S M O N T H L Y P U B L I C F I N A N C E F O R U M

Microsoft Word - 조병호


Intro to Servlet, EJB, JSP, WS

6주차.key

<4D F736F F F696E74202D C61645FB3EDB8AEC7D5BCBA20B9D720C5F8BBE7BFEBB9FD2E BC8A3C8AF20B8F0B5E55D>

±èÇö¿í Ãâ·Â

Special Theme _ 모바일웹과 스마트폰 본 고에서는 모바일웹에서의 단말 API인 W3C DAP (Device API and Policy) 의 표준 개발 현황에 대해서 살펴보고 관 련하여 개발 중인 사례를 통하여 이해를 돕고자 한다. 2. 웹 애플리케이션과 네이

PowerPoint 프레젠테이션

Oracle9i Real Application Clusters

ecorp-프로젝트제안서작성실무(양식3)

15_3oracle

untitled

PowerPoint 프레젠테이션

서현수

플랫폼을말하다 2

<30362E20C6EDC1FD2DB0EDBFB5B4EBB4D420BCF6C1A42E687770>


클라우드컴퓨팅확산에따른국내경제시사점 클라우드컴퓨팅확산에따른국내경제시사점 * 1) IT,,,, Salesforce.com SaaS (, ), PaaS ( ), IaaS (, IT ), IT, SW ICT, ICT IT ICT,, ICT, *, (TEL)

sdf

MS-SQL SERVER 대비 기능

DB진흥원 BIG DATA 전문가로 가는 길 발표자료.pptx

13.08 ②분석

vm-웨어-01장

Domino Designer Portal Development tools Rational Application Developer WebSphere Portlet Factory Workplace Designer Workplace Forms Designer

VOL /2 Technical SmartPlant Materials - Document Management SmartPlant Materials에서 기본적인 Document를 관리하고자 할 때 필요한 세팅, 파일 업로드 방법 그리고 Path Type인 Ph

목차 BUG offline replicator 에서유효하지않은로그를읽을경우비정상종료할수있다... 3 BUG 각 partition 이서로다른 tablespace 를가지고, column type 이 CLOB 이며, 해당 table 을 truncate

06_ÀÌÀçÈÆ¿Ü0926

Web Application Hosting in the AWS Cloud Contents 개요 가용성과 확장성이 높은 웹 호스팅은 복잡하고 비용이 많이 드는 사업이 될 수 있습니다. 전통적인 웹 확장 아키텍처는 높은 수준의 안정성을 보장하기 위해 복잡한 솔루션으로 구현

thesis

vm-웨어-앞부속

歯I-3_무선통신기반차세대망-조동호.PDF

Agenda 오픈소스 트렌드 전망 Red Hat Enterprise Virtualization Red Hat Enterprise Linux OpenStack Platform Open Hybrid Cloud

FMX M JPG 15MB 320x240 30fps, 160Kbps 11MB View operation,, seek seek Random Access Average Read Sequential Read 12 FMX () 2

Analytics > Log & Crash Search > Unity ios SDK [Deprecated] Log & Crash Unity ios SDK. TOAST SDK. Log & Crash Unity SDK Log & Crash Search. Log & Cras

Social Network

<31B1E8C0B1C8F128C6ED2E687770>

리포트_03.PDF

11¹Ú´ö±Ô

... 수시연구 국가물류비산정및추이분석 Korean Macroeconomic Logistics Costs in 권혁구ㆍ서상범...

슬라이드 1

¨ìÃÊÁ¡2

금오공대 컴퓨터공학전공 강의자료

PowerPoint 프레젠테이션

Windows Embedded Compact 2013 [그림 1]은 Windows CE 로 알려진 Microsoft의 Windows Embedded Compact OS의 history를 보여주고 있다. [표 1] 은 각 Windows CE 버전들의 주요 특징들을 담고

PowerPoint 프레젠테이션

Mobile Service > IAP > Android SDK [ ] IAP SDK TOAST SDK. IAP SDK. Android Studio IDE Android SDK Version (API Level 10). Name Reference V

DW 개요.PDF

무제-1

슬라이드 1

[Brochure] KOR_LENA WAS_

05( ) CPLV12-04.hwp

슬라이드 제목 없음

0125_ 워크샵 발표자료_완성.key

초보자를 위한 분산 캐시 활용 전략

04-다시_고속철도61~80p

The Self-Managing Database : Automatic Health Monitoring and Alerting

4 CD Construct Special Model VI 2 nd Order Model VI 2 Note: Hands-on 1, 2 RC 1 RLC mass-spring-damper 2 2 ζ ω n (rad/sec) 2 ( ζ < 1), 1 (ζ = 1), ( ) 1


08SW

Analyst Briefing

1

<4D F736F F D205B4354BDC9C3FEB8AEC6F7C6AE5D3131C8A35FC5ACB6F3BFECB5E520C4C4C7BBC6C320B1E2BCFA20B5BFC7E2>

Orcad Capture 9.x

DBPIA-NURIMEDIA

MobileIron_brochure_2015_6P카탈로그출력

<목 차 > 제 1장 일반사항 4 I.사업의 개요 4 1.사업명 4 2.사업의 목적 4 3.입찰 방식 4 4.입찰 참가 자격 4 5.사업 및 계약 기간 5 6.추진 일정 6 7.사업 범위 및 내용 6 II.사업시행 주요 요건 8 1.사업시행 조건 8 2.계약보증 9 3

Oracle Apps Day_SEM

APOGEE Insight_KR_Base_3P11

Innovation: CEO In 2006, 2 in 3 CEOs said they would have to bring about fundamental change to their business in the next 2 years to implement their s

RED HAT JBoss Data Grid (JDG)? KANGWUK HEO Middleware Solu6on Architect Service Team, Red Hat Korea 1

F1-1(수정).ppt

I&IRC5 TG_08권

Output file

슬라이드 1

<43494FB8AEC6F7C6AE5FB0F8B0A3C1A4BAB85FBCF6C1A42E687770>

Transcription:

Hadoop 10 th Birthday and Hadoop 3 Alpha Dongjin Seo Cloudera Korea, SE 1

Agenda Ⅰ. Hadoop 10 th Birthday Ⅱ. Hadoop 3 Alpha 2

Apache Hadoop at 10 Apache Hadoop 3

Apache Hadoop s Timeline The Invention Years 2002 ~ 2004 The Incubation Years 2005 ~ 2007 The Coming-Out Years 2008 ~ 2009 The Rapid Adaption Years 2010 ~ 2015 [2002] [2005] [2008] [2010-11] Doug Cutting and Mike Cafarella create Nutch, an open source web crawler [2003] Google publishes its Google File System paper [2004] Cutting & Cafarella implement Nutch features that will become HDFS Google publishes its MapReduce paper Cafarella spearheads an implementation of MapReduce in Nutch [2006] Cutting joins Yahoo!; starts Hadoop subproject by carving code from Nutch First Apache release of Hadoop [2007] First Hadoop User Group meeting Community contributions begin to rise steeply Hadoop becomes a Top Level ASF project Yahoo! launches world s largest Hadoop application Hive, Hadoop s first SQL framework, becomes a Hadoop sub-project Cloudera, first company to commercialize Hadoop, is founded Initial Apache release of [2009] Cutting joins Cloudera as its chief architect The extended Hadoop community busily builds out a plethora of new components (Crunch, Sqoop, Flume, Oozie, etc) [2012-15] HDFS NameNode HA, YARN, significant new features for enterprise adoption Impala joins the ecosystem Spark becomes a Top Level ASF project Kudu, the first native storage option for Hadoop since, joins the ASF Incubator 4

Evolution of the Hadoop Platform The stack is continually evolving and growing! (HDFS, MapReduce) Hive Mahout Sqoop Avro Hive Mahout Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout YARN Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout YARN Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout YARN Knox Flink Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout YARN Kudu RecordService Ibis Falcon Knox Flink Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout YARN 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015-5

Why Did Hadoop Succeed? Open source community and license A large and diverse community of developers has historically made, and continues to make, the Hadoop ecosystem among the most active and engaged in history, while the Apache License lowers the barrier to entry for users. Extensibility/adaptability With the possible exception of Linux, no other complex platform has evolved on so many levels, and so quickly, to meet user requirements over time. A strong focus on systems The roots of Hadoop are in making distributed computing infrastructure more accessible by application developers. That continuing focus continues to bear fruit in areas like resource management and security. 6

Apache Hadoop 3 Alpha Release!! Major new changes HDFS Erasure Coding (HDFS-EC) YARN Timeline Service v.2 Shell Script Rewrite MapReduce task-level native optimization 3.0.0-alpha1 Release Date: 03 September, 2016 Changes: about 3,000 Support for Multiple Stanby NameNodes Java 8 Minimum Runtime Version New Default Ports for Several Services Intra-DataNode Balancer Reworked daemon and task heap management 7

Apache Hadoop 3 Alpha Major New Changes Major new changes HDFS Erasure Coding (HDFS-EC) ErasureCoding? - Fault-tolerance 를위한데이터보존기법중하나로흔히 RAID-5 에서사용되는기법입니다. 데이터저장시 EC Codec 으로데이터를균일한사이즈의 Data cell/parity cell 로인코딩하며이와반대로데이터로드시 Data cell 과 Parity cell 로구성된 EC Group 에서유실된 cell 에대해서해당그룹에남아있는 cell 들로부터재구성하여원본데이터복구하는디코딩작업을실행 3.0.0-alpha1 Release Date: 03 September, 2016 Changes: about 3,000 HDFS-EC ü EC 의 Reed-Solomon algorithm 을수행하는 Intel ISA-L (Intelligence Stroage Acceleration Library) 사용하여스토리지성능, 처리량, 보안, 안정성개선 ü EC 는 Exclusive-OR 공식기반이지만 Multiple failure 을보장하지못하는불안정한부분이존재하여, Reed-Solomon algorithm 을적용하여 Multiple failures 를보장 ü Cloudera + Intel + Hadoop Community 합작하여빌드 ü 각개별디렉토리에 hdfs erasurecode -setpolicy 커맨드로 policy 적용 Erasure Coding 의필요성 ü 복제개수 3 개는장애대응에용이하나기회비용이비쌈 (200% overhead in storage space and other resources (e.g. network bandwidth when writing the data)) ü 일반적인 Operations 에서는추가적으로복제된블럭에대해접근을잘하지않음 Erasure Coding 의효과 ü 더적은용량으로도 Fault-tolerance 보장 ü 3x replication 에비해 storage cost ~50% 감소효과 8

Apache Hadoop 3 Alpha Major New Changes YARN Timeline Service v.2 Major new changes YARN Timeline Service v.2 - Event, Metrics 와같은컨테이너관련정보및 Map, Reduce Task 관련 Application 정보를 WebUI 를통해확인할수있도록제공되는서비스 1) Improving scalability and reliability of Timeline Service ü 확장성이높고, 분산저장가능한아키텍처 (e.g. ) 를채택하여신뢰성향상 3.0.0-alpha1 Release Date: 03 September, 2016 Changes: about 3,000 2) enhancing usability by introducing flows and aggregation ü YARN Application 의단계별논리적 Flow 제공 ü Flow 레벨에서의 metrics 에대한 Aggregation 지원 Shell Script Rewrite ü 오랫동안가지고있던버그수정및새로운기능추가를위해 Hadoop 쉘스크립트가수정됨 ü 기존쉘스크립트버전에대해완벽하게호환되지않아 Hadoop 환경변수를사용하거나 shell command 를이용하는사용자에게영향주는부분에대한검토가필요 more info: https://issues.apache.org/jira/browse/hadoop-9902 https://issues.apache.org/jira/secure/attachment/12599817/more-info.txt 9

Apache Hadoop 3 Alpha Major New Changes Major new changes MapReduce task-level native optimization NativeTask? - 데이터프로세싱에초첨을맞춘 native computing unit 으로 Hadoop MapReduce 를위한고성능 C++ API & runtime 3.0.0-alpha1 Release Date: 03 September, 2016 Changes: about 3,000 NativeTask의필요성 ü I/O bottleneck. Most Hadoop workloads are data intensive, so if no compression is used for input, mid-output, and output, I/O(disk, network) could be a bottleneck. ü Inefficient implementation.(map side sort, Serialization/Deserialization, Shuffle, Data locality, Scheduling & starting overhead) ü Inflexible programming paradigm. - limits its performance NativeTask 의효과 ü NativeTask 를 map output collector 에적용하여 shuffle-intensive job 의성능을 30% 이상향상 Java side (to bypass normal java data flow)) JNI more info: https://issues.apache.org/jira/browse/mapreduce-2841 Native side (Actual computation)) 10

Apache Hadoop 3 Alpha Major New Changes Major new changes Support for Multiple Stanby NameNodes ü 기존 single Active Namenode, single Stanby Namenode 로만구성이가능했던아키텍처에서다중 Stanby Namenode 로구성이가능하도록변경 ü 더높은수준의 High-Availability 를제공 ü Namenode 는 5 개초과 X, 3 개를추천함 à communication overheads 고려 more info: http://hadoop.apache.org/docs/r3.0.0-alpha1/hadoop-project-dist/hadoophdfs/hdfshighavailabilitywithqjm.html 3.0.0-alpha1 Release Date: 03 September, 2016 Changes: about 3,000 Java 8 Minimum Runtime Version ü Java 7 에대한오라클의공식적인지원이종료 (April 2015) 되어 Java 8 으로변경할수밖에없는상황이됨 ü 이에따라 Hadoop 3 의최소자바버전은 Java 8 로변경됨 11

Apache Hadoop 3 Alpha Major New Changes Major new changes New Default Ports for Several Services ü Hadoop 시작시 bind error 를피하기위해 Namenode, Secondary NN, Datanode, KMS 의 port 를변경 ü 대규모클러스터에서의 rolling restart 의신뢰성향상에기여할것으로예상 Namenode ports: 50470 --> 9871, 50070 --> 9870, 8020 --> 9820 Secondary NN ports: 50091 --> 9869, 50090 --> 9868 Datanode ports: 50020 --> 9867, 50010 --> 9866, 50475 --> 9865, 50075 --> 9864 KMS port: 16000 --> 9600 3.0.0-alpha1 Release Date: 03 September, 2016 Changes: about 3,000 more info: https://issues.apache.org/jira/browse/hdfs-9427 https://issues.apache.org/jira/browse/hadoop-12811 Intra-DataNode Balancer ü 디스크추가 / 변경으로발생되는 Datanode 내의디스크들에대한데이터적재불균형을 HDFS Command(diskbalancer) 로해결 more info: http://hadoop.apache.org/docs/r3.0.0-alpha1/hadoop-project-dist/hadoophdfs/hdfscommands.html 12

Apache Hadoop 3 Alpha Major New Changes Major new changes Reworked daemon and task heap management ü 호스트의메모리사이즈를기반으로자동으로튜닝해주는기능추가 ü HADOOP_HEAPSIZE 는 deprecated 됨 ü 기존에비해간단하게 map/reduce heap 사이즈설정이가능하게되어요구되는 heap 사이즈를 task 설정이나 Java 옵션으로명시해야되는수고가없어졌음 more info: https://issues.apache.org/jira/browse/hadoop-10950 https://issues.apache.org/jira/browse/mapreduce-5785 3.0.0-alpha1 Release Date: 03 September, 2016 Changes: about 3,000 Conclusion v Hadoop 의핵심인 Component(HDFS, MARECUDE) 의큰변화는또다른큰발전을위한발돋음 ü HDFS 저장공간활용도증가 à ROI 증가, TOC 감소 à 초기도입에대한장벽이낮아짐 ü Multiple Stanby Namenode, heap 튜닝기능 à 운영환경에대한용이성증가 à 운영에투자되는비용을클러스터확장, 개선과같은곳에투자할수있는기회증가 v 변화한 Hadoop 과연관 Component 들의변화가기대됨 ü HIVE 의 Query 속도개선? ü 의 Throuthput 증가? ü 데이터압축효율성증가? ü etc v 이러한큰변화뒤에는 Community 와구성원들의관심과사랑이있기에가능 13

감사합니다 djseo@cloudera.com 14