Hadoop 10 th Birthday and Hadoop 3 Alpha Dongjin Seo Cloudera Korea, SE 1
Agenda Ⅰ. Hadoop 10 th Birthday Ⅱ. Hadoop 3 Alpha 2
Apache Hadoop at 10 Apache Hadoop 3
Apache Hadoop s Timeline The Invention Years 2002 ~ 2004 The Incubation Years 2005 ~ 2007 The Coming-Out Years 2008 ~ 2009 The Rapid Adaption Years 2010 ~ 2015 [2002] [2005] [2008] [2010-11] Doug Cutting and Mike Cafarella create Nutch, an open source web crawler [2003] Google publishes its Google File System paper [2004] Cutting & Cafarella implement Nutch features that will become HDFS Google publishes its MapReduce paper Cafarella spearheads an implementation of MapReduce in Nutch [2006] Cutting joins Yahoo!; starts Hadoop subproject by carving code from Nutch First Apache release of Hadoop [2007] First Hadoop User Group meeting Community contributions begin to rise steeply Hadoop becomes a Top Level ASF project Yahoo! launches world s largest Hadoop application Hive, Hadoop s first SQL framework, becomes a Hadoop sub-project Cloudera, first company to commercialize Hadoop, is founded Initial Apache release of [2009] Cutting joins Cloudera as its chief architect The extended Hadoop community busily builds out a plethora of new components (Crunch, Sqoop, Flume, Oozie, etc) [2012-15] HDFS NameNode HA, YARN, significant new features for enterprise adoption Impala joins the ecosystem Spark becomes a Top Level ASF project Kudu, the first native storage option for Hadoop since, joins the ASF Incubator 4
Evolution of the Hadoop Platform The stack is continually evolving and growing! (HDFS, MapReduce) Hive Mahout Sqoop Avro Hive Mahout Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout YARN Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout YARN Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout YARN Knox Flink Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout YARN Kudu RecordService Ibis Falcon Knox Flink Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout YARN 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015-5
Why Did Hadoop Succeed? Open source community and license A large and diverse community of developers has historically made, and continues to make, the Hadoop ecosystem among the most active and engaged in history, while the Apache License lowers the barrier to entry for users. Extensibility/adaptability With the possible exception of Linux, no other complex platform has evolved on so many levels, and so quickly, to meet user requirements over time. A strong focus on systems The roots of Hadoop are in making distributed computing infrastructure more accessible by application developers. That continuing focus continues to bear fruit in areas like resource management and security. 6
Apache Hadoop 3 Alpha Release!! Major new changes HDFS Erasure Coding (HDFS-EC) YARN Timeline Service v.2 Shell Script Rewrite MapReduce task-level native optimization 3.0.0-alpha1 Release Date: 03 September, 2016 Changes: about 3,000 Support for Multiple Stanby NameNodes Java 8 Minimum Runtime Version New Default Ports for Several Services Intra-DataNode Balancer Reworked daemon and task heap management 7
Apache Hadoop 3 Alpha Major New Changes Major new changes HDFS Erasure Coding (HDFS-EC) ErasureCoding? - Fault-tolerance 를위한데이터보존기법중하나로흔히 RAID-5 에서사용되는기법입니다. 데이터저장시 EC Codec 으로데이터를균일한사이즈의 Data cell/parity cell 로인코딩하며이와반대로데이터로드시 Data cell 과 Parity cell 로구성된 EC Group 에서유실된 cell 에대해서해당그룹에남아있는 cell 들로부터재구성하여원본데이터복구하는디코딩작업을실행 3.0.0-alpha1 Release Date: 03 September, 2016 Changes: about 3,000 HDFS-EC ü EC 의 Reed-Solomon algorithm 을수행하는 Intel ISA-L (Intelligence Stroage Acceleration Library) 사용하여스토리지성능, 처리량, 보안, 안정성개선 ü EC 는 Exclusive-OR 공식기반이지만 Multiple failure 을보장하지못하는불안정한부분이존재하여, Reed-Solomon algorithm 을적용하여 Multiple failures 를보장 ü Cloudera + Intel + Hadoop Community 합작하여빌드 ü 각개별디렉토리에 hdfs erasurecode -setpolicy 커맨드로 policy 적용 Erasure Coding 의필요성 ü 복제개수 3 개는장애대응에용이하나기회비용이비쌈 (200% overhead in storage space and other resources (e.g. network bandwidth when writing the data)) ü 일반적인 Operations 에서는추가적으로복제된블럭에대해접근을잘하지않음 Erasure Coding 의효과 ü 더적은용량으로도 Fault-tolerance 보장 ü 3x replication 에비해 storage cost ~50% 감소효과 8
Apache Hadoop 3 Alpha Major New Changes YARN Timeline Service v.2 Major new changes YARN Timeline Service v.2 - Event, Metrics 와같은컨테이너관련정보및 Map, Reduce Task 관련 Application 정보를 WebUI 를통해확인할수있도록제공되는서비스 1) Improving scalability and reliability of Timeline Service ü 확장성이높고, 분산저장가능한아키텍처 (e.g. ) 를채택하여신뢰성향상 3.0.0-alpha1 Release Date: 03 September, 2016 Changes: about 3,000 2) enhancing usability by introducing flows and aggregation ü YARN Application 의단계별논리적 Flow 제공 ü Flow 레벨에서의 metrics 에대한 Aggregation 지원 Shell Script Rewrite ü 오랫동안가지고있던버그수정및새로운기능추가를위해 Hadoop 쉘스크립트가수정됨 ü 기존쉘스크립트버전에대해완벽하게호환되지않아 Hadoop 환경변수를사용하거나 shell command 를이용하는사용자에게영향주는부분에대한검토가필요 more info: https://issues.apache.org/jira/browse/hadoop-9902 https://issues.apache.org/jira/secure/attachment/12599817/more-info.txt 9
Apache Hadoop 3 Alpha Major New Changes Major new changes MapReduce task-level native optimization NativeTask? - 데이터프로세싱에초첨을맞춘 native computing unit 으로 Hadoop MapReduce 를위한고성능 C++ API & runtime 3.0.0-alpha1 Release Date: 03 September, 2016 Changes: about 3,000 NativeTask의필요성 ü I/O bottleneck. Most Hadoop workloads are data intensive, so if no compression is used for input, mid-output, and output, I/O(disk, network) could be a bottleneck. ü Inefficient implementation.(map side sort, Serialization/Deserialization, Shuffle, Data locality, Scheduling & starting overhead) ü Inflexible programming paradigm. - limits its performance NativeTask 의효과 ü NativeTask 를 map output collector 에적용하여 shuffle-intensive job 의성능을 30% 이상향상 Java side (to bypass normal java data flow)) JNI more info: https://issues.apache.org/jira/browse/mapreduce-2841 Native side (Actual computation)) 10
Apache Hadoop 3 Alpha Major New Changes Major new changes Support for Multiple Stanby NameNodes ü 기존 single Active Namenode, single Stanby Namenode 로만구성이가능했던아키텍처에서다중 Stanby Namenode 로구성이가능하도록변경 ü 더높은수준의 High-Availability 를제공 ü Namenode 는 5 개초과 X, 3 개를추천함 à communication overheads 고려 more info: http://hadoop.apache.org/docs/r3.0.0-alpha1/hadoop-project-dist/hadoophdfs/hdfshighavailabilitywithqjm.html 3.0.0-alpha1 Release Date: 03 September, 2016 Changes: about 3,000 Java 8 Minimum Runtime Version ü Java 7 에대한오라클의공식적인지원이종료 (April 2015) 되어 Java 8 으로변경할수밖에없는상황이됨 ü 이에따라 Hadoop 3 의최소자바버전은 Java 8 로변경됨 11
Apache Hadoop 3 Alpha Major New Changes Major new changes New Default Ports for Several Services ü Hadoop 시작시 bind error 를피하기위해 Namenode, Secondary NN, Datanode, KMS 의 port 를변경 ü 대규모클러스터에서의 rolling restart 의신뢰성향상에기여할것으로예상 Namenode ports: 50470 --> 9871, 50070 --> 9870, 8020 --> 9820 Secondary NN ports: 50091 --> 9869, 50090 --> 9868 Datanode ports: 50020 --> 9867, 50010 --> 9866, 50475 --> 9865, 50075 --> 9864 KMS port: 16000 --> 9600 3.0.0-alpha1 Release Date: 03 September, 2016 Changes: about 3,000 more info: https://issues.apache.org/jira/browse/hdfs-9427 https://issues.apache.org/jira/browse/hadoop-12811 Intra-DataNode Balancer ü 디스크추가 / 변경으로발생되는 Datanode 내의디스크들에대한데이터적재불균형을 HDFS Command(diskbalancer) 로해결 more info: http://hadoop.apache.org/docs/r3.0.0-alpha1/hadoop-project-dist/hadoophdfs/hdfscommands.html 12
Apache Hadoop 3 Alpha Major New Changes Major new changes Reworked daemon and task heap management ü 호스트의메모리사이즈를기반으로자동으로튜닝해주는기능추가 ü HADOOP_HEAPSIZE 는 deprecated 됨 ü 기존에비해간단하게 map/reduce heap 사이즈설정이가능하게되어요구되는 heap 사이즈를 task 설정이나 Java 옵션으로명시해야되는수고가없어졌음 more info: https://issues.apache.org/jira/browse/hadoop-10950 https://issues.apache.org/jira/browse/mapreduce-5785 3.0.0-alpha1 Release Date: 03 September, 2016 Changes: about 3,000 Conclusion v Hadoop 의핵심인 Component(HDFS, MARECUDE) 의큰변화는또다른큰발전을위한발돋음 ü HDFS 저장공간활용도증가 à ROI 증가, TOC 감소 à 초기도입에대한장벽이낮아짐 ü Multiple Stanby Namenode, heap 튜닝기능 à 운영환경에대한용이성증가 à 운영에투자되는비용을클러스터확장, 개선과같은곳에투자할수있는기회증가 v 변화한 Hadoop 과연관 Component 들의변화가기대됨 ü HIVE 의 Query 속도개선? ü 의 Throuthput 증가? ü 데이터압축효율성증가? ü etc v 이러한큰변화뒤에는 Community 와구성원들의관심과사랑이있기에가능 13
감사합니다 djseo@cloudera.com 14