빅데이터기술개요 2016/8/20 ~ 9/3 윤형기 (hky@openwith.net)
일정 1 일차 2 일차 3 일차 4 일차 5 일차 6 일차 7 일차 8 일차 오전 배경과개요 MR 프로그래밍 MR 프로그래밍 Pig & Hive Flume & Sqoop R 사용법 기계학습 (1) 클라우드활용 오후 환경구축과기본실습 N/A Pig & Hive Flume & Sqoop 데이터분석기초 기초통계와시각화 기계학습 (2) 사례분석 2
빅데이터기술개요 D1 3
도입 변화하는세상 데이터의힘 4
Changing World 5
Irreversible 6
Data Power 7
빅데이터 그림출처 : zdnet 8
빅데이터기술개요
배경 3V Tidal Wave 3VC Supercomputer High-throughput computing 2가지방향 : 원격, 분산형대규모컴퓨팅 (grid computing) 중앙집중형 (MPP) Scale-Up vs. Scale-Out BI (Business Intelligence) 특히 DW/OLAP/ 데이터마이닝 10
BI BI 개요 구성 솔루션 설명 전략 BI BSC Balanced Scorecard. 균형성과관리. VBM Value-based Management. 가치창조경영. ABC Activity Based Costing. 활동기준원가계산. 분석 BI OLAP On-line Analytical Processing. 다차원분석 확장 ERP,CRM ERP, CRM, SCM 등의기능을확장하여 BI기능제공 인프라 / ETL Extraction-Translation-Loading. 운영 BI DW Data Warehouse. 데이터저장소 (repository) 전달 BI Portal 포털. 11
Hadoop Hadoop 의탄생? 배경 특징 Google! Nutch/Lucene 프로젝트에서 2006 년독립 Doug Cutting Apache 의 top-level 오픈소스프로젝트 대용량데이터분산처리프레임워크 http://hadoop.apache.org 순수 S/W 프로그래밍모델의단순화로선형확장성 (Flat linearity) function-to-data model vs. data-to-function (Locality) KVP (Key-Value Pair) 12
Hadoop 탄생의배경 1990 년대 Excite, Alta Vista, Yahoo, 2003~4 Google Paper 2006 Apache 프로젝트에등재 2000 Google ; PageRank, GFS/MapReduce 2005 Hadoop 탄생 (D. Cutting & Cafarella) 13
Frameworks 14
Big Picture 15
Hadoop Kernel Hadoop 배포판 Apache 버전 2.x.x : 0.23.x 기반 3 rd Party 배포판 Cloudera, HortonWorks 와 MapR 16
Hadoop 배포판? Apache 재단의 Hadoop은 0.10에서시작하여현재 0.23 현재 Apache 2.x.x : 0.23.x 기반 1.1.x : 현재안정버전 (0.22기반) 0.20.x: 아직도많이사용되는 legacy 안정버전 현재 3 rd Party 배포판 Cloudera CDH HortonWorks MapR 17
Hadoop Ecosystem Map 18
Hadoop HDFS & MapReduce
HDFS
요구사항 Commodity hardware 잦은고장은당연한일 수많은대형파일 수백 GB or TB 대규모 streaming reads Not random access Write-once, read-many-times High throughput 이 low latency보다더중요 Modest number of HUGE files Just millions; Each > 100MB & multi-gb files typical Large streaming reads 21
HDFS 의해결책 파일을 block 단위로저장 통상의파일시스템 (default: 64MB) 보다훨씬커짐 Replication 을통한신뢰성증진 Each block replicated across 3+ DataNodes Single master (NameNode) coordinates access, metadata 단순화된중앙관리 No data caching Streaming read 의경우별도움이안됨 Familiar interface, but customize the API 문제를단순화하고분산솔루션에주력 22
GFS 아키텍처 그림출처 : Ghemawat et.al., Google File System, SOSP, 2003 23
HDFS File Storage 24
HDFS 이용환경 명령어 Interface Java API Web Interface REST Interface (WebHDFS REST API) HDFS를 mount하여사용 25
HDFS 명령어 Interface Create a directory $ hadoop fs -mkdir /user/idcuser/data Copy a file from the local filesystem to HDFS $ hadoop fs -copyfromlocal cit-patents.txt /user/idcuser/data/. List all files in the HDFS file system $ hadoop fs -ls data/* Show the end of the specified HDFS file $ hadoop fs -tail /user/idcuser/data/cit-patents-copy.txt Append multiple files and move them to HDFS (via stdin/pipes) $ cat /data/ita13-tutorial/pg*.txt hadoop fs -putdata/all_gutenberg.txt 26
File/Directory 명령어 : copyfromlocal, copytolocal, cp, getmerge, ls, lsr (recursive ls), movefromlocal, movetolocal, mv, rm, rmr (recursive rm), touchz, mkdir Status/List/Show 명령어 : stat, tail, cat, test (checks for existence of path, file, zero length files), du, dus Misc 명령어 : setrep, chgrp, chmod, chown, expunge (empties trash folder) 27
HDFS Java API Listing files/directories (globbing) Open/close inputstream Copy bytes (IOUtils) Seeking Write/append data to files Create/rename/delete files Create/remove directory Reading Data from HDFS org.apache.hadoop.fs.filesystem (abstract) org.apache.hadoop.hdfs.distributedfilesystem org.apache.hadoop.fs.localfilesystem org.apache.hadoop.fs.s3.s3filesystem 28
HDFS Web Interface 29
전형적인 Topology 30
HDFS 정리 다수의저가 H/W 위에서대규모작업에중점 잦은고장에대처 대형파일 ( 주로 appended and read) 에중점 개발자들에촛점맞춘 filesystem interface Scale-out & Batch Job 최근여러보완프로젝트 31
MapReduce
MapReduce 프로그래밍모델 33
MapReduce 프로그래밍모델 34
35
36
37
38
39
Job 수행 40
Word Count Output 41
WordCount 예의개선 문제 : 단한개의 reducer 가병목을일으킴 Work can be distributed over multiple nodes (work balance 개선 ) All the input data has to be sorted before processing Question: Which data should be send to which reducer? 해결책 : Arbitrary distributed, based on a hash function (default mode) Partitioner Class, to determine for every output tuple the corresponding reducer 42
참고 1. No of map : depends on the input data size, usually 10~100 per node SetNumMapTasks(int) 를이용해서조정가능 2. No of reducers 일반식 (?) = 0.95~1.75 x <no of nodes> * mapred.tasktracker.reduce.tasks.maximum zero reducer 인경우도많음 JobConf.setNumReduceTasks(int) 를이용해서조정가능 43
unix 명령어와 Streaming API Question: How many cities has each country? hadoop jar /mnt/biginsights/opt/ibm/biginsights/pig/test/e2e/ pig/lib/hadoop-streaming.jar \ -input input/city.csv \ -output output \ -mapper "cut -f2 -d," \ -reducer "uniq -c"\ -numreducetasks 5 Explanation: cut -f2 -d, # Extract 2 nd col. in a CSV uniq -c # Filter adjacent matches matching lines from INPUT, # -c: prefix lines by the number of occurrences additional remark: # numreducetasks=0: no shuffle & sort phase!! 44
[ 실습 ] MR with Python stream hound of the baskervilles gutenberg plain text (utf-8) as input.txt -- Python: mapper.py #!/usr/bin/env python import sys counts={} for line in sys.stdin: words = line.split() for word in words: counts[word] = counts.get(word, 0)+1 print counts -- $ chmod +x mapper.py $./mapper.py < input.txt 45
mapper2.py #!/usr/bin/env python import sys counts={} for line in sys.stdin: words = line.split() for word in words: print word + \t + str(1) $./mapper2.py < input.txt sort 46
reducer.py #!/usr/bin/env python import sys previous_key = None total =0 for line in sys.stdin: key, value = line.split("\t", 1) if key!= previous_key: if previous_key!=none print previous_key + " was found" + str(total) + " times" previous_key = key total =0 total += int(value) if previous_key!= None: print previous_key + " was found " +str(total) + " times" 47
$./mapper2.py < input.txt sort./reducer.py (factoring) $./mapper2.py < input.txt sort./reducer2.py $ cat *.txt./mapper2.py sort./reducer2.py 48
MapReduce High Level 49
MRv1 vs. MRv2 50
작업방식 개요 JobTracker/TaskTracker 의기능을세분화 a global ResourceManager a per-application ApplicationMaster a per-node slave NodeManager a per-application Container running on a NodeManager ResourceManager 와 NodeManager 가새로도입 ResourceManager ResourceManager 가 application 간의자원요청을관리 (arbitrates resources among applications) ResourceManager 의 scheduler 를통해 resource allocation to applications ApplicationMaster = a framework-specific entity 로서필요한 resource container 를 scheduler 로부터할당받음 ResourceManager 와협의한후 NodeManager(s) 를통해 component tasks 를수행 Also, tracks status & monitors progress NodeManager = per-machine slave, is responsible for launching the applications containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager. 51
MRv2 진행경과 52
필요성 Feature Multi-tenancy Cluster Utilization Scalability 기능 YARN allows multiple access engines to use Hadoop as the common standard for batch, interactive and real-time engines that can simultaneously access the same data set. Multi-tenant data processing improves an enterprise s return on its Hadoop investments. Dynamic allocation of cluster resources를통해 MR 작업향상 Scheduling 기능개선으로확장성강화 (thousands of nodes managing PB s of data). 53
Hadoop1 MR Daemons 54
55
Hadoop 1 Limitations Scalability NameNode 가취약점 Re-startability 낮은 Resource Utilization MR 에한정 Lack of wire-compatible protocols Max cluster size 4,000 nodes Max. concurrent tasks 40,000 Coarse sync in Job tracker Failure kills all queued and running jobs Restart is very tricky due to complex state Hard partition of resources into map and reduce slots Doesn t support other programs Iterative applications implementations are 10x slower Client and cluster must be of same version Applications and workflows cannot migrate to different clusters 56
Hadoop 2 Design concept job Tracker 의기능을 2 개 function 으로분리 cluster resource management Application life-cycle management MR becomes user library, or one of the application residing in Hadoop 57
MR2 이해를위한 Key Concept Application a job submitted to the framework 예 : MR job Container = allocation 의기본단위 Fine-grained resource allocation 예 : container A = 2GB, 1 CPU replaces the fixed MR slots Resource Manager = global resource scheduler Hierarchical queues NodeManager Per-machine agent Container 의 life-cycle 관리 container resource monitoring Application Master Per application 으로서 application scheduling 및 task execution 을관리 예 : MR Application Master 58
YARN = MR2.0 + Framework to develop and/or execute distributed processing applications 예 : MR, Spark, Hama, Giraph 59
Hadoop2 의 High-level architecture 60
비교 61
62
YARN 의문제점 Complexity Protocol are at very low level, very verbose Long running job 에적합치않음 Application doesn't survive Master crash No built-in communication between container and master Hard to debug 63
Hadoop 의장단점과대응 Haddop 의장점 commodity h/w scale-out fault-tolerance flexibility by MR Hadoop 의단점 MR! Missing! - schema 와 optimizer, index, view,... 기존 tool 과의호환성결여 해결책 : Hive SQL to MR Compiler + Execution 엔진 Pluggable storage layer (SerDes) 미해결숙제 : Hive ANSI SQL, UDF,... MR Latency overhead 계속작업중...! 64
SQL-on-MapReduce 방향 SQL로 HDFS에저장된데이터를빠르게조회하고, 분석 MR을사용하지않는 (low latency) 실시간분석을목표 대규모 batch 및실시간 interactive 분석에사용 HDFS, 기타데이터에대한 ETL, Ad-hoc 쿼리, 온라인통합 New Architecture for SQL on Hadoop Data Locality (MR대신) Real-timer Query Schema-on-Read SQL ecosystem과 tight 통합 65
SQL on Hadoop 프로젝트예 Google Dremel Apache Drill Cloudera Impala Citus Data Tajo 2013 년 3 월 Apache Incubator Project 에선정 APL V2.0 국내기업적용 SK 텔레콤등 66
Use the right tool for the right job 67
대표적인 Hadoop 활용 Text Mining Index 생성 그래프분석 패턴인식 Collaborative filtering 예측모델 감성분석 Risk 분석 빅데이터분석교육 (2015-11)
유형별활용양태 실시간 (real time) 리스크분석 ( 은행 ) 사기탐지 ( 신용카드 ), 자금세탁위험탐지 소셜네트워크분석 금융및통신사의마케팅 ( 이벤트 ) 유통최적화 ( 시뮬레이션 ) 부당보험첨구및탈세위험탐지 데이터의속도 사전적예방점검 ( 항공 ) 감성분석 /SNA 제조부문에서의수요예측 건강보험 / 질병정보분석 일괄처리 (Batch) 전통적 DW 텍스트분석실시간영상감시 정형데이터 데이터의유형 비정형데이터 69
Hadoop Ecosystems
Ecosystem 관계도 그림출처 : https://www.mssqltips.com/ 빅데이터분석교육 (2015-11)
Hadoop Ecosystem "Hadoop Ecosystem" 1 차적 subprojects Zookeeper Hive and Pig HBase Flume 2 차적 subprojects Sqoop Oozie Hue Mahout 72
The Ecosystem is the System Hadoop 은빅데이터용분산운영체제의 kernel 역할 No one uses the kernel alone 73
YARN & Hadoop Ecosystems MR Core component since Hadoop 1 Tez provides pre-warmed containers & low latency dispatch 100 배성능향상 특히 Hive, Pig 에서이용 HBase column-oriented data store Storm Streaming for large scale live event processing Giraph Iterative graph processing Spark In-memory cluster computing 74
빅데이터분석 빅데이터분석교육 (2015-11)
빅데이터플랫폼 빅데이터분석교육 (2015-11) 그림출처 : it.toolbox.com
분석도구 Big Bang 기능특화 빅데이터분석교육 (2015-11)
R open-source 수리 / 통계분석도구및프로그래밍언어 S 언어에서기원하였으며 7,000 여개의 package CRAN: http://cran.r-project.org/ 뛰어난성능과시각화 (visualization) 기능 빅데이터분석교육 (2015-11)
Python 오픈소스프로그래밍언어 Multi-platform 풍부한패키지 ( 10k) 가독성 Logic 언어 Executable pseudocode 간결성 Expressiveness less code Full-stack Web GUI OS Science 활발한커뮤니티활동 빅데이터분석교육 (2015-11)
분석기법 Data Mining Predictive Analysis Data Analysis Data Science OLAP BI Analytics Text Mining SNA (Social Network Analysis) Modeling Prediction Machine Learning Statistical/Mathematical Analysis KDD (Knowledge Discovery) Decision Support System Simulation 편의상 ( 데이터 ) 분석 (Data Analysis), 마이닝 (Data Mining) 으로혼용 빅데이터분석교육 (2015-11)
통계기초이론 Taxonomy 빅데이터분석교육 (2015-11)
기계학습이론 Taxonomy 빅데이터분석교육 (2015-11)
실습환경의구축 83
Hadoop 설치
Hadoop? Hadoop? 2012 이후경부터 "Hadoop" 은 Hadoop Ecosystem 을의미하는것으로확대 Base framework Hadoop Common contains libraries and utilities needed by other Hadoop modules. HDFS Hadoop MapReduce a programming model YARN a resourcemanagement platform Ecosystems 85
Hadoop 설치방법론 선택 (1) 설치모드 Standalone Pseudo distributed cluster Multinode cluster 선택 (2) 배포판 hadoop.apache.org cloudera Hortonworks MapR 기타 선택 (3) 설치항목 One-by-one vs. All-in-one Cloud ( 예 : Amazon) Virtual Machine? 86
87
88
Cloudera Quick Start 89
90
Hadoop 실습
[ 실습 1] Ubuntu + Apache Hadoop ubuntu 14.04 다운로드 영문으로설치 $ sudo apt-get install ssh $ sudo apt-get install rsync -- [java 설치 ]-- $ sudo apt-get install openjdk-7-jdk $ ls /usr/lib/jvm/java-7-openjdk-amd64/ $ sudo vi /etc/bash.bashrc append: export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/ 저장후 : $ source /etc/bash.bashrc $ java -version
참고자료 : https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoopcommon/singlecluster.html --- hadoop user 추가 $ sudo adduser hadoop (passwd: hadoop) $ su - hadoop $ ssh-keygen -t rsa $ cat ~/.ssh/id_rsa.pub $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys $ chmod 0600 ~/.ssh/authorized_keys $ ssh localhost $ exit
[Hadoop 다운로드설치 ] $ wget http://apache.tt.co.kr/hadoop/commonstable/hadoop-2.7.2.tar.gz $ tar xzvf hadoop-2.7.2.tar.gz $ mv hadoop-2.7.2 hadoop -- vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh 다음을추가 export JAVA_HOME= ~~
[ 실습 2] 간단한 MR 예제프로그램실행 Standalone 작업 예 : unpacked conf 디렉토리내용을입력데이터로복사한후정규식적용. $ mkdir input $ cp etc/hadoop/*.xml input $ bin/hadoop jar share/hadoop/mapreduce/hadoopmapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+ $ cat output/*
Pseudo-Distributed 작업 etc/hadoop/core-site.xml: <configuration> <property> <name>fs.defaultfs</name> <value>hdfs://localhost:9000</value> </property> </configuration> etc/hadoop/hdfs-site.xml: <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> Format the filesystem: $ bin/hdfs namenode -format
(1) 데몬프로그램실행 NameNode daemon 및 DataNode daemon: $ sbin/start-dfs.sh (2) 브라우저이용 - NameNode - http://localhost:50070/ (3) HDFS 디렉토리생성 $ bin/hdfs dfs -mkdir /user $ bin/hdfs dfs -mkdir /user/<username> (4) 입력파일복사 $ bin/hdfs dfs -put etc/hadoop input
(5) 예제프로그램실행 $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduceexamples-2.7.2.jar grep input output 'dfs[a-z.]+ (6) 출력사항검토 $ bin/hdfs dfs -get output output $ cat output/* Or View the output files on the distributed filesystem: $ bin/hdfs dfs -cat output/* (7) 종료 $ sbin/stop-dfs.sh 98
[ 실습 3] Stream 99
[ 실습 5] 교재 Map.java package com.packt.chapter1; import java.io.*; import java.util.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>{ public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { StringTokenizer st = new StringTokenizer(value.toString().toLowerCase()); while(st.hasmoretokens()) { output.collect(new Text(st.nextToken()), new IntWritable(1)); } } } 100
Reduce.java // Defining package of the class package com.packt.chapter1; // Importing java libraries import java.io.*; import java.util.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; // Defining the Reduce class public class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>{ // Defining the reduce method for aggregating the generated output of Map phase public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException { int count = 0; while(values.hasnext()) { count += values.next().get(); } output.collect(key, new IntWritable(count)); } } 101
WordCount.java import public class WordCount extends Configured implements Tool{ // run() method for setting the job configurations public int run(string[] args) throws IOException{ JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(map.class); conf.setreducerclass(reduce.class); conf.setinputformat(textinputformat.class); conf.setoutputformat(textoutputformat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); return 0; } public static void main(string[] args) throws Exception { int exitcode = ToolRunner.run(new WordCount(), args); System.exit(exitCode); } } 102
Hortonworks HDP [ 실습 5] 배포판의설치 Cloudera MapR 103