하둡기반트래픽분석경험으로 보는 IoT 데이터수집및분석방법 2014. 5. 29 이영석 lee@cnu.ac.kr 충남대학교컴퓨터공학과데이터네트워크연구실 (http://networks.cnu.ac.kr ) 1
발표내용 하둡기반인터넷트래픽측정 IoT 데이터수집과분석 결론 2
인터넷트래픽측정분석연구 Challenges Scalability Storage for bulky data 4.6 TB/hr in 1GE packet monitoring High-performance computing Scale-up or scale-out? Fault-tolerant system Against HDD/system failures Extensibility Agile analysis for diverse traffic format CAIDA data Ark topology: 1.8 TB Telescope: 102 TB Packet headers: 18.8 TB Josh Polterock, CAIDA: A Data Sharing Case Study, 2012 3
수행한연구문제 Given various traffic sources: packet, NetFlow and BGP data Design a traffic collection and analysis platform Such that Scalable Computing/storage performance in a scale-out manner Fault-tolerant storage and computing Extensible Easy to handle user-defined queries for diverse traffic analysis Cost-effective Commodity hardware and open-source software 4
ACM SIGCOMM2012 5
분산컴퓨팅과분산저장소 Google MapReduce, 2004 1 PB sorting by Google 2008: 6 hours and 2 minutes on 4,000 computers 2011: 33 minutes on 8000 computers Apache Hadoop project MapReduce computing framework (Java) Distributed filesystem 6
Apache Hadoop Software Ingestion Pig Analysis Structured storage Computation Sqoop Storage Zookeeper Infrastructure 7
분산트래픽분석관련연구 Traffic analysis of DNS root server (RIPE, 2011.11) PacketPig (2012.03) - Big Data Security Analytics platform Sherpasurfing Open Source Cyber Security Solution, Hadoop World 2011 Firewall/IDS logs, netflow/packet Performing Network and Security Analytics with Hadoop, (Travis Dawson, Narus), Hadoop Summit 2012 Distributed Bro (IDS) 8
Hadoop 장단점 Pros Performance Good for batch processing Development Easy for developers: Java Map & Reduce Management Fault-tolerant system Cost Scale-out feature Apache open source Cons Performance Not good for real-time processing Data uploading Development Difficult for developers problem/solution in a parallel way/suitable for MapReduce? Management Version control Debugging/troubleshoo ting in a distributed environment 9
Hadoop 기반인터넷트래픽분석 Hadoop-based Traffic Measurement and Analysis Platform Administrator NetFlow v5 Packet Web Visualizer / Hive Slave Master Traffic Collector Pcap I/O Traffic Analyzer Traffic Analysis Mapper & Reducer Bin I/O NetFlow I/O HDFS Hadoop 1. Yeonhee Lee and Youngseok Lee, "Towards Scalable Internet Traffic Measurement and Analysis with Hadoop," ACM SIGCOMM Computer Communication Review (CCR), Jan. 2013 2. Yeonhee Lee and Youngseok Lee, Scalable NetFlow Analysis with Hadoop, FloCon2013, Jan. 2013 10
Traffic Analyzer Scan IP query Hive QL Query for Traffic Analysis Spoofed IP query Heavy User query User-defined query User Interface Packet NetFlow Traffic Collector & Loader IP analysis MR Pcap InputFormat MapReduce for Traffic Analysis TCP analysis MR HTTP analysis MR IO formats Binary Input/OutputFormat DDoS analysis MR NetFlow analysis MR Text Input/OutputFormat Web UI CLI monitor query HDFS Hadoop Data Source (Jpcap, HDFS) Data Processing (HDFS, MapReduce, Hive) User Interface (Hive, Web) Distributer 11
어려웠던문제들 1. Data handling issue in Hadoop Reading variable-sized pcaplib records in HDFS heuristic to identify packet boundary in HDFS 2. Distributed traffic analysis MapReduce algorithms IP/TCP/HTTP data analysis metrics 3. Performance tuning in a large-scale Hadoop testbed 12
실험 Testbed Type Nodes CPU Memory HardDisk Rack Small 10 2.93 GHz 8 core 16 GB 1TB 1 Rack Medium 20 2.93 GHz 8 core 16 GB 1TB 1 Rack Large 200 2.66 GHz 2 core 2 GB 500 GB 4 Racks Data and MapReduce jobs Type Dataset MapReduce Job Testbed Packet 1 ~ 5 TB from CNU campus N/W IP, TCP, Web (webpop, User Behavior, DDoS) Small, Medium, Large 13
Scalability Linear performance increase 120 min with 3, 32 min with 10 nodes for IP analysis Low TCP performance 7.3 Gbps for DDoS, 1.6 Gbps with 10 nodes 1.6 14
Scalability (200 nodes) 1 ~ 5 TB input data 8 Gbps ~ 15 Gbps for 5 TB at 200 nodes (400 cores) 46 ~ 79 min 15
Hadoop 기반네트워크분석프로 젝트현황 Frontend Ingestion Storage Computation&Analys is Backend packet /flow Packet analysis CoralReef Flow analysis Packets Flows p3 HDFS put hadoop-pcap Packet Analysis DNS Analysis Visualizatio n flow-tools HDFS put IDS/IPS Bro System monitor logs system firewall server IDS/IPS PacketPig HDFS put SHERPASURFING HDFS put Security Analysis Pig Security Analysis Backend analysis BGP RIB/UPDATE BGP data analysis Quagga bgptools BGP messages BGP routing tables BGPdoop HDFS put Routing Analysis 16
Hadoop기반인터넷분석은누가어디에쓰는가? ISP 통신업체의음성 / 비디오관리와분석, 보안 포탈 컨텐츠 / 서비스관리및분석 : mobile IMS 솔루션 보안 17
미래창조과학부, 2014 18
Internet of Things (IoT) 정의 Advanced connectivity of devices, systems and services that goes beyond the traditional machine-tomachine (M2M) and covers a variety of protocols, domains and applications 개인 스마트폰, 웨어러블디바이스 교통카드, 블랙박스 홈어플라이언스 가전제품 : TV, 냉장고, 에어컨, 오븐 홈모니터링 : 에너지 (Nest), 조명 (Phillips Hue), 보안 (CCTV) 사회기반시설 건물, 도로, 철도, 항만, 물류, 발전소, 공장 센서, 로그, 화상 / 동영상 http://en.wikipedia.org/wiki/file:internet_o f_things.jpg 19
20
21
개인운동기록데이터 Jawbone UP 22
m_steps 20000 15000 10000 m_steps 5000 0 0 20 40 60 80 100 120 23
개발자 API Jawbone UP https://jawbone.com/up /developer/ http://ericblue.com/projects/upapi/ JSON, OAuth Fitbit https://wiki.fitbit.com/di splay/api/fitbit+resourc e+access+api JSON, OAuth 24
25
26
Data Partnership 예 ) Jawbone UP + Withings 27
개인신체데이터 무선랜체중계 Fitbit Aria Withings Body Analyzer 28
소니라이프로그 29
Softbank Healthcare http://www.softbank.jp/mobile/service/healthcare/ 30
전력량측정데이터 WattsUp https://www.wattsupmeter s.com/secure/support.php Google PowerMeter http://www.google.com/po wermeter/about/ 한국웹, 미중일보다전력소모커 왜? http://www.zdnet.co.kr/ne ws/news_view.asp?artice_id =20120518180333 31
18:04:04 18:04:10 18:04:16 18:04:22 18:04:28 18:04:34 18:04:40 18:04:46 18:04:52 18:04:58 18:05:04 18:05:10 18:05:16 18:05:22 18:05:28 18:05:34 18:05:40 18:05:46 18:05:52 18:05:58 18:06:04 Watts 28.0 26.0 24.0 22.0 20.0 18.0 32
데이터 텍스트 바이너리 이미지 타입장비예제형태 웨어러블디바이스센서 컴퓨팅 / 스마트디바이스 카메라, 블랙박스, CCTV 건강정보 로그 사진 정형 정형 / 비정형 비정형 오디오스마트폰통화, 음악비정형 동영상 카메라, 블랙박스 사진동영상, 영화 비정형 33
IoT, Hadoop, Big Data http://hortonworks.com/hadoop-tutorial/how-to-analyze-machine-and-sensor-data/ Hortonworks example 34
데이터마이닝 35
결론 IoT? 디바이스의중요성 Smart phone, wearable devices 새로운서비스발굴을위한데이터수집및분석프레임워크가중요 데이터의중요성 ( 빅 ) 데이터마이닝을통한부가가치창출 DB, NoSQL, Hadoop, streaming 분석기술가능 36