PowerPoint Template

Similar documents
김기남_ATDC2016_160620_[키노트].key


Page 2 of 5 아니다 means to not be, and is therefore the opposite of 이다. While English simply turns words like to be or to exist negative by adding not,

0125_ 워크샵 발표자료_완성.key

Page 2 of 6 Here are the rules for conjugating Whether (or not) and If when using a Descriptive Verb. The only difference here from Action Verbs is wh

example code are examined in this stage The low pressure pressurizer reactor trip module of the Plant Protection System was programmed as subject for

김경재 안현철 지능정보연구제 17 권제 4 호 2011 년 12 월

Intra_DW_Ch4.PDF

step 1-1


PowerPoint 프레젠테이션

지능정보연구제 16 권제 1 호 2010 년 3 월 (pp.71~92),.,.,., Support Vector Machines,,., KOSPI200.,. * 지능정보연구제 16 권제 1 호 2010 년 3 월

CD-RW_Advanced.PDF

- 2 -

#Ȳ¿ë¼®

11¹Ú´ö±Ô

PCServerMgmt7

빅데이터_DAY key

Oracle Apps Day_SEM

APOGEE Insight_KR_Base_3P11

04-다시_고속철도61~80p

Microsoft PowerPoint - 알고리즘_5주차_1차시.pptx

untitled

휠세미나3 ver0.4

Output file

ecorp-프로젝트제안서작성실무(양식3)


6주차.key

Open Cloud Engine Open Source Big Data Platform Flamingo Project Open Cloud Engine Flamingo Project Leader 김병곤


<31325FB1E8B0E6BCBA2E687770>

강의10

Social Network

본문01

슬라이드 1

°í¼®ÁÖ Ãâ·Â

solution map_....

Microsoft PowerPoint - AC3.pptx

K7VT2_QIG_v3

슬라이드 1

서론 34 2

sna-node-ties

Microsoft PowerPoint - ch03ysk2012.ppt [호환 모드]

<30362E20C6EDC1FD2DB0EDBFB5B4EBB4D420BCF6C1A42E687770>

FMX M JPG 15MB 320x240 30fps, 160Kbps 11MB View operation,, seek seek Random Access Average Read Sequential Read 12 FMX () 2

HDFS 맵리듀스

Something that can be seen, touched or otherwise sensed

09권오설_ok.hwp

6자료집최종(6.8))

ETL_project_best_practice1.ppt

Voice Portal using Oracle 9i AS Wireless

#중등독해1-1단원(8~35)학

untitled

DE1-SoC Board

Vol.259 C O N T E N T S M O N T H L Y P U B L I C F I N A N C E F O R U M

DW 개요.PDF

4 CD Construct Special Model VI 2 nd Order Model VI 2 Note: Hands-on 1, 2 RC 1 RLC mass-spring-damper 2 2 ζ ω n (rad/sec) 2 ( ζ < 1), 1 (ζ = 1), ( ) 1

<BFA9BAD02DB0A1BBF3B1A4B0ED28C0CCBCF6B9FC2920B3BBC1F62E706466>

VOL /2 Technical SmartPlant Materials - Document Management SmartPlant Materials에서 기본적인 Document를 관리하고자 할 때 필요한 세팅, 파일 업로드 방법 그리고 Path Type인 Ph

디지털포렌식학회 논문양식

정보기술응용학회 발표

ORANGE FOR ORACLE V4.0 INSTALLATION GUIDE (Online Upgrade) ORANGE CONFIGURATION ADMIN O

하나님의 선한 손의 도우심 이세상에서 가장 큰 축복은 하나님이 나와 함께 하시는 것입니다. 그 이 유는 하나님이 모든 축복의 근원이시기 때문입니다. 에스라서에 보면 하나님의 선한 손의 도우심이 함께 했던 사람의 이야기 가 나와 있는데 에스라 7장은 거듭해서 그 비결을

SchoolNet튜토리얼.PDF

¹Ìµå¹Ì3Â÷Àμâ

歯CRM개괄_허순영.PDF

Manufacturing6

vm-웨어-앞부속

<32382DC3BBB0A2C0E5BED6C0DA2E687770>

大学4年生の正社員内定要因に関する実証分析

PRO1_04E [읽기 전용]

<B3EDB9AEC1FD5F3235C1FD2E687770>

PowerPoint 프레젠테이션

Journal of Educational Innovation Research 2019, Vol. 29, No. 1, pp DOI: (LiD) - - * Way to

44-4대지.07이영희532~

1217 WebTrafMon II

비식별화 기술 활용 안내서-최종수정.indd

<C5D8BDBAC6AEBEF0BEEEC7D02D3336C1FD2E687770>

Portal_9iAS.ppt [읽기 전용]

歯1.PDF

Oracle9i Real Application Clusters

The Self-Managing Database : Automatic Health Monitoring and Alerting

DBPIA-NURIMEDIA

Orcad Capture 9.x

2009년 국제법평론회 동계학술대회 일정

untitled


DB진흥원 BIG DATA 전문가로 가는 길 발표자료.pptx

LXR 설치 및 사용법.doc

Vol.258 C O N T E N T S M O N T H L Y P U B L I C F I N A N C E F O R U M

thesis

강의지침서 작성 양식

chapter4

Microsoft PowerPoint - 27.pptx

PRO1_02E [읽기 전용]

Å©·¹Àγ»Áö20p

PowerChute Personal Edition v3.1.0 에이전트 사용 설명서

untitled

목 차

30이지은.hwp

Transcription:

빅데이터와분석알고리즘 2013. 08. 27 정보화사회실천연합 qna.pcis@daum.net

목 차 01 Big Data 1.1 Big Data 1.2 Big Data Technology 1.3 Big Data 현황 1.4 Apache Project 02 Data 2.1 Data의특성 2.2 통계분석기법 2.3 마이닝분석기법 03 Algorithm 3.1 알고리즘의특성 3.2 알고리즘의분산처리 3.3 분석자료의범위선정 04 Data mining 4.1 Data Mining 4.2 Classification rules 4.3 Clustering Rules 4.4 Association Rules 4.5 Link Analysis Page 2

1.1 Big Data 1. Big Data Big Data Page 3

1.1 Big Data 1. Big Data Big Data 정보량 (Volume) 정보 변화속도 (Velocity) 빅데이터 (Big Data) 사람 가치 (Value) 새로운 가치 Value from New Data Set 효율적 가치 ROI Innovation 다양한 가치 Hidden Value 기술 다양성 (Variety) Page 4

1.1 Big Data 1. Big Data Big Data Technology Map Grid Graph Machine Learning Statistics Analysis Data Mining Text Mining Info Graphics Visualization Lexicon Network? Cloud Distributed Parallel Computing In- Memory BigData Technology Collect information Search Search engine NoSQL SQL Parser NLP Pre-Procesing Normalisation Page 5

1.1 Big Data 1. Big Data Big Data Process Flow 수집전처리 / 저장분석가시화 Web, SNS, system Log, Sensor Data, 음성, 이미지, 영상문서 / 논문내용등 검색엔진 RSS Reader Crawling Open API Sensor Aquisition RFID Reader 분산 / 병렬전처리 NLP 문서 Fillter Parsing 분산 / 병렬저장비정형 Data 정형 Data 군집분석 분산 / 병렬데이터분석 메타정보 Streaming, no SQL, SQL 분류 / 예측분석 연관성분석 분산병렬처리프레임워크 사회관계분석 분석결과 Page 6

1.1 Big Data 1. Big Data Big Data 를위한역할과요구기술 도메인전문가데이터분석가 S/W 엔지니어 System 엔지니어 추천로직기획, 광고플랫폼 Financial & Stock Market Health Care BioInfomatics Power Management 데이터수집 마이닝알고리즘 & ML 구현 데이터처리엔진구현 데이터저장소최적화 분산알고리즘구현 통계 & 데이터탐색 데이터마이닝 & 기계학습 데이터분석 리포팅 데이터시각화 운영체계최적화 컴퓨팅 H/W, N/W 최적화 Visualization Infograph IR & RecSys OLAP Tools SAS,SPSS,R SQL RDBMS ETL Script Language Pig, Hive MapReduce Log Aggregator NoSQL Hadoop Linux X86 Network Data Scientist 엔지니어 Page 7

1.2 Big Data 의발전방향 Big Data 의발전방향 Real-time Analytics Advanced & Predictive Analytics Advanced Data Visualization 출처 : TDWI Research 4thQ 2011 on Big Data Analytics Page 8

1.3 Big Data 현황 1. Big Data 분석정보규모 지역별분석정보규모 출처 : KDnuggets Home» Polls» Algorithms for Data Mining (Nov 2011) Page 9

1.3 Big Data 현황 1. Big Data Big Data 의활용 Page 10

1.3 Big Data 현황 1. Big Data Algorithm Page 11

1.4 Apache Project 1. Big Data Apache Project 데이터마이닝 (Mahout) 분산 Coordinator (ZooKeeper) WorkFlow 관리 (Oozzie) 컬럼화된 NoSQL 저장 (Hbase) 언어처리 (Pig) 데이터처리 (SQL) (Hive) 분산프로그래밍프레임워크 (MapReduce) 메타데이터관리 (HCatalog) ( 데이블 & 스키마관리 ) Serialization (Avro) 분산파일시스템 (HDFS) 비정형데이터수집 (chukwa, Flume, Scribe) 정형데이터수집 (Sqoop, hiho) Page 12

1.4.1 Apache Hadoop 1.4 Apache Project Apache Frameworks and more Data storage (HDFS) Runs on commodity hardware (usually Linux) Horizontally scalable Processing (MapReduce) Parallelized (scalable) processing Fault Tolerant Other Tools / Frameworks Monitoring & Alerting Tools & Libraries Data Access Data Access HBase, Hive, Pig, Mahout Tools MapReduce API Hadoop Core - HDFS Hue, Sqoop Monitoring Greenplum, Cloudera Page 13

1.4.1 Apache Hadoop 1.4 Apache Project Hadoop distribution? HDFS Storage Redundant (3 copies) For large files large blocks 64 or 128 MB / block Can scale to 1000s of nodes MapReduce API Batch (Job) processing Distributed and Localized to clusters (Map) Auto-Parallelizable for huge amounts of data Fault-tolerant (auto retries) Adds high availability and more Other Libraries Pig Hive HBase Others Page 14

1.4.1 Apache Hadoop 1.4 Apache Project Cluster HDFS (Physical) Storage One Name Node Contains web site to view cluster information V2 Hadoop uses multiple Name Nodes for HA Many Data Nodes 3 copies of each node by default Name Node Secondary Name Node Work with data in HDFS Data Node 1 Data Node 2 Data Node 3 Using common Linux shell commands Block size is 64 or 128 MB Page 15

1.4.1 Apache Hadoop 1.4 Apache Project MapReduce Job Logical View Image from - http://mm-tom.s3.amazonaws.com/blog/mapreduce.png Page 16

1.4.1 Apache Hadoop 1.4 Apache Project Setting up Hadoop Development Hadoop Binaries Data Storage MapReduce Other Libraries & Tools Local install Local Linux Windows File System HDFS Pseudodistributed (singlenode) Local Vendor Tools Cloudera s Demo VM Need Virtualization software, i.e. VMware, etc Cloud AWS Azure Others Cloud Libraries Cloud AWS Microsoft (Beta) Others Page 17

1.4.1 Apache Hadoop 1.4 Apache Project Common Data Sources Text Files i.e. log files Semi-structured Unstructured Statistical information piles of numbers, often scientific sources Geospatial information i.e. cell phone activity Clickstream advertising, website traversals Page 18

1.4.1 Apache Hadoop 1.4 Apache Project Hadoop Distributed File System Hadoop Distributed File System (HDFS ) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. Single Namespace for entire cluster Data Coherency Write-once-read-many access model Client can only append to existing files Files are broken up into blocks Typically 128 MB block size Each block replicated on multiple DataNodes Intelligent Client Client can find location of blocks Client accesses data directly from DataNode Page 19

1.4.1 Apache Hadoop 1.4 Apache Project Building Blocks of Hadoop A fully configured cluster, running Hadoop means running a set of daemons, or resident programs, on the different servers in your network. These daemons have specific roles; some exist only on one server, some exist across multiple servers. Hadoop Server Roles Clients The daemons include NameNode Secondary NameNode DataNode JobTracker TaskTracker Distributed Data Analytics Map Reduce Job Tracker Data Node & Task Tracker Data Node & Task Tracker Data Node & Task Tracker Data Node & Task Tracker Name Node Distributed Data Storage HDFS Secondary Name Node Data Node & Task Tracker Data Node & Task Tracker slaves masters Page 20

1.4.1 Apache Hadoop 1.4 Apache Project NameNode The most vital of the Hadoop daemons the NameNode.Hadoop employs a master/slave architecture for both distributed storage and distributed computation. The distributed storage system is called the Hadoop File System, or HDFS. The NameNode is the master of HDFS that directs the slave DataNode daemons to perform the low-level I/O tasks. The NameNode is the bookkeeper of HDFS; it keeps track of how your files are broken down into file blocks, which nodes store those blocks, and the overall health of the distributed filesystem. The function of the NameNode is memory and I/O intensive. As such, the server hosting the NameNode typically doesn t store any user data or perform any computations for a MapReduce program to lower the workload on the machine Page 21

1.4.1 Apache Hadoop 1.4 Apache Project Secondary NameNode The Secondary NameNode (SNN) is an assistant daemon for monitoring the state of the cluster HDFS. Like the NameNode, each cluster has one SNN, and it typically resides on its own machine as well. No other DataNode or TaskTracker daemons run on the same server. The SNN differs from the NameNode in that this process doesn t receive or record any real-time changes to HDFS. Instead, it communicates with the NameNode to take snapshots of the HDFS metadata at intervals defined by the cluster configuration. As mentioned earlier, the NameNode is a single point of failure for a Hadoop cluster, and the SNN snapshots help minimize the downtime and loss of data Page 22

1.4.1 Apache Hadoop 1.4 Apache Project DataNode DataNode Each slave machine in your cluster will host a DataNode daemon to perform the grunt work of the distributed filesystem reading and writing HDFS blocks to actual files on the local filesystem. When you want to read or write a HDFS file, the file is broken into blocks and the NameNode will tell your client which DataNode each block resides in. Your client communicates directly with the DataNode daemons to process the local files corresponding to the blocks. Furthermore, a DataNode may communicate with other DataNodes to replicate its data blocks for redundancy. Page 23

1.4.1 Apache Hadoop 1.4 Apache Project Trackers JobTracker The JobTracker daemon is the liaison between your application and Hadoop. Once you submit your code to your cluster, the JobTracker determines the execution plan by determining which files to process, assigns nodes to different tasks, and monitors all tasks as they re running. Should a task fail, the JobTracker will automatically relaunch the task, possibly on a different node, up to a predefined limit of retries. There is only one JobTracker daemon per Hadoop cluster. It s typically run on a server as a master node of the cluster. TaskTracker As with the storage daemons, the computing daemons also follow a master/slave architecture: the JobTracker is the master overseeing the overall execution of a MapReduce job and the TaskTrackers manage the execution of individual tasks on each slave node. Each TaskTracker is responsible for executing the individual tasks that the JobTracker assigns. Although there is a single TaskTracker per slave node, each TaskTracker can spawn multiple JVMs to handle many map or reduce tasks in parallel. One responsibility of the TaskTracker is to constantly communicate with the JobTracker. If the JobTracker fails to receive a heartbeat from a TaskTracker within a specified amount of time, it will assume the TaskTracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster. Page 24

1.4.1 Apache Hadoop 1.4 Apache Project MapReduce Thinking MapReduce programs are designed to compute large volumes of data in a parallel fashion. This requires dividing the workload across a large number of machines. MapReduce programs transform lists of input data elements into lists of output data elements. A MapReduce program will do this twice, using two different list processing idioms: map, and reduce. A MapReduce program processes data by manipulating (key/value) pairs in the general form map: (K1,V1) list(k2,v2) reduce: (K2,list(V2)) list(k3,v3) Page 25

1.4.1 Apache Hadoop 1.4 Apache Project Input Input files : This is where the data for a MapReduce task is initially stored. While this does not need to be the case, the input files typically reside in HDFS. The format of these files is arbitrary; while line-based log files can be used, we could also use a binary format, multi-line input records, or something else entirely. It is typical for these input files to be very large -- tens of gigabytes or more. InputFormat : How these input files are split up and read is defined by the InputFormat. An InputFormat is a class that provides the following functionality: Selects the files or other objects that should be used for input Defines the InputSplits that break a file into tasks Provides a factory for RecordReader objects that read the file Several InputFormats are provided with Hadoop. An abstract type is called FileInputFormat; all InputFormats that operate on files inherit functionality and properties from this class. When starting a Hadoop job, FileInputFormat is provided with a path containing files to read. The FileInputFormat will read all files in this directory. It then divides these files into one or more InputSplits each. You can choose which InputFormat to apply to your input files for a job by calling the setinputformat() method of the JobConf object that defines the job. A table of standard InputFormats is given below. InputFormat Description Key Value TextInputFormat Default format; reads lines of text files The byte offset of the line The line contents KeyValueInputFormat Parses lines into key, val pairs Everything up to the first tab character The remainder of the line SequenceFileInputFormat A Hadoop-specific high-performance binary format user-defined user-defined Page 26

1.4.1 Apache Hadoop 1.4 Apache Project Input Splits: Input Contd. An InputSplit describes a unit of work that comprises a single map task in a MapReduce program. A MapReduce program applied to a data set, collectively referred to as a Job, is made up of several (possibly several hundred) tasks. Map tasks may involve reading a whole file; they often involve reading only part of a file. By default, the FileInputFormat and its descendants break a file up into 64 MB chunks (the same size as blocks in HDFS). You can control this value by setting the mapred.min.split.size parameter in hadoopsite.xml, or by overriding the parameter in thejobconf object used to submit a particular MapReduce job By processing a file in chunks, we allow several map tasks to operate on a single file in parallel. If the file is very large, this can improve performance significantly through parallelism. Even more importantly, since the various blocks that make up the file may be spread across several different nodes in the cluster, it allows tasks to be scheduled on each of these different nodes; the individual blocks are thus all processed locally, instead of needing to be transferred from one node to another. Of course, while log files can be processed in this piece-wise fashion, some file formats are not amenable to chunked processing. By writing a custom InputFormat, you can control how the file is broken up (or is not broken up) into splits. The InputFormat defines the list of tasks that make up the mapping phase; each task corresponds to a single input split. The tasks are then assigned to the nodes in the system based on where the input file chunks are physically resident. An individual node may have several dozen tasks assigned to it. The node will begin working on the tasks, attempting to perform as many in parallel as it can. The on-node parallelism is controlled by the mapred.tasktracker.map.tasks.maximum parameter. RecordReader: The InputSplit has defined a slice of work, but does not describe how to access it. TheRecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader instance is defined by the InputFormat. The default InputFormat, TextInputFormat, provides a LineRecordReader, which treats each line of the input file as a new value. The key associated with each line is its byte offset in the file. The RecordReader is invoke repeatedly on the input until the entire InputSplit has been consumed. Each invocation of the RecordReader leads to another call to the map() method of the Mapper. Page 27

1.4.1 Apache Hadoop 1.4 Apache Project Mapper The Mapper performs the interesting user-defined work of the first phase of the MapReduce program. Given a key and a value, the map() method emits (key, value) pair(s) which are forwarded to the Reducers. A new instance of Mapper is instantiated in a separate Java process for each map task (InputSplit) that makes up part of the total job input. The individual mappers are intentionally not provided with a mechanism to communicate with one another in any way. This allows the reliability of each map task to be governed solely by the reliability of the local machine. The map() method receives two parameters in addition to the key and the value: The Context object has a method named write() which will forward a (key, value) pair to the reduce phase of the job. The Mapper interface is responsible for the data processing step. Its single method is to process an individual (key/value) pair: public void map(k1 key,v1 value, Context context) throws IOException Page 28

1.4.1 Apache Hadoop 1.4 Apache Project In Between Phases Partition & Shuffle: After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. This process of moving map outputs to the reducers is known as shuffling. A different subset of the intermediate key space is assigned to each reduce node; these subsets (known as "partitions") are the inputs to the reduce tasks. Each map task may emit (key, value) pairs to any partition; all values for the same key are always reduced together regardless of which mapper is its origin. Therefore, the map nodes must all agree on where to send the different pieces of the intermediate data. The Partitioner class determines which partition a given (key, value) pair will go to. The default partitioner computes a hash value for the key and assigns the partition based on this result. Sort: Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer. Page 29

1.4.1 Apache Hadoop 1.4 Apache Project Reducer A Reducer instance is created for each reduce task. This is an instance of user-provided code that performs the second important phase of job-specific work. For each key in the partition assigned to a Reducer, the Reducer's reduce() method is called once. This receives a key as well as an iterator over all the values associated with the key. The values associated with a key are returned by the iterator in an undefined order. The Reducer also receives the Context object; that is used to write the output in the same manner as in the map() method. void reduce(k2 key, Iterable <V2> values, Context context) throws IOException Page 30

1.4.1 Apache Hadoop 1.4 Apache Project Combiner Combiner: The pipeline showed earlier omits a processing step which can be used for optimizing bandwidth usage by your MapReduce job. Called the Combiner, this pass runs after the Mapper and before the Reducer. Usage of the Combiner is optional. If this pass is suitable for your job, instances of the Combiner class are run on every node that has run map tasks. The Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers. The Combiner is a "mini-reduce" process which operates only on data generated by one machine. Example Word count is a prime example for where a Combiner is useful. The Word Count program emits a (word, 1) pair for every instance of every word it sees. So if the same document contains the word "cat" 3 times, the pair ("cat", 1) is emitted three times; all of these are then sent to the Reducer. By using a Combiner, these can be condensed into a single ("cat", 3) pair to be sent to the Reducer. Now each node only sends a single value to the reducer for each word -- drastically reducing the total bandwidth required for the shuffle process, and speeding up the job. The best part of all is that we do not need to write any additional code to take advantage of this! If a reduce function is both commutative and associative, then it can be used as a Combiner as well. You can enable combining in the word count program by adding the following line to the driver: conf.setcombinerclass(reduce.class); The Combiner should be an instance of the Reducer interface. If your Reducer itself cannot be used directly as a Combiner because of commutativity or associativity, you might still be able to write a third class to use as a Combiner for your job Page 31

1.4.1 Apache Hadoop 1.4 Apache Project Output OutputFormat : The (key, value) pairs provided to this OutputCollector are then written to output files. The way they are written is governed by the OutputFormat. The OutputFormat functions much like the InputFormat class described earlier. The instances of OutputFormat provided by Hadoop write to files on the local disk or in HDFS; they all inherit from a common FileOutputFormat. Each Reducer writes a separate file in a common output directory. These files will typically be named part-nnnnn, where nnnnn is the partition id associated with the reduce task. The output directory is set by the FileOutputFormat.setOutputPath() method. You can control which particular OutputFormat is used by calling the setoutputformat() method of the JobConf object that defines your MapReduce job. A table of provided OutputFormats is given below. OutputFormat: TextOutputFormat SequenceFileOutputFormat NullOutputFormat Description Default; writes lines in "key \t value" form Writes binary files suitable for reading into subsequent MapReduce jobs Disregards its inputs Hadoop provides some OutputFormat instances to write to files. The basic (default) instance is TextOutputFormat, which writes (key, value) pairs on individual lines of a text file. This can be easily re-read by a later MapReduce task using the KeyValueInputFormat class, and is also human-readable. A better intermediate format for use between MapReduce jobs is the SequenceFileOutputFormat which rapidly serializes arbitrary data types to the file; the corresponding SequenceFileInputFormat will deserialize the file into the same types and presents the data to the next Mapper in the same manner as it was emitted by the previous Reducer. The NullOutputFormat generates no output files and disregards any (key, value) pairs passed to it by the OutputCollector. This is useful if you are explicitly writing your own output files in the reduce() method, and do not want additional empty output files generated by the Hadoop framework. RecordWriter: Much like how the InputFormat actually reads individual records through the RecordReader implementation, the OutputFormat class is a factory for RecordWriter objects; these are used to write the individual records to the files as directed by the OutputFormat. The output files written by the Reducers are then left in HDFS for your use, either by another MapReduce job, a separate program, for for human inspection. Page 32

1.4.1 Apache Hadoop 1.4 Apache Project Hadoop Mapreduce Hadoop Mapreduce processes & data flow Page 33

1.4.1 Apache Hadoop 1.4 Apache Project Job Execution Hadoop MapRed is based on a pull model where multiple TaskTrackers poll the JobTracker for tasks (either map task or reduce task). The job execution starts when the client program uploading three files: job.xml (the job config including map, combine, reduce function and input/output data path, etc.), job.split (specifies how many splits and range based on dividing files into ~16 64 MB size), job.jar (the actual Mapper and Reducer implementation classes) to the HDFS location (specified by the mapred.system.dir property in the hadoop-default.conf file). Then the client program notifies the JobTracker about the Job submission. The JobTracker returns a Job id to the client program and starts allocating map tasks to the idle TaskTrackers when they poll for tasks. Each TaskTracker has a defined number of "task slots" based on the capacity of the machine. There are heartbeat protocol allows the JobTracker to know how many free slots from each TaskTracker. The JobTracker will determine appropriate jobs for the TaskTrackers based on how busy thay are, their network proximity to the data sources (preferring same node, then same rack, then same network switch). The assigned TaskTrackers will fork a MapTask (separate JVM process) to execute the map phase processing. The MapTask extracts the input data from the splits by using the RecordReader and InputFormat and it invokes the user provided map function which emits a number of key/value pair in the memory buffer. Page 34

1.4.1 Apache Hadoop 1.4 Apache Project Job Execution contd. When the buffer is full, the output collector will spill the memory buffer into disk. For optimizing the network bandwidth, an optional combine function can be invoked to partially reduce values of each key. Afterwards, the partition function is invoked on each key to calculate its reducer node index. The memory buffer is eventually flushed into 2 files, the first index file contains an offset pointer of each partition. The second data file contains all records sorted by partition and then by key. When the map task has finished executing all input records, it start the commit process, it first flush the in-memory buffer (even it is not full) to the index + data file pair. Then a merge sort for all index + data file pairs will be performed to create a single index + data file pair. The index + data file pair will then be splitted into are R local directories, one for each partition. After all the MapTask completes (all splits are done), the TaskTracker will notify the JobTracker which keeps track of the overall progress of job. JobTracker also provide a web interface for viewing the job status. When the JobTracker notices that some map tasks are completed, it will start allocating reduce tasks to subsequent polling TaskTrackers (there are R TaskTrackers will be allocated for reduce task). These allocated TaskTrackers remotely download the region files (according to the assigned reducer index) from the completed map phase nodes and concatenate (merge sort) them into a single file. Whenever more map tasks are completed afterwards, JobTracker will notify these allocated TaskTrackers to download more region files (merge with previous file). In this manner, downloading region files are interleaved with the map task progress. The reduce phase is not started at this moment yet. Eventually all the map tasks are completed. The JobTracker then notifies all the allocated TaskTrackers to proceed to the reduce phase. Each allocated TaskTracker will fork a ReduceTask (separate JVM) to read the downloaded file (which is already sorted by key) and invoke the reduce function, which collects the key/aggregatedvalue into the final output file (one per reducer node). Note that each reduce task (and map task as well) is single-threaded. And this thread will invoke the reduce(key, values) function in assending (or descending) order of the keys assigned to this reduce task. This provides an interesting property that all entries written by the reduce() function is sorted in increasing order. The output of each reducer is written to a temp output file in HDFS. When the reducer finishes processing all keys, the temp output file will be renamed atomically to its final output filename. Page 35

1.4.1 Apache Hadoop 1.4 Apache Project MapReduce Example - WordCount Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/mapreducewordcountoverview1.png Page 36

1.4.2 Mahout Algorithms 1.4 Apache Project Classification Logistic Regression (SGD) Bayesian Support Vector Machines (SVM) (open) Perceptron and Winnow (open) Neural Network (open) Random Forests (integrated) Restricted Boltzmann Machines (open) Online Passive Aggressive (integrated) Boosting (awaiting patch commit) Hidden Markov Models (HMM) Training is done in Map-Reduce Page 37

1.4.2 Mahout Algorithms 1.4 Apache Project Clustering Canopy Clustering (integrated) K-Means Clustering (integrated) Fuzzy K-Means (integrated) Expectation Maximization (EM) Mean Shift Clustering (integrated) Hierarchical Clustering Dirichlet Process Clustering (integrated) Latent Dirichlet Allocation (integrated) Spectral Clustering (integrated) Minhash Clustering (integrated) Top Down Clustering (integrated) Page 38

1.4.2 Mahout Algorithms 1.4 Apache Project Pattern Mining Parallel FP Growth Algorithm Also known as Frequent Itemset mining use Map-Reduce Regression Locally Weighted Linear Regression (open) Dimension reduction Principal Components Analysis (PCA) (open) Independent Component Analysis (open) Gaussian Discriminative Analysis (GDA) (open) Page 39

2.1 Data 의특성 2 Data 자료의형식 정형데이터 고정된필드로정의된정보 - 데이터베이스, 스프레드시트등 반정형데이터 일정한구조를갖고있는정보로서메타정보, 스키마등을포함하는정보 - XML, HTML 등 비정형데이터 고정된필드로정의되어있지않은정보 - 문서파일, 게시글, 뉴스기사, SNS 글, 이미지, 동영상, 음성정보등 자료의종류 양적자료 ( 숫자형 ) (Quantitative Data) 숫자로표현 1) 이산형자료 (Discrete Data) : 셀수있는자료 2) 연속형자료 (Continuous Data) : 셀수없는자료 질적자료 ( 문자형 ) (Qualitative Data) 특성이범주형으로만구분되고수치적으로는측정이되지않는자료 Page 40

2.1 Data 의특성 2 Data 자료의구분 구분 정의 질적 ( 범주형 ) 자료 양적자료 명목척도서열 ( 순위 ) 척도등간척도비율척도 어떤대상의내용이나특성을구분하기위한기호 순서의의미가포함되어있는척도, 숫자의크기로상대적비교가가능 순위와더불어, 측정치간의차이에대해서도의미가있는척도 구간척도의특성외에 ' 절대원점 ' 의개념이포함, 일반적으로통계기법에적용되는척도 범주 O O O O 순위 X O O O 등간격 X X O O 절대영점 X X X O 비교방법 확인, 분류 순위비교 간격비교 크기비교 산술적계산 = = = + - = + - * / 평균의측정최빈값중앙값산술평균 기하평균모든통계 예 성별 ( 남자 = '1', 여자 = '2 ) 계절, 상품유형, 결혼여부, 선수의등번호, 지역등 선호도 ( 좋다 = 1, 보통 = 2, 싫다 = 3 ) 석차, 학력등 지능지수 (IQ), 온도, 사회지표등 TV 시청률, 투표율, 무게, 연령, 생산원가, 신장, 출석률등 Page 41

2.2 통계분석기법 2 Data 변수 ( 척도 ) 별통계분석기법 등비 종속변수 명목 명. 서 독립변수 등비 명. 서 독립변수 등비 항목수? 1개 2이상 관계 관계 설명예측 판별 설명예측 목적 그룹화 차원수 3 이상 2 개 로지스틱회귀분석 T-test ANOVA 상관분석회귀분석 범주형자료분석 판별분석 군집분석 Page 42

2.2 마이닝분석기법 2 Data 데이터유형별분석기법 구분연속형종속변수이산형종속변수종속변수가없는경우 연속형독립변수 (Continuous Independent Variable) 이산형독립변수 (Discrete Independent Variable) 범주형독립변수 (Categorical Independent Variable) 예측 (Forecasting) 분류 (Classification) 예측 (Forecasting) 분류 (Classification) 분류 (Classification) 예측 (Forecasting) 분류 (Classification) 예측 (Forecasting) 분류 (Classification) 분류 (Classification) 군집화 (Clustering) 군집화 (Clustering) 연관성 (Association) 연속성 (Sequencing) 관계성 (Link Analysis) 예측 (Forecasting) 대용량데이터집합내의패턴을기반으로미래를예측 ( 예 : 수요예측 ) 분류 (Classification) 일정한집단에대한특정정의를통해분류및구분을추론 ( 예 : 이탈한고객 ) 군집화 (Clustering) 구체적인특성을공유하는자료들을분류. 미리정의된특성에대한정보를가지지않는다는점에서분류와다름 ( 예 : 유사행동집단의구분 ) 연관성 (Association) 동시에발생한사건간의상호연관성을탐색 ( 예 : 장바구니의상품들의관계규명 ) 연속성 (Sequencing) 연관규칙에시간 (time) 의개념을첨가하여시계열 (time series) 에따른패턴들의상호연관성을탐색 ( 예 : 금융상품사용에대한반복방문 ) 관계성 (Link Analysis) 대용량의정보값들간의관계를규명 ( 예 : SNS 에서관계분석 ) Page 43

3.1 알고리즘의특성 3 Algorithm 알고리즘성능특성 표기법형설명 O(1) 상수형자료의량이증가하더라도같은시간을보장한다. O(log n) 로그형 n 이증가함에따라서 log n 만큼시간이증가. O(n) 선형 n 증가시, 시간도비례해서증가. 동일한처리를하는경우. O(n log n) 선형로그형 n 이 2 배로늘어나면시간은 2 배보다약간증가. O(n^2) 평방형이중루프 O(n^3) 입방형삼중루프 O(2^n) 지수형입력자료에따라시간이급격히증가 O(n!) 펙토리얼형 big-oh 표기법에의한알고리즘의수행시간비교 O(1) < O(log n) < O(n) < O(n log n) < O(n²) < O(n³) < O(2ⁿ) < O(n!) 수학적정의 두개의함수 f(n) 과 g(n) 이주어졌을때모든 n >= n₁ 에대하여 f(n) <= c g(n) 을만족하는상수 c 와 n₁ 이존재하면 f(n) = O(g(n)) 정의를이용하여위 T(n) 을증명하면다음과같다.( 여기서 n₁ 과 c 는여러경우수가나올수있다 ) n₁= 2, c = 3 일때, n >= 2 에대하여 n² + n + 1 <= 3n² 을만족. Page 44

3.1 알고리즘의특성 3 Algorithm 실행시간및메모리증가유형 (ncr) (N 2 ) (N log N) 메모리크기 (N) Apriori Algorithm ( 예 ) 항목수 조합수 조합의결과값 1,000 2 (log N) 499,500 1,000 4 41,417,124,750 1 2 1,000 6 1,368,173,298,991,500 데이터크기또는항목값의크기 n 1,000 8 24,115,080,524,699,400,000 1,000 10 263,409,560,461,970,000,000,000 (K) Page 45

3.1 알고리즘의특성 3 Algorithm 알고리즘성능비교사례 Page 46

3.2 알고리즘의분산처리 3 Algorithm Hadoop Mapreduce Hadoop Mapreduce processes & data flow Page 47

3.2 알고리즘의분산처리 3 Algorithm MapReduce Example - WordCount Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/mapreducewordcountoverview1.png Page 48

3.2 알고리즘의분산처리 3 Algorithm 사례 : Variance 구하기 ( 단일처리 ) σ 2 = x μ 2 n = x2 2μx+μ 2 n = x2 n 2μ x n + μ2 = x2 n n nμ2 2μμ + = x2 n n μ2 σ 2 을결과값으로분산처리를하면취합시 σ 2 의값이왜곡됨 σ 2 = x2 n μ2 = x2 n x n σ2 을구하기위한인자값만분산처리로수행 2 분산처리에서구한인자값을취합시아래식에대입하여연산 x 2 n μ2 25 25 x*x= 625 32 32 x*x= 1,024 27 27 x*x= 729 45 45 x*x= 2,025 39 39 x*x= 1,521 21 21 x*x= 441 51 51 x*x= 2,601 46 46 x*x= 2,116 A C 분산 sum= 286.00 n= 8 sumsq= 11,082 sumsq/n (sum/n)=u u*u =A-C 분산값 = 107.19 u= 35.75 1,385.25 35.75 1,278.06 107.19 Page 49

3.2 알고리즘의분산처리 3 Algorithm 사례 : Variance 구하기 ( 분산처리 ) 분할자료 A 25 32 27 45 sum= 129.00 분산값 = 60.69 분할자료 A 25 25 x*x= 625 32 32 x*x= 1,024 27 27 x*x= 729 45 45 x*x= 2,025 sum= 129.00 n= 4 sumsq= 4,403 분할자료 B 39 21 51 46 sum= 157.00 분산값 = 129.19 task A " 분산값 "= 60.69 task B " 분산값 "= 129.19 sum(task A + task B)= 189.88 취합값 (sum / 2) = 94.94 분할자료 B 39 39 x*x= 1,521 21 21 x*x= 441 51 51 x*x= 2,601 46 46 x*x= 2,116 sum= 157.00 n= 4 sumsq= 6,679 sum= 286.00 n= 8 sumsq= 11,082 A C 분산 sumsq/n (sum/n)=u u*u =A-C 1,385.25 35.75 1,278.06 107.19 Page 50

3.3 분석자료의범위선정 3 Algorithm 분석시데이터의구간 Time Y-2 Y-1 Y+0 Y+1 Y+2 전수자료분석 Y+0 상반기 Y+0 하반기 Y+1 상반기 전수자료분석 Y+0 상반기 Y+0 하반기 Y+1 상반기 기분석중간결과정보 기분석중간결과정보 기분석중간결과정보 1Y 자료분석 Y+0 상반기 Y+0 하반기 Y+1 상반기 분석대상정보량의증가는실행시간및 Memory 용량과밀접한관계를가짐. Page 51

4.1 Data Mining 4 Data Mining 알고리즘발전과정 Top 10 Data Mining Algorithms 1.C4.5 2.k-Means 3.SVM(Support Vector Machines) 4.Apriori 5.EM(Expectation Maximization) 6.PageRank 7.AdaBoost 8.kNN 9.Naive Bayes 10.CART 출처 : IEEE ICDM December 2006 Page 52

4.1 Data Mining 4 Data Mining 목적에따른 DataMining 접근방법 목적분석유형설명모델종류 예측 Predictive Modeling 분류규칙 Classification 과거의데이터로부터정보의특성을찾아내어분류모형을만들어새로운정보의결과값을예측하는기법 목표마케팅및고객신용평가모형등에활용 회귀분석, 의사결정나무신경망분석, 유전자알고리즘 데이터군집 Clustering 데이터의유사특성을분석하여몇개의그룹으로분할하는기법 ( 분류와유사하나분석대상데이터에결과값이없음 ) 판촉활동이나이벤트대상선정에활용 Clustering 설명 Descriptive Modeling 유사성 연관규칙 Association 순차규칙 Sequence 데이터에존재하는항목간의관계를찾아내는기법 제품이나서비스의교차판매 (Closs Selling), 매장진열, 사기적발 (Fraud Detection) 등에활용 연관규칙에시간개념이적용된기법 목표마케팅 (Target Marketing), 개인화서비스등에활용 패턴분석 순차패턴분석 연결분석 Link Analysis 데이터의값들간의관계를파악하는기법 사회관계망분석, 감성분석등에활용 Social Network Analysis Relational Content Analysis Page 53

4.2 Classification rules 4 Data Mining 분류규칙 (classification rules) Decision trees : 각변수에따라수직적분류 ID3 (Iterative Dichotomiser 3) : 명목형예측변수이지분리 C4.5 (successor of ID3) : 명목형예측변수다지분리 CART (Classification And Regression Tree) : 지니지수 (Gini index: 범주형목표변수에적용 ) 또는분산의감소량 (variance reduction: 연속형목표변수에적용 ) 등을이용하여분리 CHAID (CHi-squared Automatic Interaction Detector) Neural networks : 각변수에가중치를사용, 분류율을최대 ( 오류율최소 ) 로하는것을기반 multi-layer perceptron Genetic algorithms Linear classifiers Logistic regression : 독립변수로명목척도 ( 성별, 인종등 ) 사용 Naive Bayes classifier : 베이즈정리 (Bayes' theorem) 를기반의단순한확률분류 ( 스펨메일 ) Support vector machines : 분류율은최대, 분류를구분하는기준 ( 여백 ) 을최대화 Kernel estimation k-nearest neighbor(knn) : 기계학습의방법중에가장간단한방법중하나 Bayesian networks Page 54

4.2.1 Decision-tree 4.2 Classification rules Decision-tree Classification C 4.5 알고리즘의엔트로피지수 (Entropy index) 는다항분포에서우도비검정통계량을사용하는것으로, 이지수가가장작은예측변수와그때의최적분리에의해마디를생성 Numeric Categorical Tid Age Car Type Class 0 23 Family High 1 17 Sports High 2 43 Sports High 3 68 Family Low 4 32 Truck Low 5 20 Family High 55 Page 55

4.2.2 Neural Network Classification Neural Network Page 56

4.2.3 Kernel Estimation 4.2 Classification rules Kernel Estimation Page 57

4.3 Clustering Rules 4 Data Mining 군집화규칙 (clustering rules) Connectivity based methods Hierarchical clustering : Linkage clustering CURE(Clustering Using REpresentatives) : 비구형모형 Chameleon : 동적인모델을이용한군집 CURE와 DBSCAN보다좋은성능으로임의적인형태의군집, 다차원시 O(n²) 모델 Centroid-based methods k-means( 평균값 ), k-medoids( 중앙값 ), k-modes( 최빈값 ) Distribution-based methods EM(Expectation maximization) Density-based methods OPTICS by using an R-tree index : 군집구조식별을위한순서화 DBSCAN(Density-Based Spatial Clustering of Applications with Noise) : 밀도기반 DENCLUE(DENsity-based CLUstEring) : 밀도분포함수이용 Grid-based methods STING(STatistical INformation Grid) : 통계정보격자이용 WaveCluster : 웨이블릿변환을이용 CLIQUE(Clustering In QUEst) : 고차원공간군집화 Page 58

4.3.1 Connectivity based 4.3 Clustering rules Connectivity based Page 59

4.3.2 Centroid based 4.3 Clustering rules Centroid based Page 60

4.3.3 Distribution based 4.3 Clustering rules Distribution based Page 61

4.3.4 Density based 4.3 Clustering rules Density based Page 62

4.4 Association Rules 4 Data Mining 연관규칙 (Association Rules) Apriori Algorithm Apriori Algorithm AprioriTid Algorithm, AprioriHybrid Algorithm Eclat Algorithm (depth-first search algorithm) RElim Algorithm (Recursive Elimination Algorithm) Pattern-Growth Algorithm FP-Growth Algorithm (Frequent Pattern Growth Algorithm) 순차패턴 (sequential patterns) Apriori Algorithm AprioriAll, AprioriSome DynamicSome GSP(Generalized Sequential Patterns) Pattern-Growth Algorithm FreeSpan(Frequent Pattern-Projected Sequential PAtterN mining) PrefixSpan(Prefix-projected Sequential PAterrN mining) Page 63

4.4.1 Apriori Algorithm 4.4 Association Rules Apriori Algorithm 항목수 조합수 조합의결과값 1,000 2 499,500 1,000 4 41,417,124,750 1,000 6 1,368,173,298,991,500 1,000 8 24,115,080,524,699,400,000 1,000 10 263,409,560,461,970,000,000,000 Page 64

4.4.2 FP-GROWTH Algorithm 4.4 Association Rules PARALLEL FP-GROWTH : Mahout Figure 2: The overall PFP framework, showing Five stages of computation. Page 65

4.4.2 FP-GROWTH Algorithm 4.4 Association Rules PARALLEL FP-GROWTH : Mahout Page 66

4.4.2 FP-GROWTH Algorithm 4.4 Association Rules PARALLEL FP-GROWTH : Mahout Page 67

4.5 Link Analysis 4 Data Mining 비정형데이터분석 콘텐트분석 (Content Analysis) 디지털환경에서생성되는정형및비정형을포함하여여러수준의콘텐트를비즈니스인텔리전스와비즈니스전략의가치를높이기위한하나의방법 보다향상된의사결정을위한 Trend 및 Pattern을발견하는것 텍스트분석 (Text Analytics) 비정형데이터로부터의미있는정보를추출하기위하여언어적혹은통계적기술자연어처리등방법을통해분석에활용될수있는형태의데이터로변환 실시간분석 (Real-time Analytics) 분석에필요한모든데이터를활용하여사용자가분석을수행하고하는시점에빠르고적시에지식을제공해줄수있는분석기법 결과의정확도및신뢰도보다는사용자에게분석결과를적시에제공하는것에주안점을가짐 웹마이닝 (Web Mining) 소셜마이닝 (Social Mining) 현실마이닝 (Reality Mining) 인터넷상에서수집된정보를데이터마이닝방법으로분석하는기법 소셜미디어의글과사용자간관계를수집하여소비자의성향과패턴등을분석함으로써판매및홍보에이용, 여론변화나사회적흐름을파악 사람들의행동패턴을예측하기위해서사회적행동과관련된정보를수집하여분석하는기법으로현실생활에서발생하는정보를기반으로인간관계나행동추론 Page 68

4.5 Link Analysis 4 Data Mining Content Analysis 의활용 Purpose Element Question Use Make inferences about the antecedents of communications Describe & make inferences about the characteristics of communications Make inferences about the consequences of communications Source Who Answer question of disputed authorship Encoding process Channel Message Recipient Decoding process Why How What To whom With what effect Secure political & military intelligence Analyse traits of individuals Infer cultural aspects & change Provide legal & evaluative evidence Analyse techniques of persuasion Analyse style Describe trends in communication content Relate known characteristics of sources to messages they produce Compare communication content to standards Relate known characteristics of audiences to messages produced for them Describe patterns of communication Measure readability Analyse the flow of information Assess responses to communications 출처 : Ole Hoisti, Duke University Page 69

4.5 Link Analysis 4 Data Mining 관계분석 (Link Analysis) Social Network Analysis 및 Relational Content Analysis 사회구조를노드 (node) 와이들노드를연결하는링크로구성되는연결망 (network) 으로도식, 이들간의상호작용을계량화해주는분석기법 자연어처리 (Natural Language Processing) 형태소분석 구문분석 Algorithm 그래프이론 (Graph Theory) 행렬 /vector/matrix ANOVA 등 가시화 (Visualization) 적용영역 생물학 : 전염경로의분쇄 ( 감염자, 건강인 = 노드, 감염성접촉 = 링크 ) 비즈니스 : 아마존도서구입안내서비스 ( 구매자 / 책 = 노드, 거래행위 = 링크 ) 도시계획 : 도시계획시도로의설계 ( 도시 = 노드, 길 = 링크 ) 정치 : 테러조직붕괴 ( 테러리스트 = 노드, 접선 = 링크 ) 컴퓨터 : 인터넷 ( 컴퓨터, 사람 = 노드, 통신선 = 링크 ) 사회학 : 한국재벌가혼맥 ( 재벌가가족들 = 노드, 결혼 = 링크 ) 경영학 : 지식경영활성화 ( 개인지식 = 노드, 지식공유 = 노드 ) Page 70

4.5.1 Natural Language Processing 4.5 Link Analysis Natural Language Processing 자연어처리란 인간의언어를기계가이해하고생성할수있도록하기위한연구 자연언어처리시스템의구성도 입력문장형태소분석기구문분석기 문법 사전 생성사전 / 생성문법 각종지식기반 (Knowledge-Base) 의미분석기 출력문장 문장생성기 담화 (discourse) 분석기 Page 71

4.5.1 Natural Language Processing 4.5 Link Analysis 어절, 단어, 형태소 어절 : 띄어쓰기의단위 두단어로된어절 : 체언 ( 혹은용언및부사 ) + 조사 한단어로된어절 : 체언, 용언, 수식언, 감탄사 형태소 : 뜻을가진가장작은말의단위 자립성의유무에따라 : 자립형태소 : 체언, 수식언, 감탄사 의존형태소 : 조사, 어간, 어미, 접사 의미, 기능에따라 : 실질형태소 : 체언, 용언의어근, 수식언, 감탄사 형식형태소 : 조사, 어미, 접사 단어 : 자립할수있는말이나자립형태소와쉽게분리되는말 홀로자립하는말 : 체언, 수식언, 감탄사 자립형태소와쉽게분리되는말 : 조사 의존형태소끼리어울려서자립하는말 : 용언 Page 72

4.5.1 Natural Language Processing 4.5 Link Analysis Grammars and Parsing 문법 (Grammar) : 문장의구조적성질을규칙으로표현한것 구문분석기 (Parser) : 문장의구조를문법을이용하여찾아내는 process 문장의구문구조는 Tree 를이룬다. 즉, 몇개의형태소들이모여서구문요소 (phrase) 를이루고, 그구문요소들간의결합구조를 Tree 형태로써구문구조를이루게된다. S NP VP N NP V ART N John ate the apple Page 73

4.5.1 Natural Language Processing 4.5 Link Analysis 형태소분석 대기업의불공정거래로벤처나중소기업이성장하지못하고국가경제에악순환을불러오고있다. ( 관훈토론 2011 년 3 월 22 일 ) NNG 일반명사 NNP 고유명사 NNB 의존명사 NP 대명사 NR 수사 VV 동사 VA 형용사 VX 보조용언 VCP 긍정지정사 VCN 부정지정사 MM 관형사 MAG 일반부사 MAJ 접속부사 IC 감탄사 JKS 주격조사 JKC 보격조사 JKG 관형격조사 JKO 목적격조사 JKB 부사격조사 JKV 호격조사 JKQ 인용격조사 JX 보조사 JC 접속조사 *) 세종계획품사 EP EF EC ETN ETM XPN XSN XSV XSA XR SF SP SS SE 선어말어미 종결어미 연결어미 명사형전성어미 관형형전성어미 체언접두사 명사파생접미사 동사파생접미사 형용사파생접미사 어근 마침표, 물음표, 느낌표 쉼표, 가운뎃점, 콜론, 빗금 따옴표, 괄호표, 줄표 줄임표 SO 붙임표 ( 물결, 숨김, 빠짐 ) SL SH SW NF NV SN NA 외국어 한자 기타기호 ( 논리수학기호, 화폐기호 ) 등 ) 명사추정범주 용언추정범주 숫자 분석불능범주 Page 74

4.5.1 Natural Language Processing 4.5 Link Analysis 구문분석 대기업의불공정거래로벤처나중소기업이성장하지못하고국가경제에악순환을불러오고있다. ( 관훈토론 2011 년 3 월 22 일 ) P. q:. U 있다 W: 있 E: 다 _ 불러오고 V: 불러오 E: 고 O 악순환을 N: 악순환 J: 을 B 국가경제에 N: 국가경제 J: 에 _ 못하고 J: 못하 E: 고 _ 성장하지 V: 성장하 E: 지 S 중소기업이 N: 중소기업 J: 이 벤처나 N: 벤처 J: 이나 B 거래로 N: 거래 J: 으로 N 불공정 N: 불공정 G 대기업의 N: 대기업 J: 의 Page 75

4.5.2 Graph Theory 4.5 Link Analysis Graph Theory Node 와 Link Node : 사람, 생물, 사물, 개념 Link : 작용 ( 일방, 쌍방 ), 관계 ( 우호적, 비우호적 ), 의사소통, 판매와구매등 행렬대수 ( 벡터 (vector) 와행렬 (matrix) scalar : 하나의값으로이루어진데이터. Ex) Likert 3, 여성, 토익 849 점 vector : 하나의배열로이루어진데이터. Ex) x={x 1, x 2, x 3, } matrix : 여러값들을직사각형모양의 2 차원으로배열한데이터. Ex) m( 행 =row) X n( 열 =column). m=n ( 정방행렬 =diagonal 존재 ) 행렬연산기법 : 행렬간가감승제, 행렬간상관관계연산등 Mode of Matrix 1-mode network: 동일벡터간의상호작용 1~n 까지학습자간이메일교환네트워크 / 콘텐츠간연결네트워크 2-mode network: 이질벡터간의상호작용 학습자와콘텐츠등이질벡터간상호작용 Page 76

4.5.2 Graph Theory 4.5 Link Analysis 주요측정값 (measures) 노드 (Node) 연결중심도 (degree centrality) 각 node가갖고있는 link의개수 거리중심도 (closeness centrality) 한 node와다른모든 node간의평균적인최단경로거리 매개중심도 (between ness centrality) 한 node 가다른모든 node들상호간의경로사이에서타인또는하위그룹들간의의사소통을어느정도원활하게연결시켜주는가를정도 기타 power, effects, eigenvector, status 등 링크 (Link) 응집도 (density/centralization) 존재할수있는가능한총 link의숫자대비실현된 link 숫자의비율 결속도 (cohension) 네트워크내모든 node들이접근하기위해필요한링크의총합계, 즉경로거리 (path distance) 측지선최단평균거리 (geodesic distance) 모든 node간의최단경로거리 (cf. 거리중심도 ) 하위집단 (component, clique 등 ) 네트워크내존재하는하위집단규명 구조적유사성 (core-periphery, block model 등 ) 두개의네트워크가서로구조적으로유사한정도 Page 77

4.5.2 Graph Theory 4.5 Link Analysis Graph Theory ( 예 ) 매개중심도 (between-ness centrality) 연결중심도 Degree centrality Page 78

4.5.2 Graph Theory 4.5 Link Analysis Relational Content Analysis Page 79

4.5.2 Graph Theory 4.5 Link Analysis Relational Content Analysis 대기업의불공정거래로벤처나중소기업이성장하지못하고국가경제에악순환을불러오고있다. ( 관훈토론 2011 년 3 월 22 일 ) Page 80

Page 81