dbinom(2, 3, 0.5) # x, n, p [1] 포아송확률분포 (Poisson distribution) X: 사건의빈도수 X~Poisson(mm), m > 0 mmxx mm P(X = x) = ee xx!, xx = 0,1,2, (Example:

Size: px

Start display at page:

Download "dbinom(2, 3, 0.5) # x, n, p [1] 포아송확률분포 (Poisson distribution) X: 사건의빈도수 X~Poisson(mm), m > 0 mmxx mm P(X = x) = ee xx!, xx = 0,1,2, (Example:"

상길 추
6 years ago
Views:

1 Python Tensorflow 를활용한머신러닝 1. Machine Learning 1.1 통계학, 머신러닝, 인공지능 (1) 통계학, 인공지능, 머신러닝 정의 Statistics (S. M. Ross) : Statistics is the art of learning from data. Machine learning (A. Samuel) : Machine Learning is a field of study that gives computers the ability to learn without being explicitly programmed. : 명시적프로그램없이데이터로부터학습할수있는능력을컴퓨터에게제공하는기법 Artificial Intelligence (S. Russell) : Artificial Intelligence is making computers intelligent. (2) 확률분포 (probability distribution) - 이항확률분포 (Binomial distribution) X: n번의베르누이시행중성공횟수 X~B(n, p) P(X = x) = nn xx ppxx (1 pp) nn xx, xx = 0, 1, 2,, nn (Example: Binomial) 동전을 3번던질때앞면이 2번나올확률 X~B(3,0.5) P(X = 2) = (1 0.5) 3 2 1

2 dbinom(2, 3, 0.5) # x, n, p [1] 포아송확률분포 (Poisson distribution) X: 사건의빈도수 X~Poisson(mm), m > 0 mmxx mm P(X = x) = ee xx!, xx = 0,1,2, (Example: Binomial) 지난주까지프로그래밍강의에지각한학생수는다음과같다. 3, 5, 4, 3, 2 이결과를이용하여오늘프로그래밍강의시간에 3명이지각할확률을구하시오. m=mean(c(3, 5, 4, 3, 2)) dpois(3, m) [1] (note) 빅데이터전처리후얻게되는문서데이터 (text Web, SNS, Patents, Papers, ) 는많은경우가 count 데이터이다. 1.2 학습 (1) 지도학습 (Supervised learning, with teacher) : Input variable (Explanatory variable) 와 Output variable (Response variable) 를모두알고있음분류, classification : ( 나이, 연봉 / 신용상태 ) 신용상태예측 회귀, regression : ( 광고비, 온도 / 아이스크림판매량 ) 매출예측 2

(2) 자율학습 (Unsupervised learning, without teacher) : Input variable (Explanatory variable) 만알고있음군집화, clustering : ( 나이, 연봉 ) 신용상태예측 (3) Classification 기법 - 판별분석 (Discriminant analysis) - 의사결정나무

3 (2) 자율학습 (Unsupervised learning, without teacher) : Input variable (Explanatory variable) 만알고있음군집화, clustering : ( 나이, 연봉 ) 신용상태예측 (3) Classification 기법 - 판별분석 (Discriminant analysis) - 의사결정나무 (Decision tree) - 랜덤포레스트 (Random forest) - Support vector machine (SVM) (4) Regression 기법 - 선형회귀분석 (Linear regression) - 로지스틱회귀분석 (Logistic regression) - Lasso (least absolute shrinkage and selection operator) regression (5) Clustering 기법 - K-means clustering - Silhouette width ( 최적군집수결정 ) 3

4 - Fuzzy clustering - Hierarchical methods - K-medoids clustering (6) UCI machine learning repository 모형평가 (1) Accuracy ( 출처 : J. Han, et al., 2012) accuracy 가클수록우수한모형 (2) MES (mean squared error) MSE = 1 nn nn YY ii YY ii 2 ii=1 YY ii : 실제값 YY ii : 예측값 n: test data의크기 MSE의크기가작을수록우수한모형 (3) AIC (Akaike s Information Criterion) 4

5 : 좋은예측을하는모형을찾으려는지표, AIC 가작을수록좋은모형 AIC = 2 최대로그우도 모수의수 1.4 Ensemble Learning ( 출처 : J. Han, et al., 2012) (1) Ensemble Learning - 다양한기본모형의가중치조합학습 f(y x, θ) = ww mm ff mm (yy xx) - 정확도를높이기위하여여러모형들의조합을사용 - 한모형의성능향상을위하여여러번의학습결과를합침 (Bootstrap method) - Committee learning Voting mm MM (2) 앙상블방법들 - Bagging: 분류기모음에대한예측의평균 (averaging the prediction over a collection of classifiers) - Boosting: 분류기모음에대한가중투표 (weighted vote with a collection of classifiers) - Ensemble: 이질적인분류자집합의결합 (combining a set of heterogeneous classifiers) 1.5 Data Scaling (1) 정규화 (Normalization) [0,1] 5

6 xx min (xx) max(xx) min (xx) (2) 표준화 (standardization) (, ) [-3, 3] xx mmmmmmmm(xx) ssss(xx) 1.6 R and Python for Machine Learning (1) R data language - 오클랜드대학의로버트젠틀맨 (Robert Gentleman) 과로스이하카 (Ross Ihaka) 에의해개발한객체지향프로그래밍언어 - 무료로사용할수있는오픈소스 - 다양한패키지를통하여최신분석기법을제공 - 간편한시각화기능 6

7 - 방대한데이터분석함수를보유 (2) R과 Python의차이 - R은데이터분석에강점, 파이썬은소프트웨어 ( 서비스 ) 개발에강점 - 물론 R로도소프트웨어개발 ( 웹서비스등 ) 이가능하지만 python에비해효율이떨어짐 - Python은 C/C++, Java와같은다른프로그래밍언어에비해데이터분석기능이잘갖추어져있음 (Tensorflow, Numpy, 등 ) (3) 함수, 메서드사용 - R: 함수 ( 객체 ) - Python: 객체. 함수 (4) R 기본과 RStudio - R 기본 - RStudio 7

8 2. Python Tensorflow 설치및기본사용법 2.1 Python 소개및설치 (1) 1990년귀도반로섬 (Guido Van Rossum) 이만든객체지향프로그래밍언어 (2) 다양한플랫폼 (Window, Linux, Mac) 에서사용될수있는인터프리터 (Interpreter) 방식의 RAD(rapid application development) 언어 (note) 인터프리터언어 : 한줄씩소스코드를해석해서그때그때실행해결과를바로확인할수있는언어 (3) 방대한라이브러리를갖추고있는오픈소스언어 (4) 더빠른속도를원하거나일부프로그램을비공개로해야할경우 python 코드의일부를 C/C++ 로작성한후, python에서불러와서사용할수있음 (5) 윈도우환경은 에서설치가가능하고리눅스환경은대부분이미설치되어있음 (32bit / 64bit 선택가능 ) 8

9 : Download Windows x86-64 web-based installer 실행 Add Python 3.7 to PATH 를선택하고 Install Now 클릭 (6) Python 실행 프로그램 Python 3.7 IDLE(Python bit) 9

10 2.2 Anaconda 소개및설치 (1) 아나콘다소개 - Python 기반의프로그래밍을위한오픈소스를포함하고있는개발플랫폼 ( 환경 ) - 아나콘다는 numpy, matplotlib 와같이데이터분석을위한다양한패키지 ( 라이브러리 ) 가내장되어있음 (2) 아나콘다설치다음의 URL에서파이썬버전과자신의컴퓨터비트수에적합한아나콘다버전을받아설치 : 64-Bit Graphical Installer (631 MB) 클릭 10

11 : Next 선택 : I Agree 선택 : Just Me 선택 11

12 : Next 선택 : Next 선택 : Skip 선택 12

3 Tensorflow 소개및설치 (1) 설치시작 Anaconda Prompt 이용 : 윈도우프로그램 Anaconda3

13 : Finish 선택 (3) 아나콘다에있는 Spyder와 ipython 콘솔을사용하면프로그램의실행결과를화면에서볼수있어편리함 - Spyder 편집기는프로그래밍을위한 usage 를보여주고 Syntax 체크도제공 2.3 Tensorflow 소개및설치 (1) 설치시작 Anaconda Prompt 이용 : 윈도우프로그램 Anaconda3 (64-bit) Anaconda Prompt (2) conda 자체업데이트 conda update -n base conda (3) 설치된파이썬패키지업데이트 13

14 conda update --all Proceed ([y]/n)? 에서 y 입력 (4) 텐서플로설치 conda install tensorflow Proceed ([y]/n)? 에서 y 입력 (5) 설치확인 - 프로그램 Anaconda3 (64-bit) Spyder 실행후 IPython Console 창에서 import tensorflow as tf 을실행후오류 (error) 메시지가나타나지않으면설치가잘된것임 14

15 3. Python 프로그래밍기초 - 인덱싱 (Indexing): 파이썬의인덱싱은 0부터시작 (note) R의인덱싱은 1부터시작 - 대문자와소문자를구별 - 들여쓰기가중요 ( 블록구조 ) 3.1 Python data type (1) 숫자형 >>> a = 3 >>> b = 4 >>> a + b 7 >>> a * b 12 >>> a / b 0.75 (2) 문자열 >>> x="python" >>> x 'Python' >>> a = "Life is too short, You need Python" >>> a[0] 'L' >>> a[12] 's' >>> a[-1] 'n' 15

16 >>> a[0:4] 'Life' # Spyder: 문자열포맷팅 num = 5 st = "two" print("i ate %d apples. so I was sick for %s days." % (num, st)). # 문자열개수세기 (count) >>> a = "hobby" >>> a.count('b') 2 (3) List # 리스트의인덱싱 >>> a = [1, 2, 3] >>> a[0] 1 >>> a[0] + a[2] 4 >>> a[-1] 3 # 리스트정렬 (sort) >>> a = [1, 4, 3, 2] >>> a.sort() >>> a [1, 2, 3, 4] # 리스트에요소삽입 (insert) >>> a = [1, 2, 3] >>> a.insert(0, 4) [4, 1, 2, 3] 16

17 # 리스트요소끄집어내기 (pop) >>> a = [1,2,3] >>> a.pop() 3 >>> a [1, 2] (4) Tuple # 인덱싱 >>> t1 = (1, 2, 'a', 'b') >>> t1[0] 1 >>> t1[3] 'b' # 리스트는수정이가능하지만튜플은안됨 (5) Dictionary # 딕셔너리, key: value # Spyder dic1 = {'name':'pey', 'phone':' ', 'birth': '1118'} a=dic1['name'] print(a) dic2={1:23,2:14,6:89,'x':78} b=dic2[2] c=dic2['x'] print(b) print(c) pey

18 # 딕셔너리에데이터추가 >>> a = {1: 'a'} >>> a[2] = 'b' >>> a {2: 'b', 1: 'a'} 3.2 Python 제어문 (1) 조건 : if, elif x = 10 if x>10: print("x is large") else: print("x is small") - if문의기본구조 if 조건문 : 수행할문장들... else: 수행할문장들... # 다중조건판단 elif score=88 if score>=90: print("high") elif score>=80: 18

19 else: print("middle") print("low") (2) 반복 : while, for # while 문을이용한 1에서 10까지의정수의합 sum=0 i=1 while i<=10: sum=sum+i i=i+1 print(sum) # for 문을이용한 1에서 10까지의정수의합 sum=0 for i in range(1,11): sum=sum+i print(sum) 3.3 외부데이터불러오기 (1) 예제데이터 : cars Speed and Stopping Distances of Cars (M. Ezekiel), 변수 speed, dist # 외부데이터불러오기 파일 import numpy as np data_file_name='e:/data/python/cars.txt' dat=np.genfromtxt(data_file_name,dtype='float32',skip_header=true) print(np.shape(dat)) speed=dat[:,1] dist=dat[:,2] 19

20 print(speed) print(dist) (note1) skip_header=true 또는 skip_header=1 데이터의첫번째행이변수명일경우지정 (note2) np.shape(dat) 데이터객체의행과열을나타냄 # 외부데이터불러오기 웹상의데이터 import pandas as pd target_url = (" # iris data를 pandas data frame 형식으로불러옴 iris_data = pd.read_csv(target_url,header=none, prefix="x") print(iris_data) print(iris_data.x4) summary = iris_data.describe() print(summary) Sepal_Length = list(iris_data.x0) print(sepal_length) EXAMPLE Japan credit 데이터를불러와서각열의값들을출력하시오. 3.4 pandas 와 numpy 다루기 (1) pandas/numpy - 고급데이터분석과수치계산등의기능을제공하는확장모듈 - C 언어로작성되어있어서속도가빠름 - numpy: 다차원배열과고수준의수학함수제공 20

21 - pandas: 데이터분석을제공하는라이브러리, csv 파일등을데이터로읽고원하는데이터 형식으로변환 (2) 데이터프레임 - 데이터프레임 (DataFrame): pandas 에서사용되는기본데이터 - 데이터프레임을정의할때는 2 차원리스트를매개변수로전달 import pandas as pd a = pd.dataframe([ [10,20,30], [40,50,60], [70,80,90] ]) print(a) # 1 차원데이터는 Series 를사용 import pandas as pd import numpy as np s = pd.series([1.0, 3.0, 5.0, 7.0, 9.0]) print(s) # 자료형도함께출력됨 m = np.mean(s) print(m) 21

22 (3) 원하는데이터추출 # 1 차원리스트의딕셔너리자료형으로부터키를이용하여원하는열의데이터출력 import pandas as pd # 키, 몸무게, 유형데이터프레임생성하기 tbl = pd.dataframe({ "weight": [80.0, 70.4, 65.5, 45.9, 51.2], "height": [170, 180, 155, 143, 154], "type": [ "f", "n", "n", "t", "t"] }) # 몸무게목록추출하기 print(" 몸무게목록 ") print(tbl["weight"]) # 몸무게와키목록추출하기 print(" 몸무게와키목록 ") print(tbl[["weight","height"]]) # 원하는위치의값을추출할때는파이썬리스트처럼슬라이스를사용 import pandas as pd tbl = pd.dataframe({ "weight": [80.0, 70.4, 65.5, 45.9, 51.2], "height": [170, 180, 155, 143, 154], "type": [ "f", "n", "n", "t", "t"] 22

23 }) print("tbl[2:4]\n", tbl[2:4]) print("tbl[3:]\n", tbl[3:]) # 원하는조건추출 import pandas as pd tbl = pd.dataframe({ "weight": [80.0, 70.4, 65.5, 45.9, 51.2, 72.5], "height": [170, 180, 155, 143, 154, 160], "gender": [ "f", "m", "m", "f", "f","m"] }) print(" 몸무게와키목록 ") print(tbl[["weight","height"]]) print("--- height 가 160 이상인것 ") print(tbl[tbl.height >= 160]) print("--- gender 가 m 인것 ") print(tbl[tbl.gender == "m"]) 23

24 # 정렬 import pandas as pd tbl = pd.dataframe({ "weight": [80.0, 70.4, 65.5, 45.9, 51.2, 72.5], "height": [170, 180, 155, 143, 154, 160], "gender": ["f", "m", "m", "f", "f", "m"] }) print("--- 키로정렬 ") print(tbl.sort_values(by="height")) print("--- 몸무게로정렬 ") print(tbl.sort_values(by="weight", ascending=false)) # 전치 import pandas as pd tbl = pd.dataframe([ ["A", "B", "C"], ["D", "E", "F"], ["G", "H", "I"] ]) 24

25 print(tbl) print("------") print(tbl.t) (4) 데이터조작 import numpy as np # 10 개의 float32 자료형데이터생성 v = np.zeros(10, dtype=np.float32) print(v) # 연속된 10 개의 uint64 자료형데이터생성 v = np.arange(10, dtype=np.uint64) print(v) # v 값을 3 배하기 v *= 3 print(v) # v 의평균구하기 print(v.mean()) # 데이터정규화 import pandas as pd 25

26 # 키, 체중, 유형데이터프레임생성하기 tbl = pd.dataframe({ "weight": [80.0, 70.4, 65.5, 45.9, 51.2, 72.5], "height": [170, 180, 155, 143, 154, 160], "gender": ["f", "m", "m", "f", "f", "m"] }) # 키와몸무게정규화하기 # 최댓값과최솟값구하기 def norm(tbl, key): c = tbl[key] v_max = c.max() v_min = c.min() print(key, "=", v_min, "-", v_max) tbl[key] = (c - v_min) / (v_max - v_min) norm(tbl, "weight") norm(tbl, "height") print(tbl) (5) numpy 로변환 머신러닝라이브러리중에서 pandas 의데이터프레임을지원하지않는경우 numpy 형식으로 변환하여사용하면됨 26

27 EXAMPLE [1] Japan credit 데이터를 pandas 데이터프레임으로불러와서각열 ( 변수 ) 에대한평균과표준편차를구하시오. [2] Japan credit 데이터를 pandas 데이터프레임으로불러와서각열 ( 변수 ) 에대한정규화및표준화를수행하시오. ( 마지막열은제외 ) 3.5 추가적인패키지들 (1) sklearn - scikit-learn - 다양한데이터셋포함 - 데이터전처리, 지도 / 자율학습알고리즘및평가기법포함 (2) scipy // 사이파이 // - 과학기술계산지원 - 학습알고리즘및최적화기법제공 (3) statsmodels - 추정, 검정을포함한통계분석 (regression, time-series analysis, ) 제공 27

28 4. Tensorflow 소개 4.1 기본적인사용 (1) Tensorflow 년구글이공개한머신러닝을위한라이브러리 - Python으로 tensorflow를구동 - 노드 (node, 원 ) 가 함수 / 연산 을의미하고에지 (edge, 화살표 ) 는텐서 (tensor, 숫자, 매트릭스, 배열 ) 를의미하는방향성그래프 - Tensorflow는텐서 (tensor) 와플로우 (flow) 를사용하여프로그램을구성하고 Session의생성과 run을통하여결과를얻음 - Tensorflow 구성 : 기본적인연산정의 정의한데이터플로우그래프를세션으로실행 (2) Tensorflow 버전확인 (3) 간단한 tensorflow 사용 덧셈 1 import tensorflow as tf # 상수정의 a = tf.constant(3) b = tf.constant(5) # 계산정의 : tensorflow는덧셈을하는것이아니라덧셈이라는계산을정의할뿐임 c = a + b # add_op 객체에저장되는것은덧셈결과 ( 숫자 ) 가아니라데이터플로그래프 ( 객체 ) 임 # 세션수행 : 세션을실행하려면데이터플로그래프를 run() 메서드의매개변수로전달 sess = tf.session() 28

29 ret = sess.run(c) print(ret) 8 덧셈 2 import tensorflow as tf # 상수정의 a = tf.constant(2) b = tf.constant(3) c = tf.constant(4) # 연산정의 calc1_op = a + b * c calc2_op = (a + b) * c # 세션시작 sess = tf.session() res1 = sess.run(calc1_op) print(res1) res2 = sess.run(calc2_op) print(res2) (4) Computation graph - 연산그래프 : 서로상호작용하는연산을만들고실행하면서머신러닝작업을수행 - 텐서플로의연산은데이터플로우그래프로구성 - 노드 (node): 산술연산자 - 에지 (edge): tensor, 다중다차원데이터, 피연산자 - 세션 : session.run 그래프로부터출력값을얻어냄 (5) Tensorflow 그래프와코드 29

30 import tensorflow as tf a=tf.constant(2, name="input_a") b=tf.constant(3, name="input_b") c=tf.multiply(a,b, name="mul_c") d=tf.add(a,b, name="add_d") e=tf.add(c,d, name="add_e") sess=tf.session() ret_e=sess.run(e) print("e=",ret_e) ret_c=sess.run(c) print("c=",ret_c) (6) Tensorflow에서변수표현 import tensorflow as tf # 상수정의 a = tf.constant(120, name="a") b = tf.constant(130, name="b") c = tf.constant(140, name="c") # 변수정의하기 v = tf.variable(0, name="v") # 데이터플로우그래프정의 calc_op = a + b + c assign_op = tf.assign(v, calc_op) # calc_op를 v에대입 # 세션실행 30

31 sess = tf.session() sess.run(assign_op) # v의내용출력 print( sess.run(v) ) (7) Tensorflow의 placeholder - 값을넣을공간을만들어두는기능 - 선언과동시에초기화하는것이아니라일단선언후그다음값을전달 - 실행시반드시데이터가제공되어야함 데이터를상수전달과같이할당하는것이아니라다른텐서를 placeholder에맵핑시키는것임 - placeholder의 parameters placeholder(dtype, shape=none, name=none) dtype : 데이터타입 shape : 입력데이터의형태 ( 상수, 다차원배열, ), (default는 None) name : 해당 placeholder의이름을부여 ( 생략가능 ), (default는 None) import tensorflow as tf # 플레이스홀더정의 a = tf.placeholder(tf.int32, [3]) # 정수자료형 3개를가진배열 # 배열을모든값을 2배하는연산정의 b = tf.constant(2) x2_op = a * b # 세션시작 sess = tf.session() # 플레이스홀더에값을넣고실행 (feed-dict 이용 ) r1 = sess.run(x2_op, feed_dict={ a:[1, 2, 3] }) print(r1) 31

32 r2 = sess.run(x2_op, feed_dict={ a:[10, 20, 10] }) print(r2) import tensorflow as tf a = tf.placeholder(tf.int32, [None]) # None: 고정되지않은원하는크기의배열사용 # 배열의모든값을 10배하는연산정의하기 b = tf.constant(10) x10_op = a * b # 세션시작 sess = tf.session() # 플레이스홀더에값을넣어실행 r1 = sess.run(x10_op, feed_dict={a: [1,2,3,4,5]}) print(r1) r2 = sess.run(x10_op, feed_dict={a: [10,20]}) print(r2) 4.2 기본적인 tensorflow 프로그램 (1) Python 기본과 Tensorflow # Hi, Python! general python x1 = "Hi," x2 = " Python" Y = x1 + x2 print(y) 32

33 # Hi, Python! - tensorflow import tensorflow as tf x1 = tf.constant("hi,") x2 = tf.constant(" Python") Y = x1 + x2 with tf.session() as sess: ret = sess.run(y) print(ret) # 다음코드는 10이출력 x = 1 y = x + 9 print(y) # Tensorflow를이용하여동일한결과출력 import tensorflow as tf x = tf.constant(1) y = tf.variable(x+9) model = tf.global_variables_initializer() # 변수초기화함수호출 # 앞에서생성한 model을사용하여변수 y의값을연산한후결과출력 with tf.session() as session: session.run(model) # y 값은 session 이실행되기전까지연산되지않음 print(session.run(y)) (2) Tensorflow 프로그래밍 - 정수 a와 b의곱하기 import tensorflow as tf a = tf.placeholder("int32") # placeholder 로명명된기본자료구조정의 b = tf.placeholder("int32") y = tf.multiply(a,b) # 정수 a와 b의곱셈을리턴 33

34 sess = tf.session() # 세션을생성해실행흐름을관리 print(sess.run(y, feed_dict={a:2,b:5})) # 연산결과출력 (3) 텐서자료구조 - tensor : tensorflow의기본자료구조, 데이터플로우그래프에서에지연결 : 다차원배열이나리스트로구성된구조 - tensor는 rank, shape, type의 3가지매개변수로구성 rank: tensor의차원, 1= 벡터, 2= 행렬,, N=N차원배열 shape: tensor의행과열의개수 type: tensor의데이터형식 (4) 1차원 tensor # Numpy의 array를이용한 1차원 tensor 생성 import numpy as np tensor_1d = np.array([1.3,1,4.0,23.99]) print(tensor_1d) print(tensor_1d[0]) print(tensor_1d[2]) print(tensor_1d.ndim) # rank 조회 print(tensor_1d.shape) # shape 조회 print(tensor_1d.dtype) # type 조회 # tensorflow의텐서로변환 import tensorflow as tf import numpy as np tensor_1d = np.array([1.3,1,4.0,23.99]) tf_tensor = tf.convert_to_tensor(tensor_1d, dtype=tf.float64) with tf.session() as sess: 34

35 print(sess.run(tf_tensor)) print(sess.run(tf_tensor[0])) print(sess.run(tf_tensor[2])) # convert_to_tensor 함수 : Numpy의배열, 파이썬리스트, 파이썬스칼라등다양한파이썬객체를 tensor 형식으로변환 (5) 2차원 tensor # 행렬이용하기 import tensorflow as tf import numpy as np tensor_2d=np.array([(1,2,3,4),(4,5,6,7),(8,9,10,11),(12,13,14,15)]) print(tensor_2d) print(tensor_2d[3][3]) print(tensor_2d[0:2,0:2]) tf_tensor = tf.convert_to_tensor(tensor_2d, dtype=tf.float64) with tf.session() as sess: print(sess.run(tf_tensor)) # tensor 다루기 import tensorflow as tf import numpy as np matrix1=np.array([(2,2,2),(2,2,2),(2,2,2)],dtype='int32') matrix2=np.array([(1,1,1),(1,1,1),(1,1,1)],dtype='int32') print("matrix1 =") print(matrix1) print("matrix2 =") print(matrix2) # matrix1=tf.constant(matrix1) # matrix2=tf.constant(matrix2) matrix_product=tf.matmul(matrix1,matrix2) 35

36 matrix_sum=tf.add(matrix1,matrix2) with tf.session() as sess: result1=sess.run(matrix_product) result2=sess.run(matrix_sum) print("matrix1*matrix2 =") print(result1) print("matrix1+matrix2 =") print(result2) EXAMPLE 다음행렬의연산결과를출력하는프로그램을작성하시오. (solution) import tensorflow as tf import numpy as np matrix1=np.array([(1,2),(3,4)],dtype='int32') matrix2=np.array([(5,6),(7,9)],dtype='int32') matrix3=np.array([(2,1),(2,4)],dtype='int32') matrix_product=tf.matmul(matrix1,matrix2) matrix_sum=tf.add(matrix_product,matrix3) with tf.session() as sess: result1=sess.run(matrix_product) result2=sess.run(matrix_sum) print("matrix1*matrix2 =") print(result1) print("matrix1+matrix2 =") print(result2) (6) 난수 36

# 균일분포 (Uniform distribution) import tensorflow as tf import matplotlib.pyplot as plt uniform = tf.random_uniform([100],minval=0,maxval=1,dtype=tf.float32) with tf.session() as session: print(uniform.

37 # 균일분포 (Uniform distribution) import tensorflow as tf import matplotlib.pyplot as plt uniform = tf.random_uniform([100],minval=0,maxval=1,dtype=tf.float32) with tf.session() as session: print(uniform.eval()) plt.hist(uniform.eval(),normed=true) # 상대빈도로출력 plt.show() # 정규분포 (Normal distribution, Gaussian distribution) import tensorflow as tf import matplotlib.pyplot as plt norm = tf.random_normal([10000], mean=0, stddev=2) with tf.session() as session: print(norm.eval()) plt.hist(norm.eval(),normed=true) plt.show() 37

38 EXAMPLE [1] n=10, p=0.5 인이항분포 (binomial distribution) 를따르는난수 1000개를생성하고이값들의히스토그램을작성하시오. [2] λ=3 인포아송분포 (Poisson distribution) 를따르는난수 1000개를생성하고이값들의히스토그램을작성하시오. 38

39 5. Linear Regression Analysis 5.1 회귀분석 통계학 (1) Multiple linear regression 여러변수들사이의관계를결정하는문제 xx 1,, xx rr : 독립변수 (independent variable), 입력변수 (input), 설명변수 (explanatory) Y: 종속변수 (dependent variable), 출력변수 (output), 반응변수 (response) ββ 0, ββ 1,, ββ rr : 회귀계수 (regression parameters) Y = ββ 0 + ββ 1 xx ββ rr xx rr + ee e: 평균이 0인확률변수로가정위식의또다른표현 : E[Y x] = ββ 0 + ββ 1 xx ββ rr xx rr E[Y x]: 입력변수들인 x 가주어졌을때반응치 (Y) 의기댓값 상수 ββ 0, ββ 1,, ββ rr : 회귀계수 (regression coefficients), 데이터로부터추정 - 단순회귀 (simple regression): 독립변수가 1 개 - 다중회귀 (multiple regression): 독립변수가여러개 최소자승추정 (least squared estimation) Y = α + βx + e 단순선형회귀모형 (simple linear regression) A: α에대한추정량 B: β에대한추정량 SS (sum of squared) : 실제값 (actual response values) 과예측값 (estimated responses values) 의차이 = 2 nn (YY ii=1 ii AA BBxx ii ) nn SS = (YY ii AA BBxx ii ) 2 ii=1 39

40 nn = 2 xx ii=1 ii(yy ii AA BBxx ii ) 위의편미분결과를 0 으로두면 SS 를최소로하는 A 와 B 의값을구할수있다. nn nn YY ii = nnnn + BB xx ii ii=1 ii=1 nn nn nn 2 xx ii YY ii = AA xx ii + BB xx ii ii=1 ii=1 ii=1 BB = ii (xx ii xx )(YY ii YY ) ii(xx ii xx ) 22 AA = YY BBxx = ii xx iiyy ii nnxx YY ii xx 22 ii nnxx 22 = SS xxxx SS xxxx (2) 회귀분석의성능평가 - 결정계수 (Coefficient of determination) RR 2 = SSSSSS SSSSSS 0 RR 2 1 SST (total sum of squared deviation) SSR (sum of squares due to regression) SSE (sum of squared errors) SST = (yy ii yy ) 2 SSR = (yy ii yy ) 2 SSE = (yy ii yy ii ) 2 SST = SSR + SSE : 결정계수가클수록모형의설명력이큼 5.2 R 을이용한회귀분석 40

5.3 Tensorflow 를이용한회귀분석 # 회귀계수학습 ( 추정 ) import numpy as np import tensorflow as tf import matplotlib.pyplot as plt data_file_name='l:/data/python/cars.txt' dat=np.

41 5.3 Tensorflow 를이용한회귀분석 # 회귀계수학습 ( 추정 ) import numpy as np import tensorflow as tf import matplotlib.pyplot as plt data_file_name='l:/data/python/cars.txt' dat=np.genfromtxt(data_file_name,dtype='float32',skip_header=true) speed=dat[:,1] dist=dat[:,2] X=tf.placeholder("float32") Y=tf.placeholder("float32") init_b0=0.5 init_b1=0.5 b0=tf.variable(init_b0) 41

42 b1=tf.variable(init_b1) y=b0+b1*x cost=tf.reduce_mean(tf.square(y-y)) opti=tf.train.gradientdescentoptimizer(0.001) training=opti.minimize(cost) init=tf.global_variables_initializer() with tf.session() as sess: sess.run(init) for i in range(0,5000): sess.run(training, feed_dict={x:speed, Y:dist}) if(i%100==0): cost_out=sess.run(cost,feed_dict={x:speed, Y:dist}) b0_out=sess.run(b0,feed_dict={x:speed, Y:dist}) b1_out=sess.run(b1,feed_dict={x:speed, Y:dist}) print(i, "session is performed.. cost is ",cost_out,", b1=", b1_out, "b0=", b0_out) plt.plot(speed, dist, 'o') plt.show() (note1) placeholder(dtype, shape=none, name=none) (note2) 초기값 (init_b0=1.0, init_b1=1.0) 에따라추정된회귀계수값이달라짐 (note3) 반복수 (for i in range(0,5000):) 에따라추정된회귀계수값이달라짐 추정된회귀계수비교 회귀계수 통계학 ( 최소자승법 ) 머신러닝 (Cost 최적화 ) 반복없음 1,000 반복 5,000 반복 10,000 반복 20,000 반복 30,000 반복 b b

43 EXAMPLE [1] 다음데이터를이용하여추정된회귀식을구하시오. Y = bb 0 + bb 1 xx 광고료 (X) 매출액 (Y) [2] 광고료와매출액에대한산점도를그리시오. 5.4 회귀분석 Simulation data Y=Ax+b # 데이터모델 import numpy as np number_of_points = 200 x_point = [] y_point = [] a = 0.22 b = 0.78 for i in range(number_of_points): x = np.random.normal(0.0,0.5) y = a*x + b +np.random.normal(0.0,0.1) x_point.append([x]) y_point.append([y]) import matplotlib.pyplot as plt 43

44 plt.plot(x_point,y_point, 'o', label='input Data') plt.legend() plt.show() # 비용함수와경사하강법 import tensorflow as tf # A와 b를 tf.variable로정의하고임의의값을할당 A = tf.variable(tf.random_uniform([1], -1.0, 1.0)) # A는 -1에서 1사이의임의의값으로, b는 0으로초기화 B = tf.variable(tf.zeros([1])) # y와 x의선형관계식정의 y = A * x_point + B # 비용함수 (cost function) 정의 : 예측값과실제값의차이 -> mean squared error (MSE) cost_function = tf.reduce_mean(tf.square(y - y_point)) # tensorflow에서경사하강법 (gradient descent) 을이용하여 cost_function을최소화 optimizer = tf.train.gradientdescentoptimizer(0.5) # 0.5는학습률 (learning rate) train = optimizer.minimize(cost_function) # 변수초기화 model = tf.global_variables_initializer() # A와 b의값을도출할수있게세션을통해모델학습을 20회반복하도록설정 with tf.session() as session: # 모델시뮬레이션을수행 session.run(model) for step in range(0,21): session.run(train) # 각스텝마다학습을수행 if (step % 5) == 0: # 매 5번째스텝마다점이어떤패턴인지출력 plt.plot(x_point, y_point, 'o',label='step = {}'.format(step)) # 학습된 A와 b를이용한회귀직선 y=ax+b 출력 plt.plot(x_point, session.run(a) * x_point + session.run(b)) plt.legend() plt.show() 44

45 45

46 EXAMPLE Iris 데이터를이용하여다음의회귀식을추정하시오. Sepal. Width = bb 0 + bb 1 Sepal. Length + bb 2 Patal. Width 46

47 6. Logistic Regression Analysis 6.1 로지스틱회귀분석 (1) Logistic regression - 이항분포 (binomial distribution) 를사용한일반화선형모형 (generalized linear model, GLM) - GLM 은확률분포, 링크함수, 선형예측식의지정이필요한통계모형 (2) 로지스틱회귀 GLM - 확률분포 이항분포 - 링크함수 로짓링크함수 (logit link function) - 선형예측식 bb 0 + bb 1 xx 1 + (3) 로지스틱함수 (logistic function) - 제약조건, 0 qq ii 1 (qq ii 는확률 ) - 선형예측식 zz ii = ββ 1 + ββ 2 xx ii + qq ii = llllllllllllllll(zz ii ) = eeeeee( zz ii ) - qq ii 가 zz ii 의로지스틱함수로표현된다고가정하면선형예측식 zz ii 가어떠한값을가져도 0 qq ii 1 의조건은만족됨 (Probability, score, ) (4) 로지스틱함수의변형 1 qq ii = 1 + ee zz ii 47

48 qq ii log = zz 1 qq ii ii 좌변의식이로짓함수 (logit function) qq ii llllllllll(qq ii ) = log 1 qq ii - 로짓함수와로지스틱함수는서로역함수관계 - 따라서다음과같은로지스틱회귀식을구함 llllllllll(qq ii ) = bb 0 + bb 1 xx Python 을이용한로지스틱회귀 import numpy as np np.random.seed(456) import tensorflow as tf tf.set_random_seed(456) from sklearn.linear_model import LogisticRegression import statsmodels.api as sm N = 100 # np.identity(2), np.eye(2) : 단위행렬 x_zeros = np.random.multivariate_normal(mean=np.array((-1, -1)), cov=.1*np.eye(2), size=(n//2,)) y_zeros = np.zeros((n//2,)) x_ones = np.random.multivariate_normal(mean=np.array((1, 1)), cov=.1*np.eye(2), size=(n//2,)) y_ones = np.ones((n//2,)) x_np = np.vstack([x_zeros, x_ones]) y_np = np.concatenate([y_zeros, y_ones]) y=y_np X=x_np 48

49 X_with_constant=sm.add_constant(X,prepend=True) model = LogisticRegression() model = model.fit(x_with_constant,y) print(model.coef_) print(model.intercept_) EXAMPLE JAPAN Credit 데이터를이용하여로지스틱회귀식을추정하시오. 49

50 7. K Means Clustering 7.1 K- 평균군집화 (1) K-means clustering 개요 (2) K-means clustering 절차 - 군집수 K 결정 (Silhouette Width) - 초기 K개의군집중심결정 (random 또는분석가가결정 ) - 군집중심에가장가까운객체들끼리묶여감 - 최종적으로더이상의군집변동이없으면학습종료 7.2 K-means clustering 실습 from sklearn.cluster import KMeans from sklearn import datasets import numpy as np import matplotlib.pyplot as plt 50

51 # np.random.seed(5) # centers = [[1, 1], [-1, -1], [1, -1]] iris = datasets.load_iris() X = iris.data y = iris.target Sepal_Length=X[:,0] Sepal_Width=X[:,1] Patal_Length=X[:,2] Patal_Width=X[:,3] Species=y print(species) clustering = KMeans(n_clusters=3) clustering.fit(x) y_predict = clustering.predict(x) plt.scatter(sepal_length,sepal_width) EXAMPLE Japan credit 데이터를이용한군집화 51

52 8. Deep Learning 입문 8.1 딥러닝이란? (1) Deep learning - Deep learning: 심층학습이가능한신경망모형기반의머신러닝 - 입력데이터에대한특징추출과문제해결을위한복잡한 ( 비선형 ) 함수를학습하기위하여다수의층 (layer) 을갖는신경망구조 - 많은데이터와컴퓨팅자원을필요로함 - 통계학뿐만아니라기존의머신러닝기법에비해월등한성능향상을보임 ( 음성인식, 이미지인식, ) (2) 딥러닝의문제해결 - 학습을통하여입력데이터로부터적합한특징을추출하면서문제해결을위한모형을구축 - 입력층에가까운층 : 낮은수준의특징이학습, 출력층에가까운층 : 더추상적인특징이학습 계층적특징 (hierarchical feature) 학습 8.2 Convolutional Neural Network (CNN) (1) CNN - 동물의시각피질 (visual cortex) 구조에영향을받은신경망구조 - 시각피질의각신경세포는시야내의특정영역의자극만수용 ( 해당영역의특정특징에대해서만반응 ) - 시각인식 : 시각자극이 1차시각피질을통해처리, 2차시각피질, 3차시각피질, 계층적인정보처리 ( 정보가계층적으로처리되어가면서점차추상적인특징이추출되어시각인식이이루어짐 (2) CNN 구조 52

53 8.3 신경망모형 import numpy as np import tensorflow as tf import matplotlib.pyplot as plt data_file_name='e:/data/python/cars.txt' dat=np.genfromtxt(data_file_name,dtype='float32',skip_header=true) speed_data=dat[:,1] dist_data=dat[:,2] speed=np.reshape(speed_data, [1,-1]) dist=np.reshape(dist_data, [1,-1]) x=tf.placeholder(dtype=tf.float32, shape=[1,none]) y=tf.placeholder(dtype=tf.float32, shape=[1,none]) hidden_number=10 b1_hidden=tf.variable(tf.random_normal([hidden_number,1])) b0_hidden=tf.variable(tf.random_normal([hidden_number,1])) layer1_out=tf.nn.sigmoid(tf.matmul(b1_hidden,x)+b0_hidden) b1_out=tf.variable(tf.random_normal([1,hidden_number])) b0_out=tf.variable(tf.random_normal([1,1])) y_out=tf.matmul(b1_out,layer1_out)+b0_out cost=tf.nn.l2_loss(y_out-y) optimizer=tf.train.adamoptimizer(0.1) training=optimizer.minimize(cost) init=tf.global_variables_initializer() 53

54 with tf.session() as sess: sess.run(init) for i in range(500): sess.run(training,feed_dict={x:speed, y:dist}) speed_data=np.linspace(0,20,50) x_test=[speed_data] y_test=sess.run(y_out, feed_dict={x: x_test}) plt.plot(speed, dist, 'ro', alpha=0.05) plt.plot(x_test,y_test, 'b^', alpha=1) plt.show() 10회반복 500 회반복 5000 회반복 54

55 8.4 신경망모형실습 (1) IRIS 데이터 classification (Sigmoid 함수 ) (2) IRIS 데이터 regression (Linear 함수 ) EXAMPLE [1] JAPAN Credit 데이터를이용하여신경망모형을수행하시오. 55

56 9. 실습예제 9.1 연관규칙마이닝, Association Rule Mining ARM (1) 아이템 (items) 과거래 (transactions) 데이터를이용하여아이템간의연관성을분석 ( 아이템 = 사건, 거래 = 실험결과 ) (2) 아이템과트랜잭션데이터집합 I={i1, i2,, in} : n개의아이템집합 T={t1, t2,, tm} : m개의트랜잭션집합 (ex) Wal mart data I={Beer, Nuts, Diaper, Coffee, Eggs, Milk} T={10,20,30,40,50} (3) 개별트랜잭션은번호 (unique identical number) 와이에포함된아이템들로구성 tt jj = ii jj1, ii jj2,, ii jjjj (ex) Wal mart data t10=(beer, Nuts, Diaper) (4) 연관규칙의표현 - X 아이템이거래되고나서 Y 아이템이거래된것을의미 X Y X와 Y는아이템집합에포함된아이템 X: 선행사건 (antecedent), lhs(left hand side) Y: 후행사건 (consequent), rhs(right hand side) 56

57 (5) ARM 의 3 가지평가측도 (evaluation measures) - 지지도 (support): 두사건 (event) A 와 B 에대하여 A 와 B 가동시에발생할확률 PP(AA BB) - 신뢰도 (confidence): A 가발생했다는조건하에서 B 가발생할확률 PP(BB AA) - 향상도 (lift) PP(BB AA) PP(BB) (6) 지지도와신뢰도최소확률값을정하여이값보다큰규칙들에대하여의미를부여 : ARM 에서는최소임계값 (minimum threshold) (7) Support X와 Y를함께포함하고있는트랜잭션수 support(x Y) = P(X Y) = 전체트랜잭션수 0 support(x Y) 1 (XX YY) 와 (YY XX) 의지지도값은같기때문에두규칙간의차이를알수없음. support(x Y) = P(X Y) = PP(YY XX) = support(y X) (8) Confidence confidence(x Y) = P(Y X) = P(X Y) P(X) X와 Y를함께포함하고있는트랜잭션수 = X를포함한트랜잭션수 0 confidence(x Y) 1 X가발생하였다는조건하에서 Y가발생할확률로정의되는신뢰도는다음과같이 X와 Y의지지도 (P(X Y)) 를 X의지지도 (P(X)) 로나눈값임 57

58 confidence(x Y) = P(Y X) = P(X Y) P(X) = support(x Y) support(x) (9) Lift lift( X confidence( X Y ) Y ) = = support( Y ) P( Y X ) = P( Y ) P( X Y ) support( X Y ) = P( X ) P( Y ) support( X )support( Y ) 0 lift(x Y) < - 향상도값은확률이아니고이론적으로 0 에서무한대 ( ) 사이의값을갖음 - 향상도값이 1 이되면 X 와 Y 는서로독립 (independent) 이됨 lift(x Y) = PP(XX YY) PP(XX)PP(YY) = 1, PP(XX YY) = PP(XX)PP(YY) - 향상도값에따른 X 와 Y 의관계 (10) Example [Wal Mart Case] > 1 XX aaaaaa YY aaaaaa cccccccccccccccccccccccccc ( 상호보완 ) lift(x Y) = 1 XX aaaaaa YY aaaaaa iiiiiiiiiiiiiiiiiiiiii ( 독립 ) < 1 XX aaaaaa YY aaaaaa ssssssssssssssssssssssss ( 상호대체 ) P(beer) =, P(diaper) =, P(beer diaper) = P(beer diaper) =, P(diaper beer) = P(diaper beer) P(diaper) =, P(beer diaper) P(beer) = (11) 실습코드 58

59 > library(arules) > library(arulesviz) > tr = read.transactions("c:/data/walmart.txt", format = "basket", sep = ",") > tr transactions in sparse format with 5 transactions (rows) and 6 items (columns) > rules = apriori(tr, parameter = list(support = 0.1, confidence = 0.8)) > rules set of 46 rules > inspect(rules) 59

60 60

61 9.2 Decision Tree (1) 의사결정나무 - Breiman 등이의사결정나무모형은소개하였고, Loh 등에의해많은발전되었음 [Breiman, 1984],[Loh, 1997] - 모형의구축과정을나무형태로표현하여대상이되는집단을몇개의소집단으로구분하는분류및예측기법 (2) 실습 library(tree) iris.tr=tree(species~., iris) iris.tr summary(adult.tr) plot(iris.tr); text(iris.tr) 61

62 62

63 References Abrahams, S. et al. (2016) Tensorflow for Machine Intelligence, Bleeding Edge Press. Brownley, C. (2017) Foundation for Analytics with Python, O Reilly. Chatterjee, S. et al. (2012) Regression analysis by example, 5 th edition, Wiley. Efron, B., and Hastie, T. (2016), Computer Age Statistical Inference, Cambridge University Press. Goodfellow, I. et al. (2016) Deep learning, MIT Press. Han, J. et al. (2012) Data Mining Concepts and Techniques, Morgan Kaufmann. McClure, N. (2017) Tensorflow Machine Learning Cookbook, Packt Publishing. Murphy, K. P. (2012) Machine Learning: a probabilistic perspective, MIT Press. Ramsundar, B. et al. (2018) TensorFlow for Deep Learning, O Reilly. Zaccone, G. (2016) Getting Started with Tensorflow, Packt Publishing. 김영우 (2017) 쉽게배우는 R 데이터분석, 이지스퍼블리싱. 나카이에츠지 (2016) 텐서플로로시작하는딥러닝, 제이펍. 박응용 (2017) 점프투파이썬, 이지스퍼블리싱. 이건명 (2018) 인공지능, 생능출판. 최병관외 (2018) Tensorflow 프로그래밍기초, 청구문화사. 63

dist=dat[:,2] # 기초통계량구하기 len(speed) # 데이터의개수 np.mean(speed) # 평균 np.var(speed) # 분산 np.std(speed) # 표준편차 np.max(speed) # 최대값 np.min(speed) # 최소값 np.me

dist=dat[:,2] # 기초통계량구하기 len(speed) # 데이터의개수 np.mean(speed) # 평균 np.var(speed) # 분산 np.std(speed) # 표준편차 np.max(speed) # 최대값 np.min(speed) # 최소값 np.me Python 을이용한기초통계분석 1. 통계학을위한 Python 모듈 1.1 numpy 패키지 - 고급데이터분석과수리계산을위한라이브러리를제공 - 아나콘다에기본적으로설치되어있음 (1) numpy가제공하는통계분석함수 import numpy as np print(dir(np)), 'max',, 'mean', 'median',, 'min',, 'percentile',,