PowerPoint 프레젠테이션 - PDF Free Download

파이썬을이용한빅데이터수집. 분석과시각화 Part 2. 데이터수집 이원하

목 차 1 2 3 4 5 FACEBOOK Crawling TWITTER Crawling NAVER Crawling 공공데이터포털 Crawling 일반 WEB Page Crawling 04 22 28 34 40

JSON (JavaScript Object Notation) JSON Object Array { } "name":"john", "age":30, "cars":[ "Ford", "BMW", "Fiat" ] 3

1 FACEBOOK CRAWLING

Facebook Graph API GET /[version info]/[node Edge Name] Host: graph.facebook.com { } fieldname : {field-value}, 5

Facebook Graph API { } "data": [... 획득한데이터 ], "paging": { "cursors": { "after": "MTAxNTExOTQ1MjAwNzI5NDE=", "before": "NDMyNzQyODI3OTQw" }, "previous": "https://graph.facebook.com/me/albums?limit=25&before=ndmynzqyodi3otqw" "next": "https://graph.facebook.com/me/albums?limit=25&after=mtaxntexotq1mjawnzi5nde=" } 6

Facebook Graph API { } "data": [... 획득한데이터 ], "paging": { "previous": "https://graph.facebook.com/me/feed?limit=25&since=1364849754", "next": "https://graph.facebook.com/me/feed?limit=25&until=1364587774" } 7

Facebook Numeric ID 가져오기 http://www.facebook.com/jtbcnews https://graph.facebook.com/v2.8/[page_id]/?access+token=[app_id] [Secret_Key] >>> page_name = "jtbcnews" >>> app_id = "2009*******7013" >>> app_secret = "daccef14d5*******95060d65e66c41d" >>> access_token = app_id + " " + app_secret >>> >>> base = "https://graph.facebook.com/v2.8" >>> node = "/" + page_name >>> parameters = "/?access_token=%s" % access_token >>> url = base + node + parameters >>> print (url) https://graph.facebook.com/v2.8/jtbcnews/?access_token=2009*******7013 daccef14d5*******95060d65e66c41d >>> { } name : JTBC 뉴스, id : 240263402699918 9

Facebook Numeric ID 가져오기 >>> import urllib.request >>> import json >>> req = urllib.request.request(url) >>> response = urllib.request.urlopen(req) >>> data = json.loads(response.read().decode('utf-8')) >>> page_id = data['id'] >>> page_name = data['name'] >>> print ("%s Facebook Numeric ID : %s" % (page_name, page_id)) JTBC 뉴스 Facebook Numeric ID : 240263402699918 >>> 10

Facebook Numeric ID 가져오기 import sys import urllib.request import json if name == ' main ': page_name = "jtbcnews" app_id = "200920440387013" app_secret = "daccef14d5cd41c0e95060d65e66c41d" access_token = app_id + " " + app_secret base = "https://graph.facebook.com/v2.8" node = "/" + page_name parameters = "/?access_token=%s" % access_token url = base + node + parameters req = urllib.request.request(url) try: response = urllib.request.urlopen(req) if response.getcode() == 200: data = json.loads(response.read().decode('utf-8')) page_id = data['id'] print ("%s Facebook Numeric ID : %s" % (page_name, page_id)) except Exception as e: print (e) : geturl() 응답한서버의 url 반환 : info() 페이지의헤더값과 meta 정보 : getcode() 서버의 HTTP 응답코드 (200이정상 ) : try 블록내부에서오류가발생하면프로그램이오류를발생하지않고 except 블록수행 : Exception 을이용오류를확인 11

Facebook 포스트 (/{post-id}) Crawling GET /[version info]/{post-id} Host: graph.facebook.com Id 포스트 ID String comments 댓글정보 Object created_time 포스트초기생성일자 Datetime from 포스트한사용자에대한프로필정보 Profile link 포스트에삽입되어있는링크 String message 포스트메시지 String name 링크의이름 String object_id 업로드한사진또는동영상 ID String parent_id 해당포스트의부모포스트 String picture 포스트에포함되어있는사진들의링크 String place 포스트를작성한위치정보 Place reactions 좋아요, 화나요등에대한리엑션정보 Obejct shares 포스트를공유한숫자 Object type 포스트의객체형식 enum{link, status, photo, video, offer} updated_time 포스트가최종업데이트된시간 Datetime 12

Facebook 포스트 (/{post-id}) Crawling 13 { "data":[ { "comments":{ }, "data":[ ], "summary":{ } "order":"ranked", "total_count":12, "can_comment":"false" "message":" 즉청와대가최씨의국정개입사건을파악하고도 \n 은폐했다는사실이안전수석입에서나온겁니다.", "type":"link", "shares":{ }, "count":46 "reactions":{ }, "data":[ ], "summary":{ } "viewer_reaction":"none", "total_count":443 "created_time":"2017-02-23t00:00:00+0000", "name":" 안종범 \" 재단임원인사에최순실개입, 알고도숨겼다 \"", "id":"240263402699918 _1328805163845731", "link":"http://news.jtbc.joins.com/article/article.aspx?new s_id= NB11427906&pDate=20170222" ], } "paging":{ "next":"https://graph.facebook.co m/v2.8/240263402699918/post s?fields=...", "previous":"https://graph.facebook.com/v2.8/240263402699918/ posts?fields=..." } }

Facebook 포스트 (/{post-id}) Crawling def get_request_url(url): req = urllib.request.request(url) try: response = urllib.request.urlopen(req) if response.getcode() == 200: print ("[%s] Url Request Success" % datetime.datetime.now()) return response.read().decode('utf-8') except Exception as e: print(e) print("[%s] Error for URL : %s" % (datetime.datetime.now(), url)) return None : response.read().decode( utf-8 ) 로반환 : 오류발생시 None 반환 14

Facebook 포스트 (/{post-id}) Crawling def getfacebooknumericid(page_id, access_token): base = "https://graph.facebook.com/v2.8" node = "/" + page_id parameters = "/?access_token=%s" % access_token url = base + node + parameters retdata = get_request_url(url) if (retdata == None): return None else: jsondata = json.loads(retdata) return jsondata['id'] 15

Facebook 포스트 (/{post-id}) Crawling def getfacebookpost(page_id, access_token, from_date, to_date, num_statuses): base = "https://graph.facebook.com/v2.8" node = "/%s/posts" % page_id fields = "/?fields=id,message,link,name,type,shares,reactions," + \ "created_time,comments.limit(0).summary(true)" + \ ".limit(0).summary(true)" duration = "&since=%s&until=%s" % (from_date, to_date) parameters = "&limit=%s&access_token=%s" % (num_statuses, access_token) url = base + node + fields + duration + parameters retdata = get_request_url(url) if (retdata == None): return None else: return json.loads(retdata) 16

Facebook 포스트 (/{post-id}) Crawling def getpostitem(post, key): try: if key in post.keys(): return post[key] else: return '' except: return ' def getposttotalcount(post, key): try: if key in post.keys(): return post[key]['summary']['total_count'] else: return 0 except: return 0 "reactions":{ "data":[ ], "summary":{ "viewer_reaction":"none", "total_count":443 } }, 17

Facebook 포스트 (/{post-id}) Crawling def getfacebookreaction(post_id, access_token): base = "https://graph.facebook.com/v2.8" node = "/%s" % post_id reactions = "/?fields=" \ "reactions.type(like).limit(0).summary(total_count).as(like)" \ ",reactions.type(love).limit(0).summary(total_count).as(love)" \ ",reactions.type(wow).limit(0).summary(total_count).as(wow)" \ ",reactions.type(haha).limit(0).summary(total_count).as(haha)" \ ",reactions.type(sad).limit(0).summary(total_count).as(sad)" \ ",reactions.type(angry).limit(0).summary(total_count).as(angry)" parameters = "&access_token=%s" % access_token url = base + node + reactions + parameters retdata = get_request_url(url) if (retdata == None): return None else: return json.loads(retdata) 18

Facebook 포스트 (/{post-id}) Crawling def getpostdata(post, access_token, jsonresult): post_id = getpostitem(post, 'id') post_message = getpostitem(post, 'message') post_name = getpostitem(post, 'name') post_link = getpostitem(post, 'link') post_type = getpostitem(post, 'type') post_num_reactions = getposttotalcount(post, 'reactions') post_num_comment = getposttotalcount(post, 'comments') post_num_shares = 0 if 'shares' not in post.keys() else post['shares']['count'] post_created_time = getpostitem(post, 'created_time') post_created_time = datetime.datetime.strptime(post_created_time, '%Y-%m-%dT%H:%M:%S+0000') post_created_time = post_created_time + datetime.timedelta(hours=+9) post_created_time = post_created_time.strftime('%y-%m-%d %H:%M:%S') 19

Facebook 포스트 (/{post-id}) Crawling reaction = getfacebookreaction(post_id, access_token) if post_created_time > '2016-02-24 00:00:00' else {} post_num_likes = getposttotalcount(reaction, 'like') post_num_likes = post_num_reactions if post_created_time < '2016-02-24 00:00:00' else post_num_likes post_num_loves = getposttotalcount(reaction, 'love') post_num_wows = getposttotalcount(reaction, 'wow') post_num_hahas = getposttotalcount(reaction, 'haha') post_num_sads = getposttotalcount(reaction, 'sad') post_num_angrys = getposttotalcount(reaction, 'angry') jsonresult.append({'post_id':post_id, 'message':post_message, 'name':post_name, 'link':post_link, 'created_time':post_created_time, 'num_reactions':post_num_reactions, 'num_comments':post_num_comment, 'num_shares':post_num_shares, 'num_likes':post_num_likes, 'num_loves':post_num_loves, 'num_wows':post_num_wows, 'num_hahas':post_num_hahas, 'num_sads':post_num_sads, 'num_angrys':post_num_angrys}) 20

Facebook 포스트 (/{post-id}) Crawling go_next = True... page_id = getfacebooknumericid(page_name, access_token)... jsonpost = getfacebookpost(page_id, access_token, from_date, to_date, num_posts)... while (go_next): for post in jsonpost['data']: getpostdata(post, access_token, jsonresult) if 'paging' in jsonpost.keys(): jsonpost = json.loads(get_request_url(jsonpost['paging']['next'])) else: go_next = False... with open('%s_facebook_%s_%s.json' % (page_name, from_date, to_date), 'w', encoding='utf8') as outfile: str_ = json.dumps(jsonresult, indent=4, sort_keys=true, ensure_ascii=false) outfile.write(str_) 21

21 TWITTER CRAWLING

OAuth 23

Twit Crawling def oauth2_request(consumer_key, consumer_secret, access_token, acces s_secret): try: consumer = oauth2.consumer(key=consumer_key, secret=consumer_ secret) token = oauth2.token(key=access_token, secret=access_secret) client = oauth2.client(consumer, token) return client except Exception as e: print(e) return None def get_user_timeline(client, screen_name, count=50, include_rts='false'): base = "https://api.twitter.com/1.1" node = "/statuses/user_timeline.json" fields = "?screen_name=%s&count=%s&include_rts=%s" % (screen_name, count, include_rts) #fields = "?screen_name=%s" % (screen_name) url = base + node + fields response, data = client.request(url) try: if response['status'] == '200': return json.loads(data.decode('utf-8')) except Exception as e: print(e) return None 24

Twit Crawling def gettwittertwit(tweet, jsonresult): tweet_id = tweet['id_str'] tweet_message = '' if 'text' not in tweet.keys() else tweet['text'] screen_name = '' if 'user' not in tweet.keys() else tweet['user']['screen_name'] tweet_link = '' if tweet['entities']['urls']: #list for i, val in enumerate(tweet['entities']['urls']): tweet_link = tweet_link + tweet['entities']['urls'][i]['url'] + ' ' else: tweet_link = '' hashtags = '' if tweet['entities']['hashtags']: #list for i, val in enumerate(tweet['entities']['hashtags']): hashtags = hashtags + tweet['entities']['hashtags'][i]['text'] + ' ' else: hashtags = '' 25

Twit Crawling if 'created_at' in tweet.keys(): # Twitter used UTC Format. EST = UTC + 9(Korean Time) Format ex: Fri Feb 10 03:57:27 +0000 2017 tweet_published = datetime.datetime.strptime(tweet['created_at'],'%a %b %d %H:%M:%S +0000 %Y') tweet_published = tweet_published + datetime.timedelta(hours=+9) tweet_published = tweet_published.strftime('%y-%m-%d %H:%M:%S') else: tweet_published = '' num_favorite_count = 0 if 'favorite_count' not in tweet.keys() else tweet['favorite_count'] num_comments = 0 num_shares = 0 if 'retweet_count' not in tweet.keys() else tweet['retweet_count'] num_likes = num_favorite_count num_loves = num_wows = num_hahas = num_sads = num_angrys = 0 jsonresult.append({'post_id':tweet_id, 'message':tweet_message, 'name':screen_name, 'link':tweet_link, 'created_time':tweet_published, 'num_reactions':num_favorite_count, 'num_comments':num_comments, 'num_shares':num_shares, 'num_likes':num_likes, 'num_loves':num_loves, 'num_wows':num_wows, 'num_hahas':num_hahas, 'num_sads':num_sads, 'num_angrys':num_angrys, 'hashtags': hashtags}) 26

Twit Crawling def main(): screen_name = "jtbc_news" num_posts = 50 jsonresult = [] client = oauth2_request(consumer_key, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_SECRET) tweets = get_user_timeline(client, screen_name) for tweet in tweets: gettwittertwit(tweet, jsonresult) with open('%s_twitter.json' % (screen_name), 'w', encoding='utf8') as outfile: str_ = json.dumps(jsonresult,indent=4, sort_keys=true, ensure_ascii=false) outfile.write(str_) 27

31 NAVER CRAWLING

Naver Search( 검색 ) API def get_request_url(url): req = urllib.request.request(url) req.add_header("x-naver-client-id", app_id) req.add_header("x-naver-client-secret", "scxkesjyib") try: response = urllib.request.urlopen(req) if response.getcode() == 200: print ("[%s] Url Request Success" % datetime.datetime.now()) return response.read().decode('utf-8') except Exception as e: print(e) print("[%s] Error for URL : %s" % (datetime.datetime.now(), url)) return None 29

Naver Search( 검색 ) API def getnaversearchresult(snode, search_text, page_start, display): base = "https://openapi.naver.com/v1/search" node = "/%s.json" % snode parameters = "?query=%s&start=%s&display=%s" % (urllib.parse.quote(search_text), page_start, display) url = base + node + parameters retdata = get_request_url(url) if (retdata == None): return None else: return json.loads(retdata) 30

Naver Search( 검색 ) API def getpostdata(post, jsonresult): title = post['title'] description = post['description'] org_link = post['originallink'] link = post['link'] #Tue, 14 Feb 2017 18:46:00 +0900 pdate = datetime.datetime.strptime(post['pubdate'], '%a, %d %b %Y %H:%M:%S +0900') pdate = pdate.strftime('%y-%m-%d %H:%M:%S') jsonresult.append({'title':title, 'description': description, 'org_link':org_link, 'link': org_link, 'pdate':pdate}) return { "display": 100, "items": [ { "description": 기사요약본 ", "link": [ 네이버 URL]", "originallink": [ 기사원본 URL]", "pubdate": "Thu, 13 Jul 2017 03:38:00 +0900", "title": [ 기사제목 ]" } ], "lastbuilddate": "Thu, 13 Jul 2017 03:50:47 +0900", "start": 1, "total": 454265 } 31

Naver Search( 검색 ) API def main(): jsonresult = [] # 'news', 'blog', 'cafearticle' snode = 'news' search_text = ' 탄핵 ' display_count = 100 jsonsearch = getnaversearchresult(snode, search_text, 1, display_count) while ((jsonsearch!= None) and (jsonsearch['display']!= 0)): for post in jsonsearch['items']: getpostdata(post, jsonresult) nstart = jsonsearch['start'] + jsonsearch['display'] jsonsearch = getnaversearchresult(snode, search_text, nstart, display_count) with open('%s_naver_%s.json' % (search_text, snode), 'w', encoding='utf8') as outfile: retjson = json.dumps(jsonresult, indent=4, sort_keys=true, ensure_ascii=false) outfile.write(retjson) 32

Naver 지도 API def getgeodata(address): base = "https://openapi.naver.com/v1/map/geocode" node = "" parameters = "?query=%s" % urllib.parse.quote(address) url = base + node + parameters retdata = get_request_url(url) if (retdata == None): return None else: return json.loads(retdata) 33

41 공공데이터포털 CRAWLING

전국유료관광지입장객정보 def gettourpointvisitor(yyyymm, sido, gungu, npagenum, nitems): end_point = "http://openapi.tour.go.kr/openapi/service/tourismresourcestatsservice/getpchrgtrrsrtvisitorlist" parameters = "?_type=json&servicekey=" + access_key parameters += "&YM=" + yyyymm parameters += "&SIDO=" + urllib.parse.quote(sido) parameters += "&GUNGU=" + urllib.parse.quote(gungu) parameters += "&RES_NM=&pageNo=" + str(npagenum) parameters += "&numofrows=" + str(nitems) url = end_point + parameters retdata = get_request_url(url) if (retdata == None): return None else: return json.loads(retdata) 35

전국유료관광지입장객정보 def gettourpointdata(item, yyyymm, jsonresult): addrcd = 0 if 'addrcd' not in item.keys() else item['addrcd'] gungu = '' if 'gungu' not in item.keys() else item['gungu'] sido = '' if 'sido' not in item.keys() else item['sido'] resnm = '' if 'resnm' not in item.keys() else item['resnm'] rnum = 0 if 'rnum' not in item.keys() else item['rnum'] ForNum = 0 if 'csforcnt' not in item.keys() else item['csforcnt'] NatNum = 0 if 'csnatcnt' not in item.keys() else item['csnatcnt'] jsonresult.append({'yyyymm': yyyymm, 'addrcd': addrcd, 'gungu': gungu, 'sido': sido, 'resnm': resnm, 'rnum': rnum, 'ForNum': ForNum, 'NatNum': NatNum}) 36

전국유료관광지입장객정보 nitem = 100 nstartyear = 2011 nendyear = 2017 for year in range(nstartyear, nendyear): for month in range(1, 13): yyyymm = "{0}{1:0>2}".format(str(year), str(month)) npagenum = 1 while True: jsondata = gettourpointvisitor(yyyymm, sido, gungu, npagenum, nitems) if (jsondata['response']['header']['resultmsg'] == 'OK'): ntotal = jsondata['response']['body']['totalcount'] if ntotal == 0: break for item in jsondata['response']['body']['items']['item']: gettourpointdata(item, yyyymm, jsonresult) npage = math.ceil(ntotal / nitem) if (npagenum == npage): break npagenum += 1 else: break 37

출입국관광통계서비스 def getnatvisitor(yyyymm, nat_cd, ed_cd): end_point = "http://openapi.tour.go.kr/openapi/service/edrcnttourismstatsservice/getedrcnttourismstatslist" parameters = "?_type=json&servicekey=" + access_key parameters += "&YM=" + yyyymm parameters += "&NAT_CD=" + nat_cd parameters += "&ED_CD=" + ed_cd url = end_point + parameters retdata = get_request_url(url) if (retdata == None): return None else: return json.loads(retdata) 38

출입국관광통계서비스 def main(): # 코드생략 for year in range(nstartyear, nendyear): for month in range(1, 13): yyyymm = "{0}{1:0>2}".format(str(year), str(month)) jsondata = getnatvisitor(yyyymm, national_code, ed_cd) if (jsondata['response']['header']['resultmsg'] == 'OK'): krname = jsondata['response']['body']['items']['item']["natkornm"] krname = krname.replace(' ', '') itotalvisit = jsondata['response']['body']['items']['item']["num"] print('%s_%s : %s' %(krname, yyyymm, itotalvisit)) jsonresult.append({'nat_name': krname, 'nat_cd': national_code, 'yyyymm': yyyymm, 'visit_cnt': itotalvisit}) # 코드생략 39

51 일반 WEB PAGE CRAWLING

웹페이지저작권

BeautifulSoup4 HTML 분석 (Parsing) 패키지 [ 파이썬설치경로 ]>pip install beautifulsoup4

BeautifulSoup4 HTML 분석 (Parsing) 패키지 >>> from bs4 import BeautifulSoup >>> html = '<td class="title"><div class="tit3"><a href="/movie/bi/mi/basic.nhn?code=136872" title=" 미녀와야수 "> 미녀와야수 </a></div></td>' >>> soup = BeatifulSoup(html, 'html.parser') >>> soup <td class="title"><div class="tit3"><a href="/movie/bi/mi/basic.nhn?code=136872" title=" 미녀와야수 "> 미녀와야수 </a></div></td> >>> tag = soup.td >>> tag <td class="title"><div class="tit3"><a href="/movie/bi/mi/basic.nhn?code=136872" title=" 미녀와야수 "> 미녀와야수 </a></div></td> >>> tag = soup.div >>> tag <div class="tit3"><a href="/movie/bi/mi/basic.nhn?code=136872" title=" 미녀와야수 "> 미녀와야수 </a></div> >>> tag = soup.a >>> tag <a href="/movie/bi/mi/basic.nhn?code=136872" title=" 미녀와야수 "> 미녀와야수 </a> >>> tag.name 'a'

BeautifulSoup4 HTML 분석 (Parsing) 패키지 >>> from bs4 import BeautifulSoup >>> html = '<td class="title"><div class="tit3"><a href="/movie/bi/mi/basic.nhn?code=136872" title=" 미녀와야수 "> 미녀와야수 </a></div></td>' >>> soup = BeatifulSoup(html, 'html.parser') >>> soup <td class="title"><div class="tit3"><a href="/movie/bi/mi/basic.nhn?code=136872" title=" 미녀와야수 "> 미녀와야수 </a></div></td> >>> tag = soup.find('td', attrs={'class':'title'}) >>> tag <td class="title"><div class="tit3"><a href="/movie/bi/mi/basic.nhn?code=136872" title=" 미녀와야수 "> 미녀와야수 </a></div></td> >>> tag = soup.find('div', attrs={'class':'tit3'}) >>> tag <div class="tit3"><a href="/movie/bi/mi/basic.nhn?code=136872" title=" 미녀와야수 "> 미녀와야수 </a></div>

BeautifulSoup4 HTML 분석 (Parsing) 패키지 >>> import urllib.request >>> from bs4 import BeautifulSoup >>> html = urllib.request.urlopen('http://movie.naver.com/movie/sdb/rank/rmovie.nhn') >>> soup = BeautifulSoup(html, 'html.parser') >>> print(soup.prettify()) <!DOCTYPE html> <html> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"> <meta content="ie=edge" http-equiv="x-ua-compatible"> <meta content="http://imgmovie.naver.com/today/naverme/naverme_profile.jpg" property="me2:image"/> <meta content=" 네이버영화 " property="me2:post_tag"/> <meta content=" 네이버영화 " property="me2:category1"/>...( 이하중략 )  </div> </body> </html>

BeautifulSoup4 HTML 분석 (Parsing) 패키지 <td class="title"> <div class="tit3"> <a href="/movie/bi/mi/basic.nhn?code=135874" title=" 스파이더맨 : 홈커밍 "> 스파이더맨 : 홈커밍 </a> </div> </td>

BeautifulSoup4 HTML 분석 (Parsing) 패키지 >>> tags = soup.findall('div', attrs={'class':'tit3'}) >>> tags [<div class="tit3"> <a href="/movie/bi/mi/basic.nhn?code=135874" title=" 스파이더맨 : 홈커밍 "> 스파이더맨 : 홈커밍 </a> </div>, <div class="tit3"> <a href="/movie/bi/mi/basic.nhn?code=146480" title=" 덩케르크 "> 덩케르크 </a> </div>, <div class="tit3"> <a href="/movie/bi/mi/basic.nhn?code=76309" title=" 플립 "> 플립 </a> ( 이하중략 ) <a href="/movie/bi/mi/basic.nhn?code=149048" title="100미터 ">100미터 </a> </div>] >>> for tag in tags: print(tag.a) <a href="/movie/bi/mi/basic.nhn?code=135874" title=" 스파이더맨 : 홈커밍 "> 스파이더맨 : 홈커밍 </a> ( 이하중략 ) >>> for tag in tags: print(tag.a.text) 스파이더맨 : 홈커밍 덩케르크 ( 이하중략 )