PowerPoint 프레젠테이션

Similar documents
PowerPoint 프레젠테이션

Week13

Ext JS À¥¾ÖÇø®ÄÉÀ̼ǰ³¹ß-³¹Àå.PDF

Polly_with_Serverless_HOL_hyouk

Analytics > Log & Crash Search > Unity ios SDK [Deprecated] Log & Crash Unity ios SDK. TOAST SDK. Log & Crash Unity SDK Log & Crash Search. Log & Cras

3장

Eclipse 와 Firefox 를이용한 Javascript 개발 발표자 : 문경대 11 년 10 월 26 일수요일

Web Scraper in 30 Minutes 강철

Microsoft PowerPoint - ch09 - 연결형리스트, Stack, Queue와 응용 pm0100

다른 JSP 페이지호출 forward() 메서드 - 하나의 JSP 페이지실행이끝나고다른 JSP 페이지를호출할때사용한다. 예 ) <% RequestDispatcher dispatcher = request.getrequestdispatcher(" 실행할페이지.jsp");

0. 들어가기 전

SIGPLwinterschool2012

API STORE 키발급및 API 사용가이드 Document Information 문서명 : API STORE 언어별 Client 사용가이드작성자 : 작성일 : 업무영역 : 버전 : 1 st Draft. 서브시스템 : 문서번호 : 단계 : Docum

Microsoft PowerPoint - web-part03-ch19-node.js기본.pptx

Mobile Service > IAP > Android SDK [ ] IAP SDK TOAST SDK. IAP SDK. Android Studio IDE Android SDK Version (API Level 10). Name Reference V

UNIST_교원 홈페이지 관리자_Manual_V1.0

chap 5: Trees

2파트-07

14-Servlet

Javascript.pages

PowerPoint 프레젠테이션

ibmdw_rest_v1.0.ppt

Multi Channel Analysis. Multi Channel Analytics :!! - (Ad network ) Report! -! -!. Valuepotion Multi Channel Analytics! (1) Install! (2) 3 (4 ~ 6 Page

slide2

OCaml

신림프로그래머_클린코드.key

Microsoft PowerPoint Python-Function.pptx

3ÆÄÆ®-14

게시판 스팸 실시간 차단 시스템

Modern Javascript

04장

I T C o t e n s P r o v i d e r h t t p : / / w w w. h a n b i t b o o k. c o. k r

Secure Programming Lecture1 : Introduction

Microsoft PowerPoint Python-WebDB

PowerPoint 프레젠테이션

교육2 ? 그림

JMF2_심빈구.PDF

11 템플릿적용 - Java Program Performance Tuning (김명호기술이사)

Facebook API

rmi_박준용_final.PDF

<C0CCBCBCBFB52DC1A4B4EBBFF82DBCAEBBE7B3EDB9AE2D D382E687770>

06장.리스트

12-file.key

MasoJava4_Dongbin.PDF

ORANGE FOR ORACLE V4.0 INSTALLATION GUIDE (Online Upgrade) ORANGE CONFIGURATION ADMIN O

목차 INDEX JSON? - JSON 개요 - JSONObject - JSONArray 서울시공공데이터 API 살펴보기 - 요청인자살펴보기 - Result Code - 출력값 HttpClient - HttpHelper 클래스작성 - JSONParser 클래스작성 공공

Social Network

PowerPoint プレゼンテーション

09-interface.key

1217 WebTrafMon II

PowerPoint Presentation

Microsoft Word - FS_ZigBee_Manual_V1.3.docx

11장 포인터

1

PowerPoint 프레젠테이션

03장.스택.key

Orcad Capture 9.x

HTML5가 웹 환경에 미치는 영향 고 있어 웹 플랫폼 환경과는 차이가 있다. HTML5는 기존 HTML 기반 웹 브라우저와의 호환성을 유지하면서도, 구조적인 마크업(mark-up) 및 편리한 웹 폼(web form) 기능을 제공하고, 리치웹 애플리케이 션(RIA)을

DocsPin_Korean.pages

Index Process Specification Data Dictionary

thesis-shk

Dialog Box 실행파일을 Web에 포함시키는 방법

제이쿼리 (JQuery) 정의 자바스크립트함수를쉽게사용하기위해만든자바스크립트라이브러리. 웹페이지를즉석에서변경하는기능에특화된자바스크립트라이브러리. 사용법 $( 제이쿼리객체 ) 혹은 $( 엘리먼트 ) 참고 ) $() 이기호를제이쿼리래퍼라고한다. 즉, 제이쿼리를호출하는기호

제 목

1. SNS Topic 생성여기를클릭하여펼치기... Create Topic 실행 Topic Name, Display name 입력후 Create topic * Topic name : 특수문자는 hyphens( - ), underscores( _ ) 만허용한다. Topi

NoSQL

A Hierarchical Approach to Interactive Motion Editing for Human-like Figures

T100MD+

02 C h a p t e r Java

C H A P T E R 2

4? [The Fourth Industrial Revolution] IT :,,,. : (AI), ,, 2, 4 3, : 4 3.

JMF3_심빈구.PDF

Chapter 4. LISTS

Microsoft PowerPoint - Java7.pptx

PowerPoint 프레젠테이션

thesis

untitled

슬라이드 1

untitled

소개 TeraStation 을 구입해 주셔서 감사합니다! 이 사용 설명서는 TeraStation 구성 정보를 제공합니다. 제품은 계속 업데이트되므로, 이 설명서의 이미지 및 텍스트는 사용자가 보유 중인 TeraStation 에 표시 된 이미지 및 텍스트와 약간 다를 수

untitled

BEef 사용법.pages

8장 문자열

Remote UI Guide

5월고용내지

Connection 8 22 UniSQLConnection / / 9 3 UniSQL OID SET

순서 OAuth 개요 OAuth 1.0 규격 OAuth 2.0 규격

기초컴퓨터프로그래밍

슬라이드 1

<4D F736F F F696E74202D203130C0E52EBFA1B7AF20C3B3B8AE205BC8A3C8AF20B8F0B5E55D>

쉽게 풀어쓴 C 프로그래밍

½Å³âÈ£-³»ÁöÀμâ

Microsoft Word - ntasFrameBuilderInstallGuide2.5.doc

PowerPoint 프레젠테이션

½Å³âÈ£-Ç¥Áö

Interstage5 SOAP서비스 설정 가이드

Secure Programming Lecture1 : Introduction

SRC PLUS 제어기 MANUAL

정도전 출생의 진실과 허구.hwp

Transcription:

파이썬을이용한빅데이터수집. 분석과시각화 Part 2. 데이터수집 이원하

목 차 1 2 3 4 5 FACEBOOK Crawling TWITTER Crawling NAVER Crawling 공공데이터포털 Crawling 일반 WEB Page Crawling 04 22 28 34 40

JSON (JavaScript Object Notation) JSON Object Array { } "name":"john", "age":30, "cars":[ "Ford", "BMW", "Fiat" ] 3

1 FACEBOOK CRAWLING

Facebook Graph API GET /[version info]/[node Edge Name] Host: graph.facebook.com { } fieldname : {field-value}, 5

Facebook Graph API { } "data": [... 획득한데이터 ], "paging": { "cursors": { "after": "MTAxNTExOTQ1MjAwNzI5NDE=", "before": "NDMyNzQyODI3OTQw" }, "previous": "https://graph.facebook.com/me/albums?limit=25&before=ndmynzqyodi3otqw" "next": "https://graph.facebook.com/me/albums?limit=25&after=mtaxntexotq1mjawnzi5nde=" } 6

Facebook Graph API { } "data": [... 획득한데이터 ], "paging": { "previous": "https://graph.facebook.com/me/feed?limit=25&since=1364849754", "next": "https://graph.facebook.com/me/feed?limit=25&until=1364587774" } 7

Facebook Graph API { } "data": [... 획득한데이터 ], "paging": { "previous": "https://graph.facebook.com/me/feed?limit=25&since=1364849754", "next": "https://graph.facebook.com/me/feed?limit=25&until=1364587774" } 8

Facebook Numeric ID 가져오기 http://www.facebook.com/jtbcnews https://graph.facebook.com/v2.8/[page_id]/?access+token=[app_id] [Secret_Key] >>> page_name = "jtbcnews" >>> app_id = "2009*******7013" >>> app_secret = "daccef14d5*******95060d65e66c41d" >>> access_token = app_id + " " + app_secret >>> >>> base = "https://graph.facebook.com/v2.8" >>> node = "/" + page_name >>> parameters = "/?access_token=%s" % access_token >>> url = base + node + parameters >>> print (url) https://graph.facebook.com/v2.8/jtbcnews/?access_token=2009*******7013 daccef14d5*******95060d65e66c41d >>> { } name : JTBC 뉴스, id : 240263402699918 9

Facebook Numeric ID 가져오기 >>> import urllib.request >>> import json >>> req = urllib.request.request(url) >>> response = urllib.request.urlopen(req) >>> data = json.loads(response.read().decode('utf-8')) >>> page_id = data['id'] >>> page_name = data['name'] >>> print ("%s Facebook Numeric ID : %s" % (page_name, page_id)) JTBC 뉴스 Facebook Numeric ID : 240263402699918 >>> 10

Facebook Numeric ID 가져오기 import sys import urllib.request import json if name == ' main ': page_name = "jtbcnews" app_id = "200920440387013" app_secret = "daccef14d5cd41c0e95060d65e66c41d" access_token = app_id + " " + app_secret base = "https://graph.facebook.com/v2.8" node = "/" + page_name parameters = "/?access_token=%s" % access_token url = base + node + parameters req = urllib.request.request(url) try: response = urllib.request.urlopen(req) if response.getcode() == 200: data = json.loads(response.read().decode('utf-8')) page_id = data['id'] print ("%s Facebook Numeric ID : %s" % (page_name, page_id)) except Exception as e: print (e) : geturl() 응답한서버의 url 반환 : info() 페이지의헤더값과 meta 정보 : getcode() 서버의 HTTP 응답코드 (200이정상 ) : try 블록내부에서오류가발생하면프로그램이오류를발생하지않고 except 블록수행 : Exception 을이용오류를확인 11

Facebook 포스트 (/{post-id}) Crawling GET /[version info]/{post-id} Host: graph.facebook.com Id 포스트 ID String comments 댓글정보 Object created_time 포스트초기생성일자 Datetime from 포스트한사용자에대한프로필정보 Profile link 포스트에삽입되어있는링크 String message 포스트메시지 String name 링크의이름 String object_id 업로드한사진또는동영상 ID String parent_id 해당포스트의부모포스트 String picture 포스트에포함되어있는사진들의링크 String place 포스트를작성한위치정보 Place reactions 좋아요, 화나요등에대한리엑션정보 Obejct shares 포스트를공유한숫자 Object type 포스트의객체형식 enum{link, status, photo, video, offer} updated_time 포스트가최종업데이트된시간 Datetime 12

Facebook 포스트 (/{post-id}) Crawling 13 { "data":[ { "comments":{ }, "data":[ ], "summary":{ } "order":"ranked", "total_count":12, "can_comment":"false" "message":" 즉청와대가최씨의국정개입사건을파악하고도 \n 은폐했다는사실이안전수석입에서나온겁니다.", "type":"link", "shares":{ }, "count":46 "reactions":{ }, "data":[ ], "summary":{ } "viewer_reaction":"none", "total_count":443 "created_time":"2017-02-23t00:00:00+0000", "name":" 안종범 \" 재단임원인사에최순실개입, 알고도숨겼다 \"", "id":"240263402699918 _1328805163845731", "link":"http://news.jtbc.joins.com/article/article.aspx?new s_id= NB11427906&pDate=20170222" ], } "paging":{ "next":"https://graph.facebook.co m/v2.8/240263402699918/post s?fields=...", "previous":"https://graph.facebook.com/v2.8/240263402699918/ posts?fields=..." } }

Facebook 포스트 (/{post-id}) Crawling def get_request_url(url): req = urllib.request.request(url) try: response = urllib.request.urlopen(req) if response.getcode() == 200: print ("[%s] Url Request Success" % datetime.datetime.now()) return response.read().decode('utf-8') except Exception as e: print(e) print("[%s] Error for URL : %s" % (datetime.datetime.now(), url)) return None : response.read().decode( utf-8 ) 로반환 : 오류발생시 None 반환 14

Facebook 포스트 (/{post-id}) Crawling def getfacebooknumericid(page_id, access_token): base = "https://graph.facebook.com/v2.8" node = "/" + page_id parameters = "/?access_token=%s" % access_token url = base + node + parameters retdata = get_request_url(url) if (retdata == None): return None else: jsondata = json.loads(retdata) return jsondata['id'] 15

Facebook 포스트 (/{post-id}) Crawling def getfacebookpost(page_id, access_token, from_date, to_date, num_statuses): base = "https://graph.facebook.com/v2.8" node = "/%s/posts" % page_id fields = "/?fields=id,message,link,name,type,shares,reactions," + \ "created_time,comments.limit(0).summary(true)" + \ ".limit(0).summary(true)" duration = "&since=%s&until=%s" % (from_date, to_date) parameters = "&limit=%s&access_token=%s" % (num_statuses, access_token) url = base + node + fields + duration + parameters retdata = get_request_url(url) if (retdata == None): return None else: return json.loads(retdata) 16

Facebook 포스트 (/{post-id}) Crawling def getpostitem(post, key): try: if key in post.keys(): return post[key] else: return '' except: return ' def getposttotalcount(post, key): try: if key in post.keys(): return post[key]['summary']['total_count'] else: return 0 except: return 0 "reactions":{ "data":[ ], "summary":{ "viewer_reaction":"none", "total_count":443 } }, 17

Facebook 포스트 (/{post-id}) Crawling def getfacebookreaction(post_id, access_token): base = "https://graph.facebook.com/v2.8" node = "/%s" % post_id reactions = "/?fields=" \ "reactions.type(like).limit(0).summary(total_count).as(like)" \ ",reactions.type(love).limit(0).summary(total_count).as(love)" \ ",reactions.type(wow).limit(0).summary(total_count).as(wow)" \ ",reactions.type(haha).limit(0).summary(total_count).as(haha)" \ ",reactions.type(sad).limit(0).summary(total_count).as(sad)" \ ",reactions.type(angry).limit(0).summary(total_count).as(angry)" parameters = "&access_token=%s" % access_token url = base + node + reactions + parameters retdata = get_request_url(url) if (retdata == None): return None else: return json.loads(retdata) 18

Facebook 포스트 (/{post-id}) Crawling def getpostdata(post, access_token, jsonresult): post_id = getpostitem(post, 'id') post_message = getpostitem(post, 'message') post_name = getpostitem(post, 'name') post_link = getpostitem(post, 'link') post_type = getpostitem(post, 'type') post_num_reactions = getposttotalcount(post, 'reactions') post_num_comment = getposttotalcount(post, 'comments') post_num_shares = 0 if 'shares' not in post.keys() else post['shares']['count'] post_created_time = getpostitem(post, 'created_time') post_created_time = datetime.datetime.strptime(post_created_time, '%Y-%m-%dT%H:%M:%S+0000') post_created_time = post_created_time + datetime.timedelta(hours=+9) post_created_time = post_created_time.strftime('%y-%m-%d %H:%M:%S') 19

Facebook 포스트 (/{post-id}) Crawling reaction = getfacebookreaction(post_id, access_token) if post_created_time > '2016-02-24 00:00:00' else {} post_num_likes = getposttotalcount(reaction, 'like') post_num_likes = post_num_reactions if post_created_time < '2016-02-24 00:00:00' else post_num_likes post_num_loves = getposttotalcount(reaction, 'love') post_num_wows = getposttotalcount(reaction, 'wow') post_num_hahas = getposttotalcount(reaction, 'haha') post_num_sads = getposttotalcount(reaction, 'sad') post_num_angrys = getposttotalcount(reaction, 'angry') jsonresult.append({'post_id':post_id, 'message':post_message, 'name':post_name, 'link':post_link, 'created_time':post_created_time, 'num_reactions':post_num_reactions, 'num_comments':post_num_comment, 'num_shares':post_num_shares, 'num_likes':post_num_likes, 'num_loves':post_num_loves, 'num_wows':post_num_wows, 'num_hahas':post_num_hahas, 'num_sads':post_num_sads, 'num_angrys':post_num_angrys}) 20

Facebook 포스트 (/{post-id}) Crawling go_next = True... page_id = getfacebooknumericid(page_name, access_token)... jsonpost = getfacebookpost(page_id, access_token, from_date, to_date, num_posts)... while (go_next): for post in jsonpost['data']: getpostdata(post, access_token, jsonresult) if 'paging' in jsonpost.keys(): jsonpost = json.loads(get_request_url(jsonpost['paging']['next'])) else: go_next = False... with open('%s_facebook_%s_%s.json' % (page_name, from_date, to_date), 'w', encoding='utf8') as outfile: str_ = json.dumps(jsonresult, indent=4, sort_keys=true, ensure_ascii=false) outfile.write(str_) 21

21 TWITTER CRAWLING

OAuth 23

Twit Crawling def oauth2_request(consumer_key, consumer_secret, access_token, acces s_secret): try: consumer = oauth2.consumer(key=consumer_key, secret=consumer_ secret) token = oauth2.token(key=access_token, secret=access_secret) client = oauth2.client(consumer, token) return client except Exception as e: print(e) return None def get_user_timeline(client, screen_name, count=50, include_rts='false'): base = "https://api.twitter.com/1.1" node = "/statuses/user_timeline.json" fields = "?screen_name=%s&count=%s&include_rts=%s" % (screen_name, count, include_rts) #fields = "?screen_name=%s" % (screen_name) url = base + node + fields response, data = client.request(url) try: if response['status'] == '200': return json.loads(data.decode('utf-8')) except Exception as e: print(e) return None 24

Twit Crawling def gettwittertwit(tweet, jsonresult): tweet_id = tweet['id_str'] tweet_message = '' if 'text' not in tweet.keys() else tweet['text'] screen_name = '' if 'user' not in tweet.keys() else tweet['user']['screen_name'] tweet_link = '' if tweet['entities']['urls']: #list for i, val in enumerate(tweet['entities']['urls']): tweet_link = tweet_link + tweet['entities']['urls'][i]['url'] + ' ' else: tweet_link = '' hashtags = '' if tweet['entities']['hashtags']: #list for i, val in enumerate(tweet['entities']['hashtags']): hashtags = hashtags + tweet['entities']['hashtags'][i]['text'] + ' ' else: hashtags = '' 25

Twit Crawling if 'created_at' in tweet.keys(): # Twitter used UTC Format. EST = UTC + 9(Korean Time) Format ex: Fri Feb 10 03:57:27 +0000 2017 tweet_published = datetime.datetime.strptime(tweet['created_at'],'%a %b %d %H:%M:%S +0000 %Y') tweet_published = tweet_published + datetime.timedelta(hours=+9) tweet_published = tweet_published.strftime('%y-%m-%d %H:%M:%S') else: tweet_published = '' num_favorite_count = 0 if 'favorite_count' not in tweet.keys() else tweet['favorite_count'] num_comments = 0 num_shares = 0 if 'retweet_count' not in tweet.keys() else tweet['retweet_count'] num_likes = num_favorite_count num_loves = num_wows = num_hahas = num_sads = num_angrys = 0 jsonresult.append({'post_id':tweet_id, 'message':tweet_message, 'name':screen_name, 'link':tweet_link, 'created_time':tweet_published, 'num_reactions':num_favorite_count, 'num_comments':num_comments, 'num_shares':num_shares, 'num_likes':num_likes, 'num_loves':num_loves, 'num_wows':num_wows, 'num_hahas':num_hahas, 'num_sads':num_sads, 'num_angrys':num_angrys, 'hashtags': hashtags}) 26

Twit Crawling def main(): screen_name = "jtbc_news" num_posts = 50 jsonresult = [] client = oauth2_request(consumer_key, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_SECRET) tweets = get_user_timeline(client, screen_name) for tweet in tweets: gettwittertwit(tweet, jsonresult) with open('%s_twitter.json' % (screen_name), 'w', encoding='utf8') as outfile: str_ = json.dumps(jsonresult,indent=4, sort_keys=true, ensure_ascii=false) outfile.write(str_) 27

31 NAVER CRAWLING

Naver Search( 검색 ) API def get_request_url(url): req = urllib.request.request(url) req.add_header("x-naver-client-id", app_id) req.add_header("x-naver-client-secret", "scxkesjyib") try: response = urllib.request.urlopen(req) if response.getcode() == 200: print ("[%s] Url Request Success" % datetime.datetime.now()) return response.read().decode('utf-8') except Exception as e: print(e) print("[%s] Error for URL : %s" % (datetime.datetime.now(), url)) return None 29

Naver Search( 검색 ) API def getnaversearchresult(snode, search_text, page_start, display): base = "https://openapi.naver.com/v1/search" node = "/%s.json" % snode parameters = "?query=%s&start=%s&display=%s" % (urllib.parse.quote(search_text), page_start, display) url = base + node + parameters retdata = get_request_url(url) if (retdata == None): return None else: return json.loads(retdata) 30

Naver Search( 검색 ) API def getpostdata(post, jsonresult): title = post['title'] description = post['description'] org_link = post['originallink'] link = post['link'] #Tue, 14 Feb 2017 18:46:00 +0900 pdate = datetime.datetime.strptime(post['pubdate'], '%a, %d %b %Y %H:%M:%S +0900') pdate = pdate.strftime('%y-%m-%d %H:%M:%S') jsonresult.append({'title':title, 'description': description, 'org_link':org_link, 'link': org_link, 'pdate':pdate}) return { "display": 100, "items": [ { "description": 기사요약본 ", "link": [ 네이버 URL]", "originallink": [ 기사원본 URL]", "pubdate": "Thu, 13 Jul 2017 03:38:00 +0900", "title": [ 기사제목 ]" } ], "lastbuilddate": "Thu, 13 Jul 2017 03:50:47 +0900", "start": 1, "total": 454265 } 31

Naver Search( 검색 ) API def main(): jsonresult = [] # 'news', 'blog', 'cafearticle' snode = 'news' search_text = ' 탄핵 ' display_count = 100 jsonsearch = getnaversearchresult(snode, search_text, 1, display_count) while ((jsonsearch!= None) and (jsonsearch['display']!= 0)): for post in jsonsearch['items']: getpostdata(post, jsonresult) nstart = jsonsearch['start'] + jsonsearch['display'] jsonsearch = getnaversearchresult(snode, search_text, nstart, display_count) with open('%s_naver_%s.json' % (search_text, snode), 'w', encoding='utf8') as outfile: retjson = json.dumps(jsonresult, indent=4, sort_keys=true, ensure_ascii=false) outfile.write(retjson) 32

Naver 지도 API def getgeodata(address): base = "https://openapi.naver.com/v1/map/geocode" node = "" parameters = "?query=%s" % urllib.parse.quote(address) url = base + node + parameters retdata = get_request_url(url) if (retdata == None): return None else: return json.loads(retdata) 33

41 공공데이터포털 CRAWLING

전국유료관광지입장객정보 def gettourpointvisitor(yyyymm, sido, gungu, npagenum, nitems): end_point = "http://openapi.tour.go.kr/openapi/service/tourismresourcestatsservice/getpchrgtrrsrtvisitorlist" parameters = "?_type=json&servicekey=" + access_key parameters += "&YM=" + yyyymm parameters += "&SIDO=" + urllib.parse.quote(sido) parameters += "&GUNGU=" + urllib.parse.quote(gungu) parameters += "&RES_NM=&pageNo=" + str(npagenum) parameters += "&numofrows=" + str(nitems) url = end_point + parameters retdata = get_request_url(url) if (retdata == None): return None else: return json.loads(retdata) 35

전국유료관광지입장객정보 def gettourpointdata(item, yyyymm, jsonresult): addrcd = 0 if 'addrcd' not in item.keys() else item['addrcd'] gungu = '' if 'gungu' not in item.keys() else item['gungu'] sido = '' if 'sido' not in item.keys() else item['sido'] resnm = '' if 'resnm' not in item.keys() else item['resnm'] rnum = 0 if 'rnum' not in item.keys() else item['rnum'] ForNum = 0 if 'csforcnt' not in item.keys() else item['csforcnt'] NatNum = 0 if 'csnatcnt' not in item.keys() else item['csnatcnt'] jsonresult.append({'yyyymm': yyyymm, 'addrcd': addrcd, 'gungu': gungu, 'sido': sido, 'resnm': resnm, 'rnum': rnum, 'ForNum': ForNum, 'NatNum': NatNum}) 36

전국유료관광지입장객정보 nitem = 100 nstartyear = 2011 nendyear = 2017 for year in range(nstartyear, nendyear): for month in range(1, 13): yyyymm = "{0}{1:0>2}".format(str(year), str(month)) npagenum = 1 while True: jsondata = gettourpointvisitor(yyyymm, sido, gungu, npagenum, nitems) if (jsondata['response']['header']['resultmsg'] == 'OK'): ntotal = jsondata['response']['body']['totalcount'] if ntotal == 0: break for item in jsondata['response']['body']['items']['item']: gettourpointdata(item, yyyymm, jsonresult) npage = math.ceil(ntotal / nitem) if (npagenum == npage): break npagenum += 1 else: break 37

출입국관광통계서비스 def getnatvisitor(yyyymm, nat_cd, ed_cd): end_point = "http://openapi.tour.go.kr/openapi/service/edrcnttourismstatsservice/getedrcnttourismstatslist" parameters = "?_type=json&servicekey=" + access_key parameters += "&YM=" + yyyymm parameters += "&NAT_CD=" + nat_cd parameters += "&ED_CD=" + ed_cd url = end_point + parameters retdata = get_request_url(url) if (retdata == None): return None else: return json.loads(retdata) 38

출입국관광통계서비스 def main(): # 코드생략 for year in range(nstartyear, nendyear): for month in range(1, 13): yyyymm = "{0}{1:0>2}".format(str(year), str(month)) jsondata = getnatvisitor(yyyymm, national_code, ed_cd) if (jsondata['response']['header']['resultmsg'] == 'OK'): krname = jsondata['response']['body']['items']['item']["natkornm"] krname = krname.replace(' ', '') itotalvisit = jsondata['response']['body']['items']['item']["num"] print('%s_%s : %s' %(krname, yyyymm, itotalvisit)) jsonresult.append({'nat_name': krname, 'nat_cd': national_code, 'yyyymm': yyyymm, 'visit_cnt': itotalvisit}) # 코드생략 39

51 일반 WEB PAGE CRAWLING

웹페이지저작권

BeautifulSoup4 HTML 분석 (Parsing) 패키지 [ 파이썬설치경로 ]>pip install beautifulsoup4

BeautifulSoup4 HTML 분석 (Parsing) 패키지 >>> from bs4 import BeautifulSoup >>> html = '<td class="title"><div class="tit3"><a href="/movie/bi/mi/basic.nhn?code=136872" title=" 미녀와야수 "> 미녀와야수 </a></div></td>' >>> soup = BeatifulSoup(html, 'html.parser') >>> soup <td class="title"><div class="tit3"><a href="/movie/bi/mi/basic.nhn?code=136872" title=" 미녀와야수 "> 미녀와야수 </a></div></td> >>> tag = soup.td >>> tag <td class="title"><div class="tit3"><a href="/movie/bi/mi/basic.nhn?code=136872" title=" 미녀와야수 "> 미녀와야수 </a></div></td> >>> tag = soup.div >>> tag <div class="tit3"><a href="/movie/bi/mi/basic.nhn?code=136872" title=" 미녀와야수 "> 미녀와야수 </a></div> >>> tag = soup.a >>> tag <a href="/movie/bi/mi/basic.nhn?code=136872" title=" 미녀와야수 "> 미녀와야수 </a> >>> tag.name 'a'

BeautifulSoup4 HTML 분석 (Parsing) 패키지 >>> from bs4 import BeautifulSoup >>> html = '<td class="title"><div class="tit3"><a href="/movie/bi/mi/basic.nhn?code=136872" title=" 미녀와야수 "> 미녀와야수 </a></div></td>' >>> soup = BeatifulSoup(html, 'html.parser') >>> soup <td class="title"><div class="tit3"><a href="/movie/bi/mi/basic.nhn?code=136872" title=" 미녀와야수 "> 미녀와야수 </a></div></td> >>> tag = soup.td >>> tag['class'] ['title'] >>> tag = soup.div >>> tag['class'] ['tit3'] >>> >>> tag.attrs {'class': ['tit3']}

BeautifulSoup4 HTML 분석 (Parsing) 패키지 >>> from bs4 import BeautifulSoup >>> html = '<td class="title"><div class="tit3"><a href="/movie/bi/mi/basic.nhn?code=136872" title=" 미녀와야수 "> 미녀와야수 </a></div></td>' >>> soup = BeatifulSoup(html, 'html.parser') >>> soup <td class="title"><div class="tit3"><a href="/movie/bi/mi/basic.nhn?code=136872" title=" 미녀와야수 "> 미녀와야수 </a></div></td> >>> tag = soup.find('td', attrs={'class':'title'}) >>> tag <td class="title"><div class="tit3"><a href="/movie/bi/mi/basic.nhn?code=136872" title=" 미녀와야수 "> 미녀와야수 </a></div></td> >>> tag = soup.find('div', attrs={'class':'tit3'}) >>> tag <div class="tit3"><a href="/movie/bi/mi/basic.nhn?code=136872" title=" 미녀와야수 "> 미녀와야수 </a></div>

BeautifulSoup4 HTML 분석 (Parsing) 패키지 >>> import urllib.request >>> from bs4 import BeautifulSoup >>> html = urllib.request.urlopen('http://movie.naver.com/movie/sdb/rank/rmovie.nhn') >>> soup = BeautifulSoup(html, 'html.parser') >>> print(soup.prettify()) <!DOCTYPE html> <html> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"> <meta content="ie=edge" http-equiv="x-ua-compatible"> <meta content="http://imgmovie.naver.com/today/naverme/naverme_profile.jpg" property="me2:image"/> <meta content=" 네이버영화 " property="me2:post_tag"/> <meta content=" 네이버영화 " property="me2:category1"/>...( 이하중략 ) <!-- //Footer --> </div> </body> </html>

BeautifulSoup4 HTML 분석 (Parsing) 패키지 <td class="title"> <div class="tit3"> <a href="/movie/bi/mi/basic.nhn?code=135874" title=" 스파이더맨 : 홈커밍 "> 스파이더맨 : 홈커밍 </a> </div> </td>

BeautifulSoup4 HTML 분석 (Parsing) 패키지 >>> tags = soup.findall('div', attrs={'class':'tit3'}) >>> tags [<div class="tit3"> <a href="/movie/bi/mi/basic.nhn?code=135874" title=" 스파이더맨 : 홈커밍 "> 스파이더맨 : 홈커밍 </a> </div>, <div class="tit3"> <a href="/movie/bi/mi/basic.nhn?code=146480" title=" 덩케르크 "> 덩케르크 </a> </div>, <div class="tit3"> <a href="/movie/bi/mi/basic.nhn?code=76309" title=" 플립 "> 플립 </a> ( 이하중략 ) <a href="/movie/bi/mi/basic.nhn?code=149048" title="100미터 ">100미터 </a> </div>] >>> for tag in tags: print(tag.a) <a href="/movie/bi/mi/basic.nhn?code=135874" title=" 스파이더맨 : 홈커밍 "> 스파이더맨 : 홈커밍 </a> ( 이하중략 ) >>> for tag in tags: print(tag.a.text) 스파이더맨 : 홈커밍 덩케르크 ( 이하중략 )