Week 13 Social Data Mining 02 Joonhwan Lee human-computer interaction + design lab.
Crawling Twitter Data OAuth Crawling Data using OpenAPI Advanced Web Crawling
1. Crawling Twitter Data
Twitter API API REST API https://dev.twitter.com/rest/public Streaming API https://dev.twitter.com/streaming/public.. consumer_key. https://dev.twitter.com/docs!4
Twitter https://apps.twitter.com/app/new!5
Twitter https://apps.twitter.com/app/new * {!5
Twitter Customer Key and Access Token!6
Twitter Customer Key and Access Token { {!6
Data Formats for Exchange Twitter, Facebook API JSON. JSON (JavaScript Object Notation).,. (http://ko.wikipedia.org/wiki/json) key:value,.!7
Data Formats for Exchange JSON(JavaScript Object Notation) {"name2": 50, "name3": " 3", "name1": true} JSON { } " ": " ", " ": 25, " ": " ", " ": " ", " ": [" ", " "], " ": {"#": 2, " ": " ", " ": " "}, " ": " 7 "!8
Data Formats for Exchange JSON: Facebook Example!9
Data Formats for Exchange XML: Extensible Markup Language W3C. HTML XML.. JSON.!10
Data Formats for Exchange XML: Food Menu!11
Using JSON from Python json JSON import json json_data = json.loads(json_string) * json_data python dictionary!12
Twitter tweepy OAuth token tweepy api. import tweepy # OAuth setup consumer_key = 'YOUR-CONSUMER-KEY' consumer_secret = 'YOUR-CONSUMER-SECRET' access_token = 'YOUR-ACCESS-TOKEN' access_secret = 'YOUR-ACCESS-SECRET' auth = tweepy.oauthhandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_secret) api = tweepy.api(auth)!13
Twitter api = tweepy.api(auth) my_timeline = api.home_timeline() for tweet in my_timeline: print(tweet.text) >> RT @skibbie81: go home. RT @tora_ru:.. " "..... RT @Keyton_S_Park: -_-;; CDMA " "., RT @PRESSIAN_news: 11 12 ~... # _!. https://t.co/o1kacdaumx!14
Twitter Streaming APIs https://dev.twitter.com/docs/streaming-apis API REST API: request. Public Streaming API User Streaming API Site Streaming API!15
Twitter Streaming APIs Public Streaming API 1% 400 Global Trends User Streaming API Site Streaming API user stream!16
Streaming Tweet Data Streaming API. StreamListener lister. Stream. Stream Twitter API.!17
2. OAuth
Facebook Login Facebook User C Login Password Login Password Facebook User A Facebook DB Password Login Password Login Login Password Facebook User (Me) Facebook User B Facebook User D
OAuth OAuth 3rd party. 3 ID, Access Token. OAuth,,,,,,.!20
OAuth Facebook App Access Privilege Request OAuth token Facebook User C Access Privilege Facebook User A Facebook DB Request OAuth token Access Privilege OAuth token Facebook User (Me) OAuth token OAuth token Facebook User B Facebook App Request OAuth token Access Privilege Facebook User D
3. Crawling using OpenAPI
OpenAPI Twitter. Twitter OpenAPI. OpenAPI OpenAPI.!23
OpenAPI API OpenAPI. application. application app-key, -.,,. ( call.)!24
OpenAPI data.go.kr ( 3.0)!25
OpenAPI data.seoul.go.kr ( )!26
OpenAPI data.seoul.go.kr ( )!26
OpenAPI data.seoul.go.kr ( )!26
OpenAPI 제공 사이트 data.seoul.go.kr (서울시 열린데이터광장)!26
4. Advanced Web Crawling 1
Dynamic Websites AJAX. no page reload ( : facebook, twitter)!. ( : media daum )!28
developer tools inspector crawl ( json )!29
Developer Tools Chrome (Safari ): View Developer Developer Tools Network!30
Developer Tools Recoding Filter comment!31
Developer Tools ( +postid )!32
Developer Tools http://comment.daum.net/apis/v1/posts/15712900/comments? parentid=0&offset=0&limit=3&sort=recommend!33
5. Advanced Web Crawling 2
Crawling Using Selenium & Webdriver Selenium & Webdriver Selenium: Python libraries for automating web browsers pip install selenium : load url, click link Selenium drive driver. Firefox driver: https://github.com/mozilla/geckodriver/releases Chrome driver: https://sites.google.com/a/chromium.org/ chromedriver/downloads!35
Using Selenium Sample Code from selenium import webdriver url = "..." driver = webdriver.firefox() driver.get(url) element = driver.find_element_by_xpath("// div[@class='alex_more']") element.click() html = driver.page_source soup = BeautifulSoup(html, "html.parser") ## process soup!36
xpath xpath xpath(path) driver.find_element_by_xpath( //h1') <h1>~</h1>...xpath( //div') <div>~</div>...xpath('//div[@class="footer"]') <div class="footer">~</div>...xpath('//div[@id="nav"]') <div id="nav">~</div>...xpath('//div[@class="header"]// a[@id="twitter_anywhere"]') <div class="header"><a href="" id="twitter_anywhere">~</a></div>...xpath('//ul[@class="paging"]//li[not(@class="btn btn_next")]') <ul class="paging"><li>~</li></ul>, <li> class btn btn_next!37
Questions?