Web Information Retrieval 2001 년 9 월 21 일 국민대학교컴퓨터학부강승식
차례 웹검색엔진 국내외검색엔진 웹의특성및사용자특성 웹검색엔진의 issues Web spider(crawler) Ranking : 문서연관성기법 PageRank, HITS 결론 2
검색엔진개발 ( 국외 ) Lycos : CMU 의연구프로젝트 (1994) Excite : Stanford 대학원생 OpenText : 워싱턴대학 HotBot U.C. Berkley 의검색엔진을발전시킴 Altavista : DEC(1995) Google : MIT 출신 InkTomi, Northernlight 등 3
검색엔진개발 ( 국내 ) Simmani : 한글과컴퓨터 까치네 : 대구대동아리 Naver(98.06) : 삼성 SDS Empas(99.11) : 숭실대 HanMir, 와카노 WiseNut(01.08) 기타 portal sites 4
http://kr.wisenut.com/ 5
http://kr.wisenut.com/ 6
검색엔진의발전과정 검색모델 boolean model vector model 질의어 Keyword 검색 자연어검색 부가기능 Image, sound 검색결과 clustering H/W Workstation PC server 7
8
검색엔진평가방법 재현율 (recall ratio) 정답문서를검색한비율 정확률 (precision ratio) 검색된문서중정답문서의비율 F-measure 재현율과정확률을하나의값으로표현 R precision(precision at rank n) 상위 n 개의검색결과에대한적합한문서비율 9
princess diana 의검색결과 Engine 1 Engine 2 Engine 3 Relevant and high quality Relevant but low quality Not relevant index pollution 10
국내검색엔진평가 ( 성공회대 ) http://green.skhu.ac.kr/~skhuir/ 평가방법 : 상위 10 개검색결과에대해 (1+1+1+3/4+4/5+5/6+6/7+7/8+7/9+8/10)/10 = 0.87 2001 년 Naver, Empas, Hanmir & Lycos, Simmani, Yahoo, MSN, Daum, Altavista 2000 년 Hanmir, Lycos, Naver & Yahoo, Simmani, Altavista 11
국내검색엔진순위변화 12
인터넷검색엔진의성능 웹문서개수 약 10 억개 (99/12) : 한글 1~3 천만개 Web spider(crawler) 검색결과의 ranking 상위 20~30 개내에적합한문서개수 Ranking algorithm 갱신주기 1 일 ~ 2 주 13
웹의특성 안정성문제 23%/day, 38%/week 다양한자료 Text, image, sound, script 다양한언어의문서 중복문서 Syntactic: 약 30% Semantic:??? High linkage : 평균 8 links/page 14
질의어및사용자특성 질의어특성 평균 2.35 terms 부정확한질의어 연산자없는질의어 : 약 80% 사용자특성 사용자 85% -- one screen only 질의어 78% -- 수정안함 Link 를따라감 15
웹검색엔진구성요소 Web Spider(crawler) 웹문서수집 Indexer 색인어추출및색인어저장구조 Search interface 질의어분석및검색 16
웹정보검색 issues 웹문서수집 Priority : 매일갱신되는 page? Load balancing : Internal, external Trap avoidance 서버가죽어있는경우 Page 가삭제된경우 문서처리 중복문서제거 색인어추출및저장구조 Query-independent ranking 문서분류 17
웹정보검색 issues ( 계속 ) 질의어처리 Query-dependent ranking 중복문서제거 질의어수정 / 확장 검색결과 clustering 18
웹문서수집 목표 : 사용자요구에적합한문서수집 Static: html, text, image, audio Dynamic: DB access 논점 URL 리스트확보 Hyperlink, 모든웹서버 (IP) Static page 수집방법 Dynamic page 학습방법 19
Web crawling Crawling process Get link at top of queue Expired pages from index Fetch page Queue of links to explore Index page and parse links Add to queue Add URL 20
Queuing discipline Standard graph exploration: Random BFS DFS (+ depth limits) Priority based on queryindependent ranking Highest indegree Highest potential PageRank 21
Load balancing Internal Response time Size of answers No. of threads, no. of open connections, etc External Server overload Queuing discipline 22
Ranking Example Query-independent ranking 각문서에대한가중치부여 Query-dependent ranking 벡터모델의 cosine measure 문서분석기법 Ad-hoc factors Publication, location Human annotation 웹광고? 문서연관성기법 Query-independent: PageRank, in-degree Query-dependent: HITS 23
문서연관성기법 (PageRank) Idea hyperlink information of the Web Assumptions Links often connect related pages A link between pages is a recommendation 24
PageRank: Query-independent ranking 웹페이지의그래프표현 (u, v) : page u 에서 page v 로 link 웹페이지의 quality In-degree 및그페이지에 link 된페이지의 quality 에의해결정 웹페이지의 PageRank 는사용자가그페이지에머무는시간에비례 Google 에서사용하는 ranking 기법중하나 25
HITS: Query-dependent ranking Given a query find: Good sources of content (authorities) Good sources of links (hubs) Better authority comes from in-edges from good hubs. Being a better hub comes from out-edges to good authorities. 26
Modified HITS 문제점 : Some edges are wrong Multiple edges from same author Automatically generated 해결방법 : Edge weighting 문제점 : Topic drift 예 ) jaguar + car pages about cars 해결방법 : Analyze content and assign topic scores to nodes 27
HITS 실험결과 10 9 8 7 6 5 4 3 2 1 0 Valuable pages within 10 top answers (averaged over 28 topics) Authorities Hubs Original Edge Weighting EW + Content Analysis 28
PageRank vs. HITS Computation: Expensive Once for all documents and queries (offline) Query-independent requires combination with query-dependent criteria Hard to spam Computation: Expensive Requires computation for each query Query-dependent Relatively easy to spam Quality depends on quality of start set Gives hubs as well 29 as authorities
Connectivity Server Basic operations InEdges(URL u, int k) OutEdges(URL u, int k) Difficulties Memory usage: 180M nodes, 1B edges Preprocessing time: days Query time: 0.0001sec/result URL 30
URL database Sorted list of URLs is 8.7 GB ( 48 bytes/url) Delta encoding reduces it to 3.8 GB ( 21 bytes/url) Original text www.foobar.com/ www.foobar.com/gandalf.htm www.foograb.com/ Delta Encoding 0 www.foobar.com/ 1 15 gandalf.htm 26 7 grab.com/ 41 15 gandalf.htm 26 size of shared prefix Node ID 31
Other I.R. issues Duplicate filtering : 중복문서제거 갱신주기문제 검색결과관련 Clustering www.northernlight.com Summarization Directory service Document classification 분야별전문화된검색엔진 32
Duplicate filtering Near-duplicate documents Computing pair-wise edit distance A short sketch for each document Near-duplicate hosts(mirrors) Pre-filtering techniques IP-based URL-string based Similar hostnames, similar paths URL-string & hyperlink based Hostname & hyperlink based 33
User interface & visualization Category or directory overview MeSHBrowse Scatter/Gather 2/3-dimensional overview Query specification Venn diagram Filter-flow visualization Block-oriented diagram visualization Current document set in the context of other information types 34
MeSHBrowse interface for category labels 35
Scatter/Gather clustering 36
Three-dim. clustering 37
Two-dim. Web pages 38
Venn diagram visualization 39
Query-terms placed in abstract graphical space 40
Graphical depiction of Web link structure 41
결론 웹검색엔진소개 웹문서특성및사용자특성고찰 웹정보검색 issues 문서수집 Ranking: 문서연관성기법 PageRank, HITS < 참고 > Monika Henzinger 의 Google 자료 Web Information Retrieval http://www.henzinger.com/~monika 42