빅데이터 실전기술 Recommendation System using Mahout 2014.12.23 IT 가맹점개발팀이태영
Mahout 설치 1) Mahout 0.9 다운로드 http://mahout.apache.org 접속후다운로드 2) 계정홈디렉토리로 mv $ mv mahout-distribution-0.9.tar.gz ~ 3) 압축을풀고 mahout 심볼릭링크를생성 $ ln -s mahout-distribution-0.9 mahout 4).bash_profile에 MAHOUT_HOME과 PATH 추가 1 #.bash_profile 2 3 # Get the aliases and functions 4 if [ -f ~/.bashrc ]; then 5. ~/.bashrc 6 fi 7 8 # User specific environment and startup programs 9 10 export JAVA_HOME=$HOME/java 11 export HADOOP_HOME=$HOME/hadoop 12 export PYTHON_HOME=$HOME/python 13 export MAHOUT_HOME=$HOME/mahout 14 15 PATH=$PATH:$HOME/bin:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PYTHON_HOME/:$MAHOUT_HOME/bin 16 17 export PATH
협업필터링알고리즘 Collarborative filtering 1. User based ( 첫째, 비슷한사용자찾음 ) 취향이비슷한유저 B 가어떤아이템을구매했는지확인후 B 가구매했던상품으로추천 2. Item based ( 첫째, 비슷한아이템찾음 ) 내가구매했던상품들을기반으로, 연관성이있는상품을추천 User based Recommendation Item based Recommendation
유사도 Similarity 1. Euclidean Distance 두객체간의선호도거리를계산하여, 작을수록비슷한성향을가짐 2. Cosine Similarity (=Pearson Similarity) 두객체간의선호도를벡터화하여, 벡터사이의각도가적을수록유사 3. Jaccard Similarity 두객체간의요소들의전체요소들중교집합되는요소가차지하는비중
Mahout What is Mahout? The Apache Mahout project's goal is to build a scalable machine learning library. Clustering ( 군집화 ) Classification ( 분류 ) Recommendation ( 추천및협업필터링 ) Pattern Mining ( 패턴마이닝 ) Regression ( 회귀분석 ) Evolutionary Algorithms ( 진화알고리즘 ) Dimension reduction ( 차원리덕션 ) Mahout is made by JAVA We can use Mahout core libarary for java programming. NO HADOOP ONLY.
Recommendation 1. MovieLens 데이터셋 ( http://grouplens.org/datasets/movielens ) 10 만개 / 1 백만개 / 1 천만개별로평가데이터셋제공 미국내상영된영화를사용자들이평가한결과물 1997/9/19 ~ 1998/4/22 간 미네소타대학컴퓨터과학연구실에서수집한추천알고리즘을위한학습데이터 다운로드 ( 약 4.8MB)
Recommendation 2. SLF4J 라이브러리 (http://www.slf4j.org ) mahout 라이브러리호환성필요 다운로드 ( 약 4.3MB)
Recommendation 3. GUAVA 라이브러리 ( https://code.google.com/p/guava-libraries ) mahout 의데이터객체는 guava 라이브러리의존 다운로드 ( 약 4.3MB) 다운로드 ( 약 2.2MB)
Recommendation 4. Apache commons Math 라이브러리 (http://commons.apache.org/proper/commons-math ) 수치계산용라이브러리 다운로드 ( 약 14.3MB)
Recommendation 최종라이브러리리스트 commons-math3-3.4.jar guava-18.0.jar mahout-core-0.9.jar mahout-integration-0.9.jar mahout-math-0.9.jar slf4j-api-1.7.7.jar slf4j-nop-1.7.7.jar
Item Based Recommendation 프로젝트생성 ItemRecommender Java 5 SDK 이상
Item Based Recommendation 프로젝트개발준비 Libraries 는모두 lib 폴더밑으로복사 ml-100k.zip(movielens 데이터 ) 를압축을푼뒤 data 폴더로 u.data 를복사 README 파일내설명 u.data -- The full u data set, 100000 ratings by 943 users on 1682 items. Each user has rated at least 20 movies. Users and items are numbered consecutively from 1. The data is randomly ordered. This is a tab separated list of user id item id rating timestamp. The time stamps are unix seconds since 1/1/1970 UTC
Recommendation 프로젝트개발준비 u.data 파일을 csv 파일형태로변경하여 movies.csv 로저장 u.data 파일의각요소별구분자인 \t 을콤마 (,) 로치환
User Based Recommendation 클래스추가 패키지 : recommend.item 클래스 : UserRecommend Java Build Path 추가
UserRecommend 클래스 public class UserRecommend { public static void main(string[] args) throws Exception{ /* 데이터모델생성 */ DataModel dm = new FileDataModel(new File("data/movies.csv")); /* 유사도모델생성 */ UserSimilarity sim = new PearsonCorrelationSimilarity(dm); /* 모든유저들로부터주어진유저와특정임계값을충족하거나초과하는 neighborhood 기준 */ UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, sim, dm); /* 사용자추천기생성 */ UserBasedRecommender recommender = new GenericUserBasedRecommender(dm, neighborhood, sim); int x= 1; /* 데이터모델내의유저들의 iterator 를단계별로이동하며추천아이템들제공 */ for(longprimitiveiterator users = dm.getuserids(); users.hasnext();){ long userid = users.nextlong(); /* 현재유저 ID */ /* 현재유저 ID 에해당되는 5 개아이템추천 */ List<RecommendedItem> recommendations = recommender.recommend(userid, 5); for(recommendeditem recommenation : recommendations){ System.out.println(userID +","+ recommenation.getitemid()+","+recommenation.getvalue()); } } } } if(++x > 5) break; /* 유저 ID 5 까지만출력 */
UserRecommend 클래스실행결과 sim = new PearsonCorrelationSimilarity(dm); sim = new LogLikelihoodSimilarity(dm); 1,1558,5.0 1,1500,5.0 1,1467,5.0 1,1189,5.0 1,1293,5.0 2,1643,5.0 2,1467,5.0 2,1500,5.0 2,1293,5.0 2,1189,5.0 3,1189,5.0 3,1500,5.0 3,1302,5.0 3,1368,5.0 3,1398,4.759591 4,1104,4.7937207 4,853,4.729132 4,169,4.655577 4,1449,4.60582 4,408,4.582672 5,1500,5.0 5,1233,5.0 5,851,5.0 5,1189,5.0 5,119,5.0 < 유저 ID, 추천아이템 ID, 연결강도 > 1,1500,5.0 1,1467,5.0 1,1189,5.0 1,1293,5.0 1,1367,4.7517056 2,1500,5.0 2,1293,5.0 2,1189,5.0 2,1449,4.608227 2,1594,4.5082903 3,1500,5.0 3,1189,5.0 3,1293,5.0 3,1449,4.76954 3,1450,4.686902 4,1467,5.0 4,1500,5.0 4,1189,5.0 4,1293,5.0 4,1594,4.566541 5,1500,5.0 5,1467,5.0 5,1189,5.0 5,1293,5.0 5,1642,4.66432
Item Based Recommendation 클래스추가 패키지 : recommend.item 클래스 : ItemRecommend Java Build Path 추가
ItemRecommend 클래스 public class ItemRecommend { public static void main(string args[]){ DataModel dm; try { /* 데이터모델생성 */ dm = new FileDataModel(new File("data/movies.csv")); /* 유사도모델선택 */ ItemSimilarity sim = new PearsonCorrelationSimilarity(dm); /* 추천기선택 : ItemBased */ GenericItemBasedRecommender recommender = new GenericItemBasedRecommender(dm, sim); int x=1; /* 데이터모델내의 item 들의 iterator 를단계별이동하며추천아이템들제공 */ for(longprimitiveiterator items = dm.getitemids(); items.hasnext();){ long itemid = items.nextlong(); /* 현재 item ID */ /* 현재 item ID 와가장유사한 5 개아이템추천 */ List<RecommendedItem> recommendations = recommender.mostsimilaritems(itemid, 5); } } /* 유사한아이템출력 = " 현재아이템 ID 추천된아이템 ID 유사도 " */ for(recommendeditem recommendation : recommendations){ System.out.println(itemID + ","+recommendation.getitemid() + "," + recommendation.getvalue()); } x++;/* 아이템 ID 5까지유사한아이템들 5개씩 */ if(x>5) System.exit(0); } } catch (IOException TasteException e) { e.printstacktrace(); }
ItemRecommend 클래스실행결과 sim = new PearsonCorrelationSimilarity(dm); sim = new LogLikelihoodSimilarity(dm); 1,973,1.0 1,885,1.0 1,920,1.0 1,757,1.0 1,341,1.0 2,341,1.0 2,119,1.0 2,308,1.0 2,75,1.0 2,74,1.0 3,560,1.0 3,422,1.0 3,344,1.0 3,400,1.0 3,115,1.0 4,1038,1.0 4,868,1.0 4,927,1.0 4,643,1.0 4,360,1.0 5,348,1.0 5,34,1.0 5,113,1.0 5,35,1.0 5,6,1.0 < 기준아이템 ID, 비교아이템 ID, 유사도 > 1,117,0.9953521 1,151,0.9953065 1,121,0.9952347 1,405,0.99500656 1,50,0.99491894 2,403,0.9964998 2,233,0.9964557 2,161,0.9961404 2,231,0.9960143 2,385,0.9959657 3,405,0.99037176 3,235,0.9893157 3,121,0.9880421 3,250,0.9880041 3,100,0.98773706 4,56,0.99627966 4,174,0.99601305 4,204,0.9959589 4,202,0.99582237 4,385,0.9957967 5,218,0.99432045 5,98,0.9922024 5,234,0.99179345 5,56,0.99115413 5,53,0.9909523
References 1. Apache Mahtout Recommender Quick Start http://mahout.apache.org/users/recommender/quickstart.html 2. Recommendation System : 협업필터링을중심으로 http://rosaec.snu.ac.kr/meet/file/20120728b.pdf 3. Apache Mahout 맛보기 (30분만에추천시스템만들기 ) http://www.slideshare.net/pitzcarraldo/mahout-cook-book 4. Mahout를활용한영화추천샘플링 http://www.mimul.com/pebble/default/2012/03/23/1332494169544.html 5. Recommendation: 추천 알고리즘 : Item-Based Filtering http://hochul.net/blog/recommendation-daisy/