단순베이즈분류기 박창이 서울시립대학교통계학과 박창이 ( 서울시립대학교통계학과 ) 단순베이즈분류기 1 / 14
학습내용 단순베이즈분류 구현 예제 박창이 ( 서울시립대학교통계학과 ) 단순베이즈분류기 2 / 14
단순베이즈분류 I 입력변수의값이 x = (x 1,..., x p ) 로주어졌을때 Y = k일사후확률 P(Y = k X 1 = x 1,..., X p = x p ) P(X 1 = x 1,..., X p = x p Y = k)p(y = k) 단순베이즈가정 (naive Bayes assumption) p P(X 1 = x 1,..., X p = x p Y = k) = P(X j = x j Y = k) p P(Y = k X 1 = x 1,..., X p = x p ) P(Y = k) P(X j = x j Y = k) j=1 j=1 박창이 ( 서울시립대학교통계학과 ) 단순베이즈분류기 3 / 14
단순베이즈분류 II 훈련자료를이용하여모든 j 와 k 에대하여추정값 ˆP(Y = k) 와 ˆP(X j = x j Y = k) 을얻은후주어진시험자료 z = (z 1,..., z p ) 에 대하여 Y 를 arg max k K ˆP(Y = k) p ˆP(X j = z j Y = k) 로예측하며입력변수가연속형인경우에는흔히구간을나눠서범주형으로변환 특정변수에서의확률추정값이 0 이면다른변수에서의확률추정값이큰값을가지더라도그곱은항상 0 이되는문제발생 j=1 박창이 ( 서울시립대학교통계학과 ) 단순베이즈분류기 4 / 14
단순베이즈분류 III 라플라스평활 (Laplace smoothing) ( 예 ) 유, 무 : 0, 990 라플라스수정 : k, 990 + k 확률추정값 : k 990+2k, 990+k 990+2k 수정을하지않았을때와큰차이가없고추정값이정확히 0 이되어생기는문제가발생하지않음 박창이 ( 서울시립대학교통계학과 ) 단순베이즈분류기 5 / 14
구현 I 전처리 : 메시지를단어단위로자르고모두소문자화 def tokenize(message): message = message.lower() # convert to lowercase all_words = re.findall("[a-z0-9 ]+", message) # extract the words return set(all_words) # remove duplicates 스팸과스팸이아닌경우의단어의빈도수세기 def count_words(training_set): """training set consists of pairs (message, is_spam)""" counts = defaultdict(lambda: [0, 0]) for message, is_spam in training_set: for word in tokenize(message): counts[word][0 if is_spam else 1] += 1 return counts 박창이 ( 서울시립대학교통계학과 ) 단순베이즈분류기 6 / 14
구현 II 평활을이용한확률추정 def word_probabilities(counts, total_spams, total_non_spams, k=0.5): """turn the word_counts into a list of triplets w, p(w spam) and p(w ~spam)""" return [(w, (spam + k) / (total_spams + 2 * k), (non_spam + k) / (total_non_spams + 2 * k)) for w, (spam, non_spam) in counts.items()] 박창이 ( 서울시립대학교통계학과 ) 단순베이즈분류기 7 / 14
구현 III 메시지가스팸일확률예측 def spam_probability(word_probs, message): message_words = tokenize(message) log_prob_if_spam = log_prob_if_not_spam = 0.0 for word, prob_if_spam, prob_if_not_spam in word_probs: # for each word in the message, add the log probability of seeing it if word in message_words: log_prob_if_spam += math.log(prob_if_spam) log_prob_if_not_spam += math.log(prob_if_not_spam) # for each word that s not in the message, add the log probability of # _not_ seeing it else: log_prob_if_spam += math.log(1.0 - prob_if_spam) log_prob_if_not_spam += math.log(1.0 - prob_if_not_spam) prob_if_spam = math.exp(log_prob_if_spam) prob_if_not_spam = math.exp(log_prob_if_not_spam) return prob_if_spam / (prob_if_spam + prob_if_not_spam) 박창이 ( 서울시립대학교통계학과 ) 단순베이즈분류기 8 / 14
구현 IV 단순베이즈분류기 class NaiveBayesClassifier: def init (self, k=0.5): self.k = k self.word_probs = [] def train(self, training_set): # count spam and non-spam messages num_spams = len([is_spam for message, is_spam in training_set if is_spam]) num_non_spams = len(training_set) - num_spams # run training data through our "pipeline" word_counts = count_words(training_set) self.word_probs = word_probabilities(word_counts,num_spams, num_non_spams,self.k) def classify(self, message): return spam_probability(self.word_probs, message) 박창이 ( 서울시립대학교통계학과 ) 단순베이즈분류기 9 / 14
예제 I spam.zip 파일을적당한위치에압축풀고다음과같이이메일의제목추출 path = "D:\spam\*\*" data = [] # regex for stripping out the leading "Subject:" and any spaces after it subject_regex = re.compile(r"^subject:\s+") # glob.glob returns every filename that matches the wildcarded path for fn in glob.glob(path): is_spam = "ham" not in fn with open(fn, r,encoding= ISO-8859-1 ) as file: for line in file: if line.startswith("subject:"): subject = subject_regex.sub("", line).strip() data.append((subject, is_spam)) 박창이 ( 서울시립대학교통계학과 ) 단순베이즈분류기 10 / 14
예제 II 훈련과시험데이터로랜덤하게분할한후훈련 random.seed(0) # just so you get the same answers as me train_data, test_data = split_data(data, 0.75) classifier = NaiveBayesClassifier() classifier.train(train_data) 박창이 ( 서울시립대학교통계학과 ) 단순베이즈분류기 11 / 14
예제 III 결과 classified = [(subject, is_spam, classifier.classify(subject)) for subject, is_spam in test_data] counts = Counter((is_spam, spam_probability > 0.5) # (actual, predicted for _, is_spam, spam_probability in classified) print(counts) Counter({(False, False): 704, (True, True): 101, (True, False): 38, (Fa 정밀도 101/(101+33)=75%, 재현율 101/(101+38)=73% 박창이 ( 서울시립대학교통계학과 ) 단순베이즈분류기 12 / 14
예제 IV 스팸일확률이가장높은단어 def p_spam_given_word(word_prob): word, prob_if_spam, prob_if_not_spam = word_prob return prob_if_spam / (prob_if_spam + prob_if_not_spam) words = sorted(classifier.word_probs, key=p_spam_given_word) spammiest_words = words[-5:] hammiest_words = words[:5] spammiest words: year, rates, sale, systemworks, money hammiest words: spambayes, users, razor, zzzzteana, sadev 박창이 ( 서울시립대학교통계학과 ) 단순베이즈분류기 13 / 14
예제 V 기타 제목뿐만아니라메시지의내용을고려할수도... 단어의최소빈도수설정동의어를고려 : stemming Porter stemmer(http://tartarus.org/martin/porterstemmer/) 등을사용할수있음 박창이 ( 서울시립대학교통계학과 ) 단순베이즈분류기 14 / 14