15 2 2011 4 스팸메일필터링을위한한글변칙어인식방법 안희국 *, 한욱표 *, 신승호 *, 양동일 **, 노희영 * Hee-Kook Ahn * Uk-Pyo Han * Seung-Ho Shin * Dong-Il Yang ** and Hee-Young Roh * 요약,.,.,..,..,, Smith-Waterman... Abstract As electronic mails are being widely used for facility and speedness of information communication, as the amount of spam mails which have malice and advertisement increase and cause lots of social and economic problem. A number of approaches have been proposed to alleviate the impact of spam. These approaches can be categorized into pre-acceptance and post-acceptance methods. Post-acceptance methods include bayesian filters, collaborative filtering and e-mail prioritization which are based on words or sentances. But, spammers are changing those characteristics and sending to avoid filtering system. In the case of Korean, the abnormal usages can be much more than other languages because syllable is composed of chosung, jungsung, and jongsung. Existing formal expressions and learning algorithms have the limits to meet with those changes promptly and efficiently. So, we present an methods for recognizing Korean abnormal language(koral) to improve accuracy and efficiency of filtering system. The method is based on syllabic than word and Smith-waterman algorithm. Through the experiment on filter keyword and e-mail extracted from mail server, we confirmed that Koral is recognized exactly according to similarity level. The required time and space costs are within the permitted limit. Key words : Spam Mail Filtering, Korean Abnormal Language, Smith- Waterman Algorithm, Keyword Similarity I. 서론 * ** 1 (First Author) : : : 2011 3 23 ( ) : 2011 3 24 ( : 2011 4 23 ) : 2011 4 30
288 15 2 2011 4.,, /., 2003, 2009 2.16, 2008 2.12 0.04. [1]. (spam) (nonspam) [2]. (pre-acceptance) (post-acceptance) [3]. (Domain, IP, E-mail address), [4-8], (Support Vector Machine),, [9, 10]. 97% [11], (false positive) 15%, (.,, ). SVM 98.9% [12].,,.,. (ISP),,.,,.., (Dynamic Programmi- ng Algorithm. DPA).. DPA.,. II smith-watherman, III,.,. IV. V, VI,. II. 관련연구 2-1 Dynamic Programming Algorithm(DPA)
,,,, ; 289 1970 Needleman & Wunsch DPA divide & conquer.,, [13, 14]. S T (a,b,c,...z),, (scoring function: σ). (,, ). 1. x y, σ(x, y) x y (align), (scoring function: σ)., +2, -1 (penalty). a c σ(a, a)=σ(c, c)=+2, σ(a, c)=σ(a, -)=σ(-, c)=-1. 1 x(acbcdb) y(cadbd) 6-5=1., S T. V(i, j) S[1]...S[i] T[1]...T[j]., S T V(n, m). S i T 0 ( ) A(i,0). : A(0,0)=0 A(i,0)=A(i-1,0) + σ(s[i],-), for i>0 A(0,j)=A(0, j-1) + σ(,-, T[j]), for j>0, 0<i, 0<j. : A(i,j)=max( A(i-1,j-1) + σ(s[i],t[j]), A(i-1,j) + σ(s[i],-), A(i,j-1) + σ(-,t[j]) ),. S=BASEBALL T=BASKETBALL 2. 그림 1. 서열정렬의예 Fig. 1. An example of sequence alignment. 2. S T (optimal alignment) strings., DPA. DPA Needleman- Wunsch,. 그림 2. DPA 에의한 S, T 매트릭스 Fig. 2. S, T matrix of DPA. n, m 3. S =n, T =m string S, T
290 15 2 2011 4 A(i-1,j-1) + σ(s[i],t[j]), A(i-1,j) + σ(s[i],-), A(i,j-1) + σ(-,t[j]) ) 그림 3. DPA 에의한최적정렬탐색과정 Fig. 3. An optimal search of DPA., 4. S=abcxdex, T=xxxcde, match: +2, mismatch: -1, space: -1,. 5 A(6,6)=5., 0. 그림 4. DPA 에의한최적정렬탐색결과 Fig. 4. Result of optimal alignment with DPA..,. 2-2 Smith-waterman algorithm 그림 5. smith-waterman 알고리즘의회복과정 Fig. 5. Recovering of smith-waterman algorithm.. DPA m n,, Smith-Waterman [15]. 0, 0,.. : (0<i, 0<j ) A(i,0)=0 A(0,j)=0 : A(i,j)=max( 0, 그림 6. smith-waterman 알고리즘의부분정렬 Fig. 6. Partial alignment of smith-waterman algorithm. 6 3 1, 3 (2)+1 (-1)=5. III. 스팸메일의유형분석,. 3-1 스팸메일의유형분석 2009, 459,885
,,,, ; 291 44,678 9.5% 89% 419,126, 1.2% 6,081., 40%, 20%, 15%, 5%., 5. 1 2 HTML 3 URL 4 5,., URL,, HTML, Meta-character. 그림 7. 한국어스팸메일예와대응방법 Fig. 7. Response method and example for Korean spam mail.,. 그림 8. 일반정규식으로해결불가능한경우 Fig. 8. Impracticable case of regular expression.,.,,.. 3.,,.,..,. 3-2 한글변칙어의패턴분석,,,,.
292 15 2 2011 4 girl gir1 drag dr@g 그림 9. 영어의변칙어예 Fig. 9. An example of abnormal English..,,,,,, B, ㅂ1 그림 10. 한국어의변칙어예 Fig. 10. An example of abnormal Korean.,,,, ( ) (,, ). (,, ). 1,.,. ) : ㄷ H, 2.,. ) :, 3.,. ) :.,. ) :, : 4. ) :. 1,. 2,. 3.. 4,,,. IV. Koral(Korean Abnormal Language) 인식모듈 그림 11. 한글변칙어의특성 Fig. 11. Characteristics of abnormal Korean.,,. smith-waterman., Koral.
,,,, ; 293 4-1 한글변칙어인식을위한 DPA 의특성.,,. 그림 12. DPA 의특성분석 I Fig. 12. Analysis of DPA characteristics I. match: +2, mismatch:-1, space:-1 " ㄷ H ". 7., +2, 10,, -5.. (1) (sim)., 7/10=0.7.,,,. 0.875., " 0.555. 15,, 0.437,.,.,,,. ) & : 0.85 & ㄷH : 0.85,,., Needleman-Wunsch Smith-Waterman.,,,, 0.954.,. 4-2 알고리즘의상세.,. 그림 13. DPA 특성분석 II. Fig. 13. Analysis of DPA characteristics II. 13
294 15 2 2011 4 Koral.,,. 5-1 실험환경 그림 14. 스팸필터링시스템의구조 Fig. 14. Structure of spam mail filtering system. 14 Koral /.. Smith-Waterman. 3.2.,,,. Koral. 1.. 2.. 3. 6.. 4.. 5.., spam. (, ), nonspam. ( ).., Intel Pentium Processor 1.73GHz, 1G RAM Windows XP Professional. Microsoft C++. 6, 3019 50., Koral.. subject body. DPA,, 100, 204, 393, 807, 1875, 3794. 2081 44192, 28., subject body 1:1, 28 100, 204, 393, 807, 1875, 3794,. match : +2, mismatch : -1, space : -1. 5-2 실험결과 V. 실험
,,,, ; 295 그림 17. 메일의크기에따른요구공간 Fig. 17. Required space by mail size. 그림 15. 대출 의유사도 Fig. 15. Similarity value for " 대출.. 1,, 60%.,,., 0.7,,, i,.., ^*, ^, ^^, ㄷ H. 49.,,.,. 그림 16. 메일크기에따른소요시간 Fig. 16. Required time by mail size. 17 2Byte,. 6,493, 363KB. VI. 결론,.,. Smith-Waterman A, subject body A.,.,..,.,,,.
296 15 2 2011 4 참고문헌 [ 1 ], 2010 (National Informatization Protection White Paper)", pp. 107-109, 2010. [ 2 ]., 13 2, 12, 2004. [ 3 ] L. H. Gomes and C. Cazita, "Characterizing a Spam Traffic.," in Proc. 2004 Internet Measurement Conference, Taormina, Sicily, Italy. Oct. 2004. [4] V. Keselj, E. Milios, A. Tuttle, S. Wang, and R. Zhang. "TREC 2005 Spam Track: Spam Filtering Using N-gram-based Techniques", Proceedings of Text REtrieval Conference, 2005. [5],,,, 31 8, pp.1092-1100, 2004. [6] R. Segal. "IBM SpamGuru on the TREC 2005 Spam Track," Proceedings of Text REtrieval Conference, 2005. [7] Al Brakto, B. Filipic. "Spam Filtering Using Character-Level Markov Models: Experiments for the TREC 2005 Spam Track," Proceedings fo Text REtrieval Conference, 2005. [8] L. A. Breyer. "DBACL at the TREC 2005," Proceedings of Text REtrieval Conference, 2005. [9] http://www.csie.ntu.edu.tw/~cjlin/libsvm [10],, URL, B, 15-B 1, pp.61-68, 2008. [ 11] F. Zhou, L. Zhuang, B. Zhao, L. Huang, A. Joseph, and J. Kubiatozicz, "Approximate object location and spam filtering on peer-to-peer systems," in Proc. Middleware, Rio de Janeiro, Brazil, June 2003. [12],, B, 17-B 3, pp.249-254, 2010. [ 13 ] S. B. Needleman and C. D. Wunsch. "A general method applicable to the search for similarities in the amino acid sequences of two proteins," Journal of Molecular Biology. vol. 48: 443-453, 1970. [ 14 ] Wagner, R. A. and Fischer, M. J. "The string-to-string correction problem," J. ACM 21, 168-173, Jan. 1974. [15 ] T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, vol. 147(1): 195-197, Mar. 1981. 안희국 ( 安熙國 )) 2003 2 : 2007 2 : 2011 : :,, 한욱표 ( 韓旭彪 ) 1993 2 : 1996 2 : 2003 : 2007 8 : 2011 : :,, 신승호 ( 辛承浩 ) 1998 : 2000 : 2001 : 2005 : 2011 : :,,, 양동일 ( 梁東一 ) 2004 2 : 2007 8 : 2011 : :,,
,,,, ; 297 노희영 ( 盧熙瑩 ) 1972 ( - ) 1978 (Vordiploma- ) 1982 (Diploma- ) 1983 1984 2 ~ : :,,