Natural Language Text Retreval Based on Neural Networks 1999 2
1 SVM(Support Vector Machne) SVM SV(Support Vector) SV SVM SRM(Structural Rsk Mnmzaton) SVM Convex Programmng Reuters-21578 5 SVM 97% (break-even pont) SVM Naïve Bayesan SV : Support Vector Machne Structural Rsk Mnmzaton Convex Programmng
1 1 1 11 1 12 2 13 3 2 4 21 4 22 7 3 8 31 8 311 8 312 10 313 RBF(Radal Bass Functon) 11 32 SUPPORT VECTOR MACHINES (SVMS)13 4 15 41 15 411 15 412 Structural Rsk Mnmzaton (SRM)17 42 SVM 19 421 (lnearly separable) 19 422 SVM22 423 24 424 25 425 SVM 27
1 5 31 51 31 52 33 53 35 531 SVM 36 532 Naïve Bayesan 39 533 39 6 42 44
1 4-1 29 5-1 32 5-2 Reuters-21578 33 5-3 33 5-4 10 35 5-5 36 5-6 SV 38 v
1 3-1 9 3-2 10 3-3 RBF 12 4-1 18 4-2 22 4-3 24 4-4 26 5-1 5 37 5-2 σ 1 RBF 37 5-3 β 0 =2 β 1 =1 2 38 5-4 Nave Bayesan 39 5-5 SV 40 5-6 RBF SV 41 v
1 1 11 1950 (WWW) [Yang 97] SVM(Support Vector Machne) SVM SV(Support Vector) 1
1 12 Naïve Bayesan Doc v NB Naïve Bayesan v NB = argmax P( v ) v j V j postons P( a v ) j v j a Doc [Mtchell 97] Qunlan C45 [Qunlan 93] k-nn(nearest Neghbor) k 2
1 13 1 2 SVM 3 4 SVM 5 6 3
2 2 21 doc 0 {doc category-value } (supervsed learnng) 4
2 (stemmng) (stop lst) (stem) engneerng engneered engneer engneer Porter Porter / the of and to 10 20~30% [Frakes et al 92] : TF(Term Frequency) doc w j TF 5
2 TF( w doc ) = count of w occurngn document doc TF IDF(Inverse Document Frequency) [Frakes et al 92] IDF n IDF( w ) = log DF( w ) n DF(Document Frequency) DF ( w ) = number of document wherew s occurrng tfdf TF IDF TF IDF DF IDF 0 tfdf 6
2 22 20000 20000 (sparse vector) 0 7
3 3 31 311 Rosenblatt(1958) (neuron) (synaptc weghts vector) (bas) 3-1 v m v = w x =1 + b w w m x x b 8
3 1 y = ฯ( v) = 1 v 0 v < 0 3-1 m = 1 1 (hyperplane) w x + b = 0 w = w w ) ( 1 n w 1 42 9
3 312 3-2 (nput layer) (output layer) (hdden layer) (network) 3-2 1 (nonlnear actvaton functon) 10
3 Logstc functon: 1 ฯ( v) = 1+ exp( av) Hyperbolc tangent functon: ฯ( v ) = a tanh( bv) 2 (backpropagaton algorthm) 2 1 2 313 RBF(Radal Bass Functon) RBF RBF 3-3 11
3 3-3 RBF ฯ(x) radal-bass functon ฯ(x) ฯ( x) = exp x 2σ 2 t 2 x t radal bass functon σ radal-bass functon Radal-bass (t ) 12
3 m1 F( x) = w ฯ ( x) = 1 + b 32 Support Vector Machnes (SVMs) RBF (global mnmum) (local mnmum) 4 RBF radal-bass (tral-error) SVM 1 SV (support vector) [haykn 98] SVM SV SV 2 13
3 PCA (Prncpal Component Analyss) SOM (Self Organzed Map) VQ (Vector Quantzer) SVM SVM 3 SVM 4 (SRM:Structural Rsk Mnmzaton) SRM SVM SVM 14
4 SVM 4 SVM 41 411 ( 1 x1 d ) ( xl dl ) x R d { 1 1} N F( x w) { F( x w) : w W} F( x w) : R { 11} N 15
4 SVM F(xw * ) R( w ) = d F( x w ) dfx D ( x d) w (free parameter) F D ( x ) x d x d (jont probablty) R(w) (rsk functonal) (expected rsk) F D ( x ) x d R(w) l R(w) (emprcal rsk) R emp l 1 ( w) = d F( x w) l = 1 R emp (w) R emp (w) R(w) R emp (w) R emp (w) w R(w) w* consstent P ( sup R( w) ( w) > ε ) 0 R emp l Vapnk Chervonenks VC (Vapnk-Chervonenks dmenson) VC F(xw) F(xw) Vapnk Chervonenks 2l η h ln + 1 ln 4 (41) ( ) ( ) h R w R w + emp w W l 16
4 SVM h VC h (41) R(w) R emp (w) l VC R emp (w) VC VC VC R emp (w) 2l η h ln + 1 ln h 4 l VC 412 Structural Rsk Mnmzaton (SRM) VC Vapnk Structural Rsk Mnmzaton (SRM) (true error) (emprcal rsk) 4-1 VC VC VC 17
4 SVM 4-1 SRM Fk ( x w); w W k k = 12 n 18
4 SVM F F 1 2 F n VC h h 1 2 h n (41) SRM Fn VC VC VC SRM SVM VC VC 42 SVM 421 (lnearly separable) 19
4 SVM SVM {-1+1} S = { x : ( x d ) d = + 1} + S - = { x : ( x d ) d = 1} 2 (hyperplane) (42) w T x + b = 0 x w b SVM T w x + b 0 T w x + b 0 x x S S + w b w b (43) T w x + b 1 T w x + b 1 x x S + S r 2 23 1 n n H = { x R : a x = α} 20
4 SVM r = T w x + b 1 w w (44) ρ = 2 r = 2 w ρ (margn of separaton) 1 (support vector) r = w SVM ρ w w 4-2 ρ 21
4 SVM 4-2 422 SVM 41 SRM VC VC SRM SVM w w SRM 22
4 SVM Vapnk x 1 x 2 x l 3 (ball) R T 2 k A k F ( w x) = { w x + b : w } F k VC h k m 0 h k 2 2 k mn{ R A m0} + 1 VC h k w VC 0 SRM VC SVM ( 1 x1 d )( xl dl ) x R d { 1 N 1} (45) Mnmze w b subject to d ( w Φ( w) = T x + b) 1 = 12 l 1 T w w 2 3 n x r o B( x r) = { x R : x x n o < r} 23
4 SVM 423 0 4-3 4-3 24
4 SVM { } l ξ slack = 1 T (46) d ( w x + b) 1 ξ = 12 l ξ Φ( ) = ξ l = 1 424 (nonlnear surface) SVM (feature space) 4-4 25
4 SVM 4-4 m 0 m 1 m 1 ฯ ฯ ( x) = { ฯ1( x) ฯ m ( x)} 1 T w ฯ( x) + b = 0 SVM 26
4 SVM (47) Mnmze w b subject to d ( w Φ( w T 1 ) = w 2 ξ 0 T x + b) 1 ξ w + C l = 1 ξ = 12 l = 12 l C C (47) w C 425 SVM (47) Convex Programmng [Peressn et al 88] f ( λx1 + [1 λ] x2) λf ( x1) + [1 λ] f ( x2) f(x) Convex Convex Programmng Convex (target functon) Convex (47) Lagrangan Dualty Dual Problem [Nash et al 97] (47) Lagrangan prmal functon l 1 T T (48) L w b ξ λ γ ) = w w λ{ d ( w ฯ( x ) + b) 1+ ξ} ( γ ξ + C ξ 2 = 1 = 1 = 1 λ 0 γ 0 Lagrange Multpler l l 27
4 SVM 28 mn-max dualty 4 Dual Problem (49) ) ( mn maxmze 0 0 γ λ ξ ξ γ λ b L b w w (48) Convex Programmng ) ( mn γ λ ξ ξ b L b w w 0 ) ( cond3: 0 ) ( cond2 : 0 ) ( ) ( cond1: 1 1 = + = = = = = = = C b L d b b L d b L l l γ λ ξ γ λ ξ λ γ λ ξ ฯ λ γ λ ξ w w x w w w cond1 (49) (410) C d d d L l l l j j T j j l = = = = = = λ λ ฯ ฯ λ λ λ λ λ 0 0 ) ( ) ( 2 1 ) ( maxmze 1 1 1 1 * x x (410) SVM ) ( ) ( j T x x ฯ ฯ K(x x j ) 4 ) ( ) ( ) ( * * * * y x F y x F y x F (x*y*) ) ( mnmax ) ( mn max y x F y x F Y y X x X x Y y =
4 SVM T T = = = T K ( x x j ) ฯ ( x ) ฯ( x j ) ฯ ( x j ) ฯ( x ) akฯk ( x ) ฯk ( x j ) k SVM (1 1 exp 2σ tanh p + x T y) P 2 x x 2 T ( β x x + β ) 0 1 σ 2 RBF 2 4-1 (410)λ w b l = 1 λ d K( x x ) + b = 0 Ns w = = 1 λ d ฯ( x ) T Ns Ns = Number of Support Vectors b = 1 w λ d K( x x ) d =1 λ 0 0<λ 0 <C x 0 = 1 0 b Karush-Kuhn-Tucker KKT (saddle pont) 5 ( w * b* ξ* λ* γ * ) λ [ d ( w T ฯ( x ) + b) 1+ ξ ] = 0 γ ξ = 0 = 12 l = 12 l λ < C cond3 ξ 0 0< λ < C x 5 L Lagrangan L( x* λ) L( x* λ*) L( x λ*) (x*λ*) 29
4 SVM T w ฯ( x ) + b) 1 = 0 d ( b (410)Convex Programmmng 2 Quadratc Programmng [Nash et al 97] (410) Quadratc Programmng (411) 1 Mnmze F( Λ) = Λ1 + ΛΗΛ 2 subject to Λd = 0 Λ C1 Λ 0 Η j =d d j K(x x j ) Quadratc Programmng LOQO[Vanderbe 97] 30
5 5 51 Reuters-21578 Reuters-21578 Reuters newswre Reuter Carnege Reusters- 22173 Davd Lews 595 Reuters-21578 Reuters-21578 SGML Reuters-21578 5-1 31
5 EXCHANGES 39 ORGS 56 PEOPLE 267 PLACES 175 TOPICS 135 5-1 TOPICS Reuters- 21578 TOPICS TOPICS 2 Reuters-21578 3 ModLews (13625): LEWISSPLIT="TRAIN"; TOPICS="YES" or "NO" (61888): LEWISSPLIT="TEST"; TOPICS="YES" or "NO" (1765): LEWISSPLIT="NOT-USED" or TOPICS="BYPASS" ModApte (9603): LEWISSPLIT="TRAIN"; TOPICS="YES" (3299): LEWISSPLIT="TEST"; TOPICS="YES" 32
5 (8676): ModHayes : (20856): CGISPLIT="TRAINING-SET" (722): CGISPLIT="PUBLISHED-TESTSET" (0): 5-2 Reuters-21578 52 ModApte Reuters-21578 DF 5-3 DF (Document Frequency) <BODY> </BODY> Reuters newslne 100000 stemmng 33
5 IDF 4 8754 SVM 8754 (feature) 0: 0 n: n tfdf 1 ModApte 9603 3299 <BODY></BODY> 8762 3009 TOPICS 135 10 34
5 Earn 2839 1043 Acq 1611 672 Money-fx 513 143 Gran 415 125 Crude 370 156 Trade 351 102 Interest 323 97 Wheat 198 61 Shp 175 66 Corn 152 36 5-4 10 10 "corn" "crude" "earn" "gran" "nterest" 5 53 1 (Accuracy) = 35
5 2 (Precson/Recall break-even pont) recall precson recall precson +1-1 +1 a b -1 c d 5-5 recall = a/(a+c) precson = a/(a+b) recall precson (precson/recall break-even pont) precson recall precson/recallbreak even pont = 2 precson+ recall (Accuracy) (a+d)/(a+b+c+d) 531 SVM C=1000 SVM 5-1 5-2 5-3 36
5 5-1 5 5-2 s 1 RBF 37
5 5-3 b 0 =2 b 1 =1 2 corn crude earn gran nterest 1540 2248 3297 2566 1517 RBF 943 1491 2469 1749 1083 124 179 419 209 286 5-6 SV 5-1 5-2 5-3 SVM 97% earn 5-4 corn nterest crude gran earn SV 38
5 532 Naïve Bayesan Naïve Bayesan 5-6 5-4 Nave Bayesan SVM Naïve Bayesan RBF SVM Naïve Bayesan 533 39
5 5-6 RBF SV 8762 SV 2 5-5 5-6 5-5 SV 40
5 5-6 RBF SV 5-5 5-6 5-15-2 41
6 6 SVM SVM SVM Naïve Bayesan SVM SV SVM SV SV SVM Quadratc 42
6 Programmmng SVM Quadratc Optmzer 43
[Cherkassky et al 98] Vladmr Cherkassky and Flp Muler Learnng From Data Concepts Theory and Methods John Wley & Sons Inc 1997 [Frakes et al 92] Wllam B Frakes and Rchard Baeze-Yates Informaton Retreval Data Structures & Algorthms Prentce-Hall Inc 1997 [Haykn 98] Smon Haykn Neural Networks 2 nd edton Prentce-Hall Inc 1997 [Lews 91] Davd D Lews Evaluatng Text Categorzaton In Proceedngs of the Speech and Natural Language Workshop pp 312-317 1991 [Lere 97] Ray Lere and Prasad Tadepall Actve Learnng wth Commttees for Text Categorzaton In the Proceedngs of AAAI 97 pp591-596 1997 [Mtchell 97] Tom M Mtchell Machne Learnng McGraw-Hll Companes Inc 1997 [Nash et al 97] Stephen G Nash and Arela Sofer Lnear and Nonlnear Programmng McGraw-Hll Companes Inc 1997 [Osuna et al 97] Edgar E Osuna Robert Freund and Federco Gros Support Vector Machne: Tranng and Applcatons AI Memo MIT AI Lab 1997 [Peressn et al 88] A L Peressn F E Sullvan and J J Uhl Jr The Mathematcs of Nonlnear Programmng Sprnger Verlag New York Inc 1997 [Qunlan 93] J Ross Qunlan C45: Programs for Machne Learnng Morang Kaufmann Publshers Inc 1993 [Stston 96] M O Sttson J A E Weston A Gammerman V Vovk and V Vapnk Theory of Support Vector Machnes Techncal Report CSD-TR-96-17 Royal Holloway Unversty of London 1997 [Vanderbe 97] Robert J Vanderbe LOQO User s Manual Verson 310 Techncal Report SOR-97-08 Prnceton Unversty 1997 44
[Vapnk 95] V Vapnk The Nature of Statstcal Learnng Theory Sprnger Verlag New York Inc 1997 [Yang 97] Ymng Yang An Evaluaton of Statstcal Approaches to Text Categorzaton Techncal Report CMU-CS-97-127 Carnege Mellon Unversty 1997 45
ABSTRACT Now that the world s connected by onlne network t s an age of a flood of nformaton It s dffcult and tme-consumng to classfy accordng to user's nterests the enormous nformaton pourng n from onlne network Therefore f the classfcaton system can be automatcally bult usng machne learnng technques t wll be very effcent The problem of classfyng texts has a very hgher dmenson of nput space and the nformaton that the text tself contans s sparse In ths paper Support Vector Machne (SVM) an algorthm sutable for problems havng these characterstcs s mplemented In order to experment wth the effect of Support Vectors (SVs) whch SVM produces multlayer perceptron network s traned over the reduced data set usng only SVs SVM s a very strong algorthm based on Structural Rsk Mnmzaton (SRM) of the statstcal learnng theory In addton SVM's learnng process whch searches optmal solutons s a mathematcally well modeled process called Convex Programmng In the experment about 5 frequently-appeared topcs of Reuters-21578 document set t s remarkable that the resultng accuracy s hgher than 97% And SVM shows a better break-even pont than Nave bayesan classfer's In addton traned multlayer perceptron network usng only SVs not only shows a good performance but also reduces a tranng tme remarkably Keywords: Text Classfcaton Multlayer Perceptron Network SVM SRM Convex Programmng 46
2 730? 2 2 6 47