컴퓨터비전 및 패턴인식 연구회 2009.2.12 Support Vector Machines http://cespc1.kumoh.ac.kr/~nonezero/svm ws cvpr.pdf 금오공과대학교 컴퓨터공학부 고재필 1
Contents Introduction Optimal Hyperplane Soft-Margin SVM Nonlinear SVM with Kernel Trick Training SVM: SMO Link to Statistical Learning Theory Multiclass SVM with Classifiers Ensemble SVM Demo References 2
Sample height weight object / sample 3
Feature height x 2 feature weight x 1 sample / feature vector x=(x 1, x 2 ) 4
Training Set height x 2 training set {(x i, y i )} N i=1 weight x 1 y 1 기린 y 2 하마 5
How to Classify Them Using Computer? height weight 6
How to Classify Them Using Computer? height (2, 3) x 2 = 3/4x 1 4x 2 = 3x 1 3x 1 4x 2 = 0 (3 4) (x 1 x 2 ) T = 0 (4, 2) weight w = (3 4) T w T x = 0 3 x 2 4 x 3 < 0 3 x 4 4 x 2 > 0 7
Linear Classification class 2, y=-1 w T x+ b < 0 height weight w T x + b = 0 hyperplane decision surface w T x+ b > 0 class 1, y=+1 8
Optimal Hyperplane height? weight w T x + b = 0 9
Optimal Hyperplane margin w T x + b < 0 w T x + b > 0 w T x + b = 0 optimal hyperplane 100 10
Canonical Hyperplane support vector w T x + b -1 y=-1 w T x + b = 0 w T x + b +1 y=+1 100 / Scale Factor optimal hyperplane 11
Normal Vector x 2 (2, 4) x 1 x 2 = 4/2x 1 2x 2 = 4x 1 4x 1 2x 2 = 0 (4 2) (x 1 x 2 ) T = 0 w T x = 0 w = (4 2) T (4, -2) (2 4)(4 2) T = 8 8 = 0 x y = <x y> = x T y = x 1 y 1 +x 2 y 2 = x y cosθ 12
Distance from a Sample to the Optimal Hyperplane x r x p x = x p + r w/ w g(x) = w T x + b = 0 g(x)=w T (x p + r w/ w )+b g(x)= w T x p + b + r w T w/ w g(x)= r w T w/ w r=g(x) w /w T w r = g(x) / w 13
Margin of Separation support vector: x (s) w T x (s) + b = 1 y= 1 r r 2r w T x (s) + b = +1 y=+1 r = g(x (s) )/ w where, g(x (s) ) = w T x (s) + b = ±1 for y (s) = ±1 margin: 2r = 2 / w 14
Finding the Optimal Hyperplane margin: w T x + b 1, y= 1 w T x + b +1, y=+1 Constrained Optimization Problem (convex) objective/cost function (linear) constraints 15
Convex Quadratic Problem Saddle Point minimize Primal problem maximize Dual problem 16
Lagrange Multipliers Method cost function: f(x) constraint function: g(x)=0 f(x) = λ g(x) tangent (line) ~ gradient = normal convex / concave f(x) = 0, g(x)=0 L(x, λ) ) = f(x) ) + λ g(x) Lagrange function Lagrange multiplier 17
A Brief Overview of Optimization Theory (2001년 추계CVPR 튜토리얼) 18
A Brief Overview of Optimization Theory 19
A Brief Overview of Optimization Theory 20
A Brief Overview of Optimization Theory Karush 21
A Brief Overview of Optimization Theory h 22
Lagrange Function for the Optimal Hyperplane (Primal Problem) Solution: dlp dw dlp db = 0 = 0 KKT condition: 23
Remark on Support Vector KKT Condition If then is support vector : is not support vector w is only related to support vectors, x i 24
Lagrange Function for the Optimal Hyperplane (Dual Problem) Optimizing L D only depends on the input patterns in the form of a set of dot product x T x (+) Not depend on the dimension of the input pattern (+) Can replace dot product with Kernel 25
n n (> 1000) Large Scale Quadratic Problem 26
Support Vector Machine Classifier for Linearly Separable Case optimal weight optimal bias 27
Optimal Hyperplane for Non-Separable Case misclassification correct classification impossible to find a separating hyperplane give them up as errors while minimizing the probabilities of classification error averaged over the training set 28
Soft Marin Technique Problem: Can t satisfy for all Adopting Slack Variable Minimizing Errors min. For QP, replace 29
Lagrange Function for Soft Margin Tech. penalize errors C penalize complexity 30
Dual Problem for Soft Margin Tech. non-bound pattern bound pattern The optimal value of C is determined experimentally, it cannot be readily related to the characteristics of the dataset or model 31
Nonlinear Support Vector Machines What if the problem is not linear? 32
Feature Space 1D 2D linearly non-separable on 1D (low-dimensional input space) linearly separable on 2D (high-dimensional feature space) 33
Dual Problem and Classifier in the Feature Space (not feasible) The feature space can be huge or infinite! The phi feature has the form of inner product 34
Kernel Kernel: a function k that takes 2 variables and computes a scalar value (a kind of similarity) Kernel Matrix: m m matrix K with elements K ij = k(x i, x j ). Standard Kernels Polynomial Kernel Radial Basis Function Kernel Sigmoid Kernel 35
Mercer s Condition Valid Kernel functions should satisfy Mercer s Condition For any g(x) for which: It must be the case that: A criteria is that the kernel should be positive semi definite Theorem: If a kernel is positive semi definite i.e.: are real numbers Then, there exists a function defining an inner product of possibly higher dimension i.e.: 36
Dual Problem and Classifier with Kernel (Generalized Inner Product SVM) Kernel Trick works without the mapping 37
Example: XOR Problem (small scale QP problem) Training set: { ( 1 1; 1), ( 1 +1; +1), (+1 1; +1), (+1 +1; 1)} x 1 x 2 x 3 x 4 Let kernel k(x, x i )=(1+x T x i ) 2, x=(x 1, x 2 ) T, x i =(x i1, x i2 ) T Then k(x, x i ) = (1+x T x i ) 2 = (1+x 1 x i1 +x 2 x i2 ) 2 = 1 + x 2 1 x2 i1 + 2x 1 x 2 x i1 x i2 + x2 2 x2 i2 + 2x 1 x i1 + 2 x2 x i2 A Mapping: ϕ(x)=(1, x 2 1, x2 2, 2x 1, 2x 2, 2x 1 x 2 )T Kernel Matrix: K = 4 x 4 38
Example: XOR Problem Dual Problem: L D (α)= α 1 + α 2 + α 3 + α 4 (9 α 2 1 2 α 1 α 2 2 α 1 α 3 +2 α 1 α 4 +9 α 2 2 +2 α 2 α 3 2 α 2 α 4 +9 α2 3 2 α 3 α 4 +9 α2 4 )/2 Optimizing L D : 9α 1 2α 2 α 3 + α 4 = 1, α 1 + 9 α 2 + α 3 α 4 = 1 α 1 + α 2 +9 α 3 α 4 = 1, α 1 α 2 α 3 + 9α 4 = 1 Optimal Lagrange multipliers: α 1 = α 2 = α 3 = α = 1/8 > 0 4 All the samples are support vectors Optimal w: w = α i y i ϕ(x i ) = (1/8)( 1) ϕ(x 1 )+ (1/8)(+1)ϕ(x 2 )+ (1/8)(+1) ϕ(x 3 )+ (1/8)( 1) ϕ(x 4 ) = ( 0 0 0 0 0 1 2) T Optimal Hyperplane: f(x) ) = sgn( w T ϕ(x) ) ) = sgn( x 1 x 2 ) 39
Sequential Minimal Optimization (Platt 98) In case of large scale QP problem divide a large QP into a series of smaller QP sub-problems and optimize them sequentially What is the smallest working set? Update just 2 Lagrange multipliers at a time Updating subset of variables while others fixed will also converge globally 40
Sequential Minimal Optimization Write dual prob. as a function of just 2 alphas Update Rules - no numerical part - memory for buffering E 41
Link to Statistical Learning Theory (Vapnik 95) Learn f from training set is to minimize the following: Unknown Minimize instead the average risk over the training set: 42
A Risk Bound Structural Risk: complexity (capacity) of func. set 43
Vapnik-Chervonenkis (VC) Dimension Maximum number of points that can be labeled in all possible way VC dimension of linear classifiers in N- dimensions is h=n+1 Lines(dichotomies) can shatter 3 points in 2d Measure of Complexity of Function Set Minimizing VC dim. Minimizing Complexity 44
VC Dimension of Marin Hyperplanes VC dimension satisfies the following: R is the smallest sphere containing a set of points Maximizing Margin Minimizing VC dim. 45
Results for Gaussian Kernel 46
Results for Gaussian Kernel 47
Summary Optimal Separating Hyperplane (Margin) Global Minimum Solution (Convexity) Only SVs are Relevant (Sparseness) Automatically selects SVs; # of SVs can be considered as # of hidden units of MLP Model selection problem; kernel selection Training speed and method for a large training set Binary classifier 48
Multiclass Support Vector Machines (Multiple Classifiers System) Ensemble of binary support vector machines Categorized by Coding / Decoding OPC (one-per-class), PWC(pair-wise coupling), ECOC (error correcting output coding) 49
Tree-based Methods 50
Open Software 51
Demo http://www.eee.metu.edu.tr/~alatan/courses/demo/ap pletsvm.html http://www.csie.ntu.edu.tw/~cjlin/libsvm/#gui 52
Applications Biometrics Object Detection and Recognition Character Recognition Information and Image Retrieval Other Applications 53
References S.Haykin, Neural Networks A Comprehensive Foundation, Prentice Hall, 1999. Chap. 6 Burges, C. J. C., A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, Vol. 2, pp. 121~167, 1998. John C. Platt, Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines, Microsoft Research, Technical Report MSR TR 98 14, 1998 C. W. Hsu and C. J. Lin. A Comparison of Methods for Multi Class Support Vector Machines, IEEE Transactions on Neural Networks, 13, 415 425, 2002. 54
Books Vladimir N. Vapnik, The Nature of Statistical Learning Theory, Springer Verlag, 1995. Nello Cristianini and John Shawe Taylor, An Introduction to Support Vector Machines and Other Kernel based Learning Methods, Cambridge University Press, 2000. Bernhard Scholkopf and Alexander J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, 2001. Links http://videolectures.net/ http://www.kernel machines.org/ 55
감사합니다 56