Learig Processes What is learig? Stimulus System New Respose Learig paradigm Supervised learig: learig with a teacher Usupervised learig or self-orgaized learig: learig without a teacher Learig algorithm Error-correctio learig Memory-based learig Hebbia learig Competitive learig Stochastic learig () Error-correctio learig Supervised learig eacher x[ ] d [ ] System y [ ] + e [ ] Delta rule or Widrow-Hoff rule: LMS (least mea square) algorithm Let the system parameter be w [ ] ad y [ ] = w [ ] x [ ] he, 2 From LMS algorithm, w[ + ] = w[ ] +ηe[ ] x [ ] * 2 w = argmi E [ ] = argmi e [ ] w w - - BME, KHU
Step size parameter η is called as the "learig rate parameter" Batch mode algorithm: LS (least square) algorithm (2) Memory-based learig Supervised learig N { i i } i= Past experieces are explicitly stored i a large memory as ( x, d ) Give a ew iput x test, Defie a local eighborhood of x test Classificatio rule applied to the local eighborhood - Nearest eighbor rule - -earest eighbor classifier (3) Hebbia learig Usupervised learig Hebbia syapse If two euros o either side of a syapse (coectio) are activated simultaeously (or sychroously), the the stregth of that syapse is selectively stregtheed If two euros o either side of a syapse (coectio) are activated asychroously, the that syapse is selectively weaeed or elimiated Properties of Hebbia syapse ime-depedet mechaism Local mechaism Iteractive mechaism Cojuctioal or corelatioal mechaism Mathematical models: w [ ] = F( y [ ], x [ ]) j j Hebb's hypothesis: w [ ] = ηy [ ] x [ ], expoetial growth, may be saturated j j Covariace hypothesis: wj[ ] η ( y[ ] y ) ( xj[ ] xj ) = with time averages y ad x j Modificatio usig a forgettig factor: - 2 - BME, KHU
( ) w [ ] = ηy [ ] x [ ] αy [ ] w [ ] = αy [ ] cx [ ] w [ ] j j j j j (4) Competitive learig Usupervised learig Competitive learig rule Competitio amog euros with the same structure but with differet weights Stregth (output) of each euro has a certai limit he wier(s) is oe (ie, competitive) or more euros (ie, cooperative) with the biggest stregth ad called wier-taes-all euro Mathematical models: Clusterig ( xj wj ) η [ ] [ ] if wier wj[ ] = 0 otherwise - 3 - BME, KHU
Clusterig ad Classificatio () Clusterig Usupervised learig Labelig could be too cost Uderstad iteral structure of data distributio from clusters Preprocessig for classificatio sice features withi the same cluster are similar Clusterig problem defiitio Give a set of vectors { x } = K, fid a set of C clusterig ceters { } C w such that each x is assiged to a cluster i i= w i so that the average distortio where (, ) i C K D= I x i d x w K i = = (, ) (, ) i d x w is a distace measure ad the idicator fuctio is I (, i) K-meas clusterig algorithm Iitializatio Repeat ( x wi) ( x w j) if d, < d,, j i x = 0 otherwise C Radomize { w i} i=, I (, i) Compute (, ) Evaluate I(, i) i x =0 for K ad i C, D(0) = 0, = d x w for K ad i C ( x wi) ( x w j) if d, < d,, j i x = for K 0 otherwise C Compute D [ ] = I( x, id ) ( x, w ) K = K K i = = i Update w I ( x, i) x with N I(, i) i N i = D [ ] If < ε, stop D [ ] Distace measure i K = x for i C = - 4 - BME, KHU
Norm, ( ) d xy = x y, p Mahalaobis distace, d ( x, y) = ( x y) S ( x y) xy Agle, ( xy, ) d = x xy y 2 2 aimoto coefficiet, ( xy, ) Distortio measure d = xy xx+ yy xy Mea square error, D I(, i) C N = = K C K C 2 2 x x wi x y 2 2 i= = i= Ni xy, C( i) I geeral, D d( xy, ) or D = mi d( xy, xy ) Scatterig criteria = i= i xy, C( i) K m = x x Mea of cluster i, I (, i) otal mea, m i N i = C i = Nim i K i = C i= {, C( i) } K i i i = Scatter matrix of cluster i, S = I( x, i) ( x m )( x m ) Withi cluster scatter matrix, S W C = S Betwee clusters scatter matrix, S = N ( m m)( m m ) i= C i B i i i i= otal scatter matrix, S = S + S = ( x m)( x m ) Note that D = tr( SBS W) Distace betwee clusters, mi { } W B = x C(), i y C( j) K d C(), i C( j) = mi d( xy, ) max { } d C(), i C( j) = max d( xy, ) { } x C(), i y C( j) d C(), i C( j) d( xy, ) avg = NN i j x C() i y C( j) - 5 - BME, KHU
{ (), ( )} d C i C j = m m Hierarchical clusterig Merge Iitially, each clusters are merged mea i j Split Iitially, all { } x is a cluster Durig iteratios, earest pair of distict K = x belog to oe cluster Durig iteratios, oe cluster is spitted ito two or more clusters if withi cluster scatterig is large (2) Classificatio K X = Classificatio problem defiitio Assume data samples { } = x are draw from M classes = { ()} M i Give a observatio x, fid a decisio rule g ( x ) C C i = M { } C such that the probability of classificatio Pr g( x) = Ci ()( x Ci () is maximized Nearest eighbor classifier Assume that the already classified set of data or mappig i= { } or ANN is available, ie, we have ( y ) ew sample x, the decisio rule choose g ( ) = C( ) i, C( j) for i N ad j M For a x if y * = arg mi y x is paired with C() -earest eighbor classifier Examie earest classes ad classify x ito the majority of them Statistical decisio rules Maximum posterior probability (MAP) classifier Maximum Lielihood (ML) classifier Neyma-Pearso (NP) detector Bayes detector yi i (3) Features x Feature y Classifier c Raw Data Extractor Feature Class Feature represetatio - 6 - BME, KHU
Symbolic vs umeric Higher dimesioal features Feature selectio Select a subset of available features ca improve classificatio Selectio of subspace or subspace approximatio Hidde euros i MPL are feature detectors Hidde euro pruig is a id of feature selectio Feature trasformatio Affie trasformatio y = x+ b Rotatio Liear filterig Fourier trasform (DF) Discrete cosie trasform Karhue-Loeve expasio (pricipal compoet aalysis) Eigedecompositio Edge or lie detectio Other liear or oliear operatio (4) Data samplig Sample data idepedetly from the uderlyig populatio Use resamplig with radomizatio Use M-fold cross-validatio or leave-oe-out cross-validatio - 7 - BME, KHU
Artificial Neural Networ (ANN) A (artificial) eural etwor is a massively parallel distributed processor made up of simple processig uits, which has a atural propesity for storig experimetal owledge ad maig it available for use It resembles the brai i two respects: Kowledge is acquired by the etwor from its eviromet through a learig process Itereuro coectio stregths, ow as syaptic weights, are used to store the acquired owledge Properties of artificial eural etwors Noliearity Iput-output mappig Adaptivity Evidetial respose Cotextual iformatio Fault tolerace VLSI implemetability Uiformity of aalysis ad desig Neurobiological aalogy () Models of a euro Syaptic Weights b Bias Iput Sigals x x 2 x i x M w i w 2 w M w Summig Juctio v Activatio Fuctio ϕ() Neuro is a iformatio processig uit A set of syapses or coectig lis with a weight or stregth Adder or liear combier Activatio fuctio or squashig fuctio y - 8 - BME, KHU
v M = w x with w0 = b, x 0 = ad y = ϕ( v) j j j= 0 Activatio fuctio hreshold fuctio or Heaviside fuctio McCulloch-Pitts model, all-or-oe if v 0 y = ϕ() v = or 0 if v < 0 the sigum fuctio if v > 0 y = ϕ() v = sg( v) = 0 if v= 0 if v < 0 Piecewise liear fuctio (ca have a gai) if v + 2 if v + y = ϕ() v = v if < v<+ or y = ϕ() v = v if < v<+ 2 2 if v 0 if v 2 Sigmoid fuctio: strictly icreasig fuctio with a graceful balace betwee liear ad oliear behavior, for example logistic fuctio y = ϕ() v = or + exp av ( ) the hyperbolic taget fuctio y = ϕ() v = tah( v) Stochastic model (2) Sigal flow graph, architectural graph, ad Matlab represetatio Sigal flow graph - 9 - BME, KHU
x x 2 x i w i w 2 w x 0 = w 0 = v ϕ() w M b y x M Architectural graph x 0 = x x 2 y x M Matlab represetatio ( R = M) (3) Networ architecture Sigle-layer feedforward etwor - 0 - BME, KHU
Multilayer feedforward etwor - - BME, KHU
Recurret etwor or dyamic etwor (4) Kowledge represetatio Kowledge refers to stored iformatio or model used by a perso or machie to iterpret, predict, ad appropriately respod to the outside world Iformatio Prior iformatio Observatios or measuremets provide a pool of iformatio from which the examples are draw to trai the ANN Examples a set of traiig data or traiig samples Labeled supervised learig Ulabelled usupervised learig Four rules of owledge represetatio for ANN Rule Similar iputs from similar class should usually produce similar represetatio iside the etwor, ad should therefore be classified as belogig to the same category Rule 2 Items to be categorized as separate classes should be give widely differet represetatios i the etwor Rule 3 If a particular feature is importat, the there should be a large umber of euros ivolved i the represetatio of that item i the etwor Rule 4 Prior iformatio ad variaces should be built ito the desig of a eural etwor, thereby simplifyig the etwor desig by ot havig to lear them "I geeral, use your commo sese" raiig ad geeralizatio - 2 - BME, KHU
Sigle-Layer Perceptro BACKGROUND MAERIALS Ucostraied optimizatio techiques Steepest descet Newto's method Gauss-Newto method Wieer filter Adaptive filter usig LMS (least mea square) algorithm LS (least square) method () Perceptro b x x 2 x i w w 2 w i w M y x x 2 x i w i w M w w 2 b v ϕ( v) y x M x M Decisio boudary is a hyperplae, M v= wx i i + b= wx + b= 0 Ad i= y x belog to class C w x> 0 0 belog to class 0 = x C2 w x - 3 - BME, KHU
Perceptro covergece algorithm Let x[ ] = [ +, x[ ],, x [ ] ] ad = [ b w w ] Iitializatio = 0 ad w[0] = 0 Activatio Apply x [ ] ad get d [ ] Respose y [ ] = sg ( w [ ] x [ ] ) M w[ ] [ ], [ ],, [ ] Weight adaptatio (LMS) + = + η ( ) w[ ] w[ ] d[ ] y[ ] x [ ] where M, if x belog to class C d [ ] =, if x belog to class C 2 (2) Perceptro as a liear classifier (Matlab) ewp sim iit learp adapt - 4 - BME, KHU
(3) Limitatios of perceptro XOR problem - 5 - BME, KHU
Multilayer Perceptro Multilayer perceptro (MLP) Iput layer Hidde layer Output layer Feed forward Noliear activatio fuctio Bacpropagatio learig algorithm () Structure of MLP (2) Bacpropagatio learig algorithm Epoch: oe complete presetatio of the complete traiig samples At the output layer, at iteratio (ie, th traiig example) e [ ] = d [ ] y [ ] at jth euro j j j E = where C is a set of all euros at the output layer 2 [ ] ej[ ] 2 j C M j = ji i ad yj[ ] = ϕ j( vj[ ] ) i= 0 v [ ] w y[ ] Chai rule E[ ] E[ ] ej[ ] yj[ ] vj[ ] = w [ ] e [ ] y [ ] v [ ] w [ ] ji j j j ji ( ) ϕ ( ) = e [ ] v [ ] y[ ] j j j i E[ ] wji = η = ηej[ ] ϕ j vj[ ] yi[ ] = ηδ j[ ] yi[ ] w [ ] LMS algorithm ( ) δ ji E[ ] E[ ] e [ ] y [ ] ( ) j j j[ ] = = = ej[ ] ϕj vj[ ] vj[ ] ej[ ] yj[ ] vj[ ] At a hidde layer, at iteratio (ie, th traiig example) E[ ] E[ ] yj[ ] E[ ] δ j[ ] = = = ϕ j vj[ ] v [ ] y [ ] v [ ] y [ ] At jth euro, ( ) From E =, 2 [ ] e [ ] 2 C j j j j E[ ] e[ ] e[ ] v[ ] = e[ ] = e[ ] y [ ] y [ ] v [ ] y [ ] j j j with - 6 - BME, KHU
Sice e [ ] d [ ] y [ ] d [ ] ϕ ( v [ ] ) Sice = = for the output layer euro, e [ ] = ϕ ( v[ ] ) v[ ] M v[ ] v[ ] = wjyj[ ], = wj[ ] j= 0 yj[ ] E[ ] = e[ ] v[ ] wj[ ] = δ[ ] wj[ ] y [ ] herefore, ϕ ( ) j Fially, at jth euro of the hidde layer, ( ) LMS algorithm w = ηδ [ ] y[ ] Activatio fuctios Logistic fuctio ji j i ( av j ) δ [ ] = ϕ v [ ] δ [ ] w [ ] j j j j yj[ ] = ϕ j( vj[ ] ) =, a> 0 ad < vj[ ] <, + exp [ ] δ [ ] = e [ ] v [ ] j j j j Hyperbolic taget fuctio ( v [ ] ) ay [ ] ( y [ ] ) ϕ =, ad j j j j ϕ ( ) ( j j ) j ( j ) j ( j ) δ j a d [ ] y [ ] y [ ] y [ ] for output layer = ay [ ] y [ ] [ ] w [ ] for hidde layer ( ) ( ) y [ ] = ϕ v [ ] = atah bv [ ], a, b> 0 ad < v [ ] <, j j j j j b ϕ j ( vj[ ] ) = ( a yj[ ] ) ( a+ yj[ ] ), ad a δ j[ ] = ej[ ] ϕ j ( vj[ ] ) b ( d j[ ] y j[ ]) ( a y j[ ]) ( a + y j[ ]) for output layer a = b ( a y j[ ]) ( a + y j[ ]) δ [ ] w j[ ] for hidde layer a Mometum w [ ] = α w [ ] + ηδ [ ] y[ ] stabilizig effect ji ji j i - 7 - BME, KHU
Modes of traiig N { } i A set of traiig examples, ( [], i d[] i ) x epoch Radomize samples at each epoch Sequetial mode (o-lie, patter, or stochastic mode): update weight sample by sample Batch mode (o-lie, patter, or stochastic mode): update weight at the ed of epoch N N E [ ] av η ej Eav = E [ ] ad wji = η = ej[ ] w N w N = Stoppig criterio Small orm of the gradiet vector Small absolute value of chage i the average squared error per epoch = ji = ji (3) Heuristics Whe the traiig data set is large ad redudat, sequetial mode is usually faster ad better Whe the traiig data set is ot large, there are several batch mode algorithms that are faster Iformatio cotet of a traiig example Use a example that results i the largest traiig error Use a example that is differet from all those previously used Distributio of traiig examples should ot be distorted Avoid ay outlier i the traiig data set Activatio fuctio arget value Iput ormalizatio Iitializatio Use ay prior iformatio Learig rate ad mometum Every adjustable etwor parameter should have its ow idividual learig rate parameter Every learig rate parameter should be allowed to vary from oe iteratio to the ext Whe the derivative of the cost fuctio wrt a syaptic weight has the same algebraic sig for several cosecutive iteratios, the correspodig learig rate - 8 - BME, KHU
parameter should be icreased Whe the algebraic sig of the derivative of the cost fuctio wrt a syaptic weight alterates for several cosecutive iteratios, the correspodig learig rate parameter should be decreased (4) Output represetatio ad decisio rule ANN is already traied Cosider M-class classificatio problem Let x j deote jth sample (prototype) to be classified ad ANN produces output y j = y, j,, ym, j = F ( xj),, FM( xj) = F( x j) Note that the fuctio F depeds o the traiig data set ( [], i d[] i ) N { } i x What is the optimal decisio rule for classifyig the M outputs of ANN? Assigig sigle class from M distict classes: x C if F ( x ) > F( x ) for all l j j l j Assigig multiple class from M distict classes: ( ) x C if F ( x ) > threshold ex, 05 j j = (5) Geeralizatio A etwor is said to geeralize well whe the iput-output mappig is correct for test data ever used i creatig or traiig the etwor Overtraiig or overfittig problem Bias-variace trade-off Factors ifluecig geeralizatio Size of traiig set ad how represetative it is of the eviromet of iterest Architecture of ANN Physical complexity of the problem at had (6) Cross-validatio Bacpropagatio learig algorithm ecodes a iput-output mappig ito the syaptic weights ad thresholds of a MLP For better geeralizatio, partitio the traiig set ito two subsets - 9 - BME, KHU
Estimatio subset, used to trai or select the model Validatio subset, used to test or validate the model Early stoppig rule, stop traiig whe the error usig validatio subset starts icreasig (7) Networ growig ad pruig techiques Networ growig Networ pruig (8) Supervised learig viewed as a optimizatio problem Cojugate gradiet method Quasi-Newto method (9) Matlab experimets ewff iit sim adapt: leardgd, leardgdm trai: traigd, traigdm, traigda, traigdx, trairp, traicgf, traicgb, traiscg, traibfg, traioss, trailm, traibr premmx, postmmx, trammx, prestd, poststd, trastd prepca, prapca postreg - 20 - BME, KHU