Nonlinear Classifiers I

(1)

Nonlinear Classifiers I

Nonlinear Classifiers: Introduction

2

• Classifiers

• Supervised Classifiers

• Linear Classifiers

• Perceptron

• Least Squares Methods

• Linear Support Vector Machine

• Nonlinear Classifiers

• Part I: Multi Layer Neural Networks, Convolutional Neural Network

• Part II: Nonlinear Support Vector Machine

• Decision Trees

• Unsupervised Classifiers

(2)

Nonlinear Classifiers: Introduction

What would a linear SVMs do with this data?

x=0

• An example: Suppose we’re in 1-dimension

Nonlinear Classifiers: Introduction

4

Not a big surprise

Positive “plane” Negative “plane”

x=0

• An example: Suppose we’re in 1-dimension

(3)

What can be done about this?

x=0

• Harder 1-dimensional dataset

Nonlinear Classifiers: Introduction

6

non-linear basis function

x=0

z

_k

 ( x

_k

, x

_k²

)

Nonlinear Classifiers: Introduction

(4)

) , (

_k _k²

k

 x x

x=0

z

non-linear basis function

Nonlinear Classifiers: Introduction

8

x=0

Nonlinear Classifiers: Introduction

x=0

• Linear classifiers are simple and computationally efficient.

• However for nonlinearly separable features, they might lead to very inaccurate decisions.

• Then we may trade simplicity and efficiency for accuracy using a nonlinear classifier.

(5)

The Perceptron

The Perceptron is a learning algorithm that adjusts the weights w

_i

of its weight vector w such that for all

examples x

_i

:

1 2

0 0

T

i T

i

w x x



  

  

1

0 l

w

w w

w

   

  

   

 

1

l

x

x x

   

  

   

  Here, the intercept is included in w :

It is assumed that the problem is linearly separable. Hence this vector

w

exists.

( ) g x w x

^T

0   

The Perceptron

10



w must minimize the classification error.



w is found using an optimization algorithm.

General steps towards a classifier:

1. Define a cost function to be minimized.

2. Choose an algorithm to minimize it.

3. The minimum corresponds to a solution.

(6)

The Perceptron Cost Function

1

2

0 0

T

w x x



  

  

Goal:

0

T

x

w x x Y



    ( ) 0 :

( ) 0 if

J w w Y

J w Y

   

  



( )

_x ^T

x Y

J w  w x



 

Cost function:

Y:

subset of the training vectors which are misclassified by the hyperplane defined by

w.

i

x 1 2

x 2 1

=-1 if but is classified in

=+1 if but is classified in

i

x x

  



The Perceptron Algorithm

12

w1

( )

J w is continuous and

piecewise linear.

Y

changes

Y

is constant

J(w)

is minimized by gradient descent:

(update wby taking steps that are proportional to the negative of the gradient of the cost functionJ(w))

(t+1) (t)

_t

J w ( )

w w w w

 ^ w

      



( ) (

_x ^T

)

_x

x Y x Y

J w w x x

w w  

 

   

    ⁽ ¹⁾ ^{( )}

^t ^x

x Y

w t w t   x



   

( ) _x ^T

x Y

J w  w x







(7)

The Perceptron as a Neural Network

Once the perceptron is trained, it is used to perform the classification:

1 2

if 0 assign to if 0 assign to

T T

w x x







The perceptron is the simplest form of a

“Neural Network”:

synaptic weights

activation function

f

1

-1 w x^T

Nonlinear Classifiers: Agenda

18

Part I: Nonlinear Classifiers Multi Layer Neural Networks

• XOR problem

• Two-Layer Perceptron

• Backpropagation

• Choice of the network size

• Model selection techniques

• Applications: XOR, ZIP Code, OCR problem

• Demo: SNNS, BPN

(8)

Nonlinear Classifiers: Agenda

Part I: Nonlinear Classifiers Multi Layer Neural Networks

• XOR problem

• Two-Layer Perceptron

• Backpropagation

• Choice of the network size

• Model selection techniques

• Applications: XOR, ZIP Code, OCR problem

• Demo: SNNS, BPN

20

x₁ x₂ XOR Class

0 0 0 B

0 1 1 A

1 0 1 A

1 1 0 B

The XOR problem

• There is no single line (hyperplane) that separates class A from class B. On the contrary, AND and OR operations are linearly separable problems.

(9)

• For the XOR problem, draw two lines, instead of one.

• Then class B is outside the shaded area and class is A inside.

• We call it a two-step design.

The Two-Layer Perceptron

22

• Step 1: Draw two lines (hyperplanes)

Each of them is realized by a perceptron.

The outputs of the perceptrons will be

depending on the value of

x

(

f

is the activation function ).

( ) 0 1, 2 ( ) 1

i i

y  f g x  ^  i 



The Two-Layer Perceptron

1

2

( ) 0, ( ) 0 g x g x



• Step 2: Find the ‘position’ of x

^w.r.t.

both lines,

based on the values of y

₁

, y

₂

.

(10)

• Equivalently:

1. The computations of the first step perform a mapping

2. The decision is then performed on the transformeddata

y

.

1^st step 2^nd x₁ x₂ y₁ y₂ step

0 0 0(-) 0(-) B(0)

0 1 1(+) 0(-) A(1)

1 0 1(+) 0(-) A(1)

1 1 1(+) 1(+) B(0)

y

T

y y

x   [

₁

,

₂

]

The Two-Layer Perceptron

24

• This decision can be performed via a second line which can also be realized by a perceptron.

, 0 ) y (

g 

The Two-Layer Perceptron

 Computations of the first step perform a mapping

that transforms the nonlinearly separable problem

to a linearly separable one.

(11)

The Two-Layer Perceptron

 This is known as the two layer perceptron with one hidden and one output layer.

The activation functions are:

1 0

(.) 0 0

f x

x

 

   

•The architecture

The Two-Layer Perceptron

26

•

The nodes (neurons) of the figure realize the following lines (hyper planes).

1 1 2

2 1 2

3 1 2

( ) 1 1 1 0

23

( ) 1 1 0

21

( ) 1 2 0

2

g x x x

g y y y

   

   

• Classification capabilities:

All possible mappings performed by the first layer are onto the verticesof the unit side square,

e.g., (0, 0), (1, 0), (1, 0), (1, 1).

1 0

(.) 0 0

f x

x

 

  

(12)

Classification capabilities

• The more general case

0 1

( ) 0

l

i ik k i

k p

j jk k j

k

g x w x w

g y w y w



  



 

,

0, 1 1, 2,...

l

i

x R

y i p



 

[ ,...

1

] , ( )

T p

p

i i

x y y y y R

y f g

  



( ) ^T_i 0 0 _i, ^l

i i

g x w xw  w xR

( ) ^T_j 0 0 _j, ^p

j j

g y w yw  w yR



g _i (x)

’s perform a mapping of a vector

x

onto

y

representing the vertices of the unit side

H

phypercube.

28

 The mapping is achieved with

p

nodes each realizing a hyperplane. The output of each of these nodes is 0 or 1 depending on the relative position of

x

w.r.t. the

hyperplane.

Classification capabilities

Intersections of these hyperplanes form regions in the l-dimensional space. Each region corresponds to a vertex of the

H

punit hypercube.

(13)

For example, the 001 vertex corresponds to the region which is located

Classification capabilities

The outputnode realizes a hyperplane in the

y

space, that separates some of the vertices from the others. Thus, the two layer perceptron has the capability to classify vectors into classes that consist of unions of polyhedral regions.

But not ANY union. It depends on the relative position of the

corresponding vertices.

to the (-) side of g₁(x) =0 to the (-) side of g₂(x) =0 to the (+) side of g₃(x) =0

The Three-Layer Perceptron

30

• This is capable to classify vectors into classes consisting of ANY union of polyhedral regions.

• The idea is similar to the XOR problem. It realizes more than one plane in the space.

(14)

 The reasoning

• For each vertex, corresponding to class A, construct a hyperplane which leaves THIS vertex on one side (+) and ALLthe others to the other side (-).

• The output neuron realizes an OR gate.

 Overall:

The firstlayer of the network forms the hyperplanes, the second layer forms the regions and

the output nodes forms the classes.

Multi-Layer Neural Networks

³²

 Many parameter

• Number of layers

• Number of nodes in each layer.

• Number of connections of each node with previous the layer (e.g. fully connected)

(15)

layer r-1 layer r

The Multi-Layer Neural Network

 

¹ ¹

 

0 1

kr

r r r r

j jk k j

k

i w y i w

 ^ ^







 

1 r

yk^ i output of the k-th node at layer

r

-1

 

r j i

 argument for f

 

. for the i-th trainings pair layer

r

 

^r ^r¹

 

r j

j i w y i

  ^

r

j

1 r

k^

for the i- th trainings pair

   

1 1

0 0

, 1

kr

r r r

jk k k

w y i with y i

 







 

 

    

^r ^r ¹

^{ } 

r r

j

j i f j

y   i  f w y ^ i

34

 Designing Multilayer Networks

– One strategy could be to adopt the above rationale and construct a structure that classifies correctly all the training patterns. (usually impossible)

– Second strategy: Start with a (large) network structure and compute the

w’

s, often called ‘synaptic weights´, to optimize a cost function.

– Back Propagation is an algorithmic procedure that computes the synaptic weights iteratively, so that an adopted cost function is minimized (optimized).

Multi-Layer Neural Networks

(16)

Nonlinear Classifiers: Agenda

Part I: Nonlinear Classifiers Multi Layer Neural Networks

• XOR problem

• Two-Layer Perceptron

• Backpropagation algorithm to train multilayer perceptrons

• Choice of the network size

• Model selection techniques

• Applications: XOR, ZIP Code, OCR problem

• Demo: SNNS, BPN

36

The Steps :

1. Adopt an optimizing cost function J( i ), e.g.,

• Least Squares Error

• Relative Entropy

between desired responses and actual responses of the network for the available training patterns.

 That is, from now on we have to live with errors resulting from structure and cost function. We only try to minimize them, using certain criteria.

The Backpropagation Algorithm (BP)

(17)

layer r-1 layer r

r

j

1 r

k^

The Backpropagation Algorithm The Steps :

2. Adopt an algorithmic procedure for the

optimization of the cost function with respect to the weights w ^e.g.:

– Gradient descent – Newton’s algorithm – Conjugate gradient

The Backpropagation Algorithm

38

The Steps :

3. The task is a nonlinear optimization e.g. with gradient descent.

(new) (old)

r r r

j j j

w  w   w

r

j r

j

w J

 ^ w

  

 



^N

i

i E J

1

)

(

(18)

(new) (old)

r r r

j j j

w w  w

r

j r

j

w J

 ^w

  



BackProp: Step 3 nonlinear optimization

Detail: Computation of the Gradients.

  ^r_j 

r j

i i

E 



 



 

1 0

: 1( )

r r r j j r r

r j j

r r j jk

i

w

y i w

w





  

 

 

  

  

  

 

 

 

r j

r r r

j j j

i i i

E E

w w



  



 

 

1

with

N

i

J E i



 ^{ }

¹ ¹

^{ }

⁰

1

and

kr

r r r r

j jk k j

k

i w y i w

 ^ ^







   

1 

1 1 1

1

:

r r r

r k

i i

i

y y

y _



  

 

  

 

 

 

¹

 

1

r N r r

j j

i

w   i y^ i



  



40

  1

r i

ej^

BackProp: Step 3 nonlinear optimization

Detail: Computation of for Least Squares

   

r

j r

j i

i

 E



 



 

²

2

1 1

1 1 ˆ

( ) ( ) ( ) ( )

2 2

L L

k k

L

m m m

m m

E i e i f  i y i

 











 

^L_j e i f_j( ) _m^L( )i

 

 

1 1

with y_m^r^( )i f _m^r^( )i

   



 



1 1

1 kr

r r r r

j k kj j

k

i i w f i

 ^  ^



  

  







   



 



1 1 1

^r_j^ i e^r_j^ i f ^r_j^ i

 

 

r j i



Case

r

= L ( Last Layer )

 

1 1

1

kr r

k

r r r

j k k j

i i i

E E 

^   ^

   





 

Case

r

< L

 



¹ 



1 r

r r

k

kj j

r j

i

i i

 w f 

^ ^

 

 



 

1 1

0

1 1

kr r r

r m km m

k

r r

j j

i i

 w y

 

 



 

 

  

 



     

 

1

1 1

r r

k

r r k

j k r

k j

i

i i

i

  





 

 





(19)

(new) (old)

r r r

j j j

w  w   w

^r^j r

j

w J

 ^ w

  

 3. The task is a nonlinear optimization.

e.g. gradient descent.

 

¹

 

1

r N r r

j j

i

w   i y

^

i



   

       

1 1

1 kr

r r r r

j k kj j

k

i i w f i



^

 

^



 

   

  

       

L L

j

i e i f

j j

i

   

BackProp: Step 3 summary

with the following up-date rules:

Error e_j( i ):Difference of actual and desired response for the j-th output neuron

) ˆ ( ) ( )

(i y i y i e_j  ^L_j  _j

output target

The Backpropagation Algorithm

42

The Procedure:

1. Initialization:

Initialize unknown weights randomly with small values.

2. Forward computations:

For each of the training examples compute the output of all neurons of all layers. Compute the cost function for the current estimate of weights.

3. Backward computations:

Compute the gradient terms backwards, starting with the weights of the last (e.g. 3rd) layer and then moving towards the first.

4. Update:

Update the weights.

5. Termination:

Repeat until a termination procedure is met

(20)

The Backpropagation Algorithm

• There is always an escape path!!!

e.g. the logistic function:

Other differentiable functions are also possible and in some cases more desirable.

) exp(

1 ) 1

(x ax

f   

 

( ) ( ) 1 ( )

f x 



f x  f x

 





 

0 0

0 ) 1

( x

x x f

• In a large number of optimizing procedures, computation of derivatives are involved. Hence, discontinuous activation functions pose a problem, i.e.,

The Backpropagation Algorithm

44

• Pattern mode: The gradients are computed every time a new training data pair appears. Thus

gradients are based on successive individual errors.

Two major philosophies:

• Batch mode: The gradients of the last layer are computed once ALL training datahave appeared to the algorithm, i.e., by summing up all error terms.

(21)

The Backpropagation Algorithm

A major problem:

The algorithm may converge to a local minimum.

The cost function choice Examples:

• The Least Squares





 ^N

i

i E J

1

) (

2 2

1 1

1, 2,...,

1 1 ˆ

( ) ( ) ( ( ) ( ))

2 2

k k

m m m

m m

i N

E i e i y i y i

 









 

: ) (i y_m

: ) ˆ (i

y_m Desired response ofm-th output node (1 or 0) for input x(i).

Actual response of m-th output node, in the interval [0, 1], for input x(i).

The Backpropagation Algorithm

46

The cost function choice Examples:

• The cross-entropy

This presupposes an interpretation of y and ŷ as probabilities!





 ^N

i

i E J

1

) (

 

1

( ) ˆ ( ) ( ) ˆ ( )

( ) ln (1 ) ln(1 )

k

m m m m

m

i i i i

E i y y y y





  

Classification error rate:

• Also known as discriminative learning.

• Most of these techniques use a smoothed version of the classification error.

(22)

The Backpropagation Algorithm

“Well formed” cost functions :

• Danger of local minimum convergence.

• “Well formed” cost functions guarantee convergence to a “good” solution.

• That is one that classifies correctly ALL training patterns, provided such a solution exists.

• The cross-entropy cost function is a well formed one. The Least Squares is not.

The Backpropagation Algorithm

48

optimally class a-posteriori probabilities:

Both, the Least Squares and the cross entropy lead to output values that approximate optimally class a- posteriori probabilities!

That is, the probability of class given . )

ˆ (i y_m

)) ( ( )

ˆ (i P x i

y_m 



_m



m ^x⁽ⁱ⁾

 It does notdepend on the underlying distributions!!!

It is a characteristic of certain cost functions and the chosen architectureof the network. It depends on the model how good or bad the approximation is.

 It is valid at the global minimum.

(23)

Nonlinear Classifiers: Agenda

Part I: Nonlinear Classifiers Multi Layer Neural Networks

• XOR problem

• Two-Layer Perceptron

• Backpropagation

• Choice of the network size

• Number of layers and of neurons per layer

• Model selection techniques

• Pruning techniques

• Constructive techniques

• Applications: XOR, ZIP Code, OCR problem

• Demo: SNNS, BPN

Choice of the network size

50

How big a network can be. How many layers and how many neurons per layer?

There are two major techniques:

• Pruning Techniques:

These techniques start from a large network and then weights and/or neurons are removed iteratively, according to a criterion.

• Constructive techniques:

They start with a small network and keep increasing it, according to a predetermined procedure and criterion.

(24)

Choice of the network size

Idea: Start with a large network and leave the algorithm to decide which weights are small.

Generalization properties:

• Large network learn the particular details of the training set.

• Not be able to perform well when presented with data unknown to it.

 The size of the network must be:

• Large enoughto learn what makes data of the same class similar and data from different classes dissimilar.

• Small enoughnot to be able to learn underlying

differences between data of the same class. This leads to the so called overfitting.