Nonlinear Classifiers I
Nonlinear Classifiers: Introduction
2• Classifiers
• Supervised Classifiers
• Linear Classifiers
• Perceptron
• Least Squares Methods
• Linear Support Vector Machine
• Nonlinear Classifiers
• Part I: Multi Layer Neural Networks, Convolutional Neural Network
• Part II: Nonlinear Support Vector Machine
• Decision Trees
• Unsupervised Classifiers
Nonlinear Classifiers: Introduction
What would a linear SVMs do with this data?
x=0
• An example: Suppose we’re in 1-dimension
Nonlinear Classifiers: Introduction
4Not a big surprise
Positive “plane” Negative “plane”
x=0
• An example: Suppose we’re in 1-dimension
What can be done about this?
x=0
• Harder 1-dimensional dataset
Nonlinear Classifiers: Introduction
6
non-linear basis function
x=0
z
k ( x
k, x
k2)
Nonlinear Classifiers: Introduction
) , (
k k2k
x x
x=0
z
non-linear basis function
Nonlinear Classifiers: Introduction
8
x=0
Nonlinear Classifiers: Introduction
x=0
• Linear classifiers are simple and computationally efficient.
• However for nonlinearly separable features, they might lead to very inaccurate decisions.
• Then we may trade simplicity and efficiency for accuracy using a nonlinear classifier.
The Perceptron
The Perceptron is a learning algorithm that adjusts the weights w
iof its weight vector w such that for all
examples x
i:
1 2
0 0
T
i T
i
w x x
w x x
1
0 l
w
w w
w
1
1
l
x
x x
Here, the intercept is included in w :
It is assumed that the problem is linearly separable. Hence this vector
w
exists.( ) g x w x
T0
The Perceptron
10
w must minimize the classification error.
w is found using an optimization algorithm.
General steps towards a classifier:
1. Define a cost function to be minimized.
2. Choose an algorithm to minimize it.
3. The minimum corresponds to a solution.
The Perceptron Cost Function
1
2
0 0
T
T
w x x
w x x
Goal:
0
T
x
w x x Y
( ) 0 :
( ) 0 if
J w w Y
J w Y
( )
x Tx Y
J w w x
Cost function:
Y:
subset of the training vectors which are misclassified by the hyperplane defined byw.
i
i
x 1 2
x 2 1
=-1 if but is classified in
=+1 if but is classified in
i
i
x x
The Perceptron Algorithm
12w1
( )
J w is continuous and
piecewise linear.
Y
changesY
is constantJ(w)
is minimized by gradient descent:(update wby taking steps that are proportional to the negative of the gradient of the cost functionJ(w))
(t+1) (t)
tJ w ( )
w w w w
w
( ) (
x T)
xx Y x Y
J w w x x
w w
( 1) ( )
t xx Y
w t w t x
( ) x T
x Y
J w w x
The Perceptron as a Neural Network
Once the perceptron is trained, it is used to perform the classification:
1 2
if 0 assign to if 0 assign to
T T
w x x
w x x
The perceptron is the simplest form of a
“Neural Network”:
synaptic weights
activation function
f
1
-1 w xT
Nonlinear Classifiers: Agenda
18Part I: Nonlinear Classifiers Multi Layer Neural Networks
• XOR problem
• Two-Layer Perceptron
• Backpropagation
• Choice of the network size
• Model selection techniques
• Applications: XOR, ZIP Code, OCR problem
• Demo: SNNS, BPN
Nonlinear Classifiers: Agenda
Part I: Nonlinear Classifiers Multi Layer Neural Networks
• XOR problem
• Two-Layer Perceptron
• Backpropagation
• Choice of the network size
• Model selection techniques
• Applications: XOR, ZIP Code, OCR problem
• Demo: SNNS, BPN
20
x1 x2 XOR Class
0 0 0 B
0 1 1 A
1 0 1 A
1 1 0 B
The XOR problem
• There is no single line (hyperplane) that separates class A from class B. On the contrary, AND and OR operations are linearly separable problems.
• For the XOR problem, draw two lines, instead of one.
• Then class B is outside the shaded area and class is A inside.
• We call it a two-step design.
The Two-Layer Perceptron
22
• Step 1: Draw two lines (hyperplanes)
Each of them is realized by a perceptron.
The outputs of the perceptrons will be
depending on the value of
x
(f
is the activation function ).( ) 0 1, 2 ( ) 1
i i
y f g x i
The Two-Layer Perceptron
1
2
( ) 0, ( ) 0 g x g x
• Step 2: Find the ‘position’ of x
w.r.t.both lines,
based on the values of y
1, y
2.
• Equivalently:
1. The computations of the first step perform a mapping
2. The decision is then performed on the transformeddata
y
.1st step 2nd x1 x2 y1 y2 step
0 0 0(-) 0(-) B(0)
0 1 1(+) 0(-) A(1)
1 0 1(+) 0(-) A(1)
1 1 1(+) 1(+) B(0)
y
Ty y
x [
1,
2]
The Two-Layer Perceptron
24
• This decision can be performed via a second line which can also be realized by a perceptron.
, 0 ) y (
g
The Two-Layer Perceptron
Computations of the first step perform a mapping
that transforms the nonlinearly separable problem
to a linearly separable one.
The Two-Layer Perceptron
This is known as the two layer perceptron with one hidden and one output layer.
The activation functions are:
1 0
(.) 0 0
f x
x
•The architecture
The Two-Layer Perceptron
26•
The nodes (neurons) of the figure realize the following lines (hyper planes).
1 1 2
2 1 2
3 1 2
( ) 1 1 1 0
23
( ) 1 1 0
21
( ) 1 2 0
2
g x x x
g x x x
g y y y
• Classification capabilities:
All possible mappings performed by the first layer are onto the verticesof the unit side square,
e.g., (0, 0), (1, 0), (1, 0), (1, 1).
1 0
(.) 0 0
f x
x
Classification capabilities
• The more general case
0 1
0 1
( ) 0
( ) 0
l
i ik k i
k p
j jk k j
k
g x w x w
g y w y w
,
0, 1 1, 2,...
l
i
x R
y i p
[ ,...
1] , ( )
T p
p
i i
x y y y y R
y f g
( ) Ti 0 0 i, l
i i
g x w xw w xR
( ) Tj 0 0 j, p
j j
g y w yw w yR
g i (x)
’s perform a mapping of a vectorx
ontoy
representing the vertices of the unit side
H
phypercube.28
The mapping is achieved with
p
nodes each realizing a hyperplane. The output of each of these nodes is 0 or 1 depending on the relative position ofx
w.r.t. thehyperplane.
Classification capabilities
Intersections of these hyperplanes form regions in the l-dimensional space. Each region corresponds to a vertex of the
H
punit hypercube.For example, the 001 vertex corresponds to the region which is located
Classification capabilities
The outputnode realizes a hyperplane in the
y
space, that separates some of the vertices from the others. Thus, the two layer perceptron has the capability to classify vectors into classes that consist of unions of polyhedral regions.But not ANY union. It depends on the relative position of the
corresponding vertices.
to the (-) side of g1(x) =0 to the (-) side of g2(x) =0 to the (+) side of g3(x) =0
The Three-Layer Perceptron
30• This is capable to classify vectors into classes consisting of ANY union of polyhedral regions.
• The idea is similar to the XOR problem. It realizes more than one plane in the space.
The reasoning
• For each vertex, corresponding to class A, construct a hyperplane which leaves THIS vertex on one side (+) and ALLthe others to the other side (-).
• The output neuron realizes an OR gate.
Overall:
The firstlayer of the network forms the hyperplanes, the second layer forms the regions and
the output nodes forms the classes.
Multi-Layer Neural Networks
32 Many parameter
• Number of layers
• Number of nodes in each layer.
• Number of connections of each node with previous the layer (e.g. fully connected)
layer r-1 layer r
The Multi-Layer Neural Network
1 1
0 1kr
r r r r
j jk k j
k
i w y i w
1 r
yk i output of the k-th node at layer
r
-1
r j i
argument for f
. for the i-th trainings pair layerr
r r1
r j
j i w y i
r
j
1 r
k
for the i- th trainings pair
1 1
0 0
, 1
kr
r r r
jk k k
w y i with y i
r r 1
r r
j
j i f j
y i f w y i
34
Designing Multilayer Networks
– One strategy could be to adopt the above rationale and construct a structure that classifies correctly all the training patterns. (usually impossible)
– Second strategy: Start with a (large) network structure and compute the
w’
s, often called ‘synaptic weights´, to optimize a cost function.– Back Propagation is an algorithmic procedure that computes the synaptic weights iteratively, so that an adopted cost function is minimized (optimized).
Multi-Layer Neural Networks
Nonlinear Classifiers: Agenda
Part I: Nonlinear Classifiers Multi Layer Neural Networks
• XOR problem
• Two-Layer Perceptron
• Backpropagation algorithm to train multilayer perceptrons
• Choice of the network size
• Model selection techniques
• Applications: XOR, ZIP Code, OCR problem
• Demo: SNNS, BPN
36
The Steps :
1. Adopt an optimizing cost function J( i ), e.g.,
• Least Squares Error
• Relative Entropy
between desired responses and actual responses of the network for the available training patterns.
That is, from now on we have to live with errors resulting from structure and cost function. We only try to minimize them, using certain criteria.
The Backpropagation Algorithm (BP)
layer r-1 layer r
r
j
1 r
k
The Backpropagation Algorithm The Steps :
2. Adopt an algorithmic procedure for the
optimization of the cost function with respect to the weights w e.g.:
– Gradient descent – Newton’s algorithm – Conjugate gradient
The Backpropagation Algorithm
38The Steps :
3. The task is a nonlinear optimization e.g. with gradient descent.
(new) (old)
r r r
j j j
w w w
r
j r
j
w J
w
Ni
i E J
1
)
(
(new) (old)
r r r
j j j
w w w
r
j r
j
w J
w
BackProp: Step 3 nonlinear optimization
Detail: Computation of the Gradients.
rj
r j
i i
E
1 0
: 1( )
r r r j j r r
r j j
r r j jk
i
i
i
w
y i w
w
r j
r r r
j j j
i i i
E E
w w
1
with
N
i
J E i
1 1
01
and
kr
r r r r
j jk k j
k
i w y i w
1
1 1 1
1
1
:
r r r
r k
i i
i
y y
y
1
1
r N r r
j j
i
w i y i
40
1
r i
ej
BackProp: Step 3 nonlinear optimization
Detail: Computation of for Least Squares
r
j r
j i
i
E
22
1 1
1 1 ˆ
( ) ( ) ( ) ( )
2 2
L L
k k
L
m m m
m m
E i e i f i y i
Lj e i fj( ) mL( )i
1 1
with ymr( )i f mr( )i
1 1
1 kr
r r r r
j k kj j
k
i i w f i
1 1 1
rj i erj i f rj i
r j i
Case
r
= L ( Last Layer )
1 1
1
kr r
k
r r r
j k k j
i i i
i i i
E E
Case
r
< L
1
1 r
r r
k
kj j
r j
i
i i
w f
1 1
0
1 1
kr r r
r m km m
k
r r
j j
i i
i i
w y
1
1 1
r r
k
r r k
j k r
k j
i
i i
i
(new) (old)
r r r
j j j
w w w
rj rj
w J
w
3. The task is a nonlinear optimization.
e.g. gradient descent.
1
1
r N r r
j j
i
w i y
i
1 1
1 kr
r r r r
j k kj j
k
i i w f i
L L
j
i e i f
j ji
BackProp: Step 3 summary
with the following up-date rules:
Error ej( i ):Difference of actual and desired response for the j-th output neuron
) ˆ ( ) ( )
(i y i y i ej Lj j
output target
The Backpropagation Algorithm
42The Procedure:
1. Initialization:
Initialize unknown weights randomly with small values.
2. Forward computations:
For each of the training examples compute the output of all neurons of all layers. Compute the cost function for the current estimate of weights.
3. Backward computations:
Compute the gradient terms backwards, starting with the weights of the last (e.g. 3rd) layer and then moving towards the first.
4. Update:
Update the weights.5. Termination:
Repeat until a termination procedure is met
The Backpropagation Algorithm
• There is always an escape path!!!
e.g. the logistic function:
Other differentiable functions are also possible and in some cases more desirable.
) exp(
1 ) 1
(x ax
f
( ) ( ) 1 ( )
f x
f x f x
0 0
0 ) 1
( x
x x f
• In a large number of optimizing procedures, computation of derivatives are involved. Hence, discontinuous activation functions pose a problem, i.e.,
The Backpropagation Algorithm
44• Pattern mode: The gradients are computed every time a new training data pair appears. Thus
gradients are based on successive individual errors.
Two major philosophies:
• Batch mode: The gradients of the last layer are computed once ALL training datahave appeared to the algorithm, i.e., by summing up all error terms.
The Backpropagation Algorithm
A major problem:
The algorithm may converge to a local minimum.
The cost function choice Examples:
• The Least Squares
N
i
i E J
1
) (
2 2
1 1
1, 2,...,
1 1 ˆ
( ) ( ) ( ( ) ( ))
2 2
k k
m m m
m m
i N
E i e i y i y i
: ) (i ym
: ) ˆ (i
ym Desired response ofm-th output node (1 or 0) for input x(i).
Actual response of m-th output node, in the interval [0, 1], for input x(i).
The Backpropagation Algorithm
46The cost function choice Examples:
• The cross-entropy
This presupposes an interpretation of y and ŷ as probabilities!
N
i
i E J
1
) (
1
( ) ˆ ( ) ( ) ˆ ( )
( ) ln (1 ) ln(1 )
k
m m m m
m
i i i i
E i y y y y
Classification error rate:
• Also known as discriminative learning.
• Most of these techniques use a smoothed version of the classification error.
The Backpropagation Algorithm
“Well formed” cost functions :
• Danger of local minimum convergence.
• “Well formed” cost functions guarantee convergence to a “good” solution.
• That is one that classifies correctly ALL training patterns, provided such a solution exists.
• The cross-entropy cost function is a well formed one. The Least Squares is not.
The Backpropagation Algorithm
48optimally class a-posteriori probabilities:
Both, the Least Squares and the cross entropy lead to output values that approximate optimally class a- posteriori probabilities!
That is, the probability of class given . )
ˆ (i ym
)) ( ( )
ˆ (i P x i
ym
m
m x(i) It does notdepend on the underlying distributions!!!
It is a characteristic of certain cost functions and the chosen architectureof the network. It depends on the model how good or bad the approximation is.
It is valid at the global minimum.
Nonlinear Classifiers: Agenda
Part I: Nonlinear Classifiers Multi Layer Neural Networks
• XOR problem
• Two-Layer Perceptron
• Backpropagation
• Choice of the network size
• Number of layers and of neurons per layer
• Model selection techniques
• Pruning techniques
• Constructive techniques
• Applications: XOR, ZIP Code, OCR problem
• Demo: SNNS, BPN
Choice of the network size
50How big a network can be. How many layers and how many neurons per layer?
There are two major techniques:
• Pruning Techniques:
These techniques start from a large network and then weights and/or neurons are removed iteratively, according to a criterion.
• Constructive techniques:
They start with a small network and keep increasing it, according to a predetermined procedure and criterion.
Choice of the network size
Idea: Start with a large network and leave the algorithm to decide which weights are small.
Generalization properties:
• Large network learn the particular details of the training set.
• Not be able to perform well when presented with data unknown to it.
The size of the network must be:
• Large enoughto learn what makes data of the same class similar and data from different classes dissimilar.
• Small enoughnot to be able to learn underlying
differences between data of the same class. This leads to the so called overfitting.
Choice of the network size
53Example:
• Decision curve (a) before and (b) after pruning.