The Bitvector Machine: A Fast and Robust Machine Learning Algorithm for Non-linear Problems

(1)

The Bitvector Machine: A Fast and Robust Machine Learning Algorithm for Non-linear Problems

S. Edelkamp, M. Stommel Universität Bremen

26.9.2012

(2)

Observation

 Typical pattern recognition situation

High-dimensional data

Often simplified by Principle Component Analysis

Or vectors represent distances to a set of prototypes

 Qualitative description of the data set with respect to the coordinate axes is often appropriate

Information coded in dimensionality instead of exact feature value

(3)

Proposed Approach

 Binarise data

Vector components treated independently

Minimal preprocessing, no training

 Use efficient kernel function

Exploit that the input values are binary

 Trade-off between speed and accuracy

Exact feature value is lost in the binarisation

But the kernel is much faster

 Bitvector Machine

Demonstration of the approach

for a Support Vector Machine with binarised input

(4)

Binarisation Procedure

 Data set

k vectors of dimensionality d

 Median of k values

Element in the k/2-th position after sorting

Can be computed in linear time

 Binarisation

Computation of d medians,

one for every component of the input set

Compare every component of a vector to the respective median

Set to 0 if less, or 1 otherwise

Complexity O(kd) for both median computation and thresholding

x x x x x x x

x

1 Cut

0 Cut

0 1

dim

₁

dim

₂

median

₂

median

₁

(5)

Support Vector Machine

 Decision function

 s support vectors x

i

 Coefficient  includes Lagrange multipliers from optimisation and class label

 Kernel function

for a dot product after applying a non-linear mapping 

(6)

Bitvector Machine

 Decision function

where  is the binarisation of the input

 Kernel function

that maps the input first to the Boolean space {0,1}

^d

before

lifting it into higher dimensions

(7)

Kernel Computation

 Observation

Euclidean and Hamming distance in {0,1}^d yield the same d+1 results from 0…d

 Hamming distance

Number of different bits

Population count on the XOR disjunction of the input arguments

 RBF-Kernel

Replace sqared norm ||…||² by Hamming distance

Precompute and tabulate kernel because it can only yield d+1 results

Saves the multiplication and the exponential during classification

 Time complexity of the classification

Native popcount-CPU instruction: O( t + d )

No native popcount: O( d + t lg^* d )

(8)

Experimental Results

 First scenario

Gaussian Mixtures

2 and 5 class problems

2...256 dimensions

 Second scenario

High-dimensional XOR problem

With noise

 Third scenario

Computer vision task

Classification of facial features (16 classes)

(9)

Mixed Gaussian Distribution

 2-class problem

3 Gaussians per class at random positions in the unit hypercube

70 sample points per Gaussian

 5-class problem

5 Gaussians per class

 Dimensionality

2, 4, 8, …, 256 dimensions

 Ground-truth

Maximum likelihood classification (Underlying distribution is known)

Difficulty of the problem controlled by bandwidth of the Gauusians

Bandwidth adjusted to dimensionality

(10)

2D-Example

Data set for 2 classes 5 classes

(11)

Class Borders in the Example

Data set for 2 classes 5 classes

(12)

Results: 2-Class Problem

(“SVM“ refers to SVM with Gaussian kernel)

(13)

Results: Five Classes

(14)

CPU-Time for Classification

 Classification of the whole data sets

 Maximum speed up factor of 32 for 2 classes

and 128 dimensions

(15)

XOR-Problem

 Gaussians now centred in

random corners of the unit cube

 Class labels assigned to each Gaussian according to XOR- problem

 Parameters

4 planar XORs

Dimensionality 8, 16, …, 256

70 data points per Gaussian

Gaussians in 0…100% of the corners

Bandwidth adjusted as before

(16)

Accuracy for 4 Planar XORs

 SVM best, BVM follows SVM on lower level

 Linear classifier fails

(17)

8-Dimensional XOR

 Linear classifier fails for more than 10% of the corners used

 SVM good. BVM better for more than 70% corners used.

 XOR suits BVM

(18)

Real-World Data Set



SIFT-Descriptors computed for 15 different face parts

SIFT represents histograms of edge orientations in a local image area

Data set created using manual annotation of the FERET data set

Additional rejection class representing non-face patterns

Equally sized, randomised training and test sets



SVM best



Linear classifier worst, tends to overfitting



BVM is 2% worse than SVM but 5% better than linear classifier

(19)

CPU-Time

 Computation of 116M kernels

BVM with native 64-bit popcount is 48 times faster than LIBSVM

 Classification of 17K vectors in 16 classes using 7000 Support Vectors

Speed improvement of factor 17

High influence of the code optimisation by the compiler

(20)

Conclusion

 Bitvector Machine is based on a dramatic simplification of the input data

 Fortunate speed-accuracy trade-off

Kernel evaluation up to 48 times faster

Classification up to 32 times faster

Accuracy not as good as SVM in most cases

But much higher accuracy than a linear classifier

XOR-like problems suit the BVM

Applicable to real-world pattern recognition problems as demonstrated for the popular SIFT descriptors