The Bitvector Machine: A Fast and Robust Machine Learning Algorithm for Non-linear Problems
S. Edelkamp, M. Stommel Universität Bremen
26.9.2012
Observation
Typical pattern recognition situation
High-dimensional data
Often simplified by Principle Component Analysis
Or vectors represent distances to a set of prototypes
Qualitative description of the data set with respect to the coordinate axes is often appropriate
Information coded in dimensionality instead of exact feature value
Proposed Approach
Binarise data
Vector components treated independently
Minimal preprocessing, no training
Use efficient kernel function
Exploit that the input values are binary
Trade-off between speed and accuracy
Exact feature value is lost in the binarisation
But the kernel is much faster
Bitvector Machine
Demonstration of the approach
for a Support Vector Machine with binarised input
Binarisation Procedure
Data set
k vectors of dimensionality d
Median of k values
Element in the k/2-th position after sorting
Can be computed in linear time
Binarisation
Computation of d medians,
one for every component of the input set
Compare every component of a vector to the respective median
Set to 0 if less, or 1 otherwise
Complexity O(kd) for both median computation and thresholding
x x x x x x x
x
1 Cut0 Cut
0 1
dim
1dim
2median
2median
1Support Vector Machine
Decision function
s support vectors x
i Coefficient includes Lagrange multipliers from optimisation and class label
Kernel function
for a dot product after applying a non-linear mapping
Bitvector Machine
Decision function
where is the binarisation of the input
Kernel function
that maps the input first to the Boolean space {0,1}
dbefore
lifting it into higher dimensions
Kernel Computation
Observation
Euclidean and Hamming distance in {0,1}d yield the same d+1 results from 0…d
Hamming distance
Number of different bits
Population count on the XOR disjunction of the input arguments
RBF-Kernel
Replace sqared norm ||…||2 by Hamming distance
Precompute and tabulate kernel because it can only yield d+1 results
Saves the multiplication and the exponential during classification
Time complexity of the classification
Native popcount-CPU instruction: O( t + d )
No native popcount: O( d + t lg* d )
Experimental Results
First scenario
Gaussian Mixtures
2 and 5 class problems
2...256 dimensions
Second scenario
High-dimensional XOR problem
With noise
Third scenario
Computer vision task
Classification of facial features (16 classes)
Mixed Gaussian Distribution
2-class problem
3 Gaussians per class at random positions in the unit hypercube
70 sample points per Gaussian
5-class problem
5 Gaussians per class
Dimensionality
2, 4, 8, …, 256 dimensions
Ground-truth
Maximum likelihood classification (Underlying distribution is known)
Difficulty of the problem controlled by bandwidth of the Gauusians
Bandwidth adjusted to dimensionality
2D-Example
Data set for 2 classes 5 classes
Class Borders in the Example
Data set for 2 classes 5 classes
Results: 2-Class Problem
(“SVM“ refers to SVM with Gaussian kernel)
Results: Five Classes
CPU-Time for Classification
Classification of the whole data sets
Maximum speed up factor of 32 for 2 classes
and 128 dimensions
XOR-Problem
Gaussians now centred in
random corners of the unit cube
Class labels assigned to each Gaussian according to XOR- problem
Parameters
4 planar XORs
Dimensionality 8, 16, …, 256
70 data points per Gaussian
Gaussians in 0…100% of the corners
Bandwidth adjusted as before
Accuracy for 4 Planar XORs
SVM best, BVM follows SVM on lower level
Linear classifier fails
8-Dimensional XOR
Linear classifier fails for more than 10% of the corners used
SVM good. BVM better for more than 70% corners used.
XOR suits BVM
Real-World Data Set
SIFT-Descriptors computed for 15 different face parts
SIFT represents histograms of edge orientations in a local image area
Data set created using manual annotation of the FERET data set
Additional rejection class representing non-face patterns
Equally sized, randomised training and test sets
SVM best
Linear classifier worst, tends to overfitting
BVM is 2% worse than SVM but 5% better than linear classifier
CPU-Time
Computation of 116M kernels
BVM with native 64-bit popcount is 48 times faster than LIBSVM
Classification of 17K vectors in 16 classes using 7000 Support Vectors
Speed improvement of factor 17
High influence of the code optimisation by the compiler
Conclusion
Bitvector Machine is based on a dramatic simplification of the input data
Fortunate speed-accuracy trade-off
Kernel evaluation up to 48 times faster
Classification up to 32 times faster
Accuracy not as good as SVM in most cases
But much higher accuracy than a linear classifier
XOR-like problems suit the BVM
Applicable to real-world pattern recognition problems as demonstrated for the popular SIFT descriptors