Embedded Methods - Feature Selection - The Support Feature Machine: An Odyssey in High-dimensio

2.6 Feature Selection

2.6.5 Embedded Methods

Recursive Feature Elimination �e feature ranking scores obtained by G��’s method can directly be used as weights in a classi�er, i.e. the feature selection results induce a linear classi�er.

�e opposite way would be to train a classi�er and use the weight vector as a feature ranking criterion.�is is exactly whatrecursive feature elimination(��) [G��et al.,��] does. It is inspired by theoptimal brain damage(��) [L�C��et al.,��] — a method for successively reducing the number of connections in a neural network by setting those connection weights to zero that cause a minimal performance loss.�e loss can be approximated by expanding the cost function locally as a second order Taylor series. For linear��with quadratic cost function, this corresponds to discarding those features with minimum absolute weight (see Figure�.��

for the overall algorithm). To reduce runtime, G��et al. suggest to discard multiple features within each iteration.

Feature Selection via Mathematical Programming Alternatively, feature selection may be regarded as an optimisation problem where the number of features and the training error are both minimised. Given two classes with the data matricesAandB, havingmandksamples,

Input : Feature vectorsx_iand class labelsy_i Output: List of ranksr_i, weight vectorwand biasb

� Initialise list of surviving featuress←(� . . . d)

� whiles≠�do

� Reduce feature vectors to surviving features, i.e. X^′=X(∶,s)

� Train an��onX^′andy, store weight vectorw

� Compute ranking criterion for all features, i.e.c_i ←w^�_i

� Find feature with minimum rank, i.e. f ←arg minc

� Add feature to rank list, i.e. r←�s_f r�

� Eliminate feature with lowest rank, i.e. s=�s� . . . s_f₋� s_f₊� . . .�

� end

Figure�.��: Recursive feature elimination

respectively.�e optimisation problem [B��and M��,��] minimise (�−λ)�^�_m^T^y+^�

k �+λ�^Tw_∗ with (w_∗)_i =��

��

� ifw_i=�

� otherwise subject to −A^Tw+�γ+�≤y

B^Tw−�γ+�≤z y≥�, z≥�, −v≤w≤v

allows a trade-o�between the training error and the number of non-zero entries in the weight vector by choosing an appropriateλ∈[�,�). Here,wdenotes the weight vector andγis the bias such that any new samplexbelongs to class A ifx^Tw >γ. �e vectorsyandzare the class-speci�c slack variables, i.e. they quantify how far each pattern is away from being correctly classi�ed.�e objective function may be linearised, and can then be solved by a successive linearisation algorithm.�us, a classi�er with inherent feature selection is obtained.

Feature Scaling Methods for Support Vector Machines Feature scaling based methods iteratively increase the weight of putatively relevant features and decrease the weight of putatively irrelevant features. In case of convergence, this weight vector — not to be confused with the weight vector of the entire classi�er — quanti�es the relevance of each feature. Most approaches alternate between solving a support vector machine and ranking the features according to some error criterion [M��et al.,��, J��and J��,��, C��et al.,

��, W��et al.,��].

One-Norm Support Vector Machines �e one-norm support vector machine [Z��et al.,

��] is an application of thelasso[T��,��] to classi�cation. It minimises ∑ⁿ_i=��−y_i�b+∑^q_j=_�w_jh_j(x_i)��

subject to ��w��^�≤s

with[k]₊=max(k,�)for a set of basis functions{h_i, . . . ,hq}, a weight vectorw, a biasband a tuning parameters. A more��-like notation is obtained using the original features instead of basis functions —h_j(x)=x_j— and a dual representation:

minimise

�

i=��−y_i�w^Tx_i+b��₊+λ��w��^�.

Zero-norm Based Methods All the above support vector related methods enforce sparse solutions by adding parameters or constraints to the standard��optimisation procedure.

�us, sparsity is merely an e�ect than a primary aim. In contrast, one may approximate the zero-norm minimising weight vector of a separating hyperplane directly [W��et al.,��].

We assume the datasetDto be linearly separable, i.e.

∃w∈ ^d,b∈ with y_i�w^Txi+b�≥� ∀i and w≠�, (�.�) where the normal vectorw∈ ^dand the biasb∈ describe the separating hyperplane except for a constant factor. Obviously, ifwandbare solutions to the inequalities, alsoλwandλb solve them withλ∈ ⁺. In general, there is no unique solution to (�.�). A solution with the least number of features

minimises �w�^��

subject to yi�w^Txi+b�≥� ∀i and w≠�

(�.�)

with�w�^��=card{w_i�wi≠�}. Note, that any solution of (�.�) can be multiplied by a positive factor and is still a solution. W��et al. proposed to solve the above problem with a variant of the support vector machine by

minimising �w�^��

subject to y_i�w^Txi+b�≥� ∀i. (�.�) Indeed, as long as there exists a solution to (�.�) for whichyi�w^Txi+b�>�for alli =�, ...,n, solving (�.�) yields a solution to (�.�). Unfortunately, (�.�) as well as (�.�) are��-hard and

Input : Feature vectorsx_iand class labelsy_i Output: Weight vectorwand biasb

� Initialisez=(�, . . . ,�).

� repeat

� Minimise�w�such thaty_i �w^T(x_i∗z)+b�≥�

� Updatez=z∗w

� untilconvergence

Figure�.��: Iterative zero-norm approximating algorithm according to [W��et al.,

��].

cannot be solved in polynomial time.�erefore, W��et al. proposed to approximate (�.�) by

minimising ∑^d_j=�ln�ε+�w_j��

subject to yi�w^Txi+b�≥� ∀i (�.��) with�<ε��. Ifw�andw^∗optimise (�.�) and (�.��), respectively, then

�w^∗�^��≤�w��^��+O� �

lnε�, (�.��)

i.e. both solutions coincide asε→�.�us, by minimising (�.��) an approximate solution to (�.�) is found. However, (�.��) is not convex, may have many local minima, and is still hard to solve. W��et al. proposed an iterative scheme (see Figure�.��) which�nds a local minimum of (�.��) by solving a sequence of linear programs. �is modi�cation of the support vector machine e�ectively reduces the feature space used for classi�cation. However, the number of features may be further reduced by discarding any margin maximisation induced by the constraintsy_i�w^Tx_i+b�≥�.�is is the basic idea of the support feature machine proposed in the next chapter.

2.7 Conclusions

Statistical learning theory and support vector based methods are well established research�elds, but their paradigms may fail in high-dimensional small sample size scenarios. Such datasets are prone to the empty space phenomenon, distance concentration, hubness and incidental separability. Further, support vector classi�cation may produce completely unintuitive leave-one-out cross-validation errors.�erefore, irrelevant features need to be excluded from the training

data, or, if no prior information about relevance is available, feature selection methods should be used for preprocessing. Here, multidimensional embedded methods are most promising as feature selection and classi�cation are directly linked. Finally, there is some evidence that zero-norm based approaches are well suited for feature selection in high-dimensional spaces, as the phenomenon of distance concentration becomes less prominent.

organs of beasts and fowls. He liked thick giblet soup, nutty gizzards, a stu�ed roast heart, liverslices fried with crustcrumbs, fried hencods’ roes. Most of all he liked grilled mutton kidneys which gave to his palate a�ne tang of faintly scented urine.

«U��», J��J��

Im Dokument The Support Feature Machine: An Odyssey in High-dimensional Spaces (Seite 47-53)