Categories of Feature Selection Algorithms

1.2 Feature Selection for Data Mining

1.2.3 Categories of Feature Selection Algorithms

Feature selection algorithms can be classified into various categories from different perspectives. Below we show five different ways for categorizing fea-ture selection algorithms.

1.2.3.1 Degrees of Supervision

In the process of feature selection, the training data can be either la-beled, unlala-beled, or partially lala-beled, leading to the development of super-vised,unsupervised, andsemi-supervised feature selection algorithms. In the evaluation process, a supervised feature selection algorithm [158, 192]

deter-Training Data

Validation Data

Test Data

Feature Subset Generation

Test Learning Model

Feature Selection

Evaluation

Train Learning Model

phase I

Stop Criterion

Yes

Best Subset

Training and Validation Data

Model Fitting/Performance Evaluation phase II

mines feature relevance by evaluating their correlation with the class or their utility for creating accurate models. And without labels, an unsupervised fea-ture selection algorithm may exploit feafea-ture variance or data distribution to evaluate the feature relevance [47, 74]. A semi-supervised feature selection al-gorithm [221, 197] can use both labeled and unlabeled data. The idea is to use a small amount of labeled data as additional information to improve the performance of unsupervised feature selection.

1.2.3.2 Relevance Evaluation Strategies

Different strategies have been used in feature selection to design feature evaluation criteria r(·) in Equation (1.1). These strategies broadly fall into three different categories: thefilter, thewrapper, and theembeddedmodels.

To evaluate the utility of features in the evaluation step, feature selection algorithms with a filter model [80, 147, 37, 158, 74, 112, 98, 222, 161] rely on analyzing the general characteristics of features, for example, the features’

correlations to the class variable. In this case, features are evaluated without involving any learning algorithm. The evaluation criteria r(·) used in the algorithms of a filter model usually assume that features are independent.

Therefore, they evaluate features independently,r b X

=r(fi₁) +. . .+r(fi_k).

Based on this assumption, the problem specified in Equation (1.1) can be solved by simply picking the topkfeatures with the largestr(f) value. Some feature selection algorithms with a filter model also consider low-order feature interactions [70, 40, 212]. In this case, heuristic search strategies, such as greedy search, best first search, and genetic-algorithmic search can be used in a backward elimination or a forward selection process for obtaining a suboptimal solution.

Feature selection algorithms with a wrapper model [80, 91, 92, 93, 111, 183, 110] require a predetermined learning algorithm and use its performance achieved on the selected features asr(·) to estimate feature relevance. Since the predetermined learning algorithm is used as a black box for evaluating features, the behavior of the corresponding feature evaluation functionr(·) is usually highly nonlinear. In this case, to obtain a global optimal solution is infeasible for high-dimensional data. To address the problem, heuristic search strategies, such as greedy search and genetic-algorithmic search can be used for identifying a feature subset.

Feature selection algorithms with an embedded model, e.g., C4.5 [141], LARS [48], 1-norm support vector machine [229], and sparse logistic regres-sion [26], also require a predetermined learning algorithm. But unlike an algo-rithm with the wrapper model, they incorporate feature selection as a part of the training process by attaching a regularization term to the original objec-tive function of the learning algorithm. In the training process, the features’

relevance is evaluated by analyzing their utility for optimizing the adjusted objective function, which forms r(·) for feature evaluation. In recent years, the embedded model has gained increasing interest in feature selection

re-search due to its superior performance. Currently, most embedded feature selection algorithms are designed by applying anL0 norm [192, 79] or anL1

norm [115, 229, 227] constraint to an existing learning model, such as the support vector machine, the logistic regression, and the principal component analysis to achieve a sparse solution. When the constraint is derived from theL1 norm, and the original problem is convex,r(·) (the adjusted objective function) is also convex and a global optimal solution exists. In this case, var-ious existing convex optimization techniques can be applied to obtain a global optimal solution efficiently [115].

Compared with the wrapper and the embedded models, feature selection algorithms with the filter model are independent of any learning model, and therefore, are not biased toward a specific learner model. This forms one ad-vantage of the filter model. Feature selection algorithms of a filter model are usually very fast, and their structures are often simple. Algorithms of a filter model are easy to design, and after being implemented, they can be easily understood by other researchers. This explains why most existing feature se-lection algorithms are of the filter model. On the other hand, researchers also recognize that feature selection algorithms of the wrapper and embedded models can select features that result in higher learning performance for the predetermined learning algorithm. Compared with the wrapper model, feature selection algorithms of the embedded model are usually more efficient, since they look into the structure of the predetermined learning algorithm and use its properties to guide feature evaluation and feature subset searching.

1.2.3.3 Output Formats

Feature selection algorithms with filter and embedded models may return either a subset of selected features or the weights (measuring the feature rel-evance) of all features. According to the type of the output, feature selection algorithms can be divided into either feature weighting algorithms or sub-set selection algorithms. Feature selection algorithms of the wrapper model usually return feature subsets, and therefore are subset selection algorithms.

1.2.3.4 Number of Data Sources

To the best of the authors’ knowledge, most existing feature selection al-gorithms are designed to handle learning tasks with only one data source, therefore they aresingle-source feature selectionalgorithms. In many real data mining applications, for the same set of features and samples, we may have multiple data sources. They depict the characters of features and samples from multiple perspectives. Multi-source feature selection [223] studies how to integrate multiple information sources in feature selection to improve the reliability of relevance estimation. Figure 1.8 demonstrates how multi-source feature selection works. Recent study shows that the capability of using multi-ple data and knowledge sources in feature selection may effectively enrich our information and enhance the reliability of relevance estimation [118, 225, 226].

Different information sources about features and samples may have very dif-ferent representations. One of the key challenges in multi-source feature selec-tion is how to effectively handle the heterogenous representaselec-tion of multiple information sources.

instances

features

target data

Information of Features (1)

Information of Features (p)

Information of Samples

Im Dokument The Open Access version of this book, available at www.taylorfrancis.com, has been made available under a Creative Commons Attribution-Non Commercial-No Derivatives 4.0 license. (Seite 27-30)