Random Forest - Efficient Bio-Inspired Sensing

recovers the global nonlinear structure from locally linear fits. In comparison to Isomap [73] which uses a traversal graph, LLE puts the small linear neighborhoods together by finding a set of weights that perform local linear interpolations which are able of approximating the data [69]. The advantage of LLE is that even though it requires solving anp×peigenvalue problem just like Isomap does, the matrix for LLE has a higher degree of sparsity (because many weights are zero) in comparison with the matrix for Isomap, which has zero values only on the diagonal. PCA [20] and MDS map faraway data points to nearby points in the plane and they are not able to discover the underlying structure of the manifold. In Figure 2.5 different embeddings are shown for a dataset which consists of selected digits from MNIST [74] corresponding to digits1,5and7. For each digit around180images were selected. The results were obtained by using thesklearn.ensemble.randomTreesEmbeddingclass implemented in Python [75].

LLE was applied to various domains, e.g., it was used on images of lips in order to find mea-ningful attributes, such as the pose and facial expression for a set of human face images [76]. It was also used in vision as an application for the construction of two dimensional parameterizations of different digit shapes [76]. The LLE algorithm was generalized to use other metric distances besides Euclidean and it was applied on the MNIST handwritten digit dataset [74]. LLE was also used in other domains than computer vision, for example it was applied to a sample of500galaxy spectra and it was found that the position of a galaxy within the embedding space is directly correlated with the mean age of the galaxy and its spectral type [77].

2.3 Random Forest

The Random Forest (RF) approach is based on learning decision trees for either classification or regression. In 1984, in the book of Breiman et al. “Classification and Regression trees” [78], the basics of decision trees and their use for classification and regression were described. By then, the decision trees were used individually and it was later on shown that boosting algorithms with iterative weighting of the training data, can actually linearly combine “weak” learners into a “strong”

classifier with higher accuracy and generalization [79]. Following this idea, a random decision forest is an ensemble of randomly trained decision trees [80]. For example, for digit recognition, the tree training was done by using random feature selection and different split functions [81]. Afterwards, it was shown that random decision trees generalize better than boosting algorithms on digit and shape recognition data [82, 83]. Further on, Breiman came with the idea to bring randomness in the forest by randomly choosing a specific amount from the original training dataset [24, 25]. The RF, consisting of a specific number of trees, chooses in the end the class with the most votes.

RFs are now widely used because they perform better or equivalent to other techniques [84].

They are also known for the success of the Microsoft Kinect for XBox 360 [85] which is capable to

(a)Selected64-dimensional digits (digits used:1_,5_{, and}7) from MNIST dataset [74]

(b)Projection of the digits using PCA (c)Projection of the digits using MDS

(d)Projection of the digits using Isomap (K= 16₎ _(e)Projection of the digits using LLE (K= 16₎ Figure 2.5:(a) Dataset consists of selected digits (digits used: 1,5, and7) from MNIST [74]. Embeddings obtained with (b) PCA, (c) MDS, (d) Isomap, and (e) LLE are shown. Note that with PCA the3digits are mixed and it is therefore not possible to separate them. Digits1and7are quite similar to each other and are easily confused while using PCA, and are better separated but are still too close to each other while using MDS. With MDS, the distances in the two-dimensional representation preserve well the distances in the original64-dimensional space. By applying Isomap the3digits are better separated in comparison to PCA and MDS. By using LLE local clusters are formed. When compared to PCA, MDS or Isomap, LLE shows a higher degree of separability of the three digits, even for the digits which are easily confused,1and7.

2.3. RANDOM FOREST

accurately predict the3D positions of body joints from a single depth image, without using temporal information [86, 87].

A forest contains a specific number of trees,nrM axT rees_:{T₁, T₂, . . . , TnrM axT rees}which are independently trained. This can be efficiently done in parallel. After training, each new data pointx_testis sent through the learned trees. In the end, the corresponding prediction ofx_test, i.e., the class label output, is given by a single forest prediction, rather than single trees prediction. In the following a more detailed explanation of the process will be presented.

2.3.1 Training the Random Forest

The training step is done offline and it is meant to optimize the parameters corresponding to the split function of each node of the tree. Randomness is used only during the training phase such that the computed trees are different from each other. Therefore, the generalization is improved. As mentioned earlier, two of the most frequently used ways of inducing randomness for the trees in the RF are: (i) random training dataset sampling, e.g., bagging [25], and (ii) node optimization with randomly chosen features [81].

The split functions are weak learners. At each nodejof a tree there is a binary split function:

h(x, θ_j)∈ {0,1}, (2.3.1)

where0represents false and1true. Depending on the split function, a data pointxis sent to the left or to the right branch of the respective tree rooting inj. The split functionθ = (Φ,Ψ, τ) has several parameters: Φis the filter function which selects only some features fromx,Ψgives the geometric primitive which is used for separating the data, andτ consists of the binary tests thresholds. For example, the most common weak learner is the axis-aligned hyperplane [88], but also general-oriented hyperplane is used [80]. In order to obtain the optimal parametersθ^∗_j of the split nodej, the information gain is maximized:

θ_j^∗=argmax

θj

IG_j, (2.3.2)

whereIG_j =IG(D^j, D^j_L, D^j_R, θ_j)depends on the training data points from before the split,D^j, and after the split:D_L^j,D_R^j, and on the split function,θj. The information gain obtained by splitting the data is given by:

IG_j =H(D^j)− ∑

i∈{L,R}

|D^j_i|

|D^j|H(D_i^j), (2.3.3)

whereiindexes the left and right child of the nodej, andH(D) =−∑

c∈Cp(c)log(p(c))is the Shannon entropy.

We consider the example shown in Figure 2.6. For a given training datasetDwith the corre-sponding labels (in this case there are3classes: red, green and blue), a subsetD₁₁of data points is randomly chosen fromD. D₂₁¹¹represents the subset of training data points which reach the left child of the root,D₁₁, respectively the right child, D₂₂¹¹. At each node j the objective function θ_j is optimized, i.e., the split which maximizes the information gainIG_j is chosen. The terminal nodes (leaves), contain the corresponding classes probability distributions. This can be seen in the considered example in Figure 2.6 for trees with two levels.

D11

Figure 2.6: A RF example where several trees are learned. Each tree contains in the rootD11 randomly chosen data from the whole datasetD. Different colors were chosen for the trees to emphasize that the trees are different, even though the data in the root has the same notation, i.e.,D11. At each level the data is split using a split function which maximizes the information gain in the respective node. In the leaves the class probability distributions are stored and will be further used for classifying a new data point as shown in Figure 2.7.

Im Dokument Efficient Bio-Inspired Sensing (Seite 29-33)