The basics of decision tree regression

2.2 Machine learning

2.2.1 The basics of decision tree regression

A part of machine learning, more specifically supervised learning, is trying to find algorithms that can reproduce some kind of input-output behaviour by analyzing a small subset of the possible inputs. In regression, the output (also called target values) are continuous numerical values that we would like to predict.

A building algorithm for decision trees has to find these splits and determine the values in every leaf.

Unfortunately, the run time of all known algorithms to build the in many senses optimal decision tree grows exponentially. Thus, it is computationally not feasible to build the optimal tree.

That is the reason why decision trees are usually built in an iterative, greedy fashion: At the beginning, all the events start in the root of the tree. The value of a node as seen in Figure 2.6 is the mean of all target values of events in that node. This means that in this case the mean p_z is 0.702.

Each node has an impurity, which measures how different the target values of the events in that node are. In this case it is ”mse” which is the mean squared error. For this tree the impurity of the root is ₃₁₁₈₂₃¹ ∑³¹¹⁸²³_i₌₁ (p_zi− 0.702)² =187464.496.

To find a split that divides the events within a node into two children, every possible split on every variable is considered and it’s impurity decrease

∆I = ^N^node

N_total(I_node− ^N^right

N_nodeI_right− ^N^left

N_nodeI_left) (2.1) calculated. Here, N denotes the number of events and I the impurity. The subscript ”node” corresponds to the current node and ”left” and ”right” to its children.

Finally, the split with the highest impurity decrease is chosen. This method makes locally optimal decisions even though they might not lead to the globally optimal split. Note that the impurity in a child node can be higher than the impurity in its parent. This happens when large portion of the data can be grouped together by a split such that the impurity is small. The small remaining portion of the data may not be as similar as the rest, but the increased impurity is multiplied by a small number and hence, has a smaller impact and the overall impurity still decreases.

The building procedure stops when there are no possible splits left, i.e. a node is a leaf if it is not possible to split the data such that the impurity decreases. It also possible (and sometimes useful) to pose restrictions on tree growth to avoid overfitting. Similarly, a node is a leaf if there is no split left that decreases the impurity while satisfying all posed restrictions. Those restrictions are also called regularization.

2.2 Machine learning

Intuitively, a machine learning model is said to overfit if it stops generalizing patterns and starts to learn details about the training data. A perfect example of an overfitting decision tree would be a tree that has one leaf for every training event and therefore, predicts the value of that single event. Hence, it is perfectly accurate for the training data. From an intuitive point of view, one would not say that this tree has ”learned” anything in terms of finding patterns. It just memorized the data.

A simple example of such a restriction is fixing the maximal depth (and therefore the number of leaves). The tree shown in Figure 2.6 is build by requiring a maximal depth of 2. The depth is also the maximal number of splits on a path from the root to a leaf.

It is generally a good idea to limit the number of leaves to reduce overfitting.

Another way to this is by requiring that a leaf has to contain at least x%

of the training data. Doing so, the prediction value of a leaf is found by considering several data points and tends to be less influenced by details of the training set.

Another possibility to restrict the learning of the decision tree is to limit the maximal number of features considered for a split. This way, the tree might not find the best split and has to use a slightly worse split of the data. Thus, it is less likely that the tree learns details of the training data. Note that (at least in thescikit-learnimplementation) features will be inspected until a valid split is found, even if this requires to consider more features than the maximal number of features parameter specifies. For example, if the parameter is set to 1 such that only splits on 1 feature should be considered but there is no valid split on that feature, splits on another feature are evaluated and chosen if they are valid splits. A nice benefit of this parameter is that it also reduces computation time because less splits are considered.

The fraction of the total impurity decrease while building the tree, that is achieved by splits on a specific feature, is called the feature importance of that feature.

Doing that calculations for the tree shown in Figure 2.6, yields a decrease in impurity of 70933.90 for p_z achieved in the initial split and a decrease in impurity of 18349.08 and 8701.36 for the left and right eta splits. Therefore, the feature importances for this tree are 65.69% for p_z and 34.31% for eta.

Jerome H. Friedman proposed a different split criterion for decisions tree which is (usually) used for gradient boosting [18] refered to as ”fried-man mse”. Instead of maximizing the impurity decrease, the split that

maximizes the expression _N^N^left^N^right

left+N_right(y¯_left−y¯_right)² is chosen. Impurities and impurity decreases are calculated and used for feature importance as shown before using Equation (2.1).

Im Dokument Bachelor Thesis Application of machine learning in the kinematic reconstruction of t¯t events (Seite 20-24)