The basics of decision tree regression

2.2 Machine learning

2.2.1 The basics of decision tree regression

A part of machine learning, more specifically supervised learning, is trying to find algorithms that can reproduce some kind of input-output behaviour by analyzing a small subset of the possible inputs. In regression, the output (also called target values) are continuous numerical values that we would like to predict.

mse = 186673.572 samples = 36156 value = -616.059

mse = 46267.094 samples = 118467value = -161.567

mse = 46608.784 samples = 120566value = 158.681

mse = 189272.592 samples = 36634value = 614.24 jet1_eta <= -1.623

mse = 116105.765 samples = 154623value = -267.842

jet1_eta <= 1.614 mse = 116948.438 samples = 157200value = 264.845 jet1_pz <= -0.732

mse = 187464.496 samples = 311823value = 0.702

Figure 2.6: A decision tree of max depth 2. It predicts the momentum along the beam axis.

Decision trees are a simple, easy to use out of the box method in machine learning. In contrast to other machine learning techniques, they allow to interpret and understand the learned rules. Therefore, decision trees and derived models are widely used to build comprehensible models.

To understand how decision tree learning for regression works, it makes sense to start at the end and consider how a completely trained decision tree processes the input to make its prediction. Therefore, Figure 2.6 represents a small fully trained decision tree. It is a inspired by the decision tree discussed in Section 4.1.

A decision tree consists of finitely many nodes that may (inner nodes) or may not (leaf nodes) split the data. In the example tree in Figure 2.6 the top node (the root of the tree) splits on pz of jet1.

The picture that one should have in mind when thinking about splits is that every event havingjet1 pz≤ −0.732 progresses to the left node while all the other events havingjet1 pz >−0.732 progress to the right child of the root.

This yields a path through this tree that ends in a leaf for each event. That leaf has a ”value” attached to it (as any other node in Figure 2.6 has). When predicting (reconstructing the momentum for) an event, the prediction of a decision tree is the ”value” of the leaf the path for that event ends in.

Therefore, if the decision tree in Figure 2.6 would only consist of the upper two levels, any event would have either get the prediction −267.842 if jet1 pz≤ −0.732 or 264.845 else.

A building algorithm for decision trees has to find these splits and determine the values in every leaf.

Unfortunately, the run time of all known algorithms to build the in many senses optimal decision tree grows exponentially. Thus, it is computationally not feasible to build the optimal tree.

That is the reason why decision trees are usually built in an iterative, greedy fashion: At the beginning, all the events start in the root of the tree. The value of a node as seen in Figure 2.6 is the mean of all target values of events in that node. This means that in this case the mean pz is 0.702.

Each node has an impurity, which measures how different the target values of the events in that node are. In this case it is ”mse” which is the mean squared error. For this tree the impurity of the root is ₃₁₁₈₂₃¹ ∑³¹¹⁸²³i=1 (p_zi− 0.702)² =187464.496.

To find a split that divides the events within a node into two children, every possible split on every variable is considered and it’s impurity decrease

∆I = ^N^node

N_total(I_node− ^N^right

N_nodeI_right− ^N^left

N_nodeI_left) (2.1) calculated. Here, N denotes the number of events and I the impurity. The subscript ”node” corresponds to the current node and ”left” and ”right” to its children.

Finally, the split with the highest impurity decrease is chosen. This method makes locally optimal decisions even though they might not lead to the globally optimal split. Note that the impurity in a child node can be higher than the impurity in its parent. This happens when large portion of the data can be grouped together by a split such that the impurity is small. The small remaining portion of the data may not be as similar as the rest, but the increased impurity is multiplied by a small number and hence, has a smaller impact and the overall impurity still decreases.

The building procedure stops when there are no possible splits left, i.e. a node is a leaf if it is not possible to split the data such that the impurity decreases. It also possible (and sometimes useful) to pose restrictions on tree growth to avoid overfitting. Similarly, a node is a leaf if there is no split left that decreases the impurity while satisfying all posed restrictions. Those restrictions are also called regularization.

Intuitively, a machine learning model is said to overfit if it stops generalizing patterns and starts to learn details about the training data. A perfect example of an overfitting decision tree would be a tree that has one leaf for every training event and therefore, predicts the value of that single event. Hence, it is perfectly accurate for the training data. From an intuitive point of view, one would not say that this tree has ”learned” anything in terms of finding patterns. It just memorized the data.

A simple example of such a restriction is fixing the maximal depth (and therefore the number of leaves). The tree shown in Figure 2.6 is build by requiring a maximal depth of 2. The depth is also the maximal number of splits on a path from the root to a leaf.

It is generally a good idea to limit the number of leaves to reduce overfitting.

Another way to this is by requiring that a leaf has to contain at least x%

of the training data. Doing so, the prediction value of a leaf is found by considering several data points and tends to be less influenced by details of the training set.

Another possibility to restrict the learning of the decision tree is to limit the maximal number of features considered for a split. This way, the tree might not find the best split and has to use a slightly worse split of the data. Thus, it is less likely that the tree learns details of the training data. Note that (at least in thescikit-learnimplementation) features will be inspected until a valid split is found, even if this requires to consider more features than the maximal number of features parameter specifies. For example, if the parameter is set to 1 such that only splits on 1 feature should be considered but there is no valid split on that feature, splits on another feature are evaluated and chosen if they are valid splits. A nice benefit of this parameter is that it also reduces computation time because less splits are considered.

The fraction of the total impurity decrease while building the tree, that is achieved by splits on a specific feature, is called the feature importance of that feature.

Doing that calculations for the tree shown in Figure 2.6, yields a decrease in impurity of 70933.90 for pz achieved in the initial split and a decrease in impurity of 18349.08 and 8701.36 for the left and right eta splits. Therefore, the feature importances for this tree are 65.69% for pz and 34.31% for eta.

Jerome H. Friedman proposed a different split criterion for decisions tree which is (usually) used for gradient boosting [18] refered to as ”fried-man mse”. Instead of maximizing the impurity decrease, the split that

maximizes the expression _N^N^left^N^right

left+N_right(y¯_left−y¯_right)² is chosen. Impurities and impurity decreases are calculated and used for feature importance as shown before using Equation (2.1).

Im Dokument Bachelor Thesis Application of machine learning in the kinematic reconstruction of t¯t events (Seite 20-24)