Machine learning methods

For each of the below mentioned machine learning techniques the imple-mentation fromscikit-learn[27] is used.

Every time, two similar data sets were generated. One of them will be used for training and model selection. It is split into a training set containing 80

% of the events and a validation set containing the remaining 20% of events.

The other one will be used as a test set for evaluating the performance.

Decision tree regressors are used to gain a few first insights. Eventually, gradient boosted decision trees will be evaluated as regression model for this problem because it is a powerful and still comprehensible machine learning method. Since this is based on decision trees, it is possible to adapt the analysis developed for decision trees in order to see how the trained model works.

All methods are based on decision trees, therefore it is not necessary to scale the data since the building rules of decision trees are invariant under such transformations.

3.2.1 Hyper parameter optimization

Each of the above mentioned regression models depends on parameters set before training. In order to find the best parameter combination, an exhaustive grid search was made to evaluate all the possible combinations out of multiple values for each parameter.

The searched parameter space is spanned by the values shown in Table 3.3.

Every possible combination of them was trained on the training set and the performance was evaluated on the validation set using different metrics (R², mean absolute error, max error and mean squared error ).

Method Parameter Possible values

Decision tree

max depth 10-20

split criterion mse, friedman mse mininum samples per leaf 0.01%, 0.1%

maximal number of features 1, 25%, 50%, 75%, all

GradientBoosting regressor

number of trees 2000

max depth 4-8

maximal number of features 1, 25%, 50%, 75%, all learning rate 1, 0.1, 0.01, 0.001 mininum samples per leaf any, 0.1%, 0.01%

Table 3.3: Parameter ranges used for hyper parameter optimization. Percentages in maximal number of features represent the fraction of input variables considered for a split. For minimum samples per leaf percentages repre-sent the minimum fraction of events that must be contained in a leaf to be built.

3.2.2 Effect of the correct jet permutation

Since generator information was used to sort the jets into columns, it is interesting to investigate how this affected the performance.

This is only analyzed for the NanoAOD data set because it is the only one where the correct assignment is known. This makes it possible to create data sets such that the fraction of incorrectly assigned jets is p∈ [0, 0.5]. As decision trees do not have any initial knowledge or understanding, of what the columns represent, every fraction above 0.5 corresponds to a fraction below 0.5 after swapping the columns which has no effect on the training.

Here, an exhaustive grid search over a smaller parameter space was per-formed for p from 0% to 50% in steps of 5%. The parameter ranges were inspired by the parameter combination that seem to yield good results in the case of the normal NanoAOD sorted data set and are shown in Table 3.4.

Parameter Possible Values

number of trees 2000

max depth 5-8

maximal number of features 25%, 50%, 75%, all learning rate 0.1, 0.01 minimum samples per leaf 1%, 0.1%, 0.01%

Table 3.4: Parameter ranges used for hyper parameter optimization when evaluating the effect of the correct jet permutation. Percentages in maximal number of features represent the fraction of input variables considered for a split. For minimum samples per leaf percentages represent the minimum number of fraction that must be contained in a leaf to be built.

4 ^Results

After performing the above described searches, the regressor having the highest coefficient of determination (R²) was selected for each data set and feature respectively. Their predictions on the test set and properties like feature importance will be the subject of this chapter.

4.1 A single decision tree

Since gradient boosting uses a collection of decision tree, it is helpful to analyze the performance and properties of a single decision tree first.

The grid search over the in Table 3.3 listed parameter space for the decision tree for the momentum along the beam axis (pz) of the NanoAOD sorted data set yields that the best parameter combination is:

• maximal depth: 12

• split criterion: mse

• minimum samples per leaf: 0.1 %

• maximal number of features: all

The usual way to present the prediction and real values will be a two dimensional histogram of point of the form (truth,prediction) and was already used in Figure 3.1.

4000 3000 2000 1000 0 1000 2000 3000 4000 simulated p

1500 1000 500 0 500 1000 1500

re co ns tru cte d p

(x,x) ideal distribution

10

⁰

10

Figure 4.1: Two dimensional histogram of points (p_z,top,p^predicted_z,top ) for p_z in NanoAOD sorted. If everything would be predicted perfectly, the his-togram would have the shape of the orange line.

Figure 4.1 shows the prediction of the decision tree. Since decision trees follow a decision path with a leaf node containing the prediction at the end, there is only a discrete number of distinct predictions possible. This behaviour yields the discrete horizontal stripes in the figure. Ensemble learning will help to smooth out the jumps in the predictions.

It is also possible to look at the decision tree structure. Since the depth of this particular tree is up to 12 and it has 720 leaves that would be next to each other, it is not possible to picture it completely here, but it is possible to visualize the first few levels.

We also can see that early splits based on pz andη of particles that originate from the top decay yield the best partition of the data.

It is also possible to determine which variables yield the best splits in the training process by looking at the feature importances (the fraction of decrease in impurity achieved by splitting on this feature).

For the whole tree we obtain the feature importances:

Figure 4.2: Diagram of the first two levels of the decision tree found by the exhaus-tive grid search having the highestR².

• jet1 pz: 81.9%

• lept1 eta: 10.7 %

• lept1 pz: 3.6 %

• jet1 eta: 2.9%

All the other input variables have an importance of less than 1% each.

This is (at least qualitatively) what one would expect from a physical point of view since these are the variables describing the movement along the z-axis. So far, the machine learning procedure works well with the physical intuition, but can also reconstruct the properties of particles that cannot be measured.

Im Dokument Bachelor Thesis Application of machine learning in the kinematic reconstruction of t¯t events (Seite 34-39)