Training, evaluation and model selection

2.2 Machine learning

2.2.3 Training, evaluation and model selection

The content of this section is covered in the scikit-learn reference on model selection [20].

Consider the overfitting decision tree from above. It is easy to discover that such a tree has not learned anything. When given the task to predict values for data that it has not seen before, the performance would probably be much worse.

That is the reason why one tries to avoid to evaluate a machine learning model on the data that was used for training. Usually, one set is used for training, the training set, and another one is used for evaluating the performance. This set is usually called test set. Doing it that way, every model has to show how good is really is on new/unseen data.

Often, one divides the training set again into a ”real” training set used for training and a validation set for model selection. The validation set plays a similar role as the test set and is therefore also not used in training. If there are several possible models (for example because there are several possibilities for parameters like the ones that restrict the building process of a decision tree), that are trained on the training set, one would select the one that performs best on the validation set.

Using the test set for model selection and performance evaluation might introduce a bias towards the selected model because it already prefers the models that tend to perform well on the test set.

There are a variety of metrics to evaluate the performance of a regression model.

The mean absolute error (mae) is the mean of the absolute deviations between the predictions ˆy_i and the correct valuesy_i _n¹∑ⁿi=₁|y_i−yˆ_i|. It repre-sents the typical error of the prediction.

The mean squared error (mse) is the mean of the squared deviations:

n∑ⁿi=1(y_i−yˆ_i)². Consider two predictions with a mean absolute error of 1:

The first one, having that every absolute error is already 1 and the second one, where 50% of the data was predicted perfectly correct (no error) and the other half of prediction has an absolute error of 2 each. Even though the mean absolute error is equal, the mean squared error differs: It is 1 for the first prediction and it is 2 for the second one.

The mean squared error is, when compared to the mean absolute error, a measure for how much the errors tend to fluctuate. If all absolute errors are identical, the mean squared error is just the square of the mean absolute error. Otherwise, it will be larger than the square of the mean absolute error.

The max error is the maximal absolute error made: max1≤i≤n|yi−yˆi|. It shows the range of the occurring errors, but since it represents only one event, it is not the most powerful metric.

The coefficient of determination (short:R²) is another measure for the quality of the prediction. It is defined asR²=1−^∑_∑ⁿⁱ⁼¹n ⁽^yⁱ⁻^y^ˆⁱ⁾²

i=1(y_i−y¯)² where ˆy_i is as usual the i-th prediction and ¯y = ¹_n _∑ⁿ_i₌₁y_i the mean of the target values. It is always≤1 and a prediction that is always equal to the mean has an R²of 0.

Originally, R²was introduced as a measure in linear regression. Due to the special properties of linear regression, the coefficient of determination is exactly equal to the square of the correlationR. Hence, the name R², even though this does not hold for a general regression method [21].

Therefore, the name R² can be a bit misleading because the coefficient of determination can be negative as the predictions of a model can becomes arbitrarily bad.

Since it depends on the variance within the data set and not only deviations, one should be careful when comparing R² values that were calculated for different data sets.

The correlation Rmeasures the linear dependence between two variables. It is normalized to have values between -1 and 1. A positive correlation (close to 1) between two variables means that they tend to rise and fall together, i.e. as one of them increases, the other one increases as well. Similarly, a

negative correlation (close to -1) means that one of the variables increases when the other one decreases and the other way around.

The above mentioned restrictions on the growth/learning of a decision tree are examples of hyper parameters. The usage of the least square loss for gradient boosting is another example of a hyper parameter. Since there are several possible combinations, it is not clear, which one is the best combination.

There are multiple strategies to find the optimal parameter combination.

Probably most simple one (that is used here) is the exhaustive grid search.

In order to do an exhaustive grid search, one has to pick some possible values for each of the parameters that shall be optimized, train a model with every possible combination of parameters, evaluate it on the validation set and pick the one that is the best with respect to some criterion (in this case:

maximal R²).

Some advantages of this method are the easy implementation and the fact that this search can be performed in parallel on multiple machines with almost no effort. On the other, if the optimal combination is not among the selected ones, i.e. it is not on the discrete grid, it wont be found by this approach and some time might be wasted for parameter combinations that are far from optimal.

3 ^Methods

3.1 Event generation and processing

For data taken at a collider experiment, it is not possible to determine the process and the involved particles. Especially for top quarks which cannot be detected, it is difficult to determine whether one of them was present in an event or not. Therefore, generated events are widely used. Instead of using real data, one simulates the collision and the measurements of the resulting particles in a detector to obtain an output that contains data similar to the measurements of a real detector and also the information which particles have been involved in which process.

This way, it is possible to develop new analysis methods and improve exist-ing ones by comparexist-ing analysis results to the real quantity or by focusexist-ing on one specific process.

The toolsMadGraph5 aMC@NLO[22] (version 2.7.0) together with the PDF set NNPDF23 lo as 0130 qedin combination with Pythia 8.2[23] andDelphes [6, 24] (version 3.4.2) is able to generate the underlying processes, simulate events based on them, perform decays and hadronization, and finally, simu-late the detector. The output is saved in arootfile, which can be read and converted intonumpy[25] arrays by the python moduleuproot.

Two million pp → tt¯ events in the dilepton channel, were generated at a center of mass energy of 13 TeV using the above mentioned tools to obtain one of the data sets. The decayW⁺ →τ⁺+ντ and the respective decay for

the W⁻ were included in the simulation. Since Delphes is unique for this production chain, we will refer to it by mentioning Delphes explicitly as in

”Delphes output”. Those events are separated into two sets of one million events each to obtain a training and test set.

Furthermore, NanoAOD samples were used. NanoAOD simulations are pro-duced centrally usingPOWHEG v2including theNNPDF31 nnlo hessian pdfas PDF set at NLO together with NLO matching of matrix element and parton shower simulation as proposed by S. Prestel [26]. They are distributed as root-files, therefore, they can be processed similarly as Delphes output, even though NanoAOD and Delphes output might contain different vari-ables. Also, while NanoAOD performs a full detector simulation of the CMS detector, Delphes contains a simplified simulation of a detector.

There are only minor differences in the structure of the output between NanoAOD and Delphes such that the same tools can be used to process the data. One difference in the format of the data is a change in naming conventions. Lowercase variables are used instead of capital letters at the beginning, e.g. ”px” instead of ”Px” and MET MET instead of MET pt. The naming convention of the respective set will be carried through.

Also, the information whether a jet is suspected to have contained a bottom quark or not is stored differently in Delphes and NanoAOD output. In Delphes, for a jet, that probably contained a bottom quark, a 1 is stored as btag and jets that probably did not contain a bottom quark get the btag 0. So btag has 2 discrete values in Delphes and every jet having a 1 is considered to be a btagged jet.

In NanoAOD, the used tagging method (btagDeepB) is based on a classifier that outputs a number between 0 and 1 that is stored as btag. This can be interpreted as the probability that a jet contained a bottom quark. Here, jets having such a probability of at least 0.2770 are considered as btagged jet.

In addition, NanoAOD contains the type of the parton from which the jet originated. That is obtained by matching a jet to a parton from the simulation that is geometrically close to the jet. This information is stored as partonFlavour and directly yields which jet contained the bottom and which jet contained the antibottom.

Each event is required to have exactly 2 btagged jets and 2 leptons of opposite charge. 63,447 out of the first million generated Delphes events and 63,287 from the second million Delphes events fulfilled these criteria.

For NanoAOD, 180,721 out of 1,190,000 and 170,053 of 1,120,000 fulfilled

Type shortcut variables

lepton lept pt, eta, phi, px, py, pz, E

jet jet pt, eta, phi, mass, px, py, pz, E, btagDeepB (NanoAOD) MET MET MET(called pt in NanoAOD), phi, px, py

Table 3.1: Overview of the input variables used for the reconstruction.

lept1 PT lept2 Px jet1 Eta jet2 Py MET Phi 0 63.683491 -13.405689 -1.785956 38.043738 0.850217 1 20.479212 89.475971 -1.248753 40.449163 -2.370134 2 15.019216 56.703254 -1.334634 30.390645 -2.407620 3 63.861282 36.655279 -0.035643 -60.828295 -0.564938 4 80.606110 59.085128 -0.589899 33.034572 -0.127876

Table 3.2: Table containing some of the variables of the first 5 events in Delphes unsorted.

these criteria. No further cuts were applied. It should be mentioned that the Delphes simulation already contains cuts while the object reconstruction.

To also evaluate the influence of this, only the requirement for leptons to have a pT ≥ 10 GeV was adapted to NanoAOD. Therefore, in NanoAOD all leptons having p_T <10 GeV were ignored. This probably explains the difference in the number of selected events.

Table 3.1 shows the variables used as input. Those are the most commonly used kinematic variables as introduced earlier. There are some redundancies by using Cartesian coordinates and polar coordinates together with the pseudo rapidity. Even though changes of coordinates are mathematically possible, a decision tree cannot perform such a calculation, so it might be beneficial to provide the kinematic variables in both coordinate systems.

The target variables, the top momenta, were taken from the top quark with the pythia8 status code 62.

The data is then represented in a pandas data frame such that each row corresponds to an event and each column corresponds to a variable as shown in table 3.2.

Columns are named as follows: (shortcut)(number) (variable). If there is only one column of a type, the number is omitted. If possible, the entries in columns with the number 1 contain to the top decay and similarly, the number 2 belongs to the antitop decay.

This works well for leptons, since the lepton in the top decay will always be positive while the lepton in every antitop decay is negative and exactly 2 leptons of opposite charge were required. Sadly, it is not trivial to assign the correct number to a btagged jet. There are several possibilities to address this problem. The most simple one is of course to ignore it and sort the jets into columns by any property, like pT. This procedure yields the first two data set: ”Delphes unsorted” and ”NanoAOD unsorted” where jet1 is the jet that has the higher pT. This could of course also be done for real data.

On the other hand, one can obtain simple solutions to this problem by using generator information to determine the correct assignment.

Unfortunately, the Delphes output does not contain information that allows to do this directly but we can try to approximate this:

If everything would be measured and reconstructed perfectly, we would have that ptop = pjet1+p_lept1+pneutrino. Based on that, one can try to find a better assignment based on this generator information. A simple way to use this is to let jet1 be the jet such that the absolute difference in the x-component in the above equation is minimized.

This yields the third used data set ”Delphes sorted” made from the same simulation data as ”Delphes unsorted” but formatted differently.

Even though this simple approach will not yield the perfect assignment, this makes it possible to study how this machine learning based reconstruction methods would gain accuracy if there was a procedure to determine which jet is which.

For NanoAOD, the partonFlavour information can be used to sort the jets and built the data sets ”NanoAOD sorted”. In order to allow this sorting method to work consistently, it is necessary to change the selection criterion on jets. Since there might be btagged jets that did not really emerge from a bottom quark, only events are selected that have exactly one jet having the partonFlavour 5 (bottom quark) and exactly one jet having the partonFlavour -5 (antibottom quark). Conversely, some bjets were not btagged, so there are also new events included in ”NanoAOD sorted”. Based on the same simulations as for NanoAOD unsorted, 311,823 out of 1,190,000 and 292,518 of 1,120,000 events were selected. The difference compared to the unsorted set comes from the fact that only 75% of the jets in NanoAOD sorted are btagged so (assuming independence) only 0.75²≈0.56 =b 56% are expected to be recovered for NanoAOD unsorted.

600 400 200 0 200 400 600

(a) Delphes, jets sorted by pt

600 400 200 0 200 400 600

750 500 250 0 250 500 750 1000

(d) NanoAOD, jets sorted with generator in-formation

Figure 3.1: Two dimensional histogram of points (p_t,p^reco_t ) for each data set. If everything in every event is treated perfectly, the points would be of the form(x,x)and the histogram a diagonal.

Hence, it must be noted that while ”Delphes unsorted” and ”Delphes sorted”

contain the same events with differently formatted input, ”NanoAOD un-sorted” and ”NanoAOD un-sorted” might contain different events. This reduces the comparability of these two sets.

Anyway, this allows to further study the effect of correct assignments since it can be assumed that the assignment of jets can be done perfectly by using generator information.

To evaluate how good these jet assignment methods work, one can use the neutrino momentum from the simulation and reconstruct the top momen-tum via p^reco_top = pjet1+plept1+pneutrino and compare this to the correct top momentum.

Figure 3.1 shows that, as expected, the sorting performance increases when

trying to minimize reconstruction error and increases when using generator information.

Im Dokument Bachelor Thesis Application of machine learning in the kinematic reconstruction of t¯t events (Seite 26-34)

Training, evaluation and model selection

2.2 Machine learning

2.2.3 Training, evaluation and model selection

3 Methods

3.1 Event generation and processing

3 ^Methods