Interpretation and feature importance

4.2 Gradient Boosting

4.2.6 Interpretation and feature importance

Since a decision tree based model is used, it is possible to analyze the learning and relate it with properties of the data set. In fact, a lot of the interesting observations made before can be explained by looking at the feature importance and the properties of that variable.

Variables that are (anti-)correlated with the target, can be helpful for a decision tree: Intuitively, when splitting on a variable that is strongly (anti-)correlated with the target, the target values are approximately separated by a split as well. This separation will not be an exact split, but the target values might tend to be closer together such that the mean is close to all of these values. Hence, the tree is able to make good predictions. Nonetheless, it is important to keep in mind that correlation only measures the linear dependence, therefore, it is possible that the target is fully determined by a variable despite being weakly correlated with it.

Delphes unsorted Delphes sorted

jet1 jet2 lept1 lept2 jet1 jet2 lept1 lept2 p_T 0.5769 0.3835 0.4273 0.3347 0.4809 0.3874 0.4273 0.3347 px 0.0634 0.0197 0.6476 -0.5040 0.6826 -0.5692 0.6476 -0.5040 py 0.0598 0.0230 0.6424 -0.5008 0.5997 -0.5088 0.6424 -0.5008 pz 0.5606 0.4737 0.7700 0.2883 0.7589 0.2900 0.7700 0.2883

NanoAOD unsorted NanoAOD sorted

jet1 jet2 lept1 lept2 jet1 jet2 lept1 lept2 pT 0.5919 0.3773 0.3924 0.2971 0.5195 0.3981 0.3865 0.2926 px 0.0851 0.0169 0.5807 -0.4596 0.6659 -0.5265 0.5761 -0.4545 py 0.0794 0.0237 0.5810 -0.4565 0.6618 -0.5235 0.5746 -0.4501 pz 0.5299 0.4496 0.6934 0.2735 0.8390 0.3082 0.6808 0.2973 Table 4.5: This table shows the correlation between the variable (row) of the physical

object (column) and the same variable (row) of the top within the training set (top most column). For example, the correlation between p^top_T and p^jet1_T in the Delphes unsorted training set is 0.5769.

px and py benefit significantly

Comparing the feature importances for px and py between Figure 4.8a and Figure 4.8b and between Figure 4.8c and Figure 4.8d, we can see that the momentum of the positive lepton is the most important feature for the unsorted data sets. For the sorted sets, the jets gain importance significantly and are as imported or even more important than the lepton.

Intuitively, this makes sense, but we can also understand it by looking at the correlations between top momentum and jet momentum.

Looking at Table 4.5 we can observe that the correlations between px/py

of the top and jet1 increase significantly when using generator information to sort jets and simultaneously, the correlations between the top’s and the jet2’s px/py become significantly negative.

Recall that jet1 is in the unsorted sets the jet that has the higher pT. Unfor-tunately, sorting the jets by p_T loses the information, which jet originated from the top decay completely, because both are equally likely to have the higher pT. Restoring that order recovers the information that the jet is likely to travel in the same direction as the top right before its decay.

Since thett¯quarks are approximately back to back in the x-y-plane when they are created, the momentum of the top is anticorrelated with the mo-mentum of the jet from antitop decay. This is the reason why the correlation becomes close to 0 when the jets are permuted. Due to radiation of the top and the antitop, this anticorrelation is not as strong as the correlation with the jet from the top decay.

The same argument explains the correlation between top and lepton mo-mentum, but the decay W⁺ → l⁺+ν_l introduces another particle, hence, the correlation is intuitively weaker than the correlation between top and jets. On the other hand, the measurement of lepton momenta is more precise than the measurement of jets. Also, leptons are very likely to be identified correctly by their charge such that this correlation can always be found in the data sets. Having both of these variables as input explains the performance gain when assigning the jets correctly to the decays.

Another interesting observation is that the angle ϕis used for py but not for px. This can be explained by the value range of ϕ∈ [−π,π) where the x-axis corresponds to the angles 0 and−π. Therefore, a particle moving in positivex-direction has a value of ϕ close to 0 while a particle moving in the negative x-direction has a value either slightly smaller thanπ or slightly greater than−π. Thus, it is much more difficult to separate the positive and the negative x-direction by one split on ϕ.

In contrast, the y-axis corresponds to±^π₂ such that the general direction of movement with respect to they-axis can be determined much more easily, i.e. the both direction can be separated using a split like ϕ≤0.

So the comprehensibility of the gradient boosted decision trees allowed us to explain the difference in performance when sorting the jets with generator information and even recover a detail of the definition of the coordinates within the detector.

pT gains less than px and py

When comparing the feature importances for predicting pT in the unsorted and sorted data sets, the most prominent difference is that jet1 actually becomes less important and jet2 becomes more important.

If we try to explain this again with the correlation, we observe that there is one major difference between the situation for pT and the situation of

px and py: For this variable, there is still a positive correlation between the top momentum and the momentum of the jet emerging from the antitop decay. Therefore, the correlation is only weakened instead of destroyed when sorting when randomly permuting jets.

It might seem surprising that the correlation between the top’s and the first jet’s pT actually decreases when using generator information to identify the jet from the top decay, but the high correlation is due to a bias introduced when sorting by pT. Sorting bypT increases the tendency that jet1 has a high pT whenever the top has a high pT since the peak of the jet pT distribution is at a lower value compared to the top.

So, the difference in performance can again be explained by looking at the feature importances and the properties of the data: Sorting jets with generator information simultaneously decreases the usefulness of jet1 pt and increases it for jet2 pt. The achieved performance increase shows that in this case, those two slightly less useful input variables are in combination more useful than the one useful input variable in the unsorted sets.

This continuous trade off between features explains the slowly decreasing curve in Figure 4.7. In addition, a decision tree never explicitly uses the correlation. So it also makes sense that the performance increases as jet1 becomes more meaningful as the jet belonging to the top decay.

In total we see that for p_T the presented gradient boosting model does not benefit as much from improved jet assignment methods as px and pydo.

Performance on Delphes seems to be better than on NanoAOD

Looking at the feature importances, the prediction of the unsorted sets is dominated by lepton properties, so it is natural to have a closer look at the leptons. In Table 4.5 we can see that the correlation between top momentum and lepton momentum is higher in Delphes than in NanoAOD output without having explicitly pointed out an explaining difference between those sets so far.

Since we use dilepton events, it makes sense to analyze the different decays of the top separately. Looking at this decay within the selected events of the unsorted sets, we observe several differences: There are almost no events with hadronically decaying taus in Delphes while those events are a significant portion of the NanoAOD set as shown in Figure 4.9. Furthermore,

e⁺(24768 events)

+(32441 events) ⁺⁺qqe⁰+ e(5 events)(2742 events)

+ + (3491 events) Selected Delphes events

lepton the top decays into

(a) Delphes unsorted

lepton the top decays into

(b) NanoAOD sorted

Figure 4.9: Fraction of top decay products in the selected events per data set.

0.0

Correlation between top and lepton px by top decay e⁺(24768 events)

Correlation between top and lepton px by top decay

e⁺(71040 events) Figure 4.10: Correlation between top px and leptonpx by top decay.

the correlation between lepton and top momentum is higher and does not decrease for products of antitau decays in contrast to NanoAOD as shown in Figure 4.10.

Using the simulation data, it is possible to check where these electrons come from. As expected, most of the leptons in decays without a tau originate from theW⁺ decay. Since the events with hadronically decaying taus have

”lost” a lepton, it makes sense that the vast majority of leptons in these events emerged as radiation from hadrons. But of course those momenta are not correlated with the top momentum as the momentum of a correct lepton in other events. In fact, a similar phenomenon affects the leptonically decaying taus. In about 15% of the events, the lepton from the tau decay has not enough pT to be detected/considered. If another lepton is detected in that event, it gets selected even though its momentum does not correlate strongly with the top momentum. Again, the misidentification of another lepton as the lepton of the top decay is the reason for decreased correlation.

In both cases, the mistakenly chosen lepton is typically not isolated, i.e. there

are other particle with significant momentum within a small cone around its trajectory. By requiring that a particle carries at least some portion of the total momentum of all particles within a cone around its trajectory, one can require that it is isolated. Such a requirement is made in Delphes in by default. Small studies suggest that after introducing a corresponding criterion to the NanoAOD samples, the discrepancy in the performance becomes insignificant.

Doing these steps, we kind of rediscovered cutting on isolation. Similarly, this kind of procedure can be adapted to improve or introduce variables that intuitively should be beneficial to gain a better reconstruction of the momentum. This enables to gradually determine the best preprocessing steps, even beyond commonly used and towards new criteria. Also, being able to analyze learning in this manner, simplifies the detection of errors or misconceptions in the analysis. If we were not aware of differences in event generation, this behaviour would be very hard to explain without looking at the learning.

pz gains less than px and py

Qualitatively, the difference between the sorted and the unsorted sets origi-nates in the same observation than the difference in px and py. For pz, the leptons alone provide more information useful for the reconstruction than in the case of px and py and randomly assigned jets still correlate with the momentum of the top.

This makes sense as the whole system is usually boosted due to different momentum fractions carried by the partons forming the tt¯pair and this boost affects as well the jets as the leptons. Because there is no W-decay that might introduce another boost, the jet momentum is correlated stronger with the top momentum than the lepton momentum.

Interestingly, there is a difference between the unsorted NanoAOD set and the sorted NanoAOD jet with permuted jets. This can be seen by compairing Table 4.3 with Figure 4.7. This effect also vanishes when requiring isolated leptons.

5 Conclusion and outlook

This work can be considered as a proof of concept for the kinematic re-construction oftt¯pairs with gradient boosted decision trees (GBDTs). The GBDTs define a powerful and yet understandable and easily interpretable model. These properties can be used to search for improvements and are also helpful to spot inconsistencies in testing and evaluation of the model.

Depending on the reconstructed variable, it turned out that it would be help-ful to have a method than can separate jets that originated from bottom and antibottom quarks (maybe using a classifier) or having correlating variables that are independent of the order of these jets. Still, there are many aspects that can still be improved and investigated further in following studies.

On the computational side there are more efficient implementations of gra-dient boosting than the one used here. At the time of this thesis (September 2020)scikit-learnincludes the still experimental HistGradientBoostingRe-gressor with apparently similar implementations in other software projects.

It uses different data structures such that more training data can be pro-cessed to build more trees in less time. This might offer the opportunity to obtain even more robust regression models with comparable predictive performance in similar or even less time. In addition, besides the impurity based feature importance that is used here, there are other notions of feature importance like the permutation importance. Those might also show inter-esting patters that can be used to understand how to improve the regression model or the preprocessing of the data sets.

At the moment, the data sets contain redundant input variable. Therefore, it

might be interesting to use procedures for feature selection, dimensionality reduction or a principal component analysis and evaluate their results. Even introducing new physical variables as input might yield new insights.

The whole complex of the effect of cuts in event selection is not covered in this study. As mentioned in the chapter before, isolation requirements seem to be a key to obtain better results easily. Similarly, small example runs suggest that introducing cuts on kinematic variables like p_T usually yield a data set on which the GBDTs perform even better. The effect of cuts reducing background like a Z-veto on the invariant mass of the leptons was not evaluated and is hard to predict in beforehand. At the end of this process, a comparison with methods that are already used in the kinematic reconstruction of the top momentum is possible and interesting.

Everything mentioned here was only studied and evaluated within the standard model. Before applying the shown method to beyond standard model searches, the performance of GBDTs trained on standard model data should be evaluated on beyond standard model events.

[1] The large hadron collider — cern. ”Available athttps://home.cern/science/

accelerators/large-hadron-collider[Online; accessed 22 September 2020 10:27 UTC]”.

[2] Ulrich Husemann. Top-quark physics: Status and prospects. 2017.

[3] Lhc report: Protons: mission accomplished — cern.

”Available at https://home.cern/news/news/physics/

lhc-report-protons-mission-accomplished[Online; accessed 22 Septem-ber 2020 11:05 UTC]”.

[4] Andrea Wulzer. Behind the standard model. 2019.

[5] Johannes Erdmann, Tim Kallage, Kevin Kr ¨oninger, and Olaf Nackenhorst.

From the bottom to the top – reconstruction oftt¯events with deep learning.

2019.

[6] J. de Favereau, C. Delaere, P. Demin, A. Giammanco, V. Lemaˆıtre, A. Mertens, and M. Selvaggi. Delphes 3, a modular framework for fast simulation of a generic collider experiment. 2013.

[7] Alessandro Bettini. Introduction to Elementary Particle Physics. Cambridge University Press, 1 edition, 2008.

[8] Wikipedia contributors. File:standard model of elementary particles.svg, 2020.

Available at https://commons.wikimedia.org/w/index.php?title=File:

Standard_Model_of_Elementary_Particles.svg&oldid=423189509[Online;

accessed 1 June 2020 17:26 UTC].

[9] The CMS Collaboration. Observation of a new boson at a mass of 125 gev with the cms experiment at the lhc. 2012.

[10] Wolfgang Wagner. Top quark physics in hadron collisions. 2005.

[11] CMS Collaboration. Particle-flow reconstruction and global event description with the cms detector. 2017.

[12] Lucio Cerrito. Radiation and Detectors - Introduction to the Physics of Radiation and Detection Devices. Springer, Berlin, Heidelberg, 2017.

[13] Matthew D. Schwartz. Tasi lectures on collider physics, 2017.

[14] Nazar Bartosik / CC BY (https://creativecommons.org/licenses/by/4.0).

File:b-tagging diagram.png, 2020. Available at https://upload.wikimedia.

org/wikipedia/commons/b/ba/B-tagging_diagram.png, [Online; accessed 20 August 2020 12:15 UTC].

[15] Martin zur Nedden. Lecture physics at the lhc, summer 2012: Vl 3: Top physics at atlas, 2012. Available at https://www.physik.hu-berlin.de/de/eephys/

teaching/lectures/ss2012/lhcphysik/vorlesung03 [Online; accessed 1 June 2020 15:19 UTC].

[16] Ethem Alpaydin. Introduction to Machine Learning, second edition -. MIT Press, Cambridge, 2009.

[17] 1.10. decision trees — scikit-learn 0.23.2 documentation. Available athttps:

//scikit-learn.org/stable/modules/ensemble.html[Online; accessed 21 September 2020 20:15 UTC].

[18] Jerome H. Friedman. Greedy function approximation: A gradient boosting machine. Available at https://statweb.stanford.edu/~jhf/ftp/trebst.

pdf[Online; accessed 27 July 2020 12:13 UTC].

[19] 1.11. ensemble methods — scikit-learn 0.23.2 documentation. Available athttps://scikit-learn.org/stable/modules/ensemble.html[Online; ac-cessed 15 August 2020 15:15 UTC].

[20] 3.3. metrics and scoring: quantifying the quality of predictions — scikit-learn 0.23.2 documentation. Available at https://scikit-learn.org/stable/

modules/model_evaluation.html[Online; accessed 21 September 2020 20:18 UTC].

[21] Ludwig Fahrmeir, Christian Heumann, Rita K ¨unstler, Iris Pigeot, and Gerhard Tutz. Statistik - Der Weg zur Datenanalyse. Springer-Verlag, Berlin Heidelberg New York, 2016.

[22] Johan Alwall, Michel Herquet, Fabio Maltoni, Olivier Mattelaer, and Tim Stelzer.

Madgraph 5 : Going beyond. 2011.

[23] Torbj ¨orn Sj ¨ostrand, Stefan Ask, Jesper R. Christiansen, Richard Corke, Nishita Desai, Philip Ilten, Stephen Mrenna, Stefan Prestel, Christine O. Rasmussen, and Peter Z. Skands. An introduction to pythia 8.2. 2014.

[24] Matteo Cacciari, Gavin P. Salam, and Gregory Soyez. Fastjet user manual.

2011.

[25] Stefan Van Der Walt, S Chris Colbert, and Gael Varoquaux. The numpy array: a structure for efficient numerical computation. Computing in Science &

Engineering, 13(2):22, 2011.

[26] Powheg merging. ”Available athome.thep.lu.se/~torbjorn/pythia82html/

POWHEGMerging.html[Online; accessed 28 September 2020 12:07 UTC]”.

[27] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

Hiermit versichere ich an Eides statt, dass ich vorliegende Bachelorarbeit im Studiengang Computing in Science selbstst¨andig verfasst und keine anderen als die angegebenen Hilfsmittel – insbesondere keine im Quellenverzeichnis nicht benannten Internet-Quellen – benutzt habe. Alle Stellen, die w ¨ortlich oder sinngem¨aß aus Ver ¨offentlichungen entnommen wurden, sind als solche kenntlich gemacht. Ich versichere weiterhin, dass ich die Arbeit vorher nicht in einem anderen Pr ¨ufungsverfahren eingereicht habe und die eingereichte schriftliche Fassung der auf dem elektronischen Speichermedium entspricht.

Hamburg, den 28. September 2020

Karim Ritter von Merkl

Im Dokument Bachelor Thesis Application of machine learning in the kinematic reconstruction of t¯t events (Seite 46-59)