Bachelor Thesis Application of machine learning in the kinematic reconstruction of t¯t events

(1)

Application of machine learning in the kinematic reconstruction of t t ¯ events

Submitted by Karim Ritter von Merkl Student number 6927983

in the study program B.Sc. Computing in Science at the Department of Informatics

of the MIN Faculty September 28, 2020

1. Examiner: Prof. Dr. Christian Schwanenberger

2. Examiner: Dr. Alexander Grohsjean

(2)

(3)

1 Introduction 1

2 Fundamental concepts 3

2.1 Particle physics . . . 3

2.1.1 The basics of the standard model . . . 3

2.1.2 The top quark at hadron colliders like the LHC . . . . 6

2.2 Machine learning . . . 14

2.2.1 The basics of decision tree regression . . . 14

2.2.2 Gradient boosting . . . 18

2.2.3 Training, evaluation and model selection . . . 20

3 Methods 23 3.1 Event generation and processing . . . 23

3.2 Machine learning methods . . . 28

3.2.1 Hyper parameter optimization . . . 28

3.2.2 Effect of the correct jet permutation . . . 29

4 Results 31 4.1 A single decision tree . . . 31

4.2 Gradient Boosting . . . 33

4.2.1 Reconstruction of px . . . 34

4.2.2 Reconstruction of py . . . 35

4.2.3 Reconstruction of pz . . . 36

4.2.4 Reconstruction of p_T . . . 38

4.2.5 Jet permutation . . . 38

4.2.6 Interpretation and feature importance . . . 40

5 Conclusion and outlook 47

Bibliography 49

(4)

(5)

Zuallererst m öchte ich meinen Eltern und meiner Familie danken, die mich immer unterst ützt haben und ohne die ich nicht soweit gekommen wäre.

Genauso m ¨ochte ich auch meinen Betreuern Christian Schwanenberger, Alexander Grohsjean sowie allen anderen Mitgliedern der DESY-Exotics- Gruppe danken, die mich sehr freundlich aufgenommen und mir bei Fragen aller Art weitergeholfen haben.

In Zeiten der SARS-CoV-2 Pandemie 2020, während der diese Arbeit ent- stand, war es besonders sch ön, meine Mitbewohnerin Theresa und meine Mitbewohner Simon und Ruben als Freunde zu haben. Vor allem die langen Gespräche in der K üche, die nicht selten mit einer neuen Erkenntnis oder einem guten Ratschlag endeten, werden mir in Erinnerung bleiben.

Gesondert m ¨ochte ich noch Lotte und Simon f ¨urs Korrekturlesen und Helfen mit der englischen Sprache danken.

Abschließend m öchte ich alle meine bisher nicht genannten Freunde nicht unerwähnt lassen, die besonders im März und April durch digitale Treffen zu einem Gef ühl der Normalität beigetragen haben.

(6)

(7)

1 Introduction

”The Large Hadron Collider (LHC) is the world’s largest and most powerful particle accelerator” [1] and the second accelerator constructed that is able to produce top quarks.

The data sets of the last run of the Tevatron, the collider where the top quark was initially discovered, contained somewhere between a few hundred and a few thousandtt¯pairs [2]. In comparison, during LHC Run 2 (2015- 2018) data corresponding to more than one hundred million top pairs was collected by the CMS detector [3]. With increased luminosity from ongoing upgrades of the LHC and a higher cross section at 14 TeV center of mass energy, the production of even more top pairs can be expected in Run 3.

Further improvements of the hardware towards the high luminosity LHC (HL-LHC) are already planned for the future.

Having this many events for the analysis, the statistical uncertainties will decrease and systematical uncertainties will dominate. Therefore, new and improved analysis methods are needed for more accurate results from the new data sets.

Many predictions of the standard model are tested and confirmed to a high precision. Nonetheless, the standard model is known to be incomplete. For example, it does not contain an explanation for dark matter/energy [4].

The properties of tt¯ decays provide another test for the standard model.

Additionally, there are theories for beyond standard model physics that are connected to the top quark, for example the heavy Higgs. For those applications, improved analysis methods for top pair events are needed.

(8)

Every decay channel of the top admits difficulties to resolve in the analysis.

Events in the dilepton and lepton+jets channel contain at least one neutrino that is not detected such that the system of equations used for the kinematic reconstruction is underdetermined. For all-hadronic decays, all products can be measured by the detector but it is not obvious how to determine which jets belong to the top decay and which jets belong to the antitop decay.

Over time, the application of machine learning to problems within the sciences became more common. Erdmann et al. [5] already proposed a neural network to find the correct permutation of jets in the lepton+jets channel and obtained a higher accuracy than a purely likelyhood-based fit.

This permutation can then be used in the kinematic reconstruction.

In this work, a purely machine-learning-based approach is presented. It uses gradient boosted decision trees (GBDTs) for regression to reconstruct the top quark momenta for dilepton events. In addition, the performance of this method on two different data sets, a simplified detector simulation with Delphes [6], and NanoAOD samples containing a complete detector simulation is evaluated. Finally, a comparison by interpreting the results and the learning of the GBDTs is attempted.

(9)

2 Fundamental concepts

2.1 Particle physics

2.1.1 The basics of the standard model

Most of the following section is covered by Bettini [7]. If another reference is used, it will be cited explicitly.

The standard model of particle physics describes the existence and interactions of all fundamental particles we know of so far.

Hundreds of particles were discovered over the course of time, but most of them are bound combinations of other particles. So far, 17 particles are suspected to be fundamental. All of them are shown together with some of their properties in Figure 2.1. These particles are divided into bosons having a integer spin and fermions (even though there are non-fundamental fermions as well) that have half odd integer spin which are again divided into leptons and quarks.

Note: The numerical values of physical quantities in particle physics are usually extremely small when converted to SI units. In addition, a lot of constants pop up in various places, blowing up formulas and equations.

Therefore, the natural unit system is widely used within particle physics and related fields. Natural units work by definingc = h¯ = ε0 =kB = GN =1.

Using this unit system, Einsteins equation for resting particles becomes

(10)

Standard Model of Elementary Particles

three generations of matter (fermions)

I II III

interactions / force carriers (bosons)

mass charge spin

QUARKS

u

≃2.2 MeV/c²

⅔

½

up

d

≃4.7 MeV/c²

−⅓

½

down

c

≃1.28 GeV/c²

⅔

½

charm

s

≃96 MeV/c²

−⅓

½

strange

t

≃173.1 GeV/c²

⅔

½

top

b

≃4.18 GeV/c²

−⅓

½

bottom

LEPTONS

e

≃0.511 MeV/c²

−1

½

electron

ν

e

<1.0 eV/c² 0

½

electron neutrino

μ

≃105.66 MeV/c²

−1

½

muon

ν

μ

<0.17 MeV/c² 0

½

neutrinomuon

τ

≃1.7768 GeV/c²

−1

½

tau

ν

τ

<18.2 MeV/c² 0

½

neutrinotau GAUGEBOSONS VECTORBOSONS

g

0 0 1

gluon

γ

0 0 1

photon

Z

≃91.19 GeV/c² 0 1

Zboson

W

≃80.39 GeV/c²

±1 1

Wboson

SCALARBOSONS

H

≃124.97 GeV/c² 0 0

higgs

Figure 2.1: Diagram visualizing the classification of fundamental particles. Each generation corresponds to a column and the up-type and down-type quarks from their respective row. Similarly, lepton families are represented as columns, whereas the rows contain the charged or the neutral member of the family. On the right-hand side are the bosons organized in two columns. One column consists of bosons with spin 1, called vector bosons. The other column consists of the higgs boson which is the only known boson without spin, called scalar bosons. Technically, the standard model requires that neutrinos are massless even though we already know that neutrinos have a non-zero mass [8].

(11)

E = m. But the unit of energy is not yet defined. The SI unit Joule is impractical as masses of particles tend to be small. The electron volt (eV) is relatively close to the order of magnitude of quantities like momenta, energies and masses of particles, so it is used as the unit of Energy. Thus, it makes sense to speak of a mass to be equal to 511 keV as it happens to be the case for the electron in natural units.

We know 6 types of quarks (called flavours) which are classified in pairs of two. These three pairs are called generations that are ordered by mass (and simultaneously by time of discovery). In each generation we have a positively and a negatively charged quark. The up(u) and the down(d) quark make up the first generation, therefore, one often refers to positively charged quarks as up-type quarks and similarly, to negatively charged quarks as down-type quarks. The remaining generations are (in up-type, down-type order) charm(c) , strange(s) and top(t), bottom(b).

Similarly, there are three families of leptons also ordered by mass. Each family consists of a negatively charged lepton and a neutrino. Those families are the electron(e) together with the electron neutrino(νe), the muon(µ) with the muon neutrino(νµ) and the tau lepton(τ) with the tau neutrino(ντ).

Each fermion also has another partner, its antiparticle (usually denoted with a bar above the letter). A particle and its antiparticle share the same mass, but carry charges of opposite signs, i.e. the charge of a muon is−_{1 whereas} the charge of an antimuon is+1.

In addition, there are four gauge bosons, which are responsible for the forces between particles. Photons (γ) are the particles corresponding to the electromagnetic force and therefore only interact with charged particles.

Hence, photons do not interact with each other.

Gluons (g) represent the strong force. They interact with quarks that are said to carry color charge, which has the three distinct states red, blue and green. Antiquarks carry the anticolors antired, antibue and antigreen. There are 8 combinations of colors and anticolors that gluons can carry. Therefore, gluons can interact with other gluons. Color charged particles always form color neutral bound states, in the sense that no color charged particle has been observed individually. This behaviour, called color confinement, mas- sively influences the possible measurements in detectors as there wont be any single quarks.

Several quarks together with gluons can form a variety of bound states, the hadrons. Mesons consist of a quark and an antiquark while baryons consist

(12)

of a combination of three quarks or antiquarks. Both mesons and baryons have neutrally color charge and are observable bound states of matter.

Famous examples for hadrons are protons and neutrons, the building blocks of atomic nuclei. Note that the proton is the only stable hadron and all mesons are unstable.

The mathematical description of the standard model requires massless particles. That contradicts our observations and is resolved via the Higgs mechanism together with the Higgs boson (h) which is the most recently discovered fundamental particle (2012) [9]. It interacts with all particles that are said to have mass. The heavier a particle, the stronger the interaction with the Higgs boson.

For the purpose of reconstructing tt¯events, it suffices to keep in mind that W^± and Z can interact with any of the fermions above. In situation of reconstruction the interaction has already taken place. The W bosons are the only way for quark flavour changes, i.e. reactions where the number of quarks per flavour is not conserved. For example, a charm quark can decay into a strange quark and aW⁺.

2.1.2 The top quark at hadron colliders like the LHC

Most of the content of this section is covered by Bettini [7] or Wagner [10].

If further resources are used, they are cited explicitly.

Hadron colliders

At a hadron collider, as the name suggests, hadrons (protons in the case of the LHC) are accelerated to almost speed of light and collide at an interaction point in a detector that identifies and measures the properties of the products of the collision (event). At the LHC, two proton beams are accelerated to 6.5 TeV each in opposite directions such that the center of mass energy of the collision, the length of the total four-momentum, is equal 13 TeV. For a collision with equal but opposite momenta, the rest frame, the inertial system in which the center of mass is at rest, is the laboratory.

As mentioned above, protons are no fundamental particles. Besides three valence quarks (2 up quarks and a down quark), a proton consists of many

(13)

Figure 2.2: Slice of the CMS-detector with trajectories of particles and their interactions with the detector. The muon is a positively charged antimuon, the charged pion a positively charged hadron and the electron’s charge is negative [11].

quark-antiquark pairs (the sea quarks) that are generated and destroyed spontaneously and gluons exchanging energy between them.

Therefore, for an accelerated proton, the total momentum is divided among its partons (all constituents of the proton) in a non-deterministic way. In general, one finds that the gluons carry about half of the total momentum of the proton and are very common among the partons carrying a small fraction of the total momentum. So intuitively, there are many gluons within the proton that carry a small fraction of the total momentum. A collision of two protons is actually a collision of two partons, one from each proton.

Such a collision results in a lot of particles and energy emerging from it and the detector is designed to identify them and measure their properties as accurately as possible. The following paragraphs summarize roughly what the most important components of the CMS detector are and what they do.

For more details about the CMS detector see CMS Collaboration [11] and Cerrito [12] for more details about the physics used in detectors.

Figure 2.2 shows a slice of the CMS detector. It is shaped like a rotational symmetric cylinder with respect to the beam axis, consisting of the cylinder barrel together with an end cap on each of the two ends. The shown slice illustrates a part of the barrel and the components used to detect and measure the properties of various decay products. The end caps contain

(14)

the same instruments arranged such that particles that move rapidly in the beam direction can be detected there in a similar way as the particles in the barrel.

The silicon tracker consists of millions of pixel detectors having side lengths smaller than about a tenth of a millimeter each. A highly-energetic charged particle leaves a trace of activated pixels in the silicon tracker that can be reconstructed to see its path through the tracker.

The purpose of calorimeters (electromagnetic and hadronic calorimeters) measure the total energy of electrons and photons (ECAL) or hadrons (HCAL). A particle entering a calorimeter will start to form a shower of many more particles within the calorimeter (the dark blue clusters in the green/yellow regions in Figure 2.2) that will eventually pass all their energy to the calorimeter. In some sense, every created particle is the origin of a smaller subshower, so this chain reaction terminates fast enough such that the total shower is mostly contained in the calorimeter. Measuring the total energy of all particles of a shower yields the energy of the initial particle.

The superconducting solenoid is responsible for a strong magnetic field that curves the trajectories of all charged particles. Due to the magnetic field, charge and momentum of particle are related via the radius of the curve. In addition, the sign of the charge can be read of by the direction of the magnetic force. For this orientation of the magnetic field relative to the viewer, positive particles move clockwise while negative particles move counter-clockwise within the magnet. Outside of the solenoid, the magnetic field changes the direction, therefore, the direction is the other way around, too. The radius of the magnet is chosen such that most of the instruments can be placed inside of it to minimize inaccuracies introduced by interactions with the solenoid.

Muons are detected in the most outer part of the detector, the muon chambers, because they can pass all the inner layers without showering. Almost all the other particles do not reach that far out. The muon chambers work similar to another layer of trackers that detect the path of the particle passing through them.

Since the interactions in a hadron collider happen on parton level, the rest frame of a collision is not necessarily the laboratory frame. Instead, the rest frame is (approximately) z-boosted due to the different fractions of the proton momentum carried by the interacting partons. Unfortunately, it is not possible to know how exactly this system moves with respect to the

(15)

detector. Therefore, it is desirable to use a coordinate system that is invariant under Lorentzz-boosts [2].

The Cartesian coordinates in a detector are usually defined such that the z- axis is tangent to the counter-clockwise beam and they-axis points upwards.

To satisfy the common right hand rule, the x-axis has to point towards the center of the accelerator ring.

This definition implies that the x-y-plane is perpendicular to the beam and therefore, movement in this plane is invariant under Lorentzz-boost.

Since no direction is favoured physically (because gravity is negligible) and detectors are usually designed to be rotational symmetric with respect to thez-axis, it is natural to use polar coordinates for thex-y-plane. Therefore, movement in this plane is described by the two variables p_T = ^qp²_x+p²_y andϕ∈ [−π,π], the azimuthal angle between thex-axis and the momentum vector.

The polar angle θ is not invariant under z-boosts, so another coordinate to measure the movement in z-direction is needed. The rapidity has the desirable property to transform additively such that differences are invariant underz-boost. Unfortunately, it is a kinematic quantity influenced by momentum and energy and not only by the direction. The purely geomet- ric pseudo rapidity on the other side is defined as η = −ln

tan^θ₂ and therefore, only depends on the direction. For massless particles, rapidity and pseudo-rapidity agree, but this is not true for particles with mass [13].

Instead, the rapidity converges to the pseudo-rapidity for momenta much larger than the mass (highly relativistic particles) as it is the case for the majority of decay products [2].

We have that:

px = p_T·_cos_ϕ py = pT·sinϕ pz = pT·sinhη

Since the total transverse momentum vanishes (approximately) before the collision, this should still be the case after the collision, but weakly interacting particles like neutrinos escape the detector with almost no reaction and therefore, their momentum is not measured. The constraint of vanishing

(16)

g

g t

t (a) gluon fusion

t

g t

(b) gluon fusion

g

g t

t

(c) gluon fusion

g

q

q t

t (d) quark antiquark

annihilation Figure 2.3: Feynman diagrams of lowest order (two vertices) for top pair production

at hadron colliders.

transverse momentum can be used to obtain another pair of variables that describe the total transverse momentum of the weakly interacting particles called~p^miss_T . It is the negative sum of all measured momenta. The absolute value of this vector is also called missing transverse energy (MET/E^miss_T ) [2].

The top quark

One especially interesting particle is the top quark. Having a mass of about 173 GeV (about the mass of a gold atom) makes it the heaviest fundamental particle, even heavier than the Higgs boson (about 125 GeV rest mass) and by far the heaviest fundamental fermion compared to the bottom quark at the second place having a rest mass of ”only” about 4 GeV.

Figure 2.3 shows the two-vertex-processes contributing to top pair production at hadron colliders. At a proton-proton collider specifically, the antiquarks must be sea quarks while the quark can be either a valence or a sea quark. In case of the LHC at 13 GeV center of mass energy about 90% of the top pairs are created via gluon fusion [2]. Thus, the valence quarks only contribute little to the top pair production. At the Tevatron, the proton-antiproton collider where the top quark was discoverd, about 80-90%

of the produced top pairs were quark anti-quark events.

Another noteworthy property of the top is connected to its decays. Firstly, its average lifetime is about two orders of magnitude smaller than the time scale on which the strong interaction operates. This implies that the top quark almost certainly decays before being able to form any kind of non- fundamental particle like mesons and baryons. Therefore, the measurements of top quark properties are not ”disturbed” by any kind of bound states involving the top. Secondly, the by far most dominant decay channel is

(17)

Figure 2.4: Diagram showing the cone structure of jets together with the secondary vertex for bjets [14].

t→W⁺b happening about 99.8% of the time. Similarly, the antitop decays mostly via ¯t→W⁻b¯ before forming hadrons.

Another handy property of this decay is that bottom quarks (and antibottom quarks) admit special properties, too. Since they are heavy particles, they decay quickly as well.

As stated above, free color charged particles are not observed. Whenever a quark or gluon emerges from a collision, it is the origin of a jet, i.e. multiple hadrons moving in a cone in roughly the same direction. Often, this is pictured as ”tension/high energy density” in some kind of potential of the strong force that is resolved via the creation of new particles that form color neutral bound states (hadrons). This results in a lot of particles hitting the detector and making measurements more difficult and imprecise since there are a lot of particles carrying the energy of the single particle the shower emerged from.

If there is a jet containing a bottom quark in the detector, the bottoms can travel a small distance before decaying. After that short amount of time, the bottom quark will decay into another quark and aW boson. That creates a secondary vertex and yields a jet within the jet. This is the key observation that led to b tagging of jets to determine whether a bottom quark was contained in a jet. This is pictured in Figure 2.4.

(18)

Therefore, analyzingtt¯events can be done by analyzing the simultaneous decay of a top and an antitop which can be even further simplified by considering only the by far most common decay respectively. But this is where things start to become a bit more complicated. TheW⁺ (andW⁻ in a similar way, particles and antiparticles change roles) can decay leptonically into an anti-lepton and its neutrino or hadronically in an up-type quark and a down-type antiquark.

This results in 3 different decay channels for the decay of att¯pair. Those are:

• dilepton: BothWs decay leptonically. This yields 2 bjets, 2 neutrinos and 2 leptons.

• lepton+jets: One W decays leptonically and the other hadronically.

This yields 2 bjets, 2 jets and 1 neutrino.

• all hadronic/alljets: BothW decay hadronically. This yields 2 bjets and 4 jets.

Often, decays including the tau are excluded from the categorization above and considered separately because of the various decay channels of the tau, but we will consider them as part of the dileptonic decays.

For every process one would like to observe, there might be other processes that have the same final state and therefore, measured data might look similar to the one we are interested in, since we can only measure the final state of a collision event. Those other possibilities are called background for this process. When using data measured at a collider experiment, we have to somehow separate signal and background to analyze particular particles and interactions.

Due to the non-deterministic nature of particle physics events, statistical methods are widely used. In order to obtain results that are as precise as possible, it is favourable to have many events available such that statistical uncertainties decrease.

The number of events occurring per unit time interval is characterized as the product of the two quantities Landσwhereσ is the cross section and L is the luminosity.

The cross section contains information that is specific for a process, e.g. two protons colliding, forming a tt¯pair and decaying in the dilepton channel.

The luminosity summarizes all the properties of the experiment, e.g. the

(19)

all-hadronic

electron+jets

electron+jets muo n+jets

muon+jets

tau+jets

e

µ

^e

τ

e

τ µτ

µτ ττ

e

⁺

µ

⁺

τ

⁺

ud cs e

–

cs ud τ

–

µ

–

Top Pair Decay Channels

W decay

e

µ

ee

µµ

di lep to ns

Figure 2.5: This graphic shows the relative frequency of final states fort¯t-events.

The labels containing the abbreviation of multiple quarks represent the by far most common quark combinations in jets originating in top pair decays [15]

number of particles per beam, the width of the beam and how often the beams cross at the interaction point.

When considering one specific collider, a high rate of occurrence corresponds to a high cross section. Thus, it would be nice if a particularly interesting decay has a small background, i.e. comparably few or rarely occurring events with the same signature in the detector and a high cross section and therefore, occur often.

Figure 2.5 shows the relative frequency (branching ratio) of the most dominant possible decay channels. As one can see, the dilepton is the rarest, followed by lepton+jets (when as usual excluding taus) and a lot of events decay hadronically. In fact, the background of the full hadronic decay channel is pretty big, because a lot of processes can produce jets, e.g. gluon

(20)

radiation which yields an additional jet. On the other hand, the background for the dilepton channel is comparably small.

To summarize, we find these practical properties of the decay channels:

• dilepton: small background yields clean event samples, but has a comparably small cross section.

• lepton+jets: moderate amount of background and cross section.

• all hadronic: high cross section, but also high background

Furthermore, every decay channel admits some difficulties to face in the kinematic reconstruction events, which tries to determine the kinematic properties of the involved particles. The all hadronic channel yields a large amount of jets, but it is not clear, which jet originated from whichW boson.

In addition, it is not possible to decide whether a bjet contained a bottom quark or an antibottom quark. These ambiguities have to be resolved in the analysis. Every event containing a lepton (and hence a neutrino) will have missing energy since neutrinos do not interact with matter strongly enough to be measured in the detector.

This is just a shallow and mostly phenomenological introduction to the standard model but it should be sufficient to follow the physical consequences arising from it.

2.2 Machine learning

The content of this section is covered by Alpaydin [16]. More details of the used implementation can be found in the scikit-learn reference on decision trees [17].

2.2.1 The basics of decision tree regression

A part of machine learning, more specifically supervised learning, is trying to find algorithms that can reproduce some kind of input-output behaviour by analyzing a small subset of the possible inputs. In regression, the output (also called target values) are continuous numerical values that we would like to predict.

(21)

mse = 186673.572 samples = 36156 value = -616.059

mse = 46267.094 samples = 118467value = -161.567

mse = 46608.784 samples = 120566value = 158.681

mse = 189272.592 samples = 36634value = 614.24 jet1_eta <= -1.623

mse = 116105.765 samples = 154623value = -267.842

jet1_eta <= 1.614 mse = 116948.438 samples = 157200value = 264.845 jet1_pz <= -0.732

mse = 187464.496 samples = 311823value = 0.702

Figure 2.6: A decision tree of max depth 2. It predicts the momentum along the beam axis.

Decision trees are a simple, easy to use out of the box method in machine learning. In contrast to other machine learning techniques, they allow to interpret and understand the learned rules. Therefore, decision trees and derived models are widely used to build comprehensible models.

To understand how decision tree learning for regression works, it makes sense to start at the end and consider how a completely trained decision tree processes the input to make its prediction. Therefore, Figure 2.6 represents a small fully trained decision tree. It is a inspired by the decision tree discussed in Section 4.1.

A decision tree consists of finitely many nodes that may (inner nodes) or may not (leaf nodes) split the data. In the example tree in Figure 2.6 the top node (the root of the tree) splits on pz of jet1.

The picture that one should have in mind when thinking about splits is that every event havingjet1 pz≤ −0.732 progresses to the left node while all the other events havingjet1 pz >−0.732 progress to the right child of the root.

This yields a path through this tree that ends in a leaf for each event. That leaf has a ”value” attached to it (as any other node in Figure 2.6 has). When predicting (reconstructing the momentum for) an event, the prediction of a decision tree is the ”value” of the leaf the path for that event ends in.

Therefore, if the decision tree in Figure 2.6 would only consist of the upper two levels, any event would have either get the prediction −267.842 if jet1 pz≤ −0.732 or 264.845 else.

(22)

A building algorithm for decision trees has to find these splits and determine the values in every leaf.

Unfortunately, the run time of all known algorithms to build the in many senses optimal decision tree grows exponentially. Thus, it is computationally not feasible to build the optimal tree.

That is the reason why decision trees are usually built in an iterative, greedy fashion: At the beginning, all the events start in the root of the tree. The value of a node as seen in Figure 2.6 is the mean of all target values of events in that node. This means that in this case the mean pz is 0.702.

Each node has an impurity, which measures how different the target values of the events in that node are. In this case it is ”mse” which is the mean squared error. For this tree the impurity of the root is ₃₁₁₈₂₃¹ ∑³¹¹⁸²³i=1 (p_zi− 0.702)² =187464.496.

To find a split that divides the events within a node into two children, every possible split on every variable is considered and it’s impurity decrease

∆I = ^N^node

N_total(I_node− ^N^right

N_nodeI_right− ^N^left

N_nodeI_left) (2.1) calculated. Here, N denotes the number of events and I the impurity. The subscript ”node” corresponds to the current node and ”left” and ”right” to its children.

Finally, the split with the highest impurity decrease is chosen. This method makes locally optimal decisions even though they might not lead to the globally optimal split. Note that the impurity in a child node can be higher than the impurity in its parent. This happens when large portion of the data can be grouped together by a split such that the impurity is small. The small remaining portion of the data may not be as similar as the rest, but the increased impurity is multiplied by a small number and hence, has a smaller impact and the overall impurity still decreases.

The building procedure stops when there are no possible splits left, i.e. a node is a leaf if it is not possible to split the data such that the impurity decreases. It also possible (and sometimes useful) to pose restrictions on tree growth to avoid overfitting. Similarly, a node is a leaf if there is no split left that decreases the impurity while satisfying all posed restrictions. Those restrictions are also called regularization.

(23)

Intuitively, a machine learning model is said to overfit if it stops generalizing patterns and starts to learn details about the training data. A perfect example of an overfitting decision tree would be a tree that has one leaf for every training event and therefore, predicts the value of that single event. Hence, it is perfectly accurate for the training data. From an intuitive point of view, one would not say that this tree has ”learned” anything in terms of finding patterns. It just memorized the data.

A simple example of such a restriction is fixing the maximal depth (and therefore the number of leaves). The tree shown in Figure 2.6 is build by requiring a maximal depth of 2. The depth is also the maximal number of splits on a path from the root to a leaf.

It is generally a good idea to limit the number of leaves to reduce overfitting.

Another way to this is by requiring that a leaf has to contain at least x%

of the training data. Doing so, the prediction value of a leaf is found by considering several data points and tends to be less influenced by details of the training set.

Another possibility to restrict the learning of the decision tree is to limit the maximal number of features considered for a split. This way, the tree might not find the best split and has to use a slightly worse split of the data. Thus, it is less likely that the tree learns details of the training data. Note that (at least in thescikit-learnimplementation) features will be inspected until a valid split is found, even if this requires to consider more features than the maximal number of features parameter specifies. For example, if the parameter is set to 1 such that only splits on 1 feature should be considered but there is no valid split on that feature, splits on another feature are evaluated and chosen if they are valid splits. A nice benefit of this parameter is that it also reduces computation time because less splits are considered.

The fraction of the total impurity decrease while building the tree, that is achieved by splits on a specific feature, is called the feature importance of that feature.

Doing that calculations for the tree shown in Figure 2.6, yields a decrease in impurity of 70933.90 for pz achieved in the initial split and a decrease in impurity of 18349.08 and 8701.36 for the left and right eta splits. Therefore, the feature importances for this tree are 65.69% for pz and 34.31% for eta.

Jerome H. Friedman proposed a different split criterion for decisions tree which is (usually) used for gradient boosting [18] refered to as ”friedman mse”. Instead of maximizing the impurity decrease, the split that

(24)

maximizes the expression _N^N^left^N^right

left+N_right(y¯_left−y¯_right)² is chosen. Impurities and impurity decreases are calculated and used for feature importance as shown before using Equation (2.1).

2.2.2 Gradient boosting

The content of this section is covered in the ensemble documentation of the scikit-learnreference [19]. Also, the calculation presented below follows the derivation there closely.

Unfortunately, single decision trees are not very powerful. Boosting techniques aim to construct a strong machine learning model out of a collection of weaker ones by combining their predictions to obtain a strong model that can produce much better predictions.

Gradient boosting is one particular way to construct this strong model.In our case, the weaker models will always be decision trees, but it would work with any type of regression model, even though scikit-learnonly supports decision trees.

Assume we have the first k decision trees d₁, . . . ,d_k trained to construct the strong model G which is the sum of the individual trees and an initial predictiong0 which is a constant by default.

Denote the training set by(x_i,y_i)wherexdenotes the input andythe target.

The prediction of the whole model is the sum of the initial prediction and the predictions of each tree such that

G(xi) = g0+

∑

k i=₁

d_k(xi) (2.2)

In general this is not equal toy_i.

Of course, we would like to have G(x_i) = y_i after adding the k+1-th decision tree. To find out how to do this, we use the concept of a loss function.

A loss function measures how ”bad” the prediction is compared to the correct value. The higher the loss, the worse the prediction. The perfect prediction has a loss of 0. There are many possible loss function with specific

(25)

properties, but we will only consider the least square loss here and use it later on. The least square loss is defined by

L(G(x_i),y_i) = (y_i−G(x_i))² (2.3) For this loss, the initial prediction g₀is the mean of the training events since it minimizes the squared error.

Note that this calculation could be done for any differentiable loss function.

For the general case, one would use an approximation by the Taylor expan- sion of the loss and obtain the same result. But since only the least square loss is considered here, we can do the calculation explicitly.

To minimize the total loss, we have to minimize it for every event in the training set individually. Including thek+1-th decision tree and substituting in Equation (2.3), we obtain that

L(G(x_i) +d_k+1(x_i),y_i) =

y²_i −2yi(G(xi) +dk+₁(xi)) +G(xi)²+2G(xi)dk+₁(xi) +dk+₁(xi)² (2.4) is the expression that we want to minimize by training thek+1-th decision tree.

This yields a quadratic equation in the prediction of thek+1-th decision tree d_k+1(x_i)which is the only term that can be varied in this step. Equation (2.4) is minimized if and only ifd_k+1(x_i)²+2(G(x_i)−y_i)d_k+1(x_i) is minimized.

Using basic calculus, we get that this happens ford_k+1(xi) = yi−G(xi) _∝

−^∂L⁽_∂G^G⁽₍^x_xⁱ⁾^,yⁱ⁾

i) .

By the above calculation, we found that we have to train thek+1-th decision tree to predict a multiple of the gradient of the loss function. Hence the name gradient boosting.

This result makes sense intuitively: Given a predictionG(xi), we obtain a better prediction by trying to find the errory_i−G(x_i) and add these two predictions together.

By posing restrictions on the individual trees, one obtains many regularization methods for the gradient boosted model. This way, all of the above mentioned regularization methods for decision trees can be applied.

(26)

Another possible parameter is the learning rate εthat reduces the contribu- tion of each decision tree to the prediction. Instead of just adding the contri- butions of the decision trees together, they are multiplied with the constant εwhich is usually smaller than 1. So we have G(xi) = g0+ε∑ⁿk=0dk(xi). When using decision trees, we can also define the feature importance for the boosted model by averaging the feature importance for every individual tree.

This gives a possibility to analyze the the learning of the boosted model.

2.2.3 Training, evaluation and model selection

The content of this section is covered in the scikit-learn reference on model selection [20].

Consider the overfitting decision tree from above. It is easy to discover that such a tree has not learned anything. When given the task to predict values for data that it has not seen before, the performance would probably be much worse.

That is the reason why one tries to avoid to evaluate a machine learning model on the data that was used for training. Usually, one set is used for training, the training set, and another one is used for evaluating the performance. This set is usually called test set. Doing it that way, every model has to show how good is really is on new/unseen data.

Often, one divides the training set again into a ”real” training set used for training and a validation set for model selection. The validation set plays a similar role as the test set and is therefore also not used in training. If there are several possible models (for example because there are several possibilities for parameters like the ones that restrict the building process of a decision tree), that are trained on the training set, one would select the one that performs best on the validation set.

Using the test set for model selection and performance evaluation might introduce a bias towards the selected model because it already prefers the models that tend to perform well on the test set.

There are a variety of metrics to evaluate the performance of a regression model.

(27)

The mean absolute error (mae) is the mean of the absolute deviations between the predictions ˆy_i and the correct valuesy_i _n¹∑ⁿi=₁|y_i−yˆ_i|. It represents the typical error of the prediction.

The mean squared error (mse) is the mean of the squared deviations:

1

n∑ⁿi=1(y_i−yˆ_i)². Consider two predictions with a mean absolute error of 1:

The first one, having that every absolute error is already 1 and the second one, where 50% of the data was predicted perfectly correct (no error) and the other half of prediction has an absolute error of 2 each. Even though the mean absolute error is equal, the mean squared error differs: It is 1 for the first prediction and it is 2 for the second one.

The mean squared error is, when compared to the mean absolute error, a measure for how much the errors tend to fluctuate. If all absolute errors are identical, the mean squared error is just the square of the mean absolute error. Otherwise, it will be larger than the square of the mean absolute error.

The max error is the maximal absolute error made: max1≤i≤n|yi−yˆi|. It shows the range of the occurring errors, but since it represents only one event, it is not the most powerful metric.

The coefficient of determination (short:R²) is another measure for the quality of the prediction. It is defined asR²=1−^∑_∑ⁿⁱ⁼¹n ⁽^yⁱ⁻^y^ˆⁱ⁾²

i=1(y_i−y¯)² where ˆy_i is as usual the i-th prediction and ¯y = ¹_n _∑ⁿ_i₌₁y_i the mean of the target values. It is always≤1 and a prediction that is always equal to the mean has an R²of 0.

Originally, R²was introduced as a measure in linear regression. Due to the special properties of linear regression, the coefficient of determination is exactly equal to the square of the correlationR. Hence, the name R², even though this does not hold for a general regression method [21].

Therefore, the name R² can be a bit misleading because the coefficient of determination can be negative as the predictions of a model can becomes arbitrarily bad.

Since it depends on the variance within the data set and not only deviations, one should be careful when comparing R² values that were calculated for different data sets.

The correlation Rmeasures the linear dependence between two variables. It is normalized to have values between -1 and 1. A positive correlation (close to 1) between two variables means that they tend to rise and fall together, i.e. as one of them increases, the other one increases as well. Similarly, a

(28)

negative correlation (close to -1) means that one of the variables increases when the other one decreases and the other way around.

The above mentioned restrictions on the growth/learning of a decision tree are examples of hyper parameters. The usage of the least square loss for gradient boosting is another example of a hyper parameter. Since there are several possible combinations, it is not clear, which one is the best combination.

There are multiple strategies to find the optimal parameter combination.

Probably most simple one (that is used here) is the exhaustive grid search.

In order to do an exhaustive grid search, one has to pick some possible values for each of the parameters that shall be optimized, train a model with every possible combination of parameters, evaluate it on the validation set and pick the one that is the best with respect to some criterion (in this case:

maximal R²).

Some advantages of this method are the easy implementation and the fact that this search can be performed in parallel on multiple machines with almost no effort. On the other, if the optimal combination is not among the selected ones, i.e. it is not on the discrete grid, it wont be found by this approach and some time might be wasted for parameter combinations that are far from optimal.

(29)

3 ^Methods

3.1 Event generation and processing

For data taken at a collider experiment, it is not possible to determine the process and the involved particles. Especially for top quarks which cannot be detected, it is difficult to determine whether one of them was present in an event or not. Therefore, generated events are widely used. Instead of using real data, one simulates the collision and the measurements of the resulting particles in a detector to obtain an output that contains data similar to the measurements of a real detector and also the information which particles have been involved in which process.

This way, it is possible to develop new analysis methods and improve exist- ing ones by comparing analysis results to the real quantity or by focusing on one specific process.

The toolsMadGraph5 aMC@NLO[22] (version 2.7.0) together with the PDF set NNPDF23 lo as 0130 qedin combination with Pythia 8.2[23] andDelphes [6, 24] (version 3.4.2) is able to generate the underlying processes, simulate events based on them, perform decays and hadronization, and finally, simulate the detector. The output is saved in arootfile, which can be read and converted intonumpy[25] arrays by the python moduleuproot.

Two million pp → tt¯ events in the dilepton channel, were generated at a center of mass energy of 13 TeV using the above mentioned tools to obtain one of the data sets. The decayW⁺ →τ⁺+ντ and the respective decay for

(30)

the W⁻ were included in the simulation. Since Delphes is unique for this production chain, we will refer to it by mentioning Delphes explicitly as in

”Delphes output”. Those events are separated into two sets of one million events each to obtain a training and test set.

Furthermore, NanoAOD samples were used. NanoAOD simulations are produced centrally usingPOWHEG v2including theNNPDF31 nnlo hessian pdfas PDF set at NLO together with NLO matching of matrix element and parton shower simulation as proposed by S. Prestel [26]. They are distributed as root-files, therefore, they can be processed similarly as Delphes output, even though NanoAOD and Delphes output might contain different variables. Also, while NanoAOD performs a full detector simulation of the CMS detector, Delphes contains a simplified simulation of a detector.

There are only minor differences in the structure of the output between NanoAOD and Delphes such that the same tools can be used to process the data. One difference in the format of the data is a change in naming conventions. Lowercase variables are used instead of capital letters at the beginning, e.g. ”px” instead of ”Px” and MET MET instead of MET pt. The naming convention of the respective set will be carried through.

Also, the information whether a jet is suspected to have contained a bottom quark or not is stored differently in Delphes and NanoAOD output. In Delphes, for a jet, that probably contained a bottom quark, a 1 is stored as btag and jets that probably did not contain a bottom quark get the btag 0. So btag has 2 discrete values in Delphes and every jet having a 1 is considered to be a btagged jet.

In NanoAOD, the used tagging method (btagDeepB) is based on a classifier that outputs a number between 0 and 1 that is stored as btag. This can be interpreted as the probability that a jet contained a bottom quark. Here, jets having such a probability of at least 0.2770 are considered as btagged jet.

In addition, NanoAOD contains the type of the parton from which the jet originated. That is obtained by matching a jet to a parton from the simulation that is geometrically close to the jet. This information is stored as partonFlavour and directly yields which jet contained the bottom and which jet contained the antibottom.

Each event is required to have exactly 2 btagged jets and 2 leptons of opposite charge. 63,447 out of the first million generated Delphes events and 63,287 from the second million Delphes events fulfilled these criteria.

For NanoAOD, 180,721 out of 1,190,000 and 170,053 of 1,120,000 fulfilled

(31)

Type shortcut variables

lepton lept pt, eta, phi, px, py, pz, E

jet jet pt, eta, phi, mass, px, py, pz, E, btagDeepB (NanoAOD) MET MET MET(called pt in NanoAOD), phi, px, py

Table 3.1: Overview of the input variables used for the reconstruction.

lept1 PT lept2 Px jet1 Eta jet2 Py MET Phi 0 63.683491 -13.405689 -1.785956 38.043738 0.850217 1 20.479212 89.475971 -1.248753 40.449163 -2.370134 2 15.019216 56.703254 -1.334634 30.390645 -2.407620 3 63.861282 36.655279 -0.035643 -60.828295 -0.564938 4 80.606110 59.085128 -0.589899 33.034572 -0.127876

Table 3.2: Table containing some of the variables of the first 5 events in Delphes unsorted.

these criteria. No further cuts were applied. It should be mentioned that the Delphes simulation already contains cuts while the object reconstruction.

To also evaluate the influence of this, only the requirement for leptons to have a pT ≥ 10 GeV was adapted to NanoAOD. Therefore, in NanoAOD all leptons having p_T <10 GeV were ignored. This probably explains the difference in the number of selected events.

Table 3.1 shows the variables used as input. Those are the most commonly used kinematic variables as introduced earlier. There are some redundancies by using Cartesian coordinates and polar coordinates together with the pseudo rapidity. Even though changes of coordinates are mathematically possible, a decision tree cannot perform such a calculation, so it might be beneficial to provide the kinematic variables in both coordinate systems.

The target variables, the top momenta, were taken from the top quark with the pythia8 status code 62.

The data is then represented in a pandas data frame such that each row corresponds to an event and each column corresponds to a variable as shown in table 3.2.

Columns are named as follows: (shortcut)(number) (variable). If there is only one column of a type, the number is omitted. If possible, the entries in columns with the number 1 contain to the top decay and similarly, the number 2 belongs to the antitop decay.

(32)

This works well for leptons, since the lepton in the top decay will always be positive while the lepton in every antitop decay is negative and exactly 2 leptons of opposite charge were required. Sadly, it is not trivial to assign the correct number to a btagged jet. There are several possibilities to address this problem. The most simple one is of course to ignore it and sort the jets into columns by any property, like pT. This procedure yields the first two data set: ”Delphes unsorted” and ”NanoAOD unsorted” where jet1 is the jet that has the higher pT. This could of course also be done for real data.

On the other hand, one can obtain simple solutions to this problem by using generator information to determine the correct assignment.

Unfortunately, the Delphes output does not contain information that allows to do this directly but we can try to approximate this:

If everything would be measured and reconstructed perfectly, we would have that ptop = pjet1+p_lept1+pneutrino. Based on that, one can try to find a better assignment based on this generator information. A simple way to use this is to let jet1 be the jet such that the absolute difference in the x-component in the above equation is minimized.

This yields the third used data set ”Delphes sorted” made from the same simulation data as ”Delphes unsorted” but formatted differently.

Even though this simple approach will not yield the perfect assignment, this makes it possible to study how this machine learning based reconstruction methods would gain accuracy if there was a procedure to determine which jet is which.

For NanoAOD, the partonFlavour information can be used to sort the jets and built the data sets ”NanoAOD sorted”. In order to allow this sorting method to work consistently, it is necessary to change the selection criterion on jets. Since there might be btagged jets that did not really emerge from a bottom quark, only events are selected that have exactly one jet having the partonFlavour 5 (bottom quark) and exactly one jet having the partonFlavour -5 (antibottom quark). Conversely, some bjets were not btagged, so there are also new events included in ”NanoAOD sorted”. Based on the same simulations as for NanoAOD unsorted, 311,823 out of 1,190,000 and 292,518 of 1,120,000 events were selected. The difference compared to the unsorted set comes from the fact that only 75% of the jets in NanoAOD sorted are btagged so (assuming independence) only 0.75²≈0.56 =b 56% are expected to be recovered for NanoAOD unsorted.

(33)

600 400 200 0 200 400 600 simulated px

600 400 200 0 200 400 600

reconstructed px

10⁰ 10¹ 10²

(a) Delphes, jets sorted by pt

600 400 200 0 200 400 600

simulated px

600 400 200 0 200 400 600

reconstructed px

10⁰ 10¹ 10² 10³

(b) Delphes, jets sorted such that|p^reco_t −pt| is minimized

750 500 250 0 250 500 750

simulated px

1000 500 0 500 1000

reconstructed px

10⁰ 10¹ 10² 10³

(c) NanoAOD, jets sorted bypt

750 500 250 0 250 500 750 1000

simulated px

1000 500 0 500 1000 1500

reconstructed px

10⁰ 10¹ 10² 10³ 10⁴

(d) NanoAOD, jets sorted with generator information

Figure 3.1: Two dimensional histogram of points (p_t,p^reco_t ) for each data set. If everything in every event is treated perfectly, the points would be of the form(x,x)and the histogram a diagonal.

Hence, it must be noted that while ”Delphes unsorted” and ”Delphes sorted”

contain the same events with differently formatted input, ”NanoAOD unsorted” and ”NanoAOD sorted” might contain different events. This reduces the comparability of these two sets.

Anyway, this allows to further study the effect of correct assignments since it can be assumed that the assignment of jets can be done perfectly by using generator information.

To evaluate how good these jet assignment methods work, one can use the neutrino momentum from the simulation and reconstruct the top momentum via p^reco_top = pjet1+plept1+pneutrino and compare this to the correct top momentum.

Figure 3.1 shows that, as expected, the sorting performance increases when

(34)

trying to minimize reconstruction error and increases when using generator information.

3.2 Machine learning methods

For each of the below mentioned machine learning techniques the implementation fromscikit-learn[27] is used.

Every time, two similar data sets were generated. One of them will be used for training and model selection. It is split into a training set containing 80

% of the events and a validation set containing the remaining 20% of events.

The other one will be used as a test set for evaluating the performance.

Decision tree regressors are used to gain a few first insights. Eventually, gradient boosted decision trees will be evaluated as regression model for this problem because it is a powerful and still comprehensible machine learning method. Since this is based on decision trees, it is possible to adapt the analysis developed for decision trees in order to see how the trained model works.

All methods are based on decision trees, therefore it is not necessary to scale the data since the building rules of decision trees are invariant under such transformations.

3.2.1 Hyper parameter optimization

Each of the above mentioned regression models depends on parameters set before training. In order to find the best parameter combination, an exhaustive grid search was made to evaluate all the possible combinations out of multiple values for each parameter.

The searched parameter space is spanned by the values shown in Table 3.3.

Every possible combination of them was trained on the training set and the performance was evaluated on the validation set using different metrics (R², mean absolute error, max error and mean squared error ).

(35)

Method Parameter Possible values

Decision tree

max depth 10-20

split criterion mse, friedman mse mininum samples per leaf 0.01%, 0.1%

maximal number of features 1, 25%, 50%, 75%, all

GradientBoosting regressor

number of trees 2000

max depth 4-8

maximal number of features 1, 25%, 50%, 75%, all learning rate 1, 0.1, 0.01, 0.001 mininum samples per leaf any, 0.1%, 0.01%

Table 3.3: Parameter ranges used for hyper parameter optimization. Percentages in maximal number of features represent the fraction of input variables considered for a split. For minimum samples per leaf percentages represent the minimum fraction of events that must be contained in a leaf to be built.

3.2.2 Effect of the correct jet permutation

Since generator information was used to sort the jets into columns, it is interesting to investigate how this affected the performance.

This is only analyzed for the NanoAOD data set because it is the only one where the correct assignment is known. This makes it possible to create data sets such that the fraction of incorrectly assigned jets is p∈ [0, 0.5]. As decision trees do not have any initial knowledge or understanding, of what the columns represent, every fraction above 0.5 corresponds to a fraction below 0.5 after swapping the columns which has no effect on the training.

Here, an exhaustive grid search over a smaller parameter space was performed for p from 0% to 50% in steps of 5%. The parameter ranges were inspired by the parameter combination that seem to yield good results in the case of the normal NanoAOD sorted data set and are shown in Table 3.4.

(36)

Parameter Possible Values

number of trees 2000

max depth 5-8

maximal number of features 25%, 50%, 75%, all learning rate 0.1, 0.01 minimum samples per leaf 1%, 0.1%, 0.01%

Table 3.4: Parameter ranges used for hyper parameter optimization when evaluating the effect of the correct jet permutation. Percentages in maximal number of features represent the fraction of input variables considered for a split. For minimum samples per leaf percentages represent the minimum number of fraction that must be contained in a leaf to be built.

(37)

4 ^Results

After performing the above described searches, the regressor having the highest coefficient of determination (R²) was selected for each data set and feature respectively. Their predictions on the test set and properties like feature importance will be the subject of this chapter.

4.1 A single decision tree

Since gradient boosting uses a collection of decision tree, it is helpful to analyze the performance and properties of a single decision tree first.

The grid search over the in Table 3.3 listed parameter space for the decision tree for the momentum along the beam axis (pz) of the NanoAOD sorted data set yields that the best parameter combination is:

• maximal depth: 12

• split criterion: mse

• minimum samples per leaf: 0.1 %

• maximal number of features: all

The usual way to present the prediction and real values will be a two dimensional histogram of point of the form (truth,prediction) and was already used in Figure 3.1.

(38)

4000 3000 2000 1000 0 1000 2000 3000 4000 simulated p

z

1500 1000 500 0 500 1000 1500

re co ns tru cte d p

z

(x,x) ideal distribution

10

⁰

10

¹

10

²

10

³

Figure 4.1: Two dimensional histogram of points (p_z,top,p^predicted_z,top ) for p_z in NanoAOD sorted. If everything would be predicted perfectly, the histogram would have the shape of the orange line.

Figure 4.1 shows the prediction of the decision tree. Since decision trees follow a decision path with a leaf node containing the prediction at the end, there is only a discrete number of distinct predictions possible. This behaviour yields the discrete horizontal stripes in the figure. Ensemble learning will help to smooth out the jumps in the predictions.

It is also possible to look at the decision tree structure. Since the depth of this particular tree is up to 12 and it has 720 leaves that would be next to each other, it is not possible to picture it completely here, but it is possible to visualize the first few levels.

We also can see that early splits based on pz andη of particles that originate from the top decay yield the best partition of the data.

It is also possible to determine which variables yield the best splits in the training process by looking at the feature importances (the fraction of decrease in impurity achieved by splitting on this feature).

For the whole tree we obtain the feature importances:

Bachelor Thesis Application of machine learning in the kinematic reconstruction of t¯t events

Application of machine learning in the kinematic reconstruction of t t ¯ events

Submitted by Karim Ritter von Merkl Student number 6927983

in the study program B.Sc. Computing in Science at the Department of Informatics

of the MIN Faculty September 28, 2020

1. Examiner: Prof. Dr. Christian Schwanenberger

2. Examiner: Dr. Alexander Grohsjean

1 Introduction

2 Fundamental concepts

2.1 Particle physics

2.1.1 The basics of the standard model

Standard Model of Elementary Particles

u

d

c

s

t

b

e

ν

μ

ν

τ

ν

g

γ

Z

W

H

2.1.2 The top quark at hadron colliders like the LHC

all-hadronic

electron+jets

electron+jets muo n+jets

muon+jets

tau+jets

tau+jets

µ

τ

τ µτ

µτ ττ

e

µ

τ

ud cs e

cs ud τ

µ

Top Pair Decay Channels

W decay

µ

µµ

di lep to ns

2.2 Machine learning

2.2.1 The basics of decision tree regression

2.2.2 Gradient boosting

∑

2.2.3 Training, evaluation and model selection

3 Methods

3.1 Event generation and processing

3.2 Machine learning methods

3.2.1 Hyper parameter optimization

3.2.2 Effect of the correct jet permutation

4 Results

4.1 A single decision tree

4000 3000 2000 1000 0 1000 2000 3000 4000 simulated p

1500 1000 500 0 500 1000 1500

re co ns tru cte d p

(x,x) ideal distribution

10

10

10

10

3 ^Methods

4 ^Results