The Gerda experiment searches for the hypothesized, very rare neutrinoless double beta (0βν&nu

Volltext

(1)Max-Planck-Institut für Physik Technische Universität München. Philipp Holl October 2, 2017. Deep Learning Based Pulse Shape Analysis for GERDA. Master‘s thesis in Nuclear, Particle and Astrophysics Supervisor and Referee Secondary Referee. Béla Majorovits Friedemann Reinhard.

(2) Deep Learning Based Pulse Shape Analysis for GERDA Philipp Holl. October 2, 2017. Abstract. The Gerda experiment searches for the hypothesized, very rare neutrinoless double beta (0βνν ) decay in 76 Ge. Pulse shape discrimination is used to reduce the background index and thereby improve the limits on the half-life of the decay. Radioactive decays constitute a major component of the background and, unlike the 0νββ decay, often deposit energy at multiple locations inside one germanium detector. Current machine learning based pulse shape discrimination methods built to recognize these events require more labeled training data to prevent overtting than is available. In this thesis, a new deep learning based two-stage discrimination scheme is presented which combines unsupervised learning for nonlinear information reduction with a second, supervised classication step. The resulting algorithm does not suer from overtting as all of the mostly unlabeled calibration data can be used for training. This algorithm is evaluated on Gerda Phase IIa data including events from calibrations with radioactive sources as well as about half a year of physics data. The two-stage approach is shown to perform approximately equal to the current implementations on calibration data while rejecting more background events in physics data.. 1.

(3) Contents 1 Introduction. 4. 1.1. Neutrinoless double beta decay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4. 1.2. The. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5. 1.3. Pulse shapes of germanium detectors . . . . . . . . . . . . . . . . . . . . . . . . . .. 6. 1.4. Pulse shape discrimination in. . . . . . . . . . . . . . . . . . . . . . . . . .. 7. 1.4.1. Current PSD methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 7. 1.4.2. Thorium calibrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 7. 1.4.3. Limitations of current PSD methods . . . . . . . . . . . . . . . . . . . . . .. 8. 1.5. Gerda experiment. Gerda. New deep learning possibilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2 Analysis software and computing. Gerda analysis software. 9. 10. 2.1. Current. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 10. 2.2. GERDADeepLearning.jl - A new open-source framework . . . . . . . . . . . . . . .. 10. 2.3. Hardware. 11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3 Data selection and preprocessing 3.1. Selection of training data. 3.2. Preprocessing of waveforms. 12. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4 Articial neural networks. 12 12. 15. 4.1. Types of layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 15. 4.2. Training process of a neural network . . . . . . . . . . . . . . . . . . . . . . . . . .. 17. 4.3. Unsupervised learning and autoencoder. . . . . . . . . . . . . . . . . . . . . . . . .. 17. 4.4. Training, validation and test sets . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 19. 5 Nonlinear information reduction of pulses. 20. 5.1. Network layout and optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 20. 5.2. Reconstruction accuracy and sensitivity to noise. 22. 5.3. Limitations and possible further applications of the autoencoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6 Results: single-site and multi-site event discrimination eciencies. 27. 31. 6.1. Layout and training of classifying network . . . . . . . . . . . . . . . . . . . . . . .. 31. 6.2. Comparison to one-stage classication. 32. 6.3. Eciencies on peaks, Compton region and. 6.4. Investigation of systematic uncertainties . . . . . . . . . . . . . . . . . . . . . . . .. 36. 6.4.1. Selection of training data. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 36. 6.4.2. Energy dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 36. 6.4.3. Time dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 38. 6.5. Evaluation on physics data. . . . . . . . . . . . . . . . . . . . . . . . . .. 2νββ. compared to current implementations 33. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 7 Conclusions and outlook. 38. 41 2.

(4) 8 Acknowledgements. 42. 9 References. 43. A Autoencoder reconstruction accuracy by detector. 46. B Classication performance and energy dependence by detector. 82. 3.

(5) 1 Introduction The standard model of particle physics is one of the most successful models in physics, correctly predicting the behavior of particles up to the TeV energy scale [1].. Though, as of this writing,. no direct test has found any signicant deviation from the standard model, we know for several reasons that it is incomplete:. •. The imbalance in matter and antimatter is one of the unresolved fundamental problems in physics. The standard model cannot explain the extent of baryogenesis, the overproduction of matter in the early universe [2].. •. In the standard model, neutrinos are assumed massless. However, neutrino oscillations have proven that assumption wrong and provided values for the mass dierences of the dierent neutrino types.. The absolute masses of the neutrinos are still unknown, yet upper limits. from cosmology place the sum of all three neutrino masses at under 0.3 eV, depending on the theoretical model [3]. It is not even clear which of the neutrinos has the highest mass, a problem referred to as the hierarchy problem [4].. Many extensions of the standard model [5, 6, 7] rely on neutrinos having a Majorana component. Majorana particles dier from the standard Dirac partricles in that they are their own antiparticles [8]. This property could account for baryogenesis by permitting a variety of lepton number violating processes which are forbidden in the standard model. It could also explain why neutrinos are so light, their masses being at least six orders of magnitude lower than the mass of the electron. Because the cross sections of interactions between neutrinos and other standard model particles are very small, a high sensitivity is required to detect a new process. This leaves only a handful of experimental possibilities to probe the nature of the neutrino, one of which is the neutrinoless double beta decay.. 1.1 Neutrinoless double beta decay Neutrinoless double beta (0νββ ) decay, depicted in gure 1, is a lepton number violating process in which two neutrons simultaneously transform into two protons [9]. In contrast to the regular double beta decay which is allowed by the standard model and has been observed with long halflives for several isotopes [10], the. 0νββ. version emits no neutrinos.. This is only possible if the. neutrino is its own antiparticle as the electron antineutrino from one decay could be absorbed by the second decay as an electron neutrino, increasing the lepton number by two.. {. }. u n d d. u d p u. Wee-. W-. Figure 1:. Feynman. dia-. gram of the neutrinoless. }. u d p u. {. u n d d The half-life of the. 0νββ. double beta (0νββ ) decay. The blue line represents the neutrino interaction.. decay scales inversely with the Majorana mass component of the neutrino. [11]. hmββ i =. 3 X. 1 2 Uei mi ∝ p T1/2 i=1 4.

(6) Setting lower limits on the. 0νββ. half-life therefore puts upper limits on the Majorana mass.. In principle, any radioactive isotope with two succeeding beta decays in the decay chain can perform a. 0νββ. decay. For the purpose of detecting this decay, however, it is necessary to pick an element. for which the single beta decay is energetically forbidden. Existing and planned experiments are therefore looking for the 13], SNO+ [14]),. 136. 0νββ. decay in. 130. Xe (KamLAND-Zen [15], Exo-200 [16], NEXT-100 [17]) and. Majorana [19]).. Te (CUORE [12, 76 Ge ( [18],. Gerda. 1.2 The Gerda experiment. Gerda) [20] searches for the 0νββ decay of 76 Ge. Its goal is to. The Germanium Detector Array (. either discover the decay or set limits on its rate and thereby on the Majorana mass component of the electron neutrino.. 5.3 × 1025 yr. So far, no. 0νββ. decay has been observed and a half-life of. resulting in a limit on the eective Majorana mass of. mββ < 0.15 − 0.33. T1/2 >. eV (90%. C.L.) has been published [21].. Gerda consists of 40 enriched, high-purity 76 Ge crystals with a total mass of 35.6 kg which simul-. taneously act as semiconductor detectors. This allows for a good energy resolution of 2.64.4 keV. ββ decays inside the detector. As 0νββ decays only emit electrons, Qββ = 2039 keV is distributed solely to their motion and can thus clearly separating 0νββ from 2νββ events [22].. for radiation produced by the. the dierence in binding energy be accurately measured,. Germanium array Liquid argon tank (-196°C) Water tank for muon veto. Figure 2: Artist's view (Ge array not to scale) of the With the available limits on the. 0νββ. Gerda experiment. [20]. decay, less than one event per year is expected in. Gerda.. This extremely low signal rate puts high constraints on the background rate of the experiment.. To minimize events induced by cosmic radiation, the whole setup, shown in gure 2, is located at the underground Laboratori Nazionali del Gran Sasso (LNGS) of the INFN in Italy underneath about 1.4 km of rock with a shielding equivalent of 3.5 km of water. Furthermore, the detectors are surrounded by ultrapure liquid argon (LAr) for cooling and shielding. Photomultipliers at the top and the bottom of the cylindrical LAr tank can detect scintillation light emitted by LAr upon. 5.

(7) energy deposition by background radiation (LAr veto). A water tank surrounding the liquid argon equipped with additional photomultipliers can detect Cherenkov light of cosmic muons passing through the experiment (muon veto). Additionally, the detectors themselves act as veto devices as simultaneous energy depositions in more than one detector are excluded as a potential. 0νββ. candidate. In order to prevent a scientic bias towards either discovery or exclusion, the data available to the. Gerda collaboration is blinded. All events within a 50 keV window around Qββ = 2038.5 keV are excluded from the normal analysis and are only processed for special unblindings which happen approximately once a year.. 1.3 Pulse shapes of germanium detectors There are two types of germanium detectors used in. Gerda:. Broad energy germanium (BEGe). and semi-coaxial [21, 20] which are depicted in gure 3. While the two types dier in geometry, their operating principles as well as the read-out electronics are identical. High voltage is applied between the borated p+ contact and the Li drifted n+ surface of the detector. Energy depositions inside the germanium crystal create electron-hole pairs which then drift to the opposite electrodes. the. charge pulse. trace. or. This induces mirror charges in the contact electrodes creating. which is then read out by the electronics and digitized and recorded by. the data acquisition system (DAQ). Some example current pulses, obtained by dierentiating the recorded traces, are plotted in gure 3 for both BEGe and semi-coaxial detectors. These current pulses are referred to as. waveforms. or simply 1. around the trigger with 100 MHz [23] .. pulses. in this thesis. The DAQ samples the interval. To avoid background from natural radioactivity, most. components of the DAQ are located above the water tank. Only initial FETs and preampliers are located inside the tank. This forces the use of long cables which constitute the main contribution to the superimposed noise seen in the waveforms.. Concact (negative) connected to DAQ Surface (positive). Type BEGe Single-site. 0μ us. 1μ us. Type semi-coaxial. Multi-site. 0μ us. Single-site. 1μ us. 0μ us. 1μ us. Multi-site. 0μ us. 1μ us. Figure 3: Top: Sketch of BEGe (left) and semi-coaxial (right) detectors. The DAQ records signal at contact (blue). Below: Recorded single-site and multi-site current pulses around 2 MeV of the above detector types. Current pulses are obtained by dierentiation.. 1 For storage reasons, not the whole time interval around the trigger is saved with 100 MHz sampling rate. Instead, only 10 µs are stored as the high-frequency waveform around the trigger and a longer time interval is recorded with one fourth of the sampling rate.. 6.

(8) The pulse shape strongly depends on the number of interactions that happened inside a detector, their energy depositions and their physical locations. Therefore, close inspection of the recorded waveform can shed light on the type of event that occurred. This part of the analysis is referred to as pulse shape discrimination.. 1.4 Pulse shape discrimination in Energy depositions caused by a. 0νββ. Gerda. decay are most likely to be single-site events as the decay. energy is distributed mainly onto the two emitted electrons. Therefore, analysis of the pulse shape can be used to reject multi-site background events such as Compton-scattering photons. This pulse shape discrimination (PSD) reduces the background index of the experiment and thereby improves the sensitivity of the experiment.. 1.4.1 Current PSD methods Currently, two independent methods for PSD, one for BEGe and one for semi-coaxial detectors [23] are used in. Gerda.. The geometry of BEGe detectors is ideal for reaching a good PSD performance. Because of the small point-like contact, the signal shape induced by a single energy deposition inside the bulk of the detector mostly depends on the distance from the contact.. Its current pulse forms a sharp. peak as can be seen in gure 3. Due to the relatively slow drift inside the detector, multiple energy depositions at dierent locations within the bulk lead to a multiple peak structure in the event pulse. This allows for a simple discrimination using a single parameter. A/E , the maximum electric. current divided by the total deposited energy. This parameter exploits the fact that the maximum amplitudes of two energy depositions only pile up if the depositions occur at the same drift distance from the p+ contact. Therefore, for the majority of multi-site events the amplitude is lower than for single-site events. Also other event types such as alpha decays at the contact can be excluded. A/E. with. as these have an even sharper peak and higher. A/E. value than single-site events.. The semi-coaxial detectors have a much larger cylindrical contact. This makes PSD more dicult as even the pulse shapes of single energy depositions strongly depend on the hit location inside the detector.. Finding an explicit algorithm with high discrimination eciency by hand is therefore. dicult. Instead, the current analysis employs a two-layer neural network in combination with a hand-written dimensionality reduction algorithm. Nevertheless, the PSD performance of coaxial detectors cannot compete with the. A/E. discrimination of the BEGes.. 1.4.2 Thorium calibrations Both currently used discrimination algorithms use energy lines from the weekly. 228. Th calibration. runs to x their parameters and determine their cut windows. During a calibration, radioactive. 228. Th sources are brought in the vicinity of the detectors [24].. The observed energy spectrum, shown in gure 4, results from interactions of photons from the. 228. Th decay chain with the germanium detectors.. The continuum of the calibration spectrum consists of events which can leave either one or multiple energy depositions in a detector. The vast majority of these are Compton interactions, however the single- or multi-site topology is not known for any of the continuum events a priori. On top of the continuum, there are narrow peaks corresponding to discrete energy depositions of photons from internal transitions of the Thorium decay chain [22]. The. β-. decays of. 208. Tl and. 212. Bi create photons at 2615 keV and 1620 keV, respectively. Many. of them deposit all of their energy in multiple locations inside one detector producing the spectral lines. Some of the photons from the decays create electron-positron pairs inside the germanium crystal. Both electron and positron quickly lose their kinetic energy to the detector. The positron then annihilates with an electron from the environment, thereby creating two 511 keV photons. These. 7.

(9) 106. Tl FEP 2614 keV. Events per keV. Tl DEP 1592 keV. Tl SEP 2103 keV. Bi FEP 1620 keV. 104. All detectors BEGes Semi-coaxial. 102 500. 1000. 1500. 2000. 2500. 3000. Energy (keV) Figure 4:. Energy spectrum of all Thorium-228 calibrations from. Gerda Phase. IIa combined.. The histogram contains a total of 64 million events, 63% of which are contributed by the BEGe detectors. The four labeled peaks are used to evaluate the performance of any single-site multisite classier. In ascending order: escape peak,. 208. 208. Tl double escape peak,. 212. Bi full energy peak,. 208. Tl single. Tl full energy peak.. photons can escape the detector without further energy depositions. comprised of the mostly multi-site events from the. 208. The line at 2103 keV is. Tl decay in which one of the two 511 keV. photon escapes while the second photon deposits its full energy. Similarly, the peak at 1592 keV accumulates the double escape events. As both photons leave the detector, the remaining energy depositions from electron and positron are localized in a small volume which makes the pulses single-site, just like the expected events for the. 0νββ. decay.. Calibration and training of the current PSD algorithms use the keV and the. 228. 212. Bi full energy peak at 1620. Tl double escape peak at 1592 keV as proxies for multi-site and single-site events,. respectively. One advantage of these two peaks is that they lie close together in energy. This is favorable as energy-dependent eects, such as the inverse scaling of the relative noise with energy, are unlikely to be trained on.. 1.4.3 Limitations of current PSD methods The currently used PSD methods are calibrated on the. 212. Bi full energy and. 208. Tl double escape. peak from the calibration data. These events constitute about 1% of the overall calibration data. This data set is split into training and evaluation sets, further reducing the event count to only a few hundred events per calibration. While this is enough to determine a cut window for. A/E , it causes problems for the neural network. classier of the coaxial detectors. The implementation in use contains more than 5000 trainable parameters [23]. Because the amount of training data is of the same order, the network can simply "memorize" the training data.. This process is called overtting or overtraining.. An overtted. network will perform very well when presented with the training data but does not generalize, i.e. performs worse on statistically independent data. The. A/E. discrimination is a simple classication that works well when calibrated correctly. How-. ever, for mostly unknown reasons the mean. A/E. values of detectors shift over time.. Linear. corrections have to be applied for the time in between calibrations and parts of the data have to. 8.

(10) be discarded as the algorithm is not stable enough to cope with these instabilities [23]. Both the instability of. A/E. and the overtting of neural networks motivate the investigation of a. new, robust approach for pulse shape analysis.. 1.5 New deep learning possibilities Unsupervised learning algorithms are designed to deal with unlabeled data. Unlabeled refers to the fact that the desired information which class a given data point belongs to is not available, neither for evaluation nor for training data. The use of unsupervised learning seems natural as the vast majority of. Gerda data is not labeled.. On the other hand, the nal result for PSD should. be a classication parameter providing information on whether a pulse is single-site or multi-site. Such a classication requires supervised training. In this thesis, a new two-step approach is presented which combines the two paradigms. In the rst step an unsupervised learning algorithm reduces the dimensionality of the data to a compact representation. Then, in the second step the compact representation of the labeled data is used for supervised training of a classier. The unsupervised part plays a key role for a successful classication by providing the compact representation. Compared to the original sampled waveform, it has a number of advantages:. Size. The compact representation has about two orders of magnitude fewer parameters. Only the quintessential information of the pulse is stored in these parameters.. Noise-lter. The high-frequency noise of most detectors can be regarded as random noise for all. practical purposes. As the compact representation only contains a handful of parameters, the unsupervised learning algorithm cannot aord to include specic noise patterns in it, especially if the training dataset is large enough to contain many versions of common events.. Generalization. If small enough, the compact representation suppresses energy-dependent phe-. nomena like the scaling of noise. It gives a natural representation of the shape of a waveform.. As will be shown in this thesis, the described scheme solves the problems of current PSD methods mentioned in 1.4.3. Overtting is overcome as the amount of trainable parameters of the classier is massively reduced due to the lower input dimension. Therefore fewer training data is required to adjust these weights.. The presented method is also more robust over time than the. classication because noise in the waveform, the main problem for. A/E ,. A/E. is learned to be ignored. by the unsupervised learning algorithm. The compact representation might even be used for other purposes in the future. The denoising eect could for example be useful for a more precise energy reconstruction.. 9.

(11) 2 Analysis software and computing 2.1 Current Gerda analysis software. Gerda is performed by the proprietary framework GELATIO GELATIO is written in C++ and depends on the open-source ROOT framework [26] including the TAM extension for cluster computing [27]. GELATIO uses The current data processing of. and a number of scripts [25].. the custom MGDO (ROOT) format as input to extract and compute higher-level event properties.. For that, it embeds all digital lters that are applied on the waveforms as modules which. can be executed in any order. This processing includes multiple energy reconstructions, quality and background rejection cuts, LAr veto and muon veto ags and pulse shape parameters for discrimination. Once the. GELATIO output is computed, a number of smaller programs and scripts provided by Gerda collaboration are run. These calibrate the PSD cut parameters. individual members of the. and cut values, nalizing the classiers e.g. for multi-site or alpha event rejection.. 2.2 GERDADeepLearning.jl - A new open-source framework Within the context of this thesis, I developed a new waveform processing and classication framework called. GERDADeepLearning.jl.. It is freely available on GitHub [28].. Unlike the current. Gerda software, it is written exclusively in the programming language Julia [29], a dynamic highperformance language designed for numerical computing.. GERDADeepLearning is independent of the current GERDA software. In principle, it only requires the same data format as. GELATIO and supports importing other high-level Gerda data les for 2. both for data ltering and cross-checking the results .. For these tasks, GERDADeepLearning. depends on the Julia wrappers for ROOT, ROOT.jl [30] and ROOTFramework.jl [31]. GERDADeepLearning provides a high-level and easy-to-use API for working with germanium detector data. Examples, tutorials and documentation are available on GitHub. GERDADeepLearning uses a customizable HDF5 [32] format for storing data and provides tools for converting from the proprietary. Gerda format to HDF5. Working with large data sets is achieved. through a just-in-time initialization concept whereby only the part of the data that is being worked on is read into memory. This is performed under-the-hood, mostly without requiring user code.. In addition to the feature-rich Julia environment GERDADeepLearning implements multiple signal processing lters specically tailored to pulse shape analysis some of which are described in 3. It supports preprocessing and ltering of data as well as many other commonly used functions. The deep learning capability is based on the open-source library MXNet [33] through the ocial Julia wrapper MXNet.jl [34].. MXNet provides common machine learning features such as fully. connected layers and convolutions for neural networks. In addition to CPU processing, it can make use of the computing power of GPUs through NVidias general purpose processing API CUDA [35]. Splitting the work among multiple machines is also supported.. While MXNet is very ecient. and fast, it lacks some basic functionality like recording network performance throughout training. GERDADeepLearning.jl therefore contains a submodule to add some of the missing functionality. Additionally, GERDADeepLearning provides statistical tools for eciency calculation as well as a rich set of plotting functions. As of this writing, the GERDADeepLearning.jl is tested on Ubuntu 16 and Julia 0.6. the framework was designed for the. Though. Gerda experiment, the method and most of the code can be. applied to a more general set of germanium detector data.. 2 In addition to the waveforms, the following high-level properties are imported: Event time, energy, multiplicity, muon and liquid argon veto, A/E value and ag, classier output for multi-site and alpha rejection for semi-coaxial detectors, test-pulse and baseline ags.. 10.

(12) 2.3 Hardware For data processing and neural network training, we set up a compute server at the Max-PlanckInstitute for Physics.. With two 14-core (28-thread) Intel Xeon processors and four high end. consumer graphics cards, it is well suited for highly parallel high-throughput processing tasks. The server specication as well as its performance on neural network training is shown in gure 5. Our tests have shown that the GPUs are superior to the CPUs in neural network processing. 3. Due to hardware restrictions , scaling beyond two graphics cards did not improve performance any further for the relatively small networks used in this work.. Network training was therefore. performed on two GPUs through the CUDA capability of MXNet. Data conversion and preprocessing was implemented in GERDADeepLearning.jl to run in parallel, making use of the 28-core CPU architecture. This workload was not optimized for GPU processing, though this may be a possibility for future. Gerda computing using Julia.. Time (s). 20. 10 CPU GPU. CPU: 2x Intel Xeon E5-2680 v4 @ 2.40 GHz GPU: 4x NVidia GTX 1080 Ti @ 1.58 GHz 0 100. Compute server Figure 5: the. Left:. 1000. 10,000. Batch size. Photograph and specication of included processing units of the compute server of. Gerda group at the Max-Planck-Institute for Physics. Right:. Neural network performance. on one CPU of the compute server compared to one GPU. The curves show the measured time for an autoencoder to optimize for two epochs and compute a prediction for a data set of 300.000 events. The used model is described in section 5.1. The batch size determines how many events are processed in one go before an optimization step is performed. For small batch sizes, the synchronization overhead dominates while large batch sizes imply a less frequent network optimizations. Network optimization is described in more detail in section 4.. 3 Each Xeon processor is connected to two GPUs. While each pair of GPUs can communicate eciently via the PCIe bus, synchronization among all GPUs requires data transport to and from the CPUs. The resulting overhead scales with the amount of synchronizations which is determined by the network complexity and the batch size.. 11.

(13) 3 Data selection and preprocessing This work is based on blinded. Gerda Phase IIa data, taken from December 2015 until June 2016. (run 53 through 64). The results for the. 0νββ. search of this data set were published in [21].. 3.1 Selection of training data The presented method was evaluated on the 36 enriched detectors that are used for the current analysis. For these, around 180,000 physical events were recorded during live time. This data set is referred to as. physics data. and mostly consists of events from the. 2νββ. decay as well as natural. radioactive radiation and events caused by cosmic muons. This set already excludes test pulses, baseline events and events that did not pass the quality cuts. Of all physics data events, 1000 triggered the muon veto system, 101,000 are rejected by the liquid argon veto and 16,000 show simultaneous energy depositions in more than one detector, also excluding them from the set of potential. 0νββ. events. These experimental cuts leave 72,000. triggered events, most of the at energies below 500 keV. For the analysis of this work, this ltered dataset was used. The energy values used for ltering and plotting events are imported from the current. Gerda software output.. Additionally, the weekly to bi-weekly calibration runs provide 64 million real events ranging from below 500 keV to about 3 MeV, as shown in gure 4.. These constitute the primary source for. training and evaluation of a learning algorithm. Because dominant part of the noise does not scale with energy, the low-energy events are less useful due to their hight relative noise. Events below 1 MeV are therefore excluded from training. In addition to there being two types of germanium detectors in. Gerda , each detector has its own. unique characteristics. Therefore a separate classier is trained for each detector using only the calibration data from that channel.. 3.2 Preprocessing of waveforms The purpose of preprocessing is to make it as easy as possible for a learning algorithm to identify the hidden relations in the data. To achieve this, a number of lters are applied to the raw data before it is fed to neural networks. The raw waveforms recorded by the DAQ capture how the charge of an event was accumulated over time. An energy deposition in the detector creates electron-hole pairs which drift to the surface and contact respectively. The DAQ records the charge level at the contact as a function of time which reaches its maximum when all holes have drifted to the contact. The electronics connected to the detector then slowly let the charge dissipate producing the exponential decay towards the baseline after the physical event. The traces, which contain the important information for inferring the type of event, are sampled at 100 MHz over 10 microseconds and hence consists of 1000 samples. The raw waveforms of some example pulses are plotted at the top left of gure 6. Starting from the traces, a number of preprocessing steps are performed.. Subtract baseline. The baseline level, averaged over the rst 200 samples of the waveform using. a hamming window, is subtracted from the pulse. The whole waveform is then sign-reversed so accumulated charge increases its value.. Energy normalization. As labeled data is only available for the peaks at 1592 keV (single-site). and 1620 keV (multi-site), it is important to prevent the network from learning an energy dependence.. This is achieved in this step by normalizing all pulses to their total charge,. determined by averaging over the last 200 samples of the waveform. The result of this stage is shown at the top right of gure 6. The waveform is then divided by this value to remove the energy information. This simple energy reconstruction does not take the exponential decay of. 12.

(14) 1.00. 0.75 12000 0.50. 11000. 0.25. 10000. Normalization. Raw waveform. 13000. 0.00 2. 4. 6. 8. Time ( s). 2. SSE 800 keV SSE 1500 keV SSE 2038 keV MSE 2038 keV. 1.00. Alignment. 0. 4. 6. 8. Time ( s) 0.08. 0.75. 0.06. 0.50. 0.04. 0.25. 0.02. Differentiation. 0. 0.00. 0.00 0.0. 0.5. 1.0. 1.5. 2.0. 2.50.0. 0.5. Time ( s). 1.0. 1.5. 2.0. 2.5. Time ( s). Figure 6: Visualization of the preprocessing steps for three single-site events (SSE) and one multi-. Top left: Raw waveforms as recorded by the DAQ. Top right: The baseline level Bottom left: Current waveforms by dierentiation. Bottom right: Waveforms are aligned and trimmed to 256. site event (MSE).. is subtracted and the height is normalized to constant energy. are computed. samples. For the shown waveforms, the alignment is dened so that the 50% level of the charge pulse lies in the center.. the waveform into account. While this may not be precise enough for energy reconstruction, it is sucient for the purposes described here.. Alignment and cut. The trigger system of the DAQ does not perfectly align the rise times of the. pulses. This results in the majority of the sampled region lying either before or after the rise window. This preprocessing step aligns the charge pulses, either by their 50% value or their maximum slope. Then, a region of 256 samples around that point is selected, samples lying outside are discarded. This step can fail for strange events which do not have a clear rise but still pass the. Gerda quality cuts.. These, usually about 10 per million events, are excluded. from the subsequent analysis.. Dierentiation. The charge pulse is useful for determining the energy of the pulse, however for. inferring the event type, the current pulse is better suited.. The current pulse is obtained. simply by dierentiating the current pulse waveform, i.e.. taking the dierence between. neighbouring samples.. Ultimately, the preprocessed waveform is nothing but equally spaced values along a time axis. This is very similar to images where the equally spaced values represent pixels. In the recent years, much research has been done in the eld of image recognition. To take advantage of the techniques and recipes developed for images, the waveform is treated as a one-dimensional image from this point on.. 13.

(15) During this work, further preprocessing methods have been investigated. Wavelet decompositions for example transform the waveform into a space with both time and frequency information. The hope that this transformation would make distinguishing between the dierent signal topologies easier did not materialize, however. A method to further decrease the dimensionality of the input is the truncated singular value decomposition. This algorithm linearly decomposes the waveforms into the their important features which are extracted from the data. It was found, however, that the dimensionality of the pulses could not be reduced without losing potentially important information. Instead, convolutions with convolutional lters that are trained from data were chosen for the last step of preprocessing. This step is contained in the neural network and will be described in chapter 5.. 14.

(16) 4 Articial neural networks Articial neural networks are a mathematical model that is optimized or. tted. on data and can, in. principle, learn any nonlinear function [36]. Originally developed in the 1950s, they became very popular in recent years, not least because of the advances in computing power. The most common use of neural networks is to learn an algorithm that would otherwise be hard to implement manually. The network is trained by feeding it some data plus their desired outputs or. labels,. often the classes or categories they belong to. An optimization algorithm then adjusts. the internal parameters or. weights. of the network to best reproduce the training labels.. After. successful training, the network is able to predict the labels of new data it is presented with. Neural network consist of. layers.. There are many types of layers, some of which will be introduced. here. To predict a label, the input data travels in forward direction through the network from layer to layer, being manipulated at each step. The output layer computes the predicted label which can be a single value or a higher-dimensional object.. 4.1 Types of layers Over the years, many types of network layers have been developed. Some of them have trainable weights while others perform xed functions. Each layer computes an output. xin. xout. for a given input. where the shape and dimensionality of the data can be arbitrary. This output is then given to. the next layer as input. In the following, the size, i.e. parameter count, of inputs and output will be abbreviated with. Nin. Fully connected layer. and. Nout ,. respectively.. This may be the most important layer for neural networks. Mathemati-. xout = M · xin + b and adds a constant oset, also M consists of Nout × Nin entries and the bias vector Nout · (Nin + 1) weights are adjusted during training.. cally, it performs a matrix multiplication called. bias b.. contains. Nout. Consequently, the matrix values. All of these. A fully connected layer can be visualized as. Nin. neurons on one side connected to. Nout. neurons on the other side, as shown in gure 7. Each neuron in the input layer is connected to each neuron in the output layer. The input values travel through each connections, thereby being multiplied by the weight of that connection.. When they reach the output layer, all. values arriving at one neuron are summed. The bias is introduced as an additional neuron at the input layer with a constant value of 1.. Forward. 𝑁𝑖𝑛 neurons. 𝑁𝑜𝑢𝑡 neurons. Weights 𝑤. 1 Input 𝑥𝑖𝑛. Convolutional layer. Figure 7:. Bias 𝑏. a. fully. Visualization of connected. layer.. The layer performs a ma-. Output 𝑥𝑜𝑢𝑡. trix multiplication with the input. xin. b xout .. and adds a bias. to produce the output. As the name suggests, this layer performs a certain number. lutions on the input. For each convolution, a lter of predened size. 15. Lf. Nf. of convo-. is convolved with.

(17) the input. xin. and a bias. bf. is added. The. Nf · (Lf + 1). values of the lters, including biases,. are learned, i.e. they are optimized in the training process. Each convolutional lter can be represented the same way as a fully connected layer, only each neuron of the input layer is connected only to neurons in the output layer that are close by. The complete layer creates a stack of the outputs of the individual convolutions as shown in gure 8. This stack can be reshaped to a vector afterwards if it is fed to a fully connected layer for example. A convolutional layer usually has fewer weights than a fully connected layer. While a fully. Nout · (Nin + 1) weights, a one-dimensional convolutional layer maps Nout ≈ Nf · Nin outputs using Nf learned lters of length Lf Nin . weights Nf · (Lf + 1) Nout · (Nin + 1) is therefore much smaller.. connected layer contains. Nin. ordered inputs to. The total number of. Forward. 𝑁𝑜𝑢𝑡 neurons. 𝑁𝑖𝑛 neurons. Filter. +𝑏 +𝑏. Figure 8:. +𝑏 Input 𝑥𝑖𝑛. Activation. Visualization of a. convolutional layer. put. xin. The in-. is convolved with a. number of learned lters and. Output 𝑥𝑜𝑢𝑡. a bias. b. is added producing. the output. So far, all described operations were linear.. xout .. Nonlinearity is commonly introduced. with activation layers. An activation layer applies a nonlinear function to each value,. h(xin,i .. xout,i =. While in principle, any nonlinear function can be chosen, a handful of functions have. proven to be very eective and are extremely popular in machine learning research. These include the sigmoid function. σ(x) =. 1 1+e−x ,. tanh(x). and ReLU(x). = max(0, x).. The rst two can be transformed into each other by scaling and translation operations.. σ and tanh converge to a nite output for large x which means that the derivative d σ(x) →x→∞ 0 tend towards zero. This poses a problem for gradient based optimization dx algorithms and can lead to in training when the change to the weights at each Both. dead ends. optimization step becomes very small. The ReLU function may seem very simple, only clipping negative values to zero, but it does not suer from this problem. It can deal with large and small inputs which makes it easy to work with.. Choosing the correct initial weights for training a new network is not as. important as the magnitudes of values in the intermediate layers does not aect the result. These advantages have made ReLU one of the most popular activation functions in machine learning research today.. Pooling. Pooling or downsampling reduces the dimensionality of an input by combining neigh-. boring values. This can be in the form of taking the average over. n. neighboring samples or. choosing the maximum or minimum of them. This layer is often put after a convolutional layer.. While the convolutional layer increases the dimensionality for. layer decreases it within each convolved output.. 16. Nf > 0,. the pooling. The combination of the two layers also.

(18) makes the network invariant to small translations (shifts) in the input as the convolutions simply shift their output which is then downsampled.. 4.2 Training process of a neural network After a network has been set up, it needs to be initialized and trained. The initialization is usually performed by assigning random weights with a uniform or Gaussian distribution to all weights. w.. Then, an optimization algorithm runs over the training data and tries to nd a minimum in the loss function. Q(w).. The loss function determines what the network should be optimized for. The. most commonly used function is the least squares loss. Q(w) =. 1 X kyw (x) − zx k2 |X|. (1). x∈X. which results in linear regression. Here, computed,. yw (x). X. denotes some data of size. is the prediction of the network and. zx. |X|. set for which the loss is. the training label or. ground truth.. The. optimization is usually performed in batches for computational and stability reasons. The training data are split into equally sized batches which are processed one at a time. The loss is computed for each batch and the weights are adjusted. This is computationally much cheaper than optimizing only after a full data pass but, depending on the batch size, stable enough for convergence towards a minimum. How the minimum is approached depends on the type optimization algorithm and its settings. A simple optimizer is stochastic gradient descent (SGD) [37]. It computes the residual at the output,. Q(w). and computes the derivative with respect to each weight. δw = −η∇Q(w) = −η. n X. w.. It then adjusts the weights. ∇Qx (w). (2). x∈Xbatch towards the minimum in the loss function with the amount of change determined by the learning rate. η.. The value of the learning rate inuences the stability and speed of training.. A higher. learning rate will lead to quicker training at the cost of stability while a low learning rate will lead to a smooth training curve at the cost of increased computation eort. The SGD optimizer can be extended with a momentum term which keeps optimizing in the direction of the previously computed gradients for a number of steps [37]. This usually speeds up convergence drastically. Nevertheless, many other optimizers have been developed such as Adam [38] which are designed to converge faster than SGD. In principle, the training process could run indenitely, constantly changing the weights.. As. computing power is limited, an exit condition must be dened. Reasonable choices are stopping when the loss function stops decreasing, either in the training set or the cross validation set (see section 4.4). The networks trained for this work were trained for a xed amount of full data passes and the best conguration was chosen afterwards.. 4.3 Unsupervised learning and autoencoder The previously described learning techniques all presumed the existence of labeled training data, i.e. the labels. zx ,. often classes, of the training samples. ing, hidden structures in. unlabeled. available, can be inferred.. x. are available. With unsupervised learn-. data, i.e. data for which no classication or categorization is. In contrast to supervised learning, the accuracy of an unsupervised. learning algorithm cannot be evaluated directly as there is no ground truth label to compare the predictions to. Over the years, many unsupervised learning algorithms have been developed. Many of these like kmeans[39] or Gaussian mixture models[40] focus on inferring an underlying structure by clustering points in groups.. Also more complex models based on neural networks have proven to be very. successful [41, 42]. For this work, the so-called. autoencoder [43]. learning on pulse shapes.. 17. was chosen for the unsupervised.

(19) An autoencoder[43] is a special neural network with the goal to nd a representation of the input data in a given number of dimensions, usually chosen much smaller than the number of input dimensions. As the autoencoder is eectively a neural network, it is compatible with neural network techniques and therefore benets from the many breakthroughs in recent time.. For example, in computer. vision, two-dimensional convolutional layers are very popular as they respect the relations between neighboring pixels. These layers can simply be added to an autoencoder, making it a convolutional autoencoder.. Encoder. Decoder. Fully connected layer. Fully connected layer. Convolution Input 𝑥. Deconvolution. Feature vector 𝜙(𝑥). Reconstruction 𝜓(𝜙 𝑥 ). Figure 9: Sketch of an autoencoder. The encoder (left) learns the transformation to the feature space. φ:X →F. while the decoder (right) learns the inverse function. ψ : F → X.. The general structure of an autoencoder consists of two neural networks stacked on top of each other as shown in gure 9.. Encoder. The encoder learns a function. φ:X →F. x ∈ X into a feature φ can in principle be an arbitrary nonlinear. that transforms the input. vector. As the encoder is a neural network itself, function.. Decoder. The decoder learns the reverse function. to a reconstruction of. ψ : F → X,. transforming a feature vector. φ(x). x.. The whole network is then trained to minimize the reconstruction. The loss function for a set of samples. x. X. can be written like in equation 1 as. Q(w) =. 1 X kx − ψw (φw (x))k2 . |X|. (3). x∈X. Consequently, the autoencoder is trained to nd the functions. φ, ψ = arg min φ,ψ. X x∈Xtrain 18. φ. and. kx − ψ(φ(x))k2 .. ψ. so that (4).

(20) Therefore, the network must pack as much quintessential information of low-dimensional feature vector. φ(x). it learns an encoding of. x. as possible into the. x.. Since no constraints are predened except for the size of the feature space, the feature representation depends only on the training data. Thus, the only bias introduced in the encoding is the choice of network conguration and the selection of training data.. 4.4 Training, validation and test sets As a neural network is trained on a certain set of data, it will learn to best classify these specic samples. The goal, however is to obtain a generalization, i.e. a function that works well on previously unseen data. Depending on the data complexity and network layout, these two evaluation metrics may give very dierent results. The case when a classier does very well on the data it was trained on but generalizes poorly, is called. overtting.. One common method which can help to prevent overtting is. dropout. [44]. During each training. batch, a fraction of randomly selected neurons are disabled, i.e. their output is set to zero and the overall output of that layer is rescaled.. This has a similar eect as averaging over multiple. neural networks trained with dierent initializations as it prohibits the network from assigning very specialized functions to individual neurons. In any case, it is necessary to split the data to get an unbiased performance evaluation.. This. procedure is common practice in machine learning applications and was also performed for this work. The data are split into three disjunct sets of dierent sizes.. Training set Xtrain. This set is used by the optimizer to adjust the weights. during training towards a minimum in the loss function. Cross validation set Xxval. w. of the network. Q(w).. While the network is continuously being optimized on the training. set, it may learn features that are specic to the selection of training data but do not represent general features that should be learned. This overtting usually dominates in the later stages of training, ideally after the quintessential features of the data have already been learned. The cross validation set is an independent data set which is used to cross-check that the network is still learning useful features.. Test set. As both. Xtrain. and. Xxval. are used in the process of training, a third independent set is. required to evaluate the performance of the algorithm after training.. In typical machine learning applications, the data are split into about 60% for training, 20% for cross validation and 20% for evaluation. However, for the classication of cut value must be determined later on.. Gerda events, an optimal. This procedure is sensitive to statistical uctuations in. the test set which is why for this work the split was chosen to be 40% for training, 20% for cross validation and 40% for evaluation, enough to reduce statistical uncertainties in the cut value to an acceptable level.. 19.

(21) 5 Nonlinear information reduction of pulses While the preprocessed waveforms still contain 256 samples, the important information specifying the physical process can be expressed in only a handful of values. This chapter describes the used method to move from the initial high-dimensional representation to a much more compact feature representation containing almost two orders of magnitude fewer parameters. This is achieved using the autoencoder, an unsupervised learning algorithm which learns the feature representation from the data.. 5.1 Network layout and optimization The conguration of the autoencoder, which is identical for all detectors, reduces the 256-dimensional input down to a seven-dimensional feature vector. The layer structure is visualized in gure 10. The parameters determining the layout of a neural network such as number of layers, neurons, etc. is are referred to as. hyperparameters.. Preprocessed current pulse 𝒙. Encoder 𝜙. 1D convolution ReLU activation function Max pooling of length 4 Fully connected layer Feature vector 𝝓(𝒙). Decoder 𝜓. Fully connected layer ReLU activation function Upsampling x4 1D deconvolution. Figure 10:. Conguration of the. autoencoder used in this work.. ReLU activation function. The encoder consists of one convolutional layer and one fully connected layer. The decoder is mir-. Reconstruction 𝝍(𝝓 𝒙 ) The whole calibration data of. rored to match the encoder.. Gerda Phase IIa serves as the data set. It is split into 40% training. data, 20% cross validation and 40% for testing. The exact training set sizes for each detector can be found in the appendix, section A. Starting with the 256-dimensional input, the encoder rst performs two convolutions where each convolutional lter has a length of nine samples. 4.. This is followed by a max pooling of length. Then the nonlinear activation function ReLU is applied on the reduced. 2 × 64. output.. A. fully connected layer with seven neurons then transforms the intermediate result into the feature vector.. Mathematically, this is a matrix multiplication of the attened 128-dimensional ReLU. output vector with a. 7 × 128. matrix of which each entry is learned.. For the reconstruction, a mirrored version of this setup transforms the feature vector back to the input dimension. This step includes another fully connected layer with ReLU activation, followed by a 4x upsampling with quadratic interpolation and a deconvolution. The network optimization is performed by the ADAM optimizer [38], running 50 times over the complete data set with a mini-batch size of 3000.. 20.

(22) While no denitive rules exist for determining these hyperparameters, nding the best conguration is not simply a case of trial and error. Heuristics from theory combined with manual optimization led to these specic settings. The following paragraphs lay out why the nal network conguration was chosen and what eect the change of certain hyperparameters has.. Complexity. There are mainly two factors limiting the complexity of the network. If the network. has too many trainable weights, overtting becomes an issue. As a rule of thumb, the amount of training samples should be ten to a hundred times larger than the number of weights. For. Gerda Phase IIa, the amount of calibration pulses in the training set varies between about 300,000 and 1.7 million between detectors.. On the other hand, the network needs to have a complex enough structure to learn eective encoding and decoding functions. The proposed conguration has just below 2000 weights including both encoder and decoder.. Size of feature space. This size of the feature space is a compromise between eciency and. reconstruction accuracy. While lower-dimensional feature spaces allow for a more compact representation, the size needs to stay large enough to capture all important features of typical pulses. It was found experimentally, that the accuracy gains drastically fall o after a feature space size of ve. Stable and reliable convergence for all detectors was seen starting at a size of seven.. Going beyond seven does not signicantly improve the reconstruction accuracy. which is why this particular value for the information reduction stage was chosen. Dierent autoencoder layouts may, however, require more or less weights.. Depth. Even within a tight bound on the overall complexity of the network, many congurations. with dierent numbers of layers are possible. Each additional layer can abstract the concepts learned by a previous layer. For data with highly complex relationships it can help to increase the number of layers though the performance gains usually drop o beyond three layers. A network with a lower layer count, on the other side, is faster and more stable to train. For the waveforms of calibration events, the training of an autoencoder with one more layer for both encoder and decoder (two convolutions or two fully connected layers) was found to be too unstable. Though this problem could probably be solved by using a variational autoencoder [45], the complexity of the pulse shapes does not require more layers to be eectively encoded in seven parameters.. Convolutions. Similar to how convolutional layers respect the order of the pixels in image recog-. nition, here they are used to take advantage of the time axis ordering the individual samples of a pulse. The conguration employed in this work uses total weight count of 20.. Nf = 2. lters with a length of. Lf = 9. for a. It was found that increasing the number of convolutional lters. simply led to the added lters learning identical features.. Pooling. The goal of the pooling operation is to reduce the dimensionality of the convolution. output. The downsampling factor was chosen to be four and the pooling algorithm was set to taking the maximum.. It was, however, found that averaging the neighboring samples. resulted in the same reconstruction accuracy within uncertainties.. Activation function. For the activation function, the rectifying linear unit ReLU(x). = max(0, x). was chosen because of the reasons outlined in section 4.1. Other activation functions were tested but did not converge as reliably.. Dropout. Dropout is a way to counteract overtting that is often used in the training process.. In case of calibration waveforms, it was found that dropout did not improve performance as enough data was available to prevent overtting.. Optimizer, learning rate, batch size. In this work stochastic gradient descent (SGD) with mo-. mentum and ADAM were tested. Both of them were found to have a very similar convergence behavior.. Whenever one of them did not converge to a solution, the other algorithm was. not able to nd one either. The performance of successfully trained networks was also very similar for both algorithms. Generally, ADAM converges faster than SGD which is why it was chosen for this work.. 21.

(23) The batch size inuences the frequency with which a network optimization step is performed and the learning rate. η. determines by how much the weights are changed per iteration,. according to equation 2. The networks trained here are small compared to some networks used in the industry, e.g. for image classication. Therefore the processing of a single batch is very fast on a modern discrete GPU. When scaling up to a multi-GPU setup, a small batch size increases overhead as the GPUs need to frequently synchronize the network weights. Here, a rather large batch size of 3000 was chosen to minimize GPU synchronization overhead.. To compensate this. rather large batch size, the learning rate of the ADAM optimizer was set to 0.005.. With. these values, training is very stable and converges to a solution after 5 to 30 data passes as can be seen in gure 11. The convergence for BEGEs is generally slower than for semi-coaxial detectors because of the narrower spikes that are present in the waveforms.. 1.50. Learning curves of GD00A. Learning curves of ANG1. Training set Cross validation set. 1.25. Training set Cross validation set. Root loss. 1.00. 0.75. 0.50. 0.25. 0.00 10. 20. 30. 10. Full data passes. 20. 30. Full data passes. Figure 11: Learning curves of autoencoders for detectors GD00A (BEGe) and ANG1 (semi-coaxial). The optimization algorithm minimizes the reconstruction error over time. The reconstruction error is measured as the square root of the loss function. Q(w). dened in equation 3.. The training. processes of autoencoders for coaxial detectors (left) converge faster than for BEGes (right). The training shows no signs of overtting as the performance on the cross validation set is equal to the one on the training set. The training curve in gure 11 also shows that there is no dierence in the reconstruction performance for events from the training or cross validation set. This complete absence of overtting allows us to reuse the training set for training the classier later on.. 5.2 Reconstruction accuracy and sensitivity to noise The goal of the autoencoder is to nd a low-dimensional representation of the waveforms in order to facilitate classication afterwards. As the encoding and decoding functions are learned from the data, there is no intuitive interpretation of what the seven components of the feature vector. φ(x). represent. Instead, one can look at the reconstructions original pulse. x.. x̃(x) ≡ ψ(φ(x)). and measure the deviation from the. The top of gure 12 shows two reconstructed waveforms from a BEGe detector.. Due to the low number of parameters of. φ(x),. the reconstructed pulses are less detailed.. The. deviation from the original pulse is quantied in the bottom part of gure 12. The dierence of the. q. P256 2 1 i=1 (xi − x̃i ) , is dominated 256 by the noise that is present in the data. In the plot, the expected deviation due to noise σnoise is original pulse. x. and reconstruction. x̃,. measured as. 22. σreconst =.

(24) therefore subtracted on a per-event basis. The noise level,. σnoise =. q. 1 256. P. − x̄)2. j (xj. with. x̄. the. mean value, is obtained from an interval of equal length as the preprocessed waveforms but taken at the beginning of the event trace, before the energy deposition causes a spike. The visible band structure broadens at low energies due to the high relative noise and forms a constant Gaussian-like peak at high energies where the changes made to the waveform by the autoencoder dominate.. The median additional deviation caused by the autoencoder is around. σreconst − σnoise ≈ 1.8 · 10−4 . This is an order of magnitude weaker than the mean noise level with σnoise ≈ 1.9 · 10−3 . The 212 Bi full energy peak at 1592 keV and the 208 Tl double escape peak at 1620 keV demonstrate the dierence in reconstruction accuracy between multi-site and single-site events.. Single-site events are easier to encode as there are less possible pulse shapes than for. multi-site events. This leads to a higher reconstruction accuracy of the pulses of single-site events. The evaluation of reconstruction accuracies for all detectors is shown in appendix section A.. Single-site event (2038 keV) Data Reconstruction. 0.100. Current pulse. Multi-site event (2038 keV) Data Reconstruction. 0.06. 0.075 0.04 0.050 0.02 0.025. 0.00. 0.000 0.0. 0.5. 1.0. 1.5. 0.0. 0.5. Time ( s). 1.0. 1.5. Time ( s) All events > 1 MeV. reconst. noise. 0.002. 0.001. SSE. MSE. 0.000. -0.001 500. 1000. 1500. 2000. 2500. 0. 10000 20000 30000. Energy (keV) Figure 12:. Top:. Events. Reconstructed current pulses for a single-site and a multi-site event at 2038. keV from BEGe detector GD00A.. Bottom:. Additional reconstruction error. σreconst − σnoise. of the. autoencoder for GD00A, measured in the same units as the current pulses above, plotted against energy. The single-site peak at 1592 keV (SSE) is more accurate on average than the multi-site peak at 1620 keV (MSE). The median additional reconstruction error is 0.00018. For most of the detectors detectors, the superimposed noise which is mostly introduced by the cables, is neither correlated to the pulse shape nor does it follow a pattern.. 23. This randomness.

(25) prevents the network from learning specic noise patterns and thus the noise information of. x. is. x̃(x).. averaged out in. Instead, the network learns to focus on the most relevant information. Led by the metric. kx − x̃k2 ,. the network aims to improve those parts of the waveforms with the biggest discrepancy the spikes. It learns common shapes of spikes and then ts any given waveform to its internal models, producing the feature vector. φ(x).. Therefore, the denoising behavior is not simply a low-pass lter. but a sophisticated algorithm that can guess what the noise-free shape should look like. This is especially apparent for events at lower energies which have a higher relative noise level. In this case, the algorithm can adjust the spike, e.g. of a single-site event, to better match the common single-site shape. The examples in gure 13 demonstrate this behavior. This adjustment is much easier for single-site than it is for multi-site events due to their less diverse pulse shapes. This dierence can be seen in gure 12, where the single-site peak at 1592 keV is clustered around zero while the multi-site peak at 1620 keV extends further towards higher deviations from the original pulse.. SSE (555 keV). 0.08. MSE (587 keV). Data Reconstruction 0.06. MSE (582 keV). Data Reconstruction 0.06. Current pulse. 0.06 0.04. 0.04 0.02. 0.02. 0.00. 0.00 0.0 0.2 0.4 0.6 0.8. Time ( s). Data Reconstruction. 0.04 0.02 0.00 0.0 0.2 0.4 0.6 0.8. Time ( s). 0.0 0.2 0.4 0.6 0.8. Time ( s). Figure 13: Reconstructed low-energy single-site (SSE) and multi-site (MSE) events between 500 and 600 keV from detector GD00A. The autoencoder alters the waveform to better t the learned single-site or multi-site pulse shapes of above 1 MeV. The denoising behavior also motivates the idea of improving the. A/E. resolution for BEGe detectors. using this capability of the autoencoder. Figure 14 shows the single-site band for both the original and the reconstructed pulses. Here,. A/E. is dened as the maximum value of the energy-normalized. current pulse. After reconstruction, the single-site band becomes more densely clustered while the overall range of. A/E. values stays the same. Because the single-site band has a negative slope in. energy, the change in peak width is calculated only for energy slices. Around the region of interest, the Gaussian-like peak shrinks by 14% on average. improvement in. A/E. This result looks promising for a potential. resolution, but, due to time constraints, this process was not implemented. in this work. For a few detectors, like ANG5, the noise shows a periodic behavior as can be seen in gure 15. This allows the autoencoder to eliminate a large fraction of the noise without expending many parameters.. The clipped sinusoidal curve with variable phase which is produced in addition to. the spike structure only requires two parameters in principal. Due to the constrains set upon the autoencoder, it can only produce the positive values of the sinus function. For these detectors, the reconstruction comes closer to the original data than what you would expect from random noise most of the time.. The statistics of this relation are shown at the bottom of gure 15.. Making. use of that property has not been pursued in this work but could provide a way to eliminate the characteristic noise and improve energy reconstruction.. 24.

(26) 0.100. Data. Reconstruction Data Reconstruction [1700 keV, 2090 keV]. A/E. 0.095. 0.090. 0.085. 0.080 1000. 1500. 2000. Energy (keV) Figure 14: Distribution of. A/E. 2500 1000. 1500. 2000. Energy (keV). 2500 0. 1000 2000 3000 4000. Events. values around the single-site band of detector GD00A (BEGe) in. the original data (left) and after autoencoder reconstruction (middle). In the energy range from 17000 keV to 2090 keV (right), the width of the single-site band is decreased by 14%, from to. 0.00165.. 25. 0.00192.

(27) Single-site event (2038 keV) Data Reconstruction. 0.04. Current pulse. Multi-site event (2038 keV) Data Reconstruction. 0.04. 0.03 0.03 0.02. 0.02. 0.01. 0.01. 0.00. 0.00 0.0. 0.5. 1.0. 1.5. 0.0. 0.5. Time ( s). 1.0. 1.5. Time ( s) All events > 1 MeV. 0.003. reconst. noise. 0.002. 0.001. 0.000. -0.001. -0.002 500. 1000. 1500. 2000. 2500. 0. Energy (keV) Figure 15:. Top:. 20000 40000 60000. Events. Reconstructions of the current pulses of a single-site and a multi-site event at. 2038 keV from semi-coaxial detector ANG5.. Bottom:. Reconstruction error of the autoencoder for. ANG5 plotted against energy. The units and axes are dened like in gure 12. As the autoencoder models a sinusoidal component of the characteristic noise, it manages to be more accurate than what would be expected by a noise removal more often than not. This can be seen in the negative median additional reconstruction error of -0.000375 (peak of blue curve).. 26.

(28) 5.3 Limitations and possible further applications of the autoencoder The autoencoder learns quintessential features of the data it is trained on. The importance of a certain class of events is determined by their amount in the training process. As the training data consists mostly of events resulting from Compton scattering interactions, the autoencoder learns the typical single-site and multi-site events from calibration data. This also implies that event classes that are very rare or completely absent from the training data are not optimized for. Investigating these failure cases allows some insight into how the autoencoder works and what other applications could be feasable. Figure 16 shows some of the worst pulse reconstructions of a BEGe detector from physics data. There are many causes which can lead to a poor reconstruction.. No energy deposition (594 keV). Current pulse. 0.010 0.005 0.000. 0.2. Current pulse. Data Reconstruction. 0.015. Strong noise (22 keV). -0.005. 0.1 0.0 -0.1. Data Reconstruction. -0.2 0.0. 0.5. 1.0. 1.5. 2.0. 2.5. 0.0. Data Reconstruction. 1.0. 1.5. 2.0. 2.5. 0.5. 0.0. Data Reconstruction. 0.20 0.15 0.10 0.05 0.00. -0.5 0.0. 0.5. 1.0. 1.5. 2.0. 2.5. 0.0. Time ( s) Three spikes (1523 keV) Data Reconstruction. 0.06. 0.04. 0.02. 0.5. 1.0. 1.5. 2.0. 2.5. Time ( s) Spikes far apart (1524 keV) 0.05. Current pulse. Current pulse. 1.0. Time ( s) Narrow spike (1053 keV). Current pulse. Current pulse. Time ( s) Undershoot (60 keV). 0.5. Data Reconstruction. 0.04 0.03 0.02 0.01 0.00. 0.00 0.0. 0.5. 1.0. 1.5. 2.0. 2.5. Time ( s). 0.0. 0.5. 1.0. 1.5. 2.0. 2.5. Time ( s). Figure 16: Example waveforms with a high reconstruction error from physics data of BEGe detector GD00A.. 27.

(29) The waveform labeled. No energy deposition. shows a trace which does not arise from an energy. deposition. This is usually caused by trigger problems. While the traces of three detectors passed all quality cuts during this event, the waveform here is simply an increase in the baseline level, maybe due to detector crosstalk. At very low energies, such as in. Strong noise,. the relative noise level, which scales approximately. inversely with energy, is very large. The autoencoder tries to reproduce some of the noise patterns but cannot reach an accuracy comparable to higher-energy events.. To prevent the autoencoder. from learning specic noise patterns, all pulses resulting from energy depositions below 1 MeV were excluded from the training data. The large undershoot in. Undershoot. is a very uncommon scenario which inuences the alignment. performed in preprocessing. The likely cause here is cross talk between detectors as seven detectors triggered for this event. At energies above 1 MeV, high reconstruction errors are caused only by pulse shapes that are very rare or not present in calibration data. The width of the spike in. Narrow spike. is not as broad as. common single-site events from the training set. This causes the autoencoder to underestimate the height. In. Three spikes,. the autoencoder fails to reproduce the three energy depositions because. these events are very rare and hard to encode in the feature space. The same goes for. apart. Spikes far. where the time between the two energy depositions is uncommonly large which prevents. the autoencoder from properly identifying spike heights. The bad reconstructions in. Three spikes. and. Spikes far apart. Narrow spike,. give some insight into how the autoencoder works. In all of these. cases, the autoencoder tries to t a learned pulse shape or combination of multiple learned shapes to the data. This fails because these events are not common enough to force the autoencoder to reserve some feature space for them. Comparison of reconstructed and original pulse can help to identify unphysical events. The example labeled. undershoot. shows an event where the current pulse contains a negative spike. This feature. is not supposed to happen given our understanding of the detector physics and electronics model and is extremely rare in training data, if at all present. Most of the failure examples from gure 16 are due to the fact that the original waveforms are extremely rare in training data. The training set is completely made up of calibration data. The calibration runs, however, do not produce all event types that are encountered in physics data. Alpha decays at the contact of the detector, for example, happen so infrequently that close to none are seen during calibration. In physics data, however, the exposure time is much larger and alpha events contribute the largest part of events above the. 2νββ. region with between 30 and 300 events. 4. per semi-coaxial detector during Phase IIa . Selecting only events above 3 MeV produces a data set that consists mostly of this type of alpha decay. The total number of recorded alpha events is about 1500. The traces from alpha decays at the contact usually have a shorter rise time than other event types. For BEGes, this is not a problem as the established events at the contact with an upper. A/E. limit.. A/E. discrimination can exclude alpha. This does not work for semi-coaxial detectors,. though, and neural networks are currently used to reject these events. The number of alpha events is too low to correctly train any reasonably sized neural network without overtting. For a detector with 300 alpha events during data taking, a recognizing network should not contain more than 30 weights. This is next to impossible to achieve if the network is trained on the waveform. A shallow network trained on the feature vector. φ(x),. however could work with just 8 weights.. This motivates an investigation into the reconstruction quality of alpha events from physics data. Figure 17 shows some reconstructions of alpha events of dierent semi-coaxial detectors.. The. reconstruction quality strongly depends on the pulse shape and the detector. For ANG1-4, most of the events are reasonably well reconstructed while the RGs and ANG5 contain a considerable amount of bad reconstructions. Overall, the reconstruction accuracy is probably too low to perform ecient alpha classication. One might consider inserting an inated alpha set into the training data of the autoencoder so that it can learn correctly project them into the feature space. This could be done either by simply duplicating alpha events so they have more weight in the training process or by. 4 The. count is lower for BEGe detectors (between 1 and 23) due to their smaller contact. 28. data augmentation,.

(30) Current pulse. ANG1 (5203 keV). RG1 (5009 keV). ANG5 (4475 keV). Data Reconstruction. Data Reconstruction. Data Reconstruction. 0.06. 0.06. 0.04. 0.04. 0.02. 0.02. 0.00. 0.00. 0.0 0.2 0.4 0.6 0.8 1.0. Time ( s). 0.005. noise reconst. 0.001. 0.05. 0.0 0.2 0.4 0.6 0.8 1.0. 0.00. Time ( s). 0.0 0.2 0.4 0.6 0.8 1.0. Time ( s). ANG1-4 (510 events) RG1,2 (142 events) ANG5 (70 events). 0.004 0.003. 0.10. 0.002 0.000 -0.001 -0.002. 3000. Figure 17:. 4000. 5000. Energy (keV). 100 101 102. Events. Reconstruction quality of alpha events from physics data.. The alpha events were. selected in the energy range from 3 MeV to 6 MeV with muon veto and anti-coincidence active.. Top:. Reconstructions of example alpha pulses of dierent detectors.. Bottom:. Distribution of. added reconstruction error (like in gures 12 and 15).. i.e. articially enlarging the set of samples. Augmentation could include adding baseline noise to the alpha events to produce more of them for training. This technique allow a shallow classier network to recognize alpha events the same way single-site events are distinguished from mult-site events as described in chapter 6. Another application of the autoencoder is again fueled by its smart noise removal. It was already shown that this automatically reduces the width of the single-site band.. A similar approach. could be used to improve the energy reconstruction which is currently limited by the noise in the waveform. This could be achieved with an autoencoder that is trained on the charge pulse instead of the current pulse.. Just like the autoencoder used here, it would learn common pulse shapes. from calibration data and try to reproduce them, thereby removing the high-frequency noise. As explained above, the feature space was chosen to be seven parameters in size as this allowed for stable and reliable convergence with high accuracy. As this value belongs to the hyperparameters of the autoencoder, it can be set to arbitrary values, including one, as long as the training converges. An autoencoder with a one-dimensional feature vector value. φ1. φ = (φ1 ) can map any given pulse to a single. in a continuous fashion. Changing this value therefore smoothly interpolates between all. learned pulse shapes under the inverse transformation. ψ,. giving. φ1. a similar role to. A/E. but for. semi-coaxial detectors. Experiments in this direction showed some promising results but could not. 29.

(31) be reproduced for every detector.. 30.

The Gerda experiment searches for the hypothesized, very rare neutrinoless double beta (0&beta;&nu;&nu

The Gerda experiment searches for the hypothesized, very rare neutrinoless double beta (0βν&nu