Online Track and Vertex Reconstruction on GPUs for the Mu3e Experiment
Dorothea vom Bruch
March 7
th2017
Connecting the Dots / Workshop on Intelligent Trackers 2017
Mu3e Signal
Signal
● Coincident in time
● Single vertex
●
● E = mμ
∑
⃗pi=0 e+e+ e-
Search for charged lepton flavour-violating decay
μ+→ e
+e
-e
+with a
sensitivity in branching ratio better than 10
-16The Mu3e Detector
Target
Inner pixel layers
Scintillating f bres
Outer pixel layers Recurl pixel layers
Scintillator tiles
μ Beam
i
10 cm 4.5 cm
B
The Mu3e Detector
10 cm 4.5 cm
Target
Inner pixel layers
Scintillating f bres
Outer pixel layers i
Recurl pixel layers Scintillator tiles
μ Beam
B
Readout Scheme
FPGA: Field-Programmable Gate Array GPU: Graphics Processing Unit
2844 Pixel Sensors
up to 45 1.25 Gbit/s links
FPGA FPGA FPGA
...
86 FPGAs
1 6 Gbit/s link each
GPU PC
GPU PC
GPU 12 PCs PC
12 10 Gbit/s links per
8 Inputs each
3072 Fibre Readout Channels
FPGA FPGA
...
12 FPGAs
6272 Tiles
FPGA FPGA
...
14 FPGAs
Data Collection
Server
Mass Storage Gbit Ethernet
Switching Board
Switching Board Switching
Board
Front-end(inside magnet)
Switching Board
Readout Scheme
From Switching board: get 50 ns time slices of data containing full detector information
2844 Pixel Sensors
up to 45 1.25 Gbit/s links
FPGA FPGA FPGA
...
86 FPGAs
1 6 Gbit/s link each
GPU PC
GPU PC
GPU 12 PCs PC
12 10 Gbit/s links per
8 Inputs each
3072 Fibre Readout Channels
FPGA FPGA
...
12 FPGAs
6272 Tiles
FPGA FPGA
...
14 FPGAs
Data Collection
Server
Mass Storage Gbit Ethernet
Switching Board
Switching Board Switching
Board
Front-end(inside magnet)
Switching Board
Readout Rate
Data rate [Gbit / s]
Pixel detector 40
Fiber detector 20
Tile detector negligible
Total ~ 60
At a rate of 108 muons / s
Triggerless, zero-suppressed readout
Need factor ~80 reduction to reach 100 MB/s
Readout Rate
Data rate [Gbit / s]
Pixel detector 40
Fiber detector 20
Tile detector negligible
Total ~ 60
At a rate of 108 muons / s
Triggerless, zero-suppressed readout
Need factor ~80 reduction to reach 100 MB/s
Selection Process
How do we find the three signal tracks?
1) Track fitting 2) Vertex search
e+
e+ e-
Geometrical Selection
r x
y
01
2
0 1
2
Geometrical Selection
r x
y
01
2
0 1
2
z1 - z0
Ф1 - Ф0
Geometrical Selection
r x
y
01
2
0 1
2
z2 - z1
Ф2 - Ф1
Geometrical Selection
r x
y
01
2
0 1
2
z2 - z1
Ф2 - Ф1
Reduce 3-hit combinations by factor 50
Fitting
● Use Multiple Scattering Fit ( talk by A. → Kozlinskiy)
● Fit hits in first three layers
● Propagate to 4th layer
● Select hit in 4th layer closest to propagated position
● Redo fit with a second triplet, cut on χ2
After all selections:
● 98.5 % of true 4-hit MC tracks selected
● 74 % of 4-hit tracks are true MC tracks
Vertex Estimate: XY-Plane
● Study each combination of two e+, one e-
● In xy-plane: find intersections of track circles
● Calculate weights of intersections based on uncertainties due to
– multiple scattering
– pixel size
e+ e+
e-
x y
Vertex Estimate: XY-Plane
● Study each combination of two e+, one e-
● In xy-plane: find intersections of track circles
● Calculate weights of intersections based on uncertainties due to
– multiple scattering
– pixel size
e+ e+
e-
x y
Vertex Estimate: XY-Plane
● Study each combination of two e+, one e-
● In xy-plane: find intersections of track circles
● Calculate weights of intersections based on uncertainties due to
– multiple scattering
– pixel size
e+ e+
e-
x y
Vertex Estimate: XY-Plane
● Study each combination of two e+, one e-
● In xy-plane: find intersections of track circles
● Calculate weights of intersections based on uncertainties due to
– multiple scattering
– pixel size
e+ e+
e-
x y
Vertex Estimate
PCAxy 1
x y
PCAxy 2
PCAxy 3
Weighted mean
● Calculate weighted mean of intersections from three different tracks
● Find point of closest approach (PCAxy) to weighted mean in xy-plane on each track
● Calculate z-position PCAz and weight at PCAxy
● Find weighted mean in z-coordinate
● Achieve vertex resolution of ~400 μm sigma
χ
2= ∑
i=0
3
PCA
xy ,i− ¯ xy σ
PCAxy ,i
+ PCA
z , i−¯ z σ
PCAz, i
z
χ 2 Distribution
0 10 20 30 40 50 60 70 80 90 Chi2100
Number of Entries
103
104
Random combinations Signal
Cut Effects
Signal reference: full offline track reconstruction and offline vertex fit
0.986 0.988 0.990 0.992 0.994 0.996 0.998 1.000
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
background tightsignalcut
Signal frames accepted Background frames accepted
Fast Reconstruction on GPU
● Use time slices of 50 ns for track &
vertex search
→ Process 20∙106 time slices per second
● Plan for 12 filter farm PCs with one GPU each
→ Process at least 1.7∙106 time slices per second
→ use GPUs
● Thousands of cores
● Optimal parallel performance
● Best suited for many floating-point operations / second
Selection on GPU
PCIe FPGA
Recurl station hits, Timing information
Hits layer 1
Geometrical three-hit selection
Coordinate transformation
Hits layer 2
Hits layer 3
Hits layer 4
DMA
GPU
GPU memory Three-hit fit
Propagation, four-hit fit Positive
tracks Negative tracks Vertex selection
GPU memory
Selection decision
DMA
Selection on GPU
PCIe FPGA
Recurl station hits, Timing information
Hits layer 1
Geometrical three-hit selection
Coordinate transformation
Hits layer 2
Hits layer 3
Hits layer 4
DMA
GPU
GPU memory Three-hit fit
Propagation, four-hit fit Positive
tracks Negative tracks Vertex selection
GPU memory
Selection decision
DMA
Parallelization Track Fit
Time slice
1 Time slice 2
Time Slice N
...
...
...
... ... ...
● Fit for one combination of three hits
● Propagation to 4th layer
● Loop over hits in 4th layer: check if hit exists in proximity of propagated track, re-fit
● Wait for all cores in one time slice to be done with previous steps
Thread
1 Thread
2
Thread N
...
...
...
... ... ...
16 x 8192 50 ns time slices
96 threads / time slice
Parallelization Track Fit
Time slice
1 Time slice 2
Time Slice N
...
...
...
... ... ...
● Fit for one combination of three hits
● Propagation to 4th layer
● Loop over hits in 4th layer: check if hit exists in proximity of propagated track, re-fit
● Wait for all cores in one time slice to be done with previous steps
Thread
1 Thread
2
Thread
...
...
...
... ... ...
16 x 8192 50 ns time slices
96 threads / time slice
Total of 12.6 million threads to be distributed among 2560 cores
Parallelization Vertex Selection
Time slice
1 Time slice 2
Time slice N
...
...
...
... ... ...
● For one electron & one positron from this 50 ns time slice:
– Loop over all other positrons
– Find vertex estimate
● Decide whether to keep this time slice
Thread
1 Thread
2
Thread N
...
...
...
... ... ...
Performance
Optimizations performed:
● Memory layout and access pattern
● Register usage
● Grid dimensions
● Approximations
Performance
Optimizations performed:
● Memory layout and access pattern
● Register usage
● Grid dimensions
● Approximations
Currently process 2∙106 time slices / s on one nvidia GTX 1080 at a muon stopping rate of 7∙107 Hz
Backup
Muon Stopping Rate Study I
4.00E+07 6.00E+07 8.00E+07 1.00E+08 1.20E+08 0.86
0.88 0.9 0.92 0.94 0.96 0.98 1
0 0.01 0.02 0.03 0.04 0.05 0.06
background tightsignalcut truthsignal losesignalcut
muon stopping rate on target [Hz]
Signal frames accepted Background frames accepted
Muon Stopping Rate Study II
4.0E+07 6.0E+07 8.0E+07 1.0E+08 1.2E+08 1.4E+08 0.0E+00
5.0E+05 1.0E+06 1.5E+06 2.0E+06 2.5E+06 3.0E+06 3.5E+06 4.0E+06
Muon stopping rate on target
Frames / s
4.0E+07 6.0E+07 8.0E+07 1.0E+08 1.2E+08 1.4E+08 0
0 0 0 0 0 0 0
0 0 0 0 0 0.01 0.01 0.01 0.01
frames with hit overflow
Muon stopping rate on target
Frames with hit overflow Frames with triplet overflow
The Mu3e Experiment
Search for charged lepton flavour-violating decay
μ+→ e
+e
-e
+with a sensitivity in branching ratio better than 10
-16Branching ratio
suppressed in Standard Model to below 10-54
Any hint of signal new physics
● Supersymmetry
● Grand unified models
● Extended Higgs sector
● ...
Mupix7: Efficiency
Mupix7: Efficiency
Mupix7, HV = -85 V
Mupix: Mechanics
● 50 m siliconμ
● ∼ 50 m flexprint: Kapton, aluminum, μ copper
● 25 m Kapton foilμ
→ Ơ(0.1 %) radiation length
Sensitivity Study
2] [MeV/c mrec
96 98 100 102 104 106 108 110
2 Events per 0.2 MeV/c
3
10− 2
10− 1
10−
1 10 102
at 10-12
eee
→ µ
at 10-13
eee
→ µ
at 10-14
eee
→ µ
at 10-15
eee
→ µ ν
eeeν
→ µ
muons/s muon stops at 108
1015
Mu3e Phase I
Bhabha + Michel
Multiple Scattering
● Muons decay at rest
→ momentum < 53 MeV/c
● Momentum resolution to first order:
σp/p ∼ θMS/Ω
● Use recurling tracks for momentum measurement
Mupix Protoype
● Readout electronics on chip
● Fast LVDS link: 1.25 Gbit/s
● Mupix7: latest prototype
● Thinned to 50 mμ
● 32 x 40 pixel matrix
● Pixel size: 103 m x 80 mμ μ
● 3.2 x 3.2 mm2
Muon Beam
@ Paul Scherrer Institute (PSI)
● 590 MeV cyclotron
● 2.2 mA proton beam
● Most powerful proton beam worldwide
● Target E: 28 MeV/c surface muons to πE5 beamline
Data Transfer
● Transfer data from FPGA to RAM via direct memory access (DMA)
● Tested at 1.5 GB/s: BER ≤ 4•10-16 (at 95% confidence level)
● Tested on beam test campaigns
● Will be used for readout of next MuPix prototype
LVDS connector for data cable from MuPix chip
Multiple Scattering Fit
● Electrons: 12 – 53 MeV/c
● Resolution dominated by multiple Coulomb scattering
● Ignore hit uncertainty
● Three consecutive hits: “triplet”
● Multiple scattering at middle hit of triplet
● Minimize multiple scattering
χ
2= Φ
2MS2
+ θ
2MS2
r y
ΦMS
S01 S12
S 12 S 01
Θ MS
x y
Triplet
Geometrical Selection
After all cuts:
Reduce 3-hit combinations by factor 50 In subsequent layers, cut on:
● Z-difference of hits
● Φ-difference of hits
y
Ф1 - Ф0
Radius Distribution
400
− −300 −200 −100 0 100 200 300 400
Number of events / 4 mm
0 500 1000 1500 2000 2500 3000
3500 Positrons
Electrons
Z distance
Uncertainty at Intersection
σ
MS , PCA=σ
MS , first layer⋅ s ≈0.8 mm σ = 0.08 mm / √ 12=0.02 mm
Take both into account when calculating weights
multiple scattering sigma at first layer [rad]
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Number of events / mrad
0 20 40 60 80 100 120
103
×
path length in xy-plane from first layer to PCA [mm]
0 5 10 15 20 25 30 35 40 45 50
Number of events / 0.5 mm
0 20 40 60 80 100 120 140 160 180 200 220 240
103
×
Offline Reconstruction Reference
● Full detector simulation is available
● For this study:
– Simulated signal events with one signal decay / 50 ns frame
– Simulated background events with ordinary muon decays
● Full offline reconstruction includes:
– Track reconstruction with hits from all layers and recurl stations
– Matching and linking of recurling track pieces
– Linearised vertex fit for low momentum tracks in magnetic field
March 7th, 2017 D. vom Bruch, Mu3e 48
Selection on GPU
● Obtain 50 ns data slices on DAQ computer, so called frames
● Need to process 20∙106 frames / s
● Will have about 10 DAQ computers
● → Process 2∙106 frames / s on each computer
● Geometric selection cuts
● Save hit positions of the three hits belonging to one triplet and hits in fourth
layer
FPGA GPU
DMA
● Fits with three and four hits
● Vertex selection
● Save frame decision
histo
Entries 7603
Mean −0.02629
RMS 0.8089
/ ndf
χ2 40.6 / 6
Constant 1065 ±18.8 Mean −0.01068±0.00355 Sigma 0.2314±0.0037
Number of events / 0.1 mm
200 400 600 800 1000
1200 histo
Entries 7603
Mean −0.02629
RMS 0.8089
/ ndf
χ2 40.6 / 6
Constant 1065 ±18.8 Mean −0.01068±0.00355 Sigma 0.2314±0.0037
Vertex Position Distribution
histo
Entries 7603
Mean 0.01541
RMS 1.332
/ ndf
χ2 84.29 / 12
Constant 600.6±11.1 Mean −0.002901±0.006102 Sigma 0.3914±0.0068
true - estimated vertex position in x [mm]
10
− −8 −6 −4 −2 0 2 4 6 8 10
Number of events / 0.1 mm
0 100 200 300 400 500 600
700 histo
Entries 7603
Mean 0.01541
RMS 1.332
/ ndf
χ2 84.29 / 12
Constant 600.6±11.1 Mean −0.002901±0.006102 Sigma 0.3914±0.0068
histo
Entries 7603
Mean 0.04704
RMS 1.331
/ ndf
χ2 84.32 / 14
Constant 613 ±10.9 Mean −0.004342±0.005676 Sigma 0.3941±0.0058
true - estimated vertex position in y [mm]
10
− −8 −6 −4 −2 0 2 4 6 8 10
Number of events / 0.1 mm
0 100 200 300 400 500 600
histo
Entries 7603
Mean 0.04704
RMS 1.331
/ ndf
χ2 84.32 / 14
Constant 613 ±10.9 Mean −0.004342±0.005676 Sigma 0.3941±0.0058
Combined Momentum and Energy
combined momentum magnitude [MeV/c]
0 10 20 30 40 50 60 70 80 90 100
Number of events / MeV/c
0 10000 20000 30000 40000 50000 60000 70000 80000
Signal
Random combinations
combined energy [MeV]
0 20 40 60 80 100 120 140 160 180 200
Number of events / MeV
10000 20000 30000 40000 50000 60000
70000 Random combinations
Signal
Combined Momentum and Energy
combined momentum magnitude [MeV/c]
0 10 20 30 40 50 60 70 80 90 100
Number of events / MeV/c
0 10000 20000 30000 40000 50000 60000 70000 80000
Signal
Random combinations
combined energy [MeV]
0 20 40 60 80 100 120 140 160 180 200
Number of events / MeV
10000 20000 30000 40000 50000 60000
70000 Random combinations
Signal
Distance to Target
distance to target [mm]
0 2 4 6 8 10 12 14 16 18 20
Number of events / 0.1 mm
10000 20000 30000 40000 50000 60000 70000
Random combinations
Signal
Distance to Target
distance to target [mm]
0 2 4 6 8 10 12 14 16 18 20
Number of events / 0.1 mm
10000 20000 30000 40000 50000 60000 70000
Random combinations
Signal
Pixel Detector
● High Voltage Monolithic Active Pixel Sensors (HV-MAPS)
● Fast charge collection via drift
● Thinned down to 50 mμ
● Pixel size: 80 m x 80 mμ μ
● Chip size: 2 cm x 2 cm
● Thickness chip & readout:
Ơ(0.1 %) radiation length
2844 Pixel Sensors
up to 45 1.25 Gbit/s links
FPGA FPGA FPGA
...
86 FPGAs
1 6 Gbit/s link each
GPU PC
GPU PC
GPU 12 PCs PC
12 10 Gbit/s links per
8 Inputs each
3072 Fibre Readout Channels
FPGA FPGA
...
12 FPGAs
6272 Tiles
FPGA FPGA
...
14 FPGAs
Data Collection
Server
Mass Storage Gbit Ethernet
Switching Board
Switching Board Switching
Board
Front-end(inside magnet)
Switching Board
Readout Scheme
Front-end board:
● Sort hits according to time stamps
● Send off via optical links
Switching board:
● Merge data from different detector regions
● Pack into 50 ns time slices
● Send off via optical links PCIe board:
● First data selection
● Transfer data to RAM of PC via PCIe