Online Track and Vertex Reconstruction on GPUs for the Mu3e Experiment

(1)

Online Track and Vertex Reconstruction on GPUs for the Mu3e Experiment

Dorothea vom Bruch

March 7

^th

2017

Connecting the Dots / Workshop on Intelligent Trackers 2017

(2)

Mu3e Signal

Signal

● Coincident in time

● Single vertex

●

● E = m_μ

∑

^⃗^pi=0 e⁺

e⁺ e^-

Search for charged lepton flavour-violating decay

^μ⁺

→ e

⁺

e

^-

e

⁺

with a

sensitivity in branching ratio better than 10

^-16

(3)

The Mu3e Detector

Target

Inner pixel layers

Scintillating f bres

Outer pixel layers Recurl pixel layers

Scintillator tiles

μ Beam

i

10 cm 4.5 cm

B

(4)

The Mu3e Detector

10 cm 4.5 cm

Target

Inner pixel layers

Scintillating f bres

Outer pixel layers i

Recurl pixel layers Scintillator tiles

μ Beam

B

(5)

Readout Scheme

FPGA: Field-Programmable Gate Array GPU: Graphics Processing Unit

2844 Pixel Sensors

up to 45 1.25 Gbit/s links

FPGA FPGA FPGA

...

86 FPGAs

1 6 Gbit/s link each

GPU PC

GPU 12 PCs PC

12 10 Gbit/s links per

8 Inputs each

3072 Fibre Readout Channels

FPGA FPGA

...

12 FPGAs

6272 Tiles

FPGA FPGA

...

14 FPGAs

Data Collection

Server

Mass Storage Gbit Ethernet

Switching Board

Switching Board Switching

Board

Front-end(inside magnet)

Switching Board

(6)

Readout Scheme

From Switching board: get 50 ns time slices of data containing full detector information

2844 Pixel Sensors

FPGA FPGA FPGA

...

86 FPGAs

GPU PC

GPU 12 PCs PC

8 Inputs each

FPGA FPGA

...

12 FPGAs

6272 Tiles

FPGA FPGA

...

14 FPGAs

Data Collection

Server

Switching Board

Board

Switching Board

(7)

Readout Rate

Data rate [Gbit / s]

Pixel detector 40

Fiber detector 20

Tile detector negligible

Total ~ 60

At a rate of 10⁸ muons / s

Triggerless, zero-suppressed readout

Need factor ~80 reduction to reach 100 MB/s

(8)

Readout Rate

Data rate [Gbit / s]

Pixel detector 40

Fiber detector 20

Tile detector negligible

Total ~ 60

At a rate of 10⁸ muons / s

Triggerless, zero-suppressed readout

Need factor ~80 reduction to reach 100 MB/s

(9)

Selection Process

How do we find the three signal tracks?

1) Track fitting 2) Vertex search

e⁺

e⁺ e^-

(10)

Geometrical Selection

r x

y

01

2

0 1

2

(11)

Geometrical Selection

r x

y

01

2

0 1

2

z₁ - z₀

Ф₁ - Ф₀

(12)

Geometrical Selection

r x

y

01

2

0 1

2

z₂ - z₁

Ф₂ - Ф₁

(13)

Geometrical Selection

r x

y

01

2

0 1

2

z₂ - z₁

Ф₂ - Ф₁

Reduce 3-hit combinations by factor 50

(14)

Fitting

● Use Multiple Scattering Fit ( talk by A. → Kozlinskiy)

● Fit hits in first three layers

● Propagate to 4^th layer

● Select hit in 4^th layer closest to propagated position

● Redo fit with a second triplet, cut on χ²

After all selections:

● 98.5 % of true 4-hit MC tracks selected

● 74 % of 4-hit tracks are true MC tracks

(15)

Vertex Estimate: XY-Plane

● Study each combination of two e⁺, one e^-

● In xy-plane: find intersections of track circles

● Calculate weights of intersections based on uncertainties due to

– multiple scattering

– pixel size

e⁺ e⁺

e^-

x y

(16)

Vertex Estimate: XY-Plane

– pixel size

e⁺ e⁺

e^-

x y

(17)

Vertex Estimate: XY-Plane

– pixel size

e⁺ e⁺

e^-

x y

(18)

Vertex Estimate: XY-Plane

– pixel size

e⁺ e⁺

e^-

x y

(19)

Vertex Estimate

PCA_xy 1

x y

PCA_xy 2

PCA_xy 3

Weighted mean

● Calculate weighted mean of intersections from three different tracks

● Find point of closest approach (PCA_xy) to weighted mean in xy-plane on each track

● Calculate z-position PCA_z and weight at PCA_xy

● Find weighted mean in z-coordinate

● Achieve vertex resolution of ~400 ^μm sigma

χ

²

= ∑

i=0

3

PCA

_{xy ,i}

− ¯ xy σ

_PCA

xy ,i

+ PCA

_{z , i}

−¯ z σ

_PCA

z, i

z

(20)

χ ² Distribution

0 10 20 30 40 50 60 70 80 90 Chi2100

Number of Entries

103

104

Random combinations Signal

(21)

Cut Effects

Signal reference: full offline track reconstruction and offline vertex fit

0.986 0.988 0.990 0.992 0.994 0.996 0.998 1.000

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

background tightsignalcut

Signal frames accepted Background frames accepted

(22)

Fast Reconstruction on GPU

● Use time slices of 50 ns for track &

vertex search

→ Process 20∙10⁶ time slices per second

● Plan for 12 filter farm PCs with one GPU each

→ Process at least 1.7∙10⁶ time slices per second

→ use GPUs

● Thousands of cores

● Optimal parallel performance

● Best suited for many floating-point operations / second

(23)

Selection on GPU

PCIe FPGA

Recurl station hits, Timing information

Hits layer 1

Geometrical three-hit selection

Coordinate transformation

Hits layer 2

Hits layer 3

Hits layer 4

DMA

GPU

GPU memory Three-hit fit

Propagation, four-hit fit Positive

tracks Negative tracks Vertex selection

GPU memory

Selection decision

DMA

(24)

Selection on GPU

PCIe FPGA

Recurl station hits, Timing information

Hits layer 1

Geometrical three-hit selection

Coordinate transformation

Hits layer 2

Hits layer 3

Hits layer 4

DMA

GPU

GPU memory Three-hit fit

Propagation, four-hit fit Positive

tracks Negative tracks Vertex selection

GPU memory

Selection decision

DMA

(25)

Parallelization Track Fit

Time slice

1 Time slice 2

Time Slice N

...

... ... ...

● Fit for one combination of three hits

● Propagation to 4^th layer

● Loop over hits in 4^th layer: check if hit exists in proximity of propagated track, re-fit

● Wait for all cores in one time slice to be done with previous steps

Thread

1 Thread

2

Thread N

...

... ... ...

16 x 8192 50 ns time slices

96 threads / time slice

(26)

Parallelization Track Fit

Time slice

1 Time slice 2

Time Slice N

...

... ... ...

● Fit for one combination of three hits

● Propagation to 4^th layer

● Loop over hits in 4^th layer: check if hit exists in proximity of propagated track, re-fit

● Wait for all cores in one time slice to be done with previous steps

Thread

1 Thread

2

Thread

...

... ... ...

16 x 8192 50 ns time slices

96 threads / time slice

Total of 12.6 million threads to be distributed among 2560 cores

(27)

Parallelization Vertex Selection

Time slice

1 Time slice 2

Time slice N

...

... ... ...

● For one electron & one positron from this 50 ns time slice:

– Loop over all other positrons

– Find vertex estimate

● Decide whether to keep this time slice

Thread

1 Thread

2

Thread N

...

... ... ...

(28)

Performance

Optimizations performed:

● Memory layout and access pattern

● Register usage

● Grid dimensions

● Approximations

(29)

Performance

Optimizations performed:

● Memory layout and access pattern

● Register usage

● Grid dimensions

● Approximations

Currently process 2∙10⁶ time slices / s on one nvidia GTX 1080 at a muon stopping rate of 7∙10⁷ Hz

(30)

Backup

(31)

Muon Stopping Rate Study I

4.00E+07 6.00E+07 8.00E+07 1.00E+08 1.20E+08 0.86

0.88 0.9 0.92 0.94 0.96 0.98 1

0 0.01 0.02 0.03 0.04 0.05 0.06

background tightsignalcut truthsignal losesignalcut

muon stopping rate on target [Hz]

Signal frames accepted Background frames accepted

(32)

Muon Stopping Rate Study II

4.0E+07 6.0E+07 8.0E+07 1.0E+08 1.2E+08 1.4E+08 0.0E+00

5.0E+05 1.0E+06 1.5E+06 2.0E+06 2.5E+06 3.0E+06 3.5E+06 4.0E+06

Muon stopping rate on target

Frames / s

4.0E+07 6.0E+07 8.0E+07 1.0E+08 1.2E+08 1.4E+08 0

0 0 0 0 0 0 0

0 0 0 0 0 0.01 0.01 0.01 0.01

frames with hit overflow

Muon stopping rate on target

Frames with hit overflow Frames with triplet overflow

(33)

The Mu3e Experiment

Search for charged lepton flavour-violating decay

^μ⁺

→ e

⁺

e

^-

e

⁺

with a sensitivity in branching ratio better than 10

^-16

Branching ratio

suppressed in Standard Model to below 10^-54

Any hint of signal new physics

● Supersymmetry

● Grand unified models

● Extended Higgs sector

● ...

(34)

Mupix7: Efficiency

(35)

Mupix7: Efficiency

Mupix7, HV = -85 V

(36)

Mupix: Mechanics

● 50 m siliconμ

● ∼ 50 m flexprint: Kapton, aluminum, μ copper

● 25 m Kapton foilμ

→ Ơ(0.1 %) radiation length

(37)

Sensitivity Study

2] [MeV/c mrec

96 98 100 102 104 106 108 110

2 Events per 0.2 MeV/c

3

10− 2

10− 1

10−

1 10 102

at 10-12

eee

→ µ

at 10-13

eee

→ µ

at 10-14

eee

→ µ

at 10-15

eee

→ µ ν

eeeν

→ µ

muons/s muon stops at 108

1015

Mu3e Phase I

Bhabha + Michel

(38)

Multiple Scattering

● Muons decay at rest

→ momentum < 53 MeV/c

● Momentum resolution to first order:

σ_p/p ∼ θ_MS/Ω

● Use recurling tracks for momentum measurement

(39)

Mupix Protoype

● Readout electronics on chip

● Fast LVDS link: 1.25 Gbit/s

● Mupix7: latest prototype

● Thinned to 50 mμ

● 32 x 40 pixel matrix

● Pixel size: 103 m x 80 mμ μ

● 3.2 x 3.2 mm²

(40)

Muon Beam

@ Paul Scherrer Institute (PSI)

● 590 MeV cyclotron

● 2.2 mA proton beam

● Most powerful proton beam worldwide

● Target E: 28 MeV/c surface muons to πE5 beamline

(41)

Data Transfer

● Transfer data from FPGA to RAM via direct memory access (DMA)

● Tested at 1.5 GB/s: BER ≤ 4•10^-16 (at 95% confidence level)

● Tested on beam test campaigns

● Will be used for readout of next MuPix prototype

LVDS connector for data cable from MuPix chip

(42)

Multiple Scattering Fit

● Electrons: 12 – 53 MeV/c

● Resolution dominated by multiple Coulomb scattering

● Ignore hit uncertainty

● Three consecutive hits: “triplet”

● Multiple scattering at middle hit of triplet

● Minimize multiple scattering

χ

²

= Φ

²_MS

2

+ θ

²_MS

2

r y

Φ_MS

S₀₁ S₁₂

S 12 S 01

Θ MS

x y

Triplet

(43)

Geometrical Selection

After all cuts:

Reduce 3-hit combinations by factor 50 In subsequent layers, cut on:

● Z-difference of hits

● Φ-difference of hits

y

Ф₁ - Ф₀

(44)

Radius Distribution

400

− −300 −200 −100 0 100 200 300 400

Number of events / 4 mm

0 500 1000 1500 2000 2500 3000

3500 Positrons

Electrons

(45)

Z distance

(46)

Uncertainty at Intersection

σ

_{MS , PCA}

=σ

MS , first layer

⋅ s ≈0.8 mm σ = 0.08 mm / √ ^12=0.02 ^mm

Take both into account when calculating weights

multiple scattering sigma at first layer [rad]

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

Number of events / mrad

0 20 40 60 80 100 120

103

×

path length in xy-plane from first layer to PCA [mm]

0 5 10 15 20 25 30 35 40 45 50

Number of events / 0.5 mm

0 20 40 60 80 100 120 140 160 180 200 220 240

103

×

(47)

Offline Reconstruction Reference

● Full detector simulation is available

● For this study:

– Simulated signal events with one signal decay / 50 ns frame

– Simulated background events with ordinary muon decays

● Full offline reconstruction includes:

– Track reconstruction with hits from all layers and recurl stations

– Matching and linking of recurling track pieces

– Linearised vertex fit for low momentum tracks in magnetic field

(48)

March 7th, 2017 D. vom Bruch, Mu3e 48

Selection on GPU

● Obtain 50 ns data slices on DAQ computer, so called frames

● Need to process 20∙10⁶ frames / s

● Will have about 10 DAQ computers

● → Process 2∙10⁶ frames / s on each computer

● Geometric selection cuts

● Save hit positions of the three hits belonging to one triplet and hits in fourth

layer

FPGA GPU

DMA

● Fits with three and four hits

● Vertex selection

● Save frame decision

(49)

histo

Entries 7603

Mean −0.02629

RMS 0.8089

/ ndf

χ2 40.6 / 6

Constant 1065 ±18.8 Mean −0.01068±0.00355 Sigma 0.2314±0.0037

200 400 600 800 1000

1200 histo

Entries 7603

Mean −0.02629

RMS 0.8089

/ ndf

χ2 40.6 / 6

Vertex Position Distribution

histo

Entries 7603

Mean 0.01541

RMS 1.332

/ ndf

χ2 84.29 / 12

Constant 600.6±11.1 Mean −0.002901±0.006102 Sigma 0.3914±0.0068

true - estimated vertex position in x [mm]

10

− −8 −6 −4 −2 0 2 4 6 8 10

0 100 200 300 400 500 600

700 histo

Entries 7603

Mean 0.01541

RMS 1.332

/ ndf

χ2 84.29 / 12

Constant 600.6±11.1 Mean −0.002901±0.006102 Sigma 0.3914±0.0068

histo

Entries 7603

Mean 0.04704

RMS 1.331

/ ndf

χ2 84.32 / 14

true - estimated vertex position in y [mm]

10

− −8 −6 −4 −2 0 2 4 6 8 10

0 100 200 300 400 500 600

histo

Entries 7603

Mean 0.04704

RMS 1.331

/ ndf

χ2 84.32 / 14

(50)

Combined Momentum and Energy

combined momentum magnitude [MeV/c]

0 10 20 30 40 50 60 70 80 90 100

Number of events / MeV/c

0 10000 20000 30000 40000 50000 60000 70000 80000

Signal

Random combinations

combined energy [MeV]

0 20 40 60 80 100 120 140 160 180 200

Number of events / MeV

10000 20000 30000 40000 50000 60000

70000 Random combinations

Signal

(51)

Combined Momentum and Energy

combined momentum magnitude [MeV/c]

0 10 20 30 40 50 60 70 80 90 100

Number of events / MeV/c

0 10000 20000 30000 40000 50000 60000 70000 80000

Signal

Random combinations

combined energy [MeV]

0 20 40 60 80 100 120 140 160 180 200

Number of events / MeV

10000 20000 30000 40000 50000 60000

70000 Random combinations

Signal

(52)

Distance to Target

distance to target [mm]

0 2 4 6 8 10 12 14 16 18 20

10000 20000 30000 40000 50000 60000 70000

Random combinations

Signal

(53)

Distance to Target

distance to target [mm]

0 2 4 6 8 10 12 14 16 18 20

10000 20000 30000 40000 50000 60000 70000

Random combinations

Signal

(54)

Pixel Detector

● High Voltage Monolithic Active Pixel Sensors (HV-MAPS)

● Fast charge collection via drift

● Thinned down to 50 mμ

● Pixel size: 80 m x 80 mμ μ

● Chip size: 2 cm x 2 cm

● Thickness chip & readout:

Ơ(0.1 %) radiation length

(55)

2844 Pixel Sensors

FPGA FPGA FPGA

...

86 FPGAs

GPU PC

GPU 12 PCs PC

8 Inputs each

FPGA FPGA

...

12 FPGAs

6272 Tiles

FPGA FPGA

...

14 FPGAs

Data Collection

Server

Switching Board

Board

Switching Board

Readout Scheme

Front-end board:

● Sort hits according to time stamps

● Send off via optical links

Switching board:

● Merge data from different detector regions

● Pack into 50 ns time slices

● Send off via optical links PCIe board:

● First data selection

● Transfer data to RAM of PC via PCIe

Online Track and Vertex Reconstruction on GPUs for the Mu3e Experiment