• Keine Ergebnisse gefunden

Online Track Reconstruction on GPUs for the Mu3e Experiment Dorothea vom Bruch

N/A
N/A
Protected

Academic year: 2022

Aktie "Online Track Reconstruction on GPUs for the Mu3e Experiment Dorothea vom Bruch"

Copied!
28
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Feb 29, 2016 Online Tracking for Mu3e 1

Online Track Reconstruction

on GPUs for the Mu3e Experiment

Dorothea vom Bruch

for the Mu3e Collaboration

DPG Frühjahrstagung 2016, T42: Trigger und DAQ II

(2)

The Mu3e Experiment

Search for charged lepton flavour-violating decay with a sensitivity in the branching ratio better than 10

-16

μ

+

e

+

e

e

+

Branching ratio

suppressed in Standard Model to below 10

-54

Any hint of signal new physics

Supersymmetry

Grand unified models

Extended Higgs sector

...

Current limit on branching ratio: 10

-12

(SINDRUM, 1988)

(3)

Feb 29, 2016 Online Tracking for Mu3e 3

Signal versus Background

Signal

Coincident in time

Single vertex

∑ E = p

i

=0

m

μ

e+

e+ e-

Random Combinations

Not coincident in time

No single vertex

E

p

i

≠0

≠m

μ

e

+

e

+

e

-

Internal Conversion

Coincident in time

Single vertex

E

p

i

≠0

≠m

μ

(4)

The Mu3e Detector

Requirements

Excellent momentum resolution: < 0.5 MeV/c

Good timing resolution: 100 ps for tiles, 1 ns for fibres, < 20 ns for pixels

Good vertex resolution: 300 μ m

High rates: 10

8

-10

9

μ /s (Paul Scherrer Institute, Switzerland)

(5)

Feb 29, 2016 Online Tracking for Mu3e 5

The Mu3e Detector

Requirements

Excellent momentum resolution: < 0.5 MeV/c

Good timing resolution: 100 ps for tiles, 1 ns for fibres, < 20 ns for pixels

Good vertex resolution: 300 μ m

High rates: 10

8

-10

9

μ /s (Paul Scherrer Institute, Switzerland)

10 cm

(6)

Readout Scheme

Triggerless readout 50 Gbit/s data rate

Online data reduction

Track reconstruction and vertex fitting on Graphics Processing Units (GPUs)

Reduction factor of

~1000

...

~1100 Pixel Sensors

FPGA FPGA 38 FPGAs FPGA

Switching Boards

PC GPU ...

PC GPU

12 PCs PC

Data Collection

Server

Mass Storage

1 6.4 Gbit/s link each

12 6.4 Gbits/s links per board

Gbit Ethernet up to 45 1.25 Gbit/s links

GPU

FPGA FPGA FPGA

(7)

Feb 29, 2016 Online Tracking for Mu3e 7

Fast Data Transfer

CPU

FPGA GPU

PCIe

RAM

Direct Memory Access to main memory

Copy to GPU memory

At 1.5 GB/s: measured bit error rate

< 4 x 10

-16

(8)

Online Reconstruction

Number of possible track candidates ~ n

3

At 10

8

μ /s: ~10 hits / layer / 50 ns O (10

3

) combinations / 50 ns FPGA

Geometrical selection

G PU

RAM

RAM

Main memory

Multiple scattering fit Matching layer 4

Track combinations Vertex fit

Main memory

e+ e-

Selection decision

Selected frames

DMA Transfer DMA Transfer

Main memory as buffer

(9)

Feb 29, 2016 Online Tracking for Mu3e 9

Geometrical Selection

z

r x

y

0 1

2

0 1

2

(10)

Geometrical Selection

z

r x

y

0 1

2

0 1

2

z

1

- z

0

Ф

1

- Ф

0

(11)

Feb 29, 2016 Online Tracking for Mu3e 11

Geometrical Selection

z

r x

y

0 1

2

0 1

2

z

2

- z

1

Ф

2

- Ф

1

(12)

Geometrical Selection

(13)

Feb 29, 2016 Online Tracking for Mu3e 13

Multiple Scattering Fit

Electrons: 12 – 53 MeV/c

Resolution dominated by multiple Coulomb scattering

Ignore hit uncertainty

Describe track as sequence of hit triplets

Multiple scattering at middle hit of triplet

Minimize multiple scattering

Triplet

χ

2

= Φ

2MS

σ

2MS ,Φ

+ θ

2MS

σ

2MS ,θ

(14)

Propagation to 4 th Layer

Position of 4

th

layer known

: propagate in xy-plane

: propagate in z direction

After all selections:

98 % of true 4-hit tracks selected

65 % random combinations of 3 hits

α

β

α

R β

(15)

Feb 29, 2016 Online Tracking for Mu3e 15

Parallelization

...

...

...

...

... ... ...

~ 2000 compute cores on GPU

Fit for one combination of three hits

Cut on χ

2

Propagation to 4

th

layer

Loop over hits in 4

th

layer: check if hit

exists in proximity of propagated track

(16)

Performance

10

8

muons / s GTX680 GTX980

Fits / s 2x10

7

3x10

7

10

9

muons / s

Fits / s 9.7x10

9

1.6x10

10

Pictures: pcmag.com, nvidia.com

(17)

Feb 29, 2016 Online Tracking for Mu3e 17

Performance

10

8

muons / s GTX680 GTX980

Fits / s 2x10

7

3x10

7

10

9

muons / s

Fits / s 9.7x10

9

1.6x10

10

Pictures: pcmag.com, nvidia.com

10

8

muons / s Reduction

factor Triplets / s

Total 2x10

10

After geometrial

selection 50 4x10

8

After multiple

scattering fit 2 2x10

8

After propagation

To 4

th

layer 2.5 8x10

7

@ 10

8

μ /s: O (10) DAQ computers are sufficient

(18)

Next Steps

Study, optimize vertex fit performance

Simplify for GPU implementation

Implement geometrical selection on FPGA

Test whole chain of online selection

More Mu3e talks:

Mu3e Experiment: T22.4&5, T42.7, T75.7, T98.1&5

MuPix Telescope: T42.6, T99.5

HV-MAPS / MuPix: T72.1-3

(19)

Feb 29, 2016 Online Tracking for Mu3e 19

Backup Slides

(20)

Multiple Scattering Fit

Reduce by factor 2

z s

x y

ΦMS

S01 S12

S 12 S 01

Θ MS

χ

2

= ϕ

2MS

σ

2MS

+ θ

2MS

σ

2MS

R

3D

from fit

Sign of R

3D

track curvature

Cut on fit success and χ

2

(21)

Feb 29, 2016 Online Tracking for Mu3e 21

Required Momentum Resolution

Graph: R. M. Djilkibaev, R. V. Konoplich, Phys.Rev.D79(2009)073004

(22)

Performance @ 10 9 muons/s

10

9

muons / s Reduction factor Triplets / s

Total 2x10

13

After geometrial

selection 50 4x10

11

After multiple

scattering fit 2 2x10

11

After propagation

To 4

th

layer 2.6 8x10

10

(23)

Feb 29, 2016 Online Tracking for Mu3e 23

GPU Properties

Highly parallel structure

Process large blocks of data

Nvidia: API extension to C:

CUDA (Compute Unified Device Architecture)

DRAM

Device = GPU card

Streaming Multiprocessor

(SM)

GPU

Cache Host = CPU

Memory

allocate

Host code

launch kernel

copy back

allocate

(24)

GPU Architecture

Device

SM 0 SM 1 SM 2 SM 3

...

.. .

.. .

.. .

.. . Block 0 Block 1 Block 2 Block 3

Block 4 Block 5 Block 6 Block 7

...

Thread 0 .. . Thread 31

Warp 0

Thread 32 .. . Thread 63

Warp 1

Thread 64 .. . Thread 96

Warp 2 Limits # blocks

per SM

8 SMs

Max. 1024 threads per block

1 kernel per thread, all threads execute same kernel

Max. 2048 threads per SM

Specs for GTX680

(25)

Feb 29, 2016 Online Tracking for Mu3e 25

Fitting Kernel

Block (0,0) Block (0,1)

grid dimension N = # selected triplets / 128

Thread

(0,0) Thread

(0,1) Thread

(0,128)

block dimension x = 128 (or other multiple of 32)

Block (0,N)

...

...

Launch grid with all possible hit combinations

Apply selection cuts

Store indices of

selected triplets FPGA in final

implemen- tation

(26)

DMA: Implementation

Stratix V / IV development board: DMA engine, PCIe interface

Kernel module for communication with FPGA

– Mapping of memory addresses

– Read, write functions

– Interrupt handling

CUDA API: memory allocation of page-locked memory, usable for DMA from FPGA to RAM and from RAM to GPU memory

Use DMA with scatter / gather mapping

– Large (GB) memory buffers possible

(27)

Feb 29, 2016 Online Tracking for Mu3e 27

DMA: Implementation

CUDA API:

memory allocation

Physical memory Virtual

memory

Length 1 Length 2

Length 3

FPGA

Data memory

256 kB Address

memory

Write addresses, lengths to FPGA

GPU

(28)

Segmentation, Interrupt messages

...

64 DMA blocks

DMA block

DMA block

.. .

16 PCIe blocks interrupt

interrupt

4 kB each

Referenzen

ÄHNLICHE DOKUMENTE

Transfer these + hits in 4 th layer to GPU Positive tracks Negative tracks Select combinations of 2 positive, 1 negative track from one vertex, based on

From Switching board: get 50 ns time slices of data containing full detector information..

From Switching board: get 50 ns time slices of data containing full detector information. 2844

Track reconstruction for the Mu3e experiment based on a novel Multiple Scattering fit.. Alexandr Kozlinskiy (Mainz, KPH) for the Mu3e collaboration CTD/WIT 2017

Most time spent on geometric kernel → outsource to FPGA Ratio of copying data from CPU to GPU to compute time:. 40 %, will improve when selection cuts are applied on FPGA For 10 8

If all cuts passed: Count triplets and save hits in global memory using atomic function. Copy back global

Tracks are reconstructed based on the hits in the pixel detectors using a 3D tracking algorithm for multiple scattering dominated resolution [7, 8].. Triplets of subsequent hits

(NARSSA), the Nederduitsch Hervormde Kerk in Africa Argief (Dutch Reformed Church in Africa Archive; NHKA) in Pretoria, and ARCA in Bloemfontein; an abundance of newspaper