• Keine Ergebnisse gefunden

Track Reconstruction on GPUs for the Mu3e Experiment

N/A
N/A
Protected

Academic year: 2022

Aktie "Track Reconstruction on GPUs for the Mu3e Experiment"

Copied!
24
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Track Reconstruction on GPUs for the Mu3e Experiment

Dorothea vom Bruch for the Mu3e Collaboration

DPG Frühjahrstagung, T41: Detektoren und DAQ 1 March 10, 2015

Physikalisches Institut Heidelberg

(2)

The Mu3e Experiment

Mu3e searches for the charged lepton-flavour violating decay µ+ →e+e+e with a sensitivity better than 10-16

µdecays at rest →Ee <53MeV

e+

e+ e-

Signal

Coincident in time Single vertex Σ~pi =0

Random Combinations Not coincident in time No single vertex Σ~pi 6=0

E 6=mµ

(3)

The Mu3e Experiment

Mu3e searches for the charged lepton-flavour violating decay µ+ →e+e+e with a sensitivity better than 10-16

µdecays at rest →Ee <53MeV

e+

e+ e-

Signal

Coincident in time Single vertex Σ~pi =0 E =mµ

e+

e+ e-

Random Combinations Not coincident in time No single vertex Σ~pi 6=0

E 6=mµ

Mar 10, 2015 Track Reconstruction on GPUs Dorothea vom Bruch 2

(4)

The Mu3e Detector

Requirements:

Excellent momentum resolution: <0.5 MeV/c

Good timing resolution: 100 ps for tiles, 1 ns for fibres,

< 20 ns for pixels

Good vertex resolution: 300µm High ratesO(108−109µ/s)

Target Inner pixel layers

Scintillating fibres

Outer pixel layers Recurl pixel layers

Scintillator tiles

μ Beam

(5)

Readout Scheme

...

~1100 Pixel Sensors

FPGA FPGA 38 FPGAs FPGA

2 RO Boards

...

GPU

PC GPU

PC GPU

12 PCs PC

Data Collection Server

Mass Storage 1 6.4 Gbit/s

link each

12 6.4 Gbits/s links per RO Board

Gbit Ethernet up to 45 1.25 Gbit/s links

Triggerless readout→ 1 Tbit/s data rate Online data reduction required

Track reconstruction and vertex fitting on GPUs

⇒Reduction factor of

∼1000

Mar 10, 2015 Track Reconstruction on GPUs Dorothea vom Bruch 4

(6)

Multiple Scattering Fit

Low Momentum Electrons: 15 - 53 MeV

Resolution dominated by multiple Coulomb scattering Ignore hit uncertainty

Describe track as sequence of hit triplets Multiple scattering at middle hit of triplet Minimize multiple scattering

χ2 = φσ2MS2 MS

+θσMS22 MS

Triplet 1 Triplet 2

(7)

Fit on the GPU

Number of possible track candidates∼n3

With n: # hits per layer On GPU: Loop over all possible combinations

Geometrical selection cuts Triplet fit

Vertex fit

⇒ Reduction factor of∼1000

Image source: http://www.pcmag.com/article2/0,2817,2401953,00.asp

Mar 10, 2015 Track Reconstruction on GPUs Dorothea vom Bruch 6

(8)

GPU Properties

DRAM

Device = GPU card

Streaming Multiprocessor

(SM)

GPU

Cache Host = CPU

Memory

allocate

Host code

launch kernel

copy back

allocate

Highly parallel structure Process large blocks of data Nvidia: API extension to C:

CUDA (Compute Unified Device Architecture)

(9)

GPU Architecture (GTX 680 as example)

Device

SM 0 SM 1 SM 2 SM 3

...

. . .

. . .

. . .

. . . Block 0 Block 1 Block 2 Block 3

Block 4 Block 5 Block 6 Block 7

...

Thread 0 . . .

Thread 31 Warp 0

Thread 32 . . .

Thread 63 Warp 1

Thread 64 . . .

Thread 96 Warp 2 Max. 2048 threads per SM

Limits # blocks per SM

8 SMs

Max. 1024 threads per block

1 kernel per thread, all threads execute same kernel

Mar 10, 2015 Track Reconstruction on GPUs Dorothea vom Bruch 8

(10)

Main Steps

Sort hit arrays with respect to z-coordinate

on CPU

Geometric filtering on ordered hits

select N track candidates

on GPU Geometric Kernel

Triplet fitting and selection

via χ2 on GPU Fitting Kernel

(11)

Grid for Geometric Kernel

Block (0,0) Block (0,1)

grid dimension = n[1]

Thread (0,0)

Thread (0,1)

Thread (0,n[2])

block dimension = n[2]

Block (n[2],0)

...

Loop over n[3] hits

...

Mar 10, 2015 Track Reconstruction on GPUs Dorothea vom Bruch 10

(12)

Compute Utilization for Fitting Kernel

(13)

Performance

Process 1.4×1010 triplets / s

Most time spent on geometric kernel →outsource to FPGA Ratio of copying data from CPU to GPU to compute time:

40 %, will improve when selection cuts are applied on FPGA For 108 µ/s: 1012 hit combinations→ 48 GPU computers For 109 µ/s: More improvements needed

Mar 10, 2015 Track Reconstruction on GPUs Dorothea vom Bruch 12

(14)

Outlook

Study data transmission from FPGA to GPU:

FPGA−−−−−−→PCIe,DMA CPU−−−→PCIe GPU FPGA−−−−−−→PCIe,DMA GPU

Outsource selection to FPGA Include vertex fit

(15)

Backup

Backup Slides

Mar 10, 2015 Track Reconstruction on GPUs Dorothea vom Bruch 14

(16)

Multiple Scattering at middle hit of triplet

Multiple scattering at middle hit of triplet

x y

φMS S01 S12

z

s S12

S01 ϑMS

Minimizeχ2= φ

2 MS

σ2MS + θ

2 MS

σMS2

(17)

Irreducible Background

Mar 10, 2015 Track Reconstruction on GPUs Dorothea vom Bruch 16

(18)

Phases of Mu3e

Phase 1 Phase 2

O108µ/s O109µ/s Central part Central part

+ +

1 recurl station 2 recurls stations

(19)

Hardware Implementation: Warps

After block is assigned to SM

→ Division into units called warps On GTX 680: 1 warp = 32 threads

Device

SM 0 SM 1 SM 2 SM 3

...

. . .

. . .

. . .

. . . Block 0 Block 1 Block 2 Block 3

Block 4 Block 5 Block 6 Block 7

...

Thread 0 . . . Thread 31

Warp 0

Thread 32 . . . Thread 63

Warp 1

Thread 64 . . . Thread 96

Warp 2

Mar 10, 2015 Track Reconstruction on GPUs Dorothea vom Bruch 18

(20)

Warp Scheduling

Warps execute

In SIMD fashion (Single Instruction, Multiple Data) Not ordered

Some instruction

Branch

Option 1

Option 2

Some instruction

Branch divergence

...

...

...

...

...

1 warp = 32 threads

SM instruction scheduler warp 22 , instruction 13

warp 22 , instruction 14 warp 13 , instruction 4

warp 13 , instruction 5 warp 96 , instruction 33

time

(21)

GPU Memory

48 kB Shared Memory

Thread 0

Block 0

Thread 1

Host

Registers Registers

Thread 0

Block 1

Registers Registers

Global Memory Constant Memory

Thread 1

4 GB 64 kB

48 kB Shared Memory

fastest, limited to 65536 registers per block extremely fast, highly parallel

high access latency (400 - 800 cycles), finite access bandwidth read only,

short latency

Mar 10, 2015 Track Reconstruction on GPUs Dorothea vom Bruch 20

(22)

Memory Access

Address 128 256

Thread ID 0 31

Coalesced memory access

Address 128 256

Thread ID 0 31

Non-coalesced memory access

Warp Memory Access

128 bytes in single transaction

(23)

Diagnostic Tools (One Kernel)

Nvidia provides several diagnostic tools:

Profiler for terminal usage: Time spent by CPU and GPU Memory check: Memory allocation errors, misuse, ...

Visual profiler: Diagnostics for performance of GPU code

Mar 10, 2015 Track Reconstruction on GPUs Dorothea vom Bruch 22

(24)

Diagnostic Tools (One Kernel)

Kernel profile: Cuda code and machine code

Referenzen

ÄHNLICHE DOKUMENTE

Alexandr Kozlinskiy (Mainz, Institut für Kernphysik) on behalf of the Mu3e collaboration. DPG 2019 @ Aachen

Transfer these + hits in 4 th layer to GPU Positive tracks Negative tracks Select combinations of 2 positive, 1 negative track from one vertex, based on

• Offline implementation also takes into account pixel size and energy loss (minor fitter change). • Mu3e readout is

From Switching board: get 50 ns time slices of data containing full detector information..

From Switching board: get 50 ns time slices of data containing full detector information. 2844

Track reconstruction for the Mu3e experiment based on a novel Multiple Scattering fit.. Alexandr Kozlinskiy (Mainz, KPH) for the Mu3e collaboration CTD/WIT 2017

If all cuts passed: Count triplets and save hits in global memory using atomic function. Copy back global

Shen Kirchoff Institut für Physik, Universität Heidelberg, Heidelberg. Perić Zentralinstitut für Informatik, Universität