• Keine Ergebnisse gefunden

Track and Vertex Reconstruction on GPUs for the Mu3e Experiment

N/A
N/A
Protected

Academic year: 2022

Aktie "Track and Vertex Reconstruction on GPUs for the Mu3e Experiment"

Copied!
49
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Track and Vertex Reconstruction on GPUs for the Mu3e Experiment

Dorothea vom Bruch for the Mu3e Collaboration

GPU Computing in High Energy Physics, Pisa September 11th, 2014

(2)

Outline

The Mu3e experiment Readout and event selection Track fit on the GPU Current performance Outlook

(3)

Motivation

The Mu3e experiment searches for...

... the charged lepton-flavour violating decayµ→e+e+e with a sensitivity better than 10−16

µ+ e+

W+

νµ νe γ

e- e+

*

Suppressed in Standard Model to below 10−54 Any hint of a signal indicates new physics:

Supersymmetry Grand unified models Extended Higgs sector ...

(4)

Signal versus Background

e+

e+ e-

Signal Coincident in time

Single vertex Σ~pi =0 E =mµ

Combinatorial background Not coincident in time

No single vertex E 6=mµ

Σ~pi 6=0

Internal conversion background Coincident in time Single vertex E 6=mµ Σ~pi 6=0

(5)

Signal versus Background

e+

e+ e-

Signal Coincident in time

Single vertex Σ~pi =0 E =mµ

e+

e+ e-

Combinatorial background Not coincident in time

No single vertex E 6=mµ

Σ~pi 6=0

Internal conversion background Coincident in time Single vertex E 6=mµ Σ~pi 6=0

(6)

Signal versus Background

e+

e+ e-

Signal Coincident in time

Single vertex Σ~pi =0 E =mµ

e+

e+ e-

Combinatorial background Not coincident in time

No single vertex E 6=mµ

Internal conversion background Coincident in time Single vertex E 6=mµ Σp~ 6=0

(7)

Resolution

µ decays at rest→ pe<53MeV/c

Resolution dominated by multiple Coulomb scattering (∝1/p)

Minimize material

High Voltage Monolithic Active Pixel Sensors thinned to 50µm Ultralight mechanics

(8)

Detector Requirements

Excellent momentum resolution: <0.5 MeV/c Good timing resolution: 100 ps

Good vertex resolution: 100µm

(9)

The Detector

Target μ Beam

(10)

The Detector

Target μ Beam

(11)

The Detector

Target Inner pixel layers μ Beam

(12)

The Detector

Target Inner pixel layers

Outer pixel layers μ Beam

(13)

The Detector

Target Inner pixel layers

Scintillating fibres

Outer pixel layers μ Beam

(14)

The Detector

Target Inner pixel layers

Scintillating fibres

Outer pixel layers Recurl pixel layers

μ Beam

(15)

The Detector

Target Inner pixel layers

Scintillating fibres

Outer pixel layers Recurl pixel layers

Scintillator tiles

μ Beam

(16)

Beam and Statistics

Beam provided by the Paul Scherrer Institut

Currently: 108µ/s

In future: Up to 2×109µ/s

Triggerless readout 1 Tbit/s data rate Online selection

→ Reduction by factor 1000

(17)

Readout Scheme

...

1116 Pixel Sensors

FPGA FPGA 38 FPGAs FPGA

2 RO Boards

...

GPU PC

GPU PC

GPU 12 PCs PC

Data Collection Server

Mass Storage

HVMAPS HVMAPS HVMAPS

up to 45 800 Mbits/s links

1 6.4 Gbit/s link each

12 6.4 Gbits/s links per RO Board

Gbit Ethernet

(18)

Readout Scheme

...

1116 Pixel Sensors

FPGA FPGA 38 FPGAs FPGA

2 RO Boards

...

GPU PC

GPU PC

GPU 12 PCs PC

Data Collection Server

Mass Storage

HVMAPS HVMAPS HVMAPS

up to 45 800 Mbits/s links

1 6.4 Gbit/s link each

12 6.4 Gbits/s links per RO Board

Gbit Ethernet

(19)

Event Filtering

Niklaus Berger – Lepton Moments 2014 – Slide 21

GPU gets 50 ns time slice Full detector information

Find 3 tracks originating from common vertex

(20)

Multiple Scattering Fit

Ignore spatial uncertainty

Multiple scattering at middle hit of triplet Minimize multiple scattering χ2

z s x

y ΦMS

S01 S12

S12 S01 ΘMS

Minimizeχ2= φσ2MS2 MS

+σθ2MS2 MS

(21)

Multiple Scattering Fit

Triplet 1 Triplet 2

Describe track as sequence of hit triplets

Non-iterative fit

(22)

GPU Specifications

Use Nvidia’s CUDA environment GeForce GTX 680

8 Streaming Multiprocessors

Image source: http://www.pcmag.com/article2/0,2817,2401953,00.asp

(23)

Fit on the GPU

Consider first three detector layers

Number of possible track candidates ∼n[1]×n[2]×n[3]

n[i]: # hits in layer i

On GPU: Loop over all possible combinations Geometrical selection cuts

Triplet fit Vertex fit

⇒Goal: Reduction factor of ∼1000

(24)

Sharing the Work

On FPGA:

Sort hits

Copy hit arrays to global memory of GPU Currently: FPGA tasks are performed by CPU Within kernel / thread:

Apply geometrical selection cuts:

For pairs of hits in layers [1,2] and [2,3] check proximity in x-y plane and in z

Do triplet fit

Cut onχ2 and fit completion status

If all cuts passed: Count triplets and save hits in global memory using atomic function

Copy back global index array

(25)

Sharing the Work

Some instruction

Branch

Option 1

Option 2

Some instruction

Branch divergence

Within kernel / thread:

Apply geometrical selection cuts:

For pairs of hits in layers [1,2]

and [2,3] check

proximity in x-y plane and in z Do triplet fit

Cut onχ2 and fit completion status

If all cuts passed: Count triplets and save hits in global memory using atomic function

(26)

Sharing the Work

Some instruction

Branch

Option 1

Option 2

Some instruction

Branch divergence

Within kernel / thread:

Apply geometrical selection cuts:

For pairs of hits in layers [1,2]

and [2,3] check

proximity in x-y plane and in z Do triplet fit

Cut onχ2 and fit completion status

If all cuts passed: Count triplets and save hits in global memory using atomic function

(27)

Grid Alternatives: One Fit per Thread

Block (0,0) Block (0,1) Block (0,n[1])

Block (1,0) Block (1,1) Block (1,n[1])

Thread (0,0) Thread

(0,1) Thread

(0,n[3])

grid dimension y = n[2]

grid dimension x = n[1]

Block (n[2],0) Block (n[2],1) Block (n[2],n[1])

...

...

...

. . . .

. . .

. .

...

(28)

Grid Alternatives: Several Fits per Thread

Block (0,0) Block (0,1)

grid dimension x = n[1]

Thread (0,0) Thread

(0,1) Thread

(0,n[2])

block dimension x = n[2]

Block (n[2],0)

...

Loop over n[3] hits

...

(29)

Separate Kernels

Block (0,0) Block (0,1)

grid dimension N = # selected triplets / 128

Thread (0,0)

Thread (0,1)

Thread (0,128)

block dimension x = 128 (or other multiple of 32)

Block (0,N)

...

...

Launch grid with all possible hit combinations

Apply selection cuts

Store indices of

selected triplets

Advantages

No idle threads in time-intensive fitting kernel Block dimension: Multiple of 32 (warp size)

(30)

Kernel Profile

Fit Not passed 87 % branch divergence during fit procedure Selection Cuts

One kernel version

Fit

branch divergence in first kernel

No divergence during fit Selection Cuts

Separate kernel version

⇒Choose separate kernel version

(31)

Compute Utilization: Fitting Kernel

(32)

Other Optimization Attempts

Idea Problem

Count triplets by using Synchronization of threads and atomicInc on shared variable copying back to global memory

takes too long

Compose grid of only one Amount of shared memory per block and n[1] threads; Streaming Multiprocessor not enough load hit arrays into shared

memory for quicker access # of blocks too small to hide latency

(33)

Current Performance

One kernel Separate kernels Wall time of

CPU & GPU → 7.6×109 triplets/s 1.4×1010 triplets/s Run time

measured over > 11 days > 15 hours

Most time spent on selection cuts

Can be improved by using FPGA for selection

Currently: Fit performed on CPU and GPU to compare output

→ Contributes to computation time

(34)

Summary

Searching forµ→e+e+e with a sensitivity better than 10−16

Goal: Find 2×109tracks/s online Achieved: Process 1010triplets/s

(35)

Outlook

Include vertex fit or alternative vertex selection criteria Outsource pre-fit selection to FPGA

Write data to GPU via Direct Memory Access from FPGA

(36)

Thank You

Thank you for your attention!

(37)

Backup

Backup Slides

(38)

More Detailed Performance for Separate Kernels

Wall time of

CPU & GPU without 1.4×1010 triplets/s fit on CPU

GPU time only

Average time per fit 26µs

Average time for fit & memory copying 30µs Fit & copying 1.1×107 fits/s 1

(39)

Multiple Scattering

Ω ~ π MS

θMS

B

(40)

NVVP Profile: One Kernel

Kernel profile for one selected kernel

(41)

NVVP Profile: Separate Kernels

Kernel profile of one selected fitting kernel (without selection kernel):

(42)

CPU - GPU Communication

DRAM

Device

Streaming Multiprocessor

(SM)

GPU

Cache Host

Memory

allocate

Host code

launch kernel

copy back

allocate Nvidia: API extension to C:

CUDA (Compute Unified Device Architecture)

Compile with nvcc and gcc

→ runs on host (= CPU) and device (= GPU)

Very similar to C / C++ code Compatible with other languages

(43)

CPU - GPU Communication

Host code

Some CPU code ...

GPU function (kernel) launched as grid on GPU

...

Some more CPU code

CUDA: special variables / func- tions introduced for

Identification of GPU code Allocation of GPU

memory

Access to grid size Options for grid launch ...

CUDA Grid

Grid: Consists of blocks Block: Consists of threads

(44)

CUDA Architecture (GTX 680 as example)

One kernel per thread Up to 3 dimensions for block and thread indices Up to 1024 threads per block

Max dimension of grid:

65535 x 65535 x 65535 Access to thread & block index via built-in variable within kernel

Block (0,0) Block (0,1) Block (0,n)

Block (1,0) Block (1,1) Block (1,n)

Thread (0,0) Thread

(0,1)

Thread (M,0) Thread

(M,1)

Thread (0,N)

Thread (M,N)

Block (m,0) Block (m,1) Block (m,n)

...

...

...

. . . .

. . .

..

...

...

. . .

. . .

. . .

(45)

Hardware Implementation (GTX 680 as example)

All threads in grid execute same kernel Execution order of blocks is arbitrary

Scheduled on Streaming Multiprocessors (SMs) according to Resource usage: memory, registers

Thread number limit

Block 0

Kernel grid Device

SM 0 SM 1 SM 2 SM 3

...

. . .

. . .

. . .

. .

. .

. .

. . .

Max. 2048 threads per SM

8 SMs Block 1

Block 2 Block 3 Block 4 Block 5 Block 6 Block 7

Block 0 Block 1 Block 2 Block 3

Block 4 Block 5 Block 6 Block 7

Limits # blocks per SM

(46)

Hardware Implementation: Warps

After block is assigned to SM

→ Division into units called warps On GTX 680: 1 warp = 32 threads

Device

SM 0 SM 1 SM 2 SM 3

...

. . .

. . .

. . .

. . . Block 0 Block 1 Block 2 Block 3

Block 4 Block 5 Block 6 Block 7

...

Thread 0 . . . Warp 0

Thread 32 . . . Warp 1

Thread 64 . . . Warp 2

(47)

Warp Scheduling

Warps execute

In SIMD fashion (Single Instruction, Multiple Data) Not ordered

Some instruction

Branch

Option 1

Option 2

Some instruction

Branch divergence

...

...

...

...

...

1 warp = 32 threads

SM instruction scheduler warp 22 , instruction 13

warp 22 , instruction 14 warp 13 , instruction 4

warp 13 , instruction 5 warp 96 , instruction 33

time

(48)

GPU Memory

48 kB Shared Memory

Thread 0

Block 0

Thread 1

Host

Registers Registers

Thread 0

Block 1

Registers Registers

Global Memory Constant Memory

Thread 1

4 GB 64 kB

48 kB Shared Memory

fastest, limited to 65536 registers per block extremely fast, highly parallel

high access latency (400 - 800 cycles), finite access bandwidth read only,

short latency

(49)

Memory Access

Address 128 256

Thread ID 0 31

Coalesced memory access

Address 128 256

Thread ID 0 31

Non-coalesced memory access

Warp Memory Access

128 bytes in single transaction

Referenzen

ÄHNLICHE DOKUMENTE

Transfer these + hits in 4 th layer to GPU Positive tracks Negative tracks Select combinations of 2 positive, 1 negative track from one vertex, based on

From Switching board: get 50 ns time slices of data containing full detector information..

From Switching board: get 50 ns time slices of data containing full detector information. 2844

Track reconstruction for the Mu3e experiment based on a novel Multiple Scattering fit.. Alexandr Kozlinskiy (Mainz, KPH) for the Mu3e collaboration CTD/WIT 2017

• Switch the data stream between front-end FPGAs and the filter farm. • Merge the data of sub-detectors and the data from

● CUDA API: memory allocation of page-locked memory, usable for DMA from FPGA to RAM and from RAM to GPU memory. ● Use DMA with scatter /

Most time spent on geometric kernel → outsource to FPGA Ratio of copying data from CPU to GPU to compute time:. 40 %, will improve when selection cuts are applied on FPGA For 10 8

Tracks are reconstructed based on the hits in the pixel detectors using a 3D tracking algorithm for multiple scattering dominated resolution [7, 8].. Triplets of subsequent hits