GPU-based online track reconstruction for the MuPix-telescope
Carsten Grzesik for the Mu3e collaboration
February 29, 2016
Motivation
Mu3e experiment
I high data rate: ∼1 Tbit s−1
I online track reconstruction
I reduction factor: ∼1000
MuPix telescope
I test setup: pixel sensors, readout and online reconstruction
I high beam rates: O(1 MHz)
I max. output rate:
4×1.25 Gbit s−1
High Voltage Monolithic Active Pixel Sensor
I 180 nm HV-CMOS technology
I reverse biased up to 90 V
I thin depletion region
I thinning to 50µm
I readout logic directly on chip
I zero suppressed, serial data output 1.25 Gbit s−1
I details in T72.1/2/3
I.Peri´c,Nucl.Instrum.Meth., 2007, A582, 876
Setup
Beam
I MuPix7
I sensor size:
3.2×3.2 mm2
I serial data output via LVDS
Data Transmission - DMA
I serial data output from sensor planes
I merge and sort on FPGA
I PCIe connection to PC
I Direct Memory Access to GPU not available
I DMA via main memory
I data rate: ≤1.5 GB s−1
CPU
FPGA GPU
PCIe
RAM
MuPix
Graphics Processing Unit
I programming: CUDA API
I commercial gaming GPUs
I GTX 980: 2048 cores @ 1.3 GHz
I straight track model →few calculation steps
I combinatorics of hits → lots of memory loads
I memory bound algorithm → need high memory throughput
Memory Coalescing
Memory
Threads
Memory load
I example for 16 threads/SM
I 16 threads perform same operation at same time (e.g.
memory load)
I for consecutive data in memory →grouped in 1 load operation
GPU implementation
I parallelization: one timeframe per thread →no communication required across thread boundaries
I hits from consecutive frames next to each other (coalesced memory access)
I need to sort the data by plane and time
memory pos. 0 1 2 ... 31
hit.plane.frame 0.0.0 0.0.1 0.0.2 ... 0.0.31
memory pos. 32 33 ... ... 63
hit.plane.frame 1.0.0 1.0.1 ... ... 1.0.31 ...
memory pos. 256 257 ...
hit.plane.frame 0.1.0 0.1.1 ...
Setup - DESY Testbeam
planned:
CPU
FPGA GPU
PCIe
RAM
MuPix
implemented:
CPU
FPGA GPU
PCIe
RAM
MuPix
Sorting
Results - DESY Testbeam
Entries h1636869
Mean 5.006
RMS 33.35
1000
− −500 0 500 1000
1 10 102 103 104 105
Entries h1636869
Mean 5.006
RMS 33.35
res1_x
CPU residuals [um]
Entries h1636869
Mean 5.006
RMS 33.35
10 102 103 104 105
Entries h1636869
Mean 5.006
RMS 33.35
GPU residuals [um]
Results - DESY Testbeam
res_hist Entries 636869 Mean 0.0001312 RMS 0.0001586
0.001
−0 −0.0008−0.0006−0.0004−0.0002 0 0.0002 0.0004 0.0006 0.0008 0.001 5000
10000 15000 20000 25000 30000 35000 40000 45000
res_hist Entries 636869 Mean 0.0001312 RMS 0.0001586
Residual of Residuals [um]
I deviation<1 nm
I bias to bigger CPU values
I execution differences CPU/GPU (e.g. floating point precision)
Summary and Outlook
I GPU tracking implemented
I DMA working up to 1.5 GB s−1
I offline tracking on GPU gives reasonable results to do:
I finally test FPGA firmware
I use GPU tracking online
I optimization of GPU code Acknowledgments
The measurements leading to these results have been performed at the Test Beam Facility at DESY Hamburg (Germany), a member of the Helmholtz Association (HGF)
Backup
I memory bound GPU kernels →32 bit floating point
I IEEE 754 floating point arithmetic
I GPU uses Fused Multiply Add (FMA)
0 50 100 150 200 250 300 350103
× residual vs value: chi2
chi2(double)
0 20 40 60 80 100 120
chi2(GPU)-chi2(double)
−0.06 0.04
− 0.02
− 0 0.02 0.04
3
10−
× histo
Entries 1310720 Mean x 11.43 Mean y 5.017e−07 RMS x 8.625 RMS y 9.745e−07
residual vs value: chi2 10^3
Backup
htemp Entries 999799 Mean 0.02322 RMS 0.1488
(chi2_float-chi2_double)/chi2_float
0 0.2 0.4 0.6 0.8 1
1 10 102
103
104
105
106
htemp Entries 999799 Mean 0.02322 RMS 0.1488 (chi2_float-chi2_double)/chi2_float {chi2_float > 0}
I IEEE 754:
I float ULP: 10−7
Backup
htemp Entries 626660 Mean 56.56 RMS 146.1
chi2
0 2000 4000 6000 8000 10000 12000 14000 16000
1 10 102 103 104 105
106 htemp
Entries 626660 Mean 56.56 RMS 146.1 chi2
htemp Entries 636869 Mean 56.58 RMS 146.2
chi2
0 2000 4000 6000 8000 10000 12000 14000 16000
1 10 102 103 104 105
106 htemp
Entries 636869 Mean 56.58 RMS 146.2 chi2 {chi2 == chi2}