High Voltage Monolithic Active Pixel Sensor

(1)

GPU-based online track reconstruction for the MuPix-telescope

Carsten Grzesik for the Mu3e collaboration

February 29, 2016

(2)

Motivation

Mu3e experiment

I high data rate: ∼1 Tbit s⁻¹

I online track reconstruction

I reduction factor: ∼1000

MuPix telescope

I test setup: pixel sensors, readout and online reconstruction

I high beam rates: O(1 MHz)

I max. output rate:

4×1.25 Gbit s⁻¹

(3)

High Voltage Monolithic Active Pixel Sensor

I 180 nm HV-CMOS technology

I reverse biased up to 90 V

I thin depletion region

I thinning to 50µm

I readout logic directly on chip

I zero suppressed, serial data output 1.25 Gbit s⁻¹

I details in T72.1/2/3

I.Peri´c,Nucl.Instrum.Meth., 2007, A582, 876

(4)

Setup

Beam

I MuPix7

I sensor size:

3.2×3.2 mm²

I serial data output via LVDS

(5)

Data Transmission - DMA

I serial data output from sensor planes

I merge and sort on FPGA

I PCIe connection to PC

I Direct Memory Access to GPU not available

I DMA via main memory

I data rate: ≤1.5 GB s⁻¹

CPU

FPGA GPU

PCIe

RAM

MuPix

(6)

Graphics Processing Unit

I programming: CUDA API

I commercial gaming GPUs

I GTX 980: 2048 cores @ 1.3 GHz

I straight track model →few calculation steps

I combinatorics of hits → lots of memory loads

I memory bound algorithm → need high memory throughput

(7)

Memory Coalescing

Memory

Threads

Memory load

I example for 16 threads/SM

I 16 threads perform same operation at same time (e.g.

memory load)

I for consecutive data in memory →grouped in 1 load operation

(8)

GPU implementation

I parallelization: one timeframe per thread →no communication required across thread boundaries

I hits from consecutive frames next to each other (coalesced memory access)

I need to sort the data by plane and time

memory pos. 0 1 2 ... 31

hit.plane.frame 0.0.0 0.0.1 0.0.2 ... 0.0.31

memory pos. 32 33 ... ... 63

hit.plane.frame 1.0.0 1.0.1 ... ... 1.0.31 ...

memory pos. 256 257 ...

hit.plane.frame 0.1.0 0.1.1 ...

(9)

Setup - DESY Testbeam

planned:

CPU

FPGA GPU

PCIe

RAM

MuPix

implemented:

CPU

FPGA GPU

PCIe

RAM

MuPix

Sorting

(10)

Results - DESY Testbeam

Entries h1636869

Mean 5.006

RMS 33.35

1000

− −500 0 500 1000

1 10 102 103 104 105

Entries h1636869

Mean 5.006

RMS 33.35

res1_x

CPU residuals [um]

Entries h1636869

Mean 5.006

RMS 33.35

10 102 103 104 105

Entries h1636869

Mean 5.006

RMS 33.35

GPU residuals [um]

(11)

Results - DESY Testbeam

res_hist Entries 636869 Mean 0.0001312 RMS 0.0001586

0.001

−0 −0.0008−0.0006−0.0004−0.0002 0 0.0002 0.0004 0.0006 0.0008 0.001 5000

10000 15000 20000 25000 30000 35000 40000 45000

res_hist Entries 636869 Mean 0.0001312 RMS 0.0001586

Residual of Residuals [um]

I deviation<1 nm

I bias to bigger CPU values

I execution differences CPU/GPU (e.g. floating point precision)

(12)

Summary and Outlook

I GPU tracking implemented

I DMA working up to 1.5 GB s⁻¹

I offline tracking on GPU gives reasonable results to do:

I finally test FPGA firmware

I use GPU tracking online

I optimization of GPU code Acknowledgments

The measurements leading to these results have been performed at the Test Beam Facility at DESY Hamburg (Germany), a member of the Helmholtz Association (HGF)

(13)

Backup

I memory bound GPU kernels →32 bit floating point

I IEEE 754 floating point arithmetic

I GPU uses Fused Multiply Add (FMA)

0 50 100 150 200 250 300 35010³

× residual vs value: chi2

chi2(double)

0 20 40 60 80 100 120

chi2(GPU)-chi2(double)

−0.06 0.04

− 0.02

− 0 0.02 0.04

3

10−

× histo

Entries 1310720 Mean x 11.43 Mean y 5.017e−07 RMS x 8.625 RMS y 9.745e−07

residual vs value: chi2 10^3

(14)

Backup

htemp Entries 999799 Mean 0.02322 RMS 0.1488

(chi2_float-chi2_double)/chi2_float

0 0.2 0.4 0.6 0.8 1

1 10 102

103

104

105

106

htemp Entries 999799 Mean 0.02322 RMS 0.1488 (chi2_float-chi2_double)/chi2_float {chi2_float > 0}

I IEEE 754:

I float ULP: 10⁻⁷

(15)

Backup

chi2

0 2000 4000 6000 8000 10000 12000 14000 16000

1 10 102 103 104 105

106 htemp

Entries 626660 Mean 56.56 RMS 146.1 chi2

chi2

0 2000 4000 6000 8000 10000 12000 14000 16000

1 10 102 103 104 105

106 htemp

Entries 636869 Mean 56.58 RMS 146.2 chi2 {chi2 == chi2}