• Keine Ergebnisse gefunden

CUDA provides access to different memory spaces, which differ in purpose, scope, and access characteristics. We give a brief introduction to each memory space. Figure 3.2 gives an overview on the scope of the memory spaces and their hierarchical relationship with caches and device memory.

Device and Global Memory Device memory is the off-chip main memory of a GPU. Es-sentially every CUDA kernel at least works on device memory, as it is the only memory area which allows to exchange data with the host. While it is possible to directly read from and write to system memory from the GPU, this is unpreferable as bandwidth through the PCI express bus is about two orders of magnitudes lower than for device memory access.

From the scope of a grid, device memory is also called global memory as it is globally readable and writable for all threads in a grid.

The first generation of CUDA-enabled GPUs had very strict so-calledcoalescing rulesfor efficient global memory access which we will not discuss here. All following generations

3.4. Memory Spaces

128 160 192 224 256 288

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 warp

Figure 3.3: Depiction of a warp accessing global memory via the L1 cache (red) or the L2 cache (green). The L1 cache access results in two 128 byte transactions as the warp accesses two 128 byte L1 cache lines. On a miss, 256 bytes would have to be loaded from the L2 cache. The L2 cache access results in four 32 byte transactions as the warp accesses four 32 byte L2 cache lines. (Based on[NVIDIA 2017a], Figure 16)

have an L2 cache and some have an L1 cache which have much simpler coalescing rules.

The presence of the L1 cache varies between GPU series and also between models in a series. Global memory is divided into segments of 32 bytes for the L2 cache and 128 bytes for the L1 cache. When threads in a warp access global memory, the access is simply split into as many memory transactions as different segments are accessed. Figure3.3depicts this for L1 and L2 accesses. In the worst case each thread in a warp accesses a different segment, which results in 32 transactions. Thus, to keep the number of transactions low threads in a warp should access nearby addresses, or related data should be kept closer together.

Local and Constant Memory Local memory and constant memory are two additional memory types, which reside in device memory. Local memory derives its name from the fact that it has thread local scope. It is mainly used for register spilling if a kernel uses too many registers, or thread scope arrays, which have no static access pattern. The thread local traversal stack used in the GPU ray tracing kernels fromAila and Laine[2009], for example, ends up in local memory, as it is accessed in an unpredictable manner. Constant memory has a designated cache, which is optimized for multiple simultaneous 4-byte ac-cesses to the same address. Thus, it is meant for constant data that is needed by several threads at the same time. Simultaneous accesses to multiple addresses are serialized into the number of different addresses.

Texture Memory Texture memoryis the last type of memory, that resides in device mem-ory. It allows 1-, 2-, or 3- dimensional indices for addressing with optimized performance of lookups in a 2D or 3D neighborhood. For the last two variants the input data first has to be converted into aCUDA Array, which stores the data in an optimized opaque propri-etary memory layout. All NVIDIA GPUs access texture memory via an additional dedicated read-only L1 cache. The CUDA programming guide is unspecific regarding optimal tex-ture memory access patterns. The only hint is that “[if] the memory reads do not follow the access patterns that global or constant memory reads must follow to get good perfor-mance, higher bandwidth can be achieved providing that there is locality in the texture fetches”[NVIDIA 2017a]. As we will see in Chapter7, Section7.2.1our experiments with certain access patterns, which are bad for either global or both global and shared memory,

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 simplified warp with 16 threads

threads

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

shared memory

conflicts 0 0 0 2 0 0 0 0 0 3 0 0 0 0 0 0

Figure 3.4: Depiction of the organization of shared memory into banks and access pat-terns in a warp which cause bank conflicts. For clarity the warp size and number of banks is reduced to 16. Consecutive 4-byte words are periodically assigned to the memory banks.

Threads 3 and 4 cause a two-way bank conflict as they access two different words in the same bank. Threads 7, 8, 9, and 10 cause a three-way bank conflict as they access three different words in the same bank. (Based on[NVIDIA 2017a], Figure 18)

reveal an almost equal performance for texture memory with either access pattern. From its computer graphics origins texture memory provides some additional hardware features such as different addressing modes, value interpolation, and unpacking of specially stored data. These features are not important in the context of this dissertation.

Shared Memory Shared memoryis located on-chip and as such has “much higher band-width and much lower latency than local or global memory”[NVIDIA 2017a]. It has block scope and is intended to be shared between threads in a block in a cooperative way. The amount of available shared memory is only a couple of kilobytes per multiprocessor. It is organized in 32 banks, which can be accessed simultaneously. Consecutive 4-byte words are periodically assigned to the 32 banks. That is, shared memory addresses which are 128 byte apart are assigned to the same bank. The bandwidth of each bank is one 4-byte word per clock. Multiple threads are allowed to access different banks or the same bank. When several threads access different words which are assigned to the same bank abank conflictoccurs. In this case the accesses to the different words have to be serial-ized, which effectively reduces the instruction throughput. Thus, efficient shared memory usage aims at reducing the number of bank conflicts. Figure3.4depicts the shared mem-ory organization and conflicting access patterns. Common strided access patterns of the formthreadIdx*stridecause bank conflicts ifstridehas common divisors with the number of banks and should be avoided if possible. The worst case is if the stride is the number of banks itself, in which case we have the number of banks as many conflicts.