• Keine Ergebnisse gefunden

class n processed by remarks

Large n>2048 #SMs blocks/node/dim. geometry parallelism Medium 256<n≤2048 block/node/dim. node parallelism

Small 32<n≤256 warp/node/dim. implicit SIMD synchr.

Tiny 16<n≤32 warp/node exact SAH

Micro 8<n≤16 half warp/node exact SAH

Nano n≤8 quarter warp/node exact SAH

Table 8.2: The different node classes along with their triangle limits and parallelization for BVH single-job processing. Each class is processed by a dedicated kernel. The kd-tree implementation uses a similar classification.

data types, triangle bounds are stored in order preserving integer format[Terdiman 2000]. After a block iterated over all its triangles it atomically combines its results with the results from other blocks in global memory. For code simplicity a block only performs binning in a single binning dimension. Thus, a set of blocks is generated for each dimension and we only require a single set of bins per dimension. This essentially nullifies the amount of needed auxiliary memory to a constant 5.25KB with our 64 bins per dimension. The resulting minimum number of blocks for multi-job binning isBMJ= 3·NSM. Depending on the chosen block size and kernel resource usage integer multiples ofBMJ have to be used to maximize occupancy. After binning, results of all chunks are combined and the best split plane is determined with parallel prefix sum and reduction.

The next step is distribution of primitive references to the new children. We explicitly dedicate blocks responsible for putting primitives only to the left or only to the right side.

Each block compacts its triangles and writes them to a position specified by atomic coun-ters for both children. The resulting number of blocks for splitting isSMJ=2· N

chunkprimitives

SB

£. In the spirit of the GPU kd-tree construction approach fromDanilewski et al.[2010]a single-job uses different specialized kernels for binning and partitioning depending on the number of primitives in a node. This is to adapt to the shift from primitive parallelism to node parallelism by adapting the mapping of threads to primitives and nodes.Danilewski et al.[2010] execute their different specializations in stages. Each stage works on the complete set of nodes. When threads responsible for a node detect that a node has the wrong primitive count they immediately return, causing unnecessary overhead. Stages which map nodes to warps are especially inefficient because they loose effective occupancy when warps in a block partially return if they have the wrong node primitive count.

To avoid these problems we analyze the current set of nodes to classify each node w.r.t.

its number of primitives into six different classes. For each class a compact list of node IDs is extracted which is then processed by the corresponding specialized implementation. We use a block size of 256 for all binning and partition specializations. The different classes along with their triangle limits and parallelization are depicted in Table8.2. Large-nodes still offer enough geometry parallelism that they essentially can be processed by several multiprocessors like a chunk of a multi-job but with a set of blocks for every node. With Medium-nodes parallelism slowly changes to node parallelism, where a single block per dimension performs binning on a node in a couple of iterations over the geometry in a node. WithSmall-nodes effective occupancy would start to decrease as there would be less primitives than threads in a block which causes warps in a block not to be assigned to any

8.4. Implementation

primitives. Thus,Small-nodes switch to mapping warps in a block to nodes and let warps iterate over the triangles of a node for binning in a persistent-warps manner. We also take advantage of the implicit synchronization of threads in a warp by omitting block synchro-nization primitives which greatly increases performance. With less than 32 primitives per node theSmall-node approach starts to suffer from decreasing SIMD efficiency. The last three node classesTiny,Micro, andNanoaccount for this by partitioning a warp into sub warps which process their own node, keeping more lanes busy. Similar toDanilewski et al.

[2010]for kd-trees we noticed that initialization of all bins and performing scans on the bin data dominates computation time for such small node primitive counts. This is even more severe for BVHs as BVH bins store 3.5 times as much information as kd-tree bins.

Thus, like Danilewski et al.[2010] we switch to exact SAH computation for such small nodes by letting each thread iterate over all primitives. It also turned out to be beneficial to directly handle all three candidate dimensions.

With additional effort it should be possible to introduce more node classes for more differentiated levels of parallelism.

8.4.2 kd-Tree Implementation

Implementation of multi- and single-job processing is analogous to the BVH case with the different node classes. The major difference is that we have to count enter and exit events of primitive references which is simpler in terms of computation and memory consump-tion than growing of bin bounds in BVH construcconsump-tion. The primitive distribuconsump-tion step is more involved, as we also actually have to split triangles to compute clipped primitive bounds. A straightforward implementation requires a couple of nested conditional state-ments with computational complexity in their bodies which causes poor SIMD efficiency due to high thread divergence and also resulted in high register usage which additionally reduced occupancy. Empirically most primitives in a node do not straddle the split plane.

According toWald and Havran[2006]“for reasonable scenes[1], there will be (at most) O(p

N)triangles overlapping” the split plane. Thus, our approach of choice is to exploit this by splitting primitive distribution into two phases. The first phase copies all primi-tive references and triangles to their respecprimi-tive side of the split plane including references which have to be split. Duplicate primitives are explicitly compactly stored at the ends of the arrays of each side. Without the primitive splitting code this kernel is essentially a memory copy kernel which has high occupancy due to its simplicity. Now in the second phase the highly divergent and inefficient primitive bounds splitting kernel is only exe-cuted on the few compacted duplicate primitives. Performance of splitting increased by almost one order of magnitude with this approach compared to the branching version.

8.4.3 Out-of-Core Work and Data Management

CUDA kernels are grouped in a so calledtask. Data dependencies of kernels are registered with the task. Multi-jobs have tasks for binning, combination of binning results and chunk splitting. A whole single-job is mapped to one task.

A GPU device processes at most two tasks at a time. One task executes its kernels while the other resolves its data dependencies. Devices in need of work register at a task scheduler. A good scheduling strategy aims at reducing host-to-GPU, GPU-to-host and GPU-to-GPU memory transactions. At the same time GPUs have to be kept busy as much as possible. Our chosen task scheduling strategy for a requesting device is as follows: The

scheduler iterates over all available tasks and determines for each task the most suitable device. The primary deciding factor for suitability is the number of already resolved data dependencies. The second factor is the number of tasks currently processed by a device.

A requesting device is only assigned tasks it is the most suitable device for. Thus, we intentionally assign no task if there are other more suitable devices for the available tasks.

If a task is equally suitable for all devices, it is assigned to the requesting device. Though this seems subpar at first, it proved to be a good strategy as it trades some idle time for a significant reduction in memory transfers.

GPU memory allocation functions in CUDA cause overhead and, even worse, synchro-nize the GPU. To avoid these issues we allocate a large self managed GPU memory pool for each GPU on startup. Allocations are performed in a first fit manner and are evicted to system memory with an LRU strategy. Data dependencies belonging to the two tasks a device can process at a time are protected from eviction. In the rare case, that no memory can be allocated due to external fragmentation a defragmentation step is performed.

When resolving dependencies, data that resides in other GPUs’ memory pools is di-rectly asynchronously copied via peer-to-peer GPU copy. This avoids expensive round trips through system memory. For transactions from or to system memory to be asyn-chronous, CUDA requires the involved system memory to be page-locked. Allocation of page-locked memory has a much higher overhead than GPU memory allocation and also causes synchronization. Again we avoid these issues by allocating a huge self managed memory pool of page-locked memory on startup.