• Keine Ergebnisse gefunden

To the best of our knowledge complete SAH-based out-of-core BVH construction has not been done with GPUs. We argue that this is, however, worth the effort as the higher quality will result in less memory loads in the final application due to better separation of geometry. This is in fact even more important for out-of-core rendering. Though typically larger than BVHs, kd-trees can perform better than the former. But out-of-core kd-tree construction has not been addressed on both, CPUs and GPUs. The main reason for this is their unbounded memory footprint, which is problematic in an out-of-core context.

We show that full SAH-based top-down out-of-core BVH and even kd-tree construc-tion can be done in a way that efficiently exploits the massive parallelism of GPUs. Apart from handling of the huge data itself, the significant overhead introduced by memory copy operations between host and GPU is the biggest challenge that has to be faced. Memory copies drastically reduce the computational throughput and introduce additional synchro-nization requirements between the two sides. It becomes an even more serious bottleneck when the system is extended to multiple GPUs.

Our contributions are as follows:

• an efficient out-of-core multi-GPU algorithm for BVH and kd-tree construction that allows the memory footprint of the output tree as well as the geometry to exceed graphics memory,

• a construction that applies SAH right from the beginning and does not rely on quality degrading pre-clustering of geometry, and

• an SAH improvement threshold that allows to trade rendering performance for a reduced acceleration structure memory footprint and construction time in a con-trollable way.

8.1. Related Work

8.1.1 kd-Trees

Zhou et al.[2008]proposed the first GPU kd-tree construction algorithm, which achieved realtime build times for small scenes. It applies a hybrid construction strategy that uses cheap spatial median splits in the upper levels and expensive SAH splits as soon as a node contains at most 64 triangles. The poor splitting decisions made in most of the top part of the tree cannot be corrected or compensated in any way in the last 1 to 6 levels.

As a result the negative effect of spatial median splitting on tree quality increases with scene size. Wu et al. [2011] proposed an efficient GPU implementation andChoi et al.

[2010] proposed an efficient multi-core implementation of full SAH construction. Full SAH algorithms involve sorting the complete data several times for each dimension.

Popov et al.[2006], proposed binned kd-tree construction which does not require sort-ing. This approach uses a discrete amount of equidistant split planes to sample the SAH cost function at certain points. It allows for much faster implementations with negligi-ble tree quality deterioration. Danilewski et al.[2010]presented an efficient single-GPU implementation of binned SAH kd-tree construction. All steps are implemented in five different variations/stages. Each stage is optimized for a distinct amount of geometry in a node and number of such nodes in a tree level. Only one stage is computed at a time.

Thus, nodes which are classified for a different stage than the current one are scheduled for later processing. Scheduling details and overhead are not discussed, but the authors state their implementation is faster than the lower quality hybrid construction fromZhou et al.[2008].

8.1.2 BVHs

First efficient GPU algorithms for BVH construction were proposed byLauterbach et al.

[2009]. They presented three approaches with different trade-offs between tree qual-ity and construction time. The fastest algorithm called linear BVH (LBVH) first assigns Morton codes to triangles. Then the triangles are sorted according to their codes using efficient parallel radix sort. The whole BVH can then be extracted from the sorted Morton codes by interpreting them as coordinates in an octree. This simple construction roughly corresponds to a spatial median split which results in poor tree quality, but is fast to com-pute. The second algorithm is a parallel approach for full binned-SAH BVH construction.

Tree quality is high but construction is much slower, especially since the approach taken lacks sufficient parallelism in the upper levels. To strike a balance, they propose a third algorithm, that is a hybrid of the former two. The upper levels are constructed according to the highly parallel first algorithm while the remaining levels expose enough parallelism to be efficiently constructed according to the second one. As a result the output tree is of lower quality than full SAH as it suffers from the same problems as Zhou et al.’s ap-proach. Exact SAH values are omitted but the authors report tracing times close to full SAH for the hybrid algorithm and up to 7 times higher for LBVH.Pantaleoni and Luebke [2010]andGaranzha et al.[2011]proposed much faster implementations for LBVH and the hybrid algorithm called HLBVH which allow realtime rebuilds for scenes with up to 2 million triangles. A key change in the hybrid algorithm is, that LBVH is used to build the lower levels of the tree first. The roots of the subtrees themselves are then used for binned top-down SAH BVH construction. Thus the expensive part of the algorithm is performed on much less input elements and tree quality is improved in the important upper levels.

The authors state a tree quality which is about half way between LBVH and full SAH.

8.1.3 Out-of-Core construction

The discussed GPU-based BVH and kd-tree construction techniques require both static scene geometry and transient data to fit into GPU memory. There is only little work on out-of-core ray tracing acceleration structure construction, especially in the context of GPUs. Wald et al. [2001a] roughly outlined an hypothetical out-of-core algorithm for kd-tree (called BSP-tree in the paper) construction involving several compute nodes with CPUs. The exact construction strategy is, however, unstated.

Baert et al.[2013]proposed an out-of-core CPU algorithm for regular voxelization and bottom-up sparse voxel octree construction of extremely large triangle meshes. They man-age to be roughly as fast as an unoptimized in-core solution for in-core datasets with just 1 GB of available memory by exploiting the relationship between Morton codes and octrees.

Their concepts are not applicable to top-down SAH-based kd-tree or BVH construction.

Pantaleoni et al.[2010]proposed an out-of-core two-level BVH construction algorithm for complex scenes which runs entirely on the CPU. First, the scene geometry is divided into a regular 3D grid of buckets. Afterwards geometry buckets are merged or split into chunks with respect to a specified target chunk size. For each chunk of geometry a sep-arate SAH based BVH is constructed. Then, the chunks themselves are organized into a single top-level SAH-based BVH. Finally the tree of a chunk is decomposed into a set of smaller treelets (bricks) and stored on disk. Wang et al.[2013]presented a combined pre-processing and rendering approach for many lights rendering of out-of-core scenes that uses GPUs in all steps. Acceleration structure construction is very similar to[Pantaleoni et al. 2010]. The main difference is how initial chunks are determined. Each primitive is associated with a Morton code. Then the list of primitives is sorted with respect to their codes and partitioned into chunks of specified target size. The resulting spatial clustering should be at least similar to[Pantaleoni et al. 2010]. Instead of an ordinary SAH-based BVH a higher quality SBVH[Stich et al. 2009]is constructed for each chunk. SBVHs allow to also adaptively apply spatial splits during construction if beneficial. No efficient SBVH GPU implementation has been presented to date. Thus it is unfortunate, that the authors have omitted any details of their implementation. Again chunks themselves are organized in a single top-level BVH. Both methods can be described as a bottom-up top-down ap-proach. This is not possible with kd-trees as placement of the root splitting plane is a global decision, that affects all following steps due to triangle splits. Further both meth-ods lead to reduced acceleration structure quality as the applied initial chunking enforces a spatial median like distribution of triangles.

Finally,Hou et al. [2011] presented semi out-of-core extensions to the GPU kd-tree construction approach fromZhou et al.[2008]and the hybrid BVH construction approach fromLauterbach et al.[2009]. Hereby only the size of the tree is allowed to exceed graph-ics memory. Scene geometry has to completely fit into memory. As an extension to the two algorithms, their approach also inherits the inferior quality in the upper tree levels these algorithms suffer from. They propose apartial-BFS order that processes as many nodes in parallel in a BFS manner as memory allows. In every iteration as many descendants of the previous batch as fit into memory are processed. Thus if not all nodes of a tree level fit into memory the algorithm gradually transform into a DFS-BFS traversal. The algorithm degenerates to full BFS processing for small scenes. Special care is taken to reduce the amount of memory resulting from triangle duplicates in kd-tree construction.

They propose the commonly used technique already introduced byHavran and Bittner [2002] to store a primitive referencefor every triangle that consists of its bounding box