GPU implementation - Computing Performance

6.2 Computing Performance

6.2.2 GPU implementation

6.2 Computing Performance

10.59 11.17 10.63 11.16 10.67 11.25 10.75 11.31 10.86 11.71 11.10 12.04 11.69 13.20

0 5 10 15 20 25

30 CHT-Resize-Approx configuration

Sx : Number of Vote-Sample (x = 16, 32) and

NRx : Number of Robots (1, 2, 4, 8, 16, 32, 64)

19.16 21.53 19.17 21.83 19.32 21.95 19.42 22.02 19.68 22.70 20.49 24.22 21.92 26.52

0 5 10 15 20 25

30 CHT-Full-Approx configuration

19.37 21.75 19.41 21.92 19.46 22.00 19.72 22.46 20.04 23.00 20.66 24.61 22.10 26.88

0 5 10 15 20 25

30 CHT-Full-Pyth configuration

Time (ms) 10.68 11.19 10.70 11.27 10.74 11.35 10.78 11.46 10.89 11.79 11.16 12.16 11.72 13.24

0 5 10 15 20 25

30 CHT-Resize-Pyth configuration

MemCopy HostToDev Object Segmentation

Resize CHT

MemCopy DevToHost Total

Time (ms)

Figure 6.8: GPU computing performances on GTX-580 for implemented kernels. The experiments were performed for different numbers of robots and CHT votes samples, and measured in processing time (ms).

Figure 6.9 presents the execution time of the implemented kernel in the NVIDIA GTX- 780 GPU. Basically, the characteristics of the experiment results in the GTX-780 GPU are similar to the previous results obtained in the GTX-580 GPU. However, using the newer architecture and fabrication process technology, the GTX-780 GPU obtains a faster timing processing. The GTX-780 GPU architecture is built based on NVIDIA’s Kepler architecture and manufactured using the 28 nm fabrication process, whereas the GTX-580 GPU is designed based on NVIDIA’s Fermi architecture and fabricated using the 40 nm fabrication process.

Comparisons of the computing performances of the GTX-780 and GTX-580 GPUs for implementations of configurations with the CHT algorithm are shown in Figure 6.10.

The GTX-780 implementations produce significantly faster execution times for all the scenarios compared to the implementations on the GTX-580. Utilizing its 2304 CUDA core processors, the GTX-780 GPU achieves up to a 30% faster execution time. The proposed design with S16 and S32 CHT vote samples is the most favorable configuration when considering the trade-off between computing performance (Figure 6.10) and detection performance (Table 6.3). Using configurations with 16 and 32 CHT vote samples (S16 and S32), the GTX-780 GPU is able to reach frame rates of 135 and 128 fps, respectively, whereas the GTX-580 obtains frame rates of 94 and 89 fps, respectively.

6.2 Computing Performance

Sx : Number of Vote-Sample (x = 16, 32) and

NRx : Number of Robots (1, 2, 4, 8, 16, 32, 64)

Time (ms)Time (ms) 12.13 13.30 12.16 13.40 12.19 13.47 12.35 13.81 12.62 14.34 13.24 15.56 14.52 18.03

0 2 4 6 8 10 12 14 16 18

20 CHT-Full-Approx configuration

12.58 13.78 12.62 13.88 12.65 13.92 12.82 14.30 13.10 14.80 13.68 16.03 14.94 18.49

0 2 4 6 8 10 12 14 16 18

20 CHT-Full-Pyth configuration

7.40 7.81 7.42 7.87 7.44 7.91 7.50 8.05 7.62 8.27 7.86 8.73 8.35 9.69

0 2 4 6 8 10 12 14 16 18

20 CHT-Resize-Pyth configuration

MemCopy HostToDev Object Segmentation

Resize CHT

MemCopy DevToHost Total

7.37 7.76 7.39 7.83 7.41 7.84 7.46 7.96 7.58 8.18 7.82 8.67 8.31 9.62

0 2 4 6 8 10 12 14 16 18

20 CHT-Resize-Approx configuration

Figure 6.9: GPU computing performances on GTX-780 for implemented kernels. The experiments were performed using different numbers of robots and CHT votes samples, and measured in processing time (ms).

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 GTX580-CHT-Full-Pyth-S16

GTX580-CHT-Full-Approx-S16 GTX580-CHT-Full-Pyth-S32 GTX580-CHT-Full-Approx-S32 GTX580-CHT-Resize-Pyth-S16 GTX580-CHT-Resize-Approx-S16 GTX580-CHT-Resize-Pyth-S32 GTX580-CHT-Resize-Approx-S32 GTX780-CHT-Full-Pyth-S16 GTX780-CHT-Full-Approx-S16 GTX780-CHT-Full-Pyth-S32 GTX780-CHT-Full-Approx-S32 GTX780-CHT-Resize-Pyth-S16 GTX780-CHT-Resize-Approx-S16 GTX780-CHT-Resize-Pyth-S32 GTX780-CHT-Resize-Approx-S32

Time (ms)

NR64 NR32 NR16 NR8 NR4 NR2 NR1

Figure 6.10: GPU computing performances for configurations using CHT algorithm.

6.2 Computing Performance

Configuration with CSW-based algorithm

Figure 6.11 shows the execution times of the implemented kernels on the GTX-580 and GTX-780 GPUs for CSW-based configurations with different numbers of CSW vote samples (Sx). The CSW kernel performance was determined by the video frame size and number of CSW vote samples. Additionally, increasing the CSW vote-samples from 16 to 32 significantly increased the execution time, particularly for configurations with the full frame size approach. In contrast to the CHT kernel, the CSW kernel computing performance does not depend on the number of robots (NRx). It supports a scalable number of robots without affecting the execution time. A scenario with one robot has the same execution time as a scenario with four, sixty-four (64), or even a higher number of robots. Therefore, the CSW kernel is more favorable for an application that uses a large number of robots.

Configurations with the resize (downscaling) technique have significantly higher com-puting performances than those with the full frame approach, as shown in Figure 6.11.

Reducing the frame size of the segmented image contributes an improvement of up to 75% for the CSW and Sobel kernels operations. Overall, it provides a speed-up factor of approximately two times for the GPU implementation compared to the full frame size approach; yet, this technique only produces a small reduction in the detection performance. This particularly refers to configurations where the number of CSW sample votes are 16 and 32 (S16 and S32), as shown in the Table 6.4. Therefore, the proposed designs with S16 and S32 CSW vote samples are the favorable configurations considering the trade-off between the computing performance (Figure 6.11) and detec-tion performance (Table 6.4). The fastest execudetec-tion time is achieved when using 16 CSW vote samples (S16) combined with the downscaling technique.

29.37 40.57

0 5 10 15 20 25 30 35 40 45

S16 S32

CSW-Full-Pyth

29.14 40.34

0 5 10 15 20 25 30 35 40 45

S16 S32

CSW-Full-Approx

13.35 16.19

0 5 10 15 20 25 30 35 40 45

S16 S32

CSW-Resize-Pyth

13.30 16.13

0 5 10 15 20 25 30 35 40 45

S16 S32

CSW-Resize-Approx

20.30 28.41

0 5 10 15 20 25 30 35 40 45

S16 S32

CSW-Full-Pyth

9.21 11.36

0 5 10 15 20 25 30 35 40 45

S16 S32

CSW-Resize-Pyth

MemCopy DevToHost CHT

Sobel Filter Resize

Object Segmentation Total

19.84 27.95

0 5 10 15 20 25 30 35 40 45

S16 S32

CSW-Full-Approx

9.18 11.33

0 5 10 15 20 25 30 35 40 45

S16 S32

CSW-Resize-Approx

Sx : Number of Vote-Sample and

NRx : Number of Robots (Scalable)

Time (ms)Time (ms)

GTX-580

GTX-780

Figure 6.11: GPU computing performances for configurations using CSW algorithm in GTX-580 and GTX-780 GPU.

6.2 Computing Performance

Computing performance comparison between CHT and CSW-based configurations

Figure 6.12 presents the computing performances of the proposed design for the CHT-and CSW-based configurations. For a similar number of vote sample CHT-and robots (up to 64), the CHT-based approach can achieve a faster execution time than the CSW-based design. A small number of robots produces large differences in the execution times between the CHT- and CSW-based configurations. However, this difference becomes smaller with an increase in the number of robots. This is because the execution time in the CSW-based design is constant for any number of robots, whereas the execution time in the CHT-based design increases exponentially with an increase in the number of robots, as shown in Figure 6.12 (top). Because the operation in the CHT algorithm depends on the active edge pixels, its execution time could be higher than the CSW-based design when the number of robots is significantly high (e.g., 100). Additionally, if there are many objects (such as obstacles) in the robot arena with the same color as the circle color of the robot marker (e.g., red), the execution time of the CHT-based design could potentially increase and become higher than the outcome shown in Figure 6.10 and Figure 6.12.

In the GPU implementation, the CPU is used to process the graph clustering algorithm.

As depicted in Figure 6.12, its performance rate depends on the number of circle center candidates output from the GPU. For a small number of robots (e.g., NR1 up to NR16), the execution time is very short and insignificant compared to the execution time on the GPU. However, it is essential for a high number of robots such as 32 (NR32) or above.

The execution time of the graph clustering operation could be decreased by increasing the threshold value of the circle center candidates on the GPU, with a consequence that this adjustment can reduce the detection performance.

Figure 6.13 depicts a comparison of the computing performances between the GPU-CPU implementation and GPU-CPU-only implementation for multi-robot localization. The GTX-580- and GTX-780-based designs achieve speed-ups of about 7 and 10, respectively, as compared to the multithreaded processor-based implementation. This performance is obtained using only a single GPU. However, a higher computing performance could be achieved using the multiple GPU approach when necessary.

0 2 4 6 8 10 12 14 16 18

NR1 NR2 NR4 NR8 NR16 NR32 NR64 GTX-780

CHT-Resize-Approx-S32 CSW-Resize-Approx-S32

CHT-Resize-Approx-S16 CSW-Resize-Approx-S16

0 2 4 6 8 10 12 14 16 18

NR1 NR2 NR4 NR8 NR16 NR32 NR64 GTX-580

CHT-Resize-Approx-S32 CSW-Resize-Approx-S32

CHT-Resize-Approx-S16 CSW-Resize-Approx-S16

Time (ms)

0 2 4 6 8 10 12 14 16 18 20 22 24

NR1 NR2 NR4 NR8 NR16 NR32 NR64 0 2 4 6 8 10 12 14 16 18 20 22 24

NR1 NR2 NR4 NR8 NR16 NR32 NR64

Time (ms)

graph clustering in CPU

Figure 6.12: Computing performances of proposed design on GTX-580 and GTX-780 GPUs for CHT- and CSW-based configurations. Top: without clustering in CPU and Bottom: with clustering in CPU.

6.2 Computing Performance

109

Intel i7 quadcore CPU, 4770K Haswell 110

Time (ms)

Number of robots

~

14.285 15.895 15.795 18.725

10.90512.21513.92511.775

0 5 10 15 20 25 30

0 10 20 30 40 50 60 70

GTX580-CHT-Resize-Approx-S16 GTX580-CSW-Resize-Approx-S16 GTX580-CHT-Resize-Approx-S32 GTX580-CSW-Resize-Approx-S32 GTX780-CHT-Resize-Approx-S16 GTX780-CSW-Resize-Approx-S16 GTX780-CHT-Resize-Approx-S32 GTX780-CSW-Resize-Approx-S32

Figure 6.13: Comparison of computing performances between GPU-accelerated com-puting system and CPU-based system for detecting different numbers of robots (1 to 64), measured in processing time (ms).

Im Dokument Heterogeneous computing systems for vision-based multi-robot tracking (Seite 150-160)