6.2 Computing Performance
6.2.2 GPU implementation
6.2 Computing Performance
10.59 11.17 10.63 11.16 10.67 11.25 10.75 11.31 10.86 11.71 11.10 12.04 11.69 13.20
0 5 10 15 20 25
30 CHT-Resize-Approx configuration
Sx : Number of Vote-Sample (x = 16, 32) and
NRx : Number of Robots (1, 2, 4, 8, 16, 32, 64)
19.16 21.53 19.17 21.83 19.32 21.95 19.42 22.02 19.68 22.70 20.49 24.22 21.92 26.52
0 5 10 15 20 25
30 CHT-Full-Approx configuration
19.37 21.75 19.41 21.92 19.46 22.00 19.72 22.46 20.04 23.00 20.66 24.61 22.10 26.88
0 5 10 15 20 25
30 CHT-Full-Pyth configuration
Time (ms) 10.68 11.19 10.70 11.27 10.74 11.35 10.78 11.46 10.89 11.79 11.16 12.16 11.72 13.24
0 5 10 15 20 25
30 CHT-Resize-Pyth configuration
MemCopy HostToDev Object Segmentation
Resize CHT
MemCopy DevToHost Total
Time (ms)
Figure 6.8: GPU computing performances on GTX-580 for implemented kernels. The experiments were performed for different numbers of robots and CHT votes samples, and measured in processing time (ms).
Figure 6.9 presents the execution time of the implemented kernel in the NVIDIA GTX- 780 GPU. Basically, the characteristics of the experiment results in the GTX-780 GPU are similar to the previous results obtained in the GTX-580 GPU. However, using the newer architecture and fabrication process technology, the GTX-780 GPU obtains a faster timing processing. The GTX-780 GPU architecture is built based on NVIDIA’s Kepler architecture and manufactured using the 28 nm fabrication process, whereas the GTX-580 GPU is designed based on NVIDIA’s Fermi architecture and fabricated using the 40 nm fabrication process.
Comparisons of the computing performances of the GTX-780 and GTX-580 GPUs for implementations of configurations with the CHT algorithm are shown in Figure 6.10.
The GTX-780 implementations produce significantly faster execution times for all the scenarios compared to the implementations on the GTX-580. Utilizing its 2304 CUDA core processors, the GTX-780 GPU achieves up to a 30% faster execution time. The proposed design with S16 and S32 CHT vote samples is the most favorable configuration when considering the trade-off between computing performance (Figure 6.10) and detection performance (Table 6.3). Using configurations with 16 and 32 CHT vote samples (S16 and S32), the GTX-780 GPU is able to reach frame rates of 135 and 128 fps, respectively, whereas the GTX-580 obtains frame rates of 94 and 89 fps, respectively.
6.2 Computing Performance
Sx : Number of Vote-Sample (x = 16, 32) and
NRx : Number of Robots (1, 2, 4, 8, 16, 32, 64)
Time (ms)Time (ms) 12.13 13.30 12.16 13.40 12.19 13.47 12.35 13.81 12.62 14.34 13.24 15.56 14.52 18.03
0 2 4 6 8 10 12 14 16 18
20 CHT-Full-Approx configuration
12.58 13.78 12.62 13.88 12.65 13.92 12.82 14.30 13.10 14.80 13.68 16.03 14.94 18.49
0 2 4 6 8 10 12 14 16 18
20 CHT-Full-Pyth configuration
7.40 7.81 7.42 7.87 7.44 7.91 7.50 8.05 7.62 8.27 7.86 8.73 8.35 9.69
0 2 4 6 8 10 12 14 16 18
20 CHT-Resize-Pyth configuration
MemCopy HostToDev Object Segmentation
Resize CHT
MemCopy DevToHost Total
7.37 7.76 7.39 7.83 7.41 7.84 7.46 7.96 7.58 8.18 7.82 8.67 8.31 9.62
0 2 4 6 8 10 12 14 16 18
20 CHT-Resize-Approx configuration
Figure 6.9: GPU computing performances on GTX-780 for implemented kernels. The experiments were performed using different numbers of robots and CHT votes samples, and measured in processing time (ms).
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 GTX580-CHT-Full-Pyth-S16
GTX580-CHT-Full-Approx-S16 GTX580-CHT-Full-Pyth-S32 GTX580-CHT-Full-Approx-S32 GTX580-CHT-Resize-Pyth-S16 GTX580-CHT-Resize-Approx-S16 GTX580-CHT-Resize-Pyth-S32 GTX580-CHT-Resize-Approx-S32 GTX780-CHT-Full-Pyth-S16 GTX780-CHT-Full-Approx-S16 GTX780-CHT-Full-Pyth-S32 GTX780-CHT-Full-Approx-S32 GTX780-CHT-Resize-Pyth-S16 GTX780-CHT-Resize-Approx-S16 GTX780-CHT-Resize-Pyth-S32 GTX780-CHT-Resize-Approx-S32
Time (ms)
NR64 NR32 NR16 NR8 NR4 NR2 NR1
Figure 6.10: GPU computing performances for configurations using CHT algorithm.
6.2 Computing Performance
Configuration with CSW-based algorithm
Figure 6.11 shows the execution times of the implemented kernels on the GTX-580 and GTX-780 GPUs for CSW-based configurations with different numbers of CSW vote samples (Sx). The CSW kernel performance was determined by the video frame size and number of CSW vote samples. Additionally, increasing the CSW vote-samples from 16 to 32 significantly increased the execution time, particularly for configurations with the full frame size approach. In contrast to the CHT kernel, the CSW kernel computing performance does not depend on the number of robots (NRx). It supports a scalable number of robots without affecting the execution time. A scenario with one robot has the same execution time as a scenario with four, sixty-four (64), or even a higher number of robots. Therefore, the CSW kernel is more favorable for an application that uses a large number of robots.
Configurations with the resize (downscaling) technique have significantly higher com-puting performances than those with the full frame approach, as shown in Figure 6.11.
Reducing the frame size of the segmented image contributes an improvement of up to 75% for the CSW and Sobel kernels operations. Overall, it provides a speed-up factor of approximately two times for the GPU implementation compared to the full frame size approach; yet, this technique only produces a small reduction in the detection performance. This particularly refers to configurations where the number of CSW sample votes are 16 and 32 (S16 and S32), as shown in the Table 6.4. Therefore, the proposed designs with S16 and S32 CSW vote samples are the favorable configurations considering the trade-off between the computing performance (Figure 6.11) and detec-tion performance (Table 6.4). The fastest execudetec-tion time is achieved when using 16 CSW vote samples (S16) combined with the downscaling technique.
29.37 40.57
0 5 10 15 20 25 30 35 40 45
S16 S32
CSW-Full-Pyth
29.14 40.34
0 5 10 15 20 25 30 35 40 45
S16 S32
CSW-Full-Approx
13.35 16.19
0 5 10 15 20 25 30 35 40 45
S16 S32
CSW-Resize-Pyth
13.30 16.13
0 5 10 15 20 25 30 35 40 45
S16 S32
CSW-Resize-Approx
20.30 28.41
0 5 10 15 20 25 30 35 40 45
S16 S32
CSW-Full-Pyth
9.21 11.36
0 5 10 15 20 25 30 35 40 45
S16 S32
CSW-Resize-Pyth
MemCopy DevToHost CHT
Sobel Filter Resize
Object Segmentation Total
19.84 27.95
0 5 10 15 20 25 30 35 40 45
S16 S32
CSW-Full-Approx
9.18 11.33
0 5 10 15 20 25 30 35 40 45
S16 S32
CSW-Resize-Approx
Sx : Number of Vote-Sample and
NRx : Number of Robots (Scalable)
Time (ms)Time (ms)
GTX-580
GTX-780
Figure 6.11: GPU computing performances for configurations using CSW algorithm in GTX-580 and GTX-780 GPU.
6.2 Computing Performance
Computing performance comparison between CHT and CSW-based configurations
Figure 6.12 presents the computing performances of the proposed design for the CHT-and CSW-based configurations. For a similar number of vote sample CHT-and robots (up to 64), the CHT-based approach can achieve a faster execution time than the CSW-based design. A small number of robots produces large differences in the execution times between the CHT- and CSW-based configurations. However, this difference becomes smaller with an increase in the number of robots. This is because the execution time in the CSW-based design is constant for any number of robots, whereas the execution time in the CHT-based design increases exponentially with an increase in the number of robots, as shown in Figure 6.12 (top). Because the operation in the CHT algorithm depends on the active edge pixels, its execution time could be higher than the CSW-based design when the number of robots is significantly high (e.g., 100). Additionally, if there are many objects (such as obstacles) in the robot arena with the same color as the circle color of the robot marker (e.g., red), the execution time of the CHT-based design could potentially increase and become higher than the outcome shown in Figure 6.10 and Figure 6.12.
In the GPU implementation, the CPU is used to process the graph clustering algorithm.
As depicted in Figure 6.12, its performance rate depends on the number of circle center candidates output from the GPU. For a small number of robots (e.g., NR1 up to NR16), the execution time is very short and insignificant compared to the execution time on the GPU. However, it is essential for a high number of robots such as 32 (NR32) or above.
The execution time of the graph clustering operation could be decreased by increasing the threshold value of the circle center candidates on the GPU, with a consequence that this adjustment can reduce the detection performance.
Figure 6.13 depicts a comparison of the computing performances between the GPU-CPU implementation and GPU-CPU-only implementation for multi-robot localization. The GTX-580- and GTX-780-based designs achieve speed-ups of about 7 and 10, respectively, as compared to the multithreaded processor-based implementation. This performance is obtained using only a single GPU. However, a higher computing performance could be achieved using the multiple GPU approach when necessary.
0 2 4 6 8 10 12 14 16 18
NR1 NR2 NR4 NR8 NR16 NR32 NR64 GTX-780
CHT-Resize-Approx-S32 CSW-Resize-Approx-S32
CHT-Resize-Approx-S16 CSW-Resize-Approx-S16
0 2 4 6 8 10 12 14 16 18
NR1 NR2 NR4 NR8 NR16 NR32 NR64 GTX-580
CHT-Resize-Approx-S32 CSW-Resize-Approx-S32
CHT-Resize-Approx-S16 CSW-Resize-Approx-S16
Time (ms)
0 2 4 6 8 10 12 14 16 18 20 22 24
NR1 NR2 NR4 NR8 NR16 NR32 NR64 0 2 4 6 8 10 12 14 16 18 20 22 24
NR1 NR2 NR4 NR8 NR16 NR32 NR64
Time (ms)
graph clustering in CPU
Figure 6.12: Computing performances of proposed design on GTX-580 and GTX-780 GPUs for CHT- and CSW-based configurations. Top: without clustering in CPU and Bottom: with clustering in CPU.
6.2 Computing Performance
109
Intel i7 quadcore CPU, 4770K Haswell 110
Time (ms)
Number of robots
~
14.285 15.895 15.795 18.725
10.90512.21513.92511.775
0 5 10 15 20 25 30
0 10 20 30 40 50 60 70
GTX580-CHT-Resize-Approx-S16 GTX580-CSW-Resize-Approx-S16 GTX580-CHT-Resize-Approx-S32 GTX580-CSW-Resize-Approx-S32 GTX780-CHT-Resize-Approx-S16 GTX780-CSW-Resize-Approx-S16 GTX780-CHT-Resize-Approx-S32 GTX780-CSW-Resize-Approx-S32
Figure 6.13: Comparison of computing performances between GPU-accelerated com-puting system and CPU-based system for detecting different numbers of robots (1 to 64), measured in processing time (ms).