Conclusion - Multi-modal Statistics of Local Image Structures and its Applications for Depth Pr

(a) (b) (c)

(d) (e) (f)

Figure 7.25: Surface Verification Experiment (taken from [Kjargaard et al., 2007]). (a) Setup of the scene.(b)View of the 3 predicted surfaces.(c)The robot moving in position to verify the surface on the box. (d)The sensor in contact with the surface on the box. (e)The 3 detected haptic primitives shown as small red squares. (f)The robot moving through the wrongly predicted surface without detecting a contact.

7.7 Conclusion

The current chapter has introduced a voting model that estimates the depth at homogeneous or weakly-textured image patches from the depth of the bounding edge-like structures. The depth at edge-like structures is computed using a feature-based stereo algorithm, and is used to vote for the depth of a mono, which otherwise is not possible to compute easily due to the correspondence problem. The results have been compared with different dense stereo algorithms in order to state that our feature-based algorithm works well for scenes that dense stereo algorithms are not suited. However, our aim is not to claim that our approach is better but rather to suggest that the different approaches are suited for different image contexts and that a combination of them is necessary. A naive combination of the different approaches as a proof of concept is provided to show that such a combination would benefit from both approaches and would be able to work in textured as well as non-textured image areas.

tion discontinuities differently since they are indications of different 3D information, as suggested in [Barrow and Tenenbaum, 1981].

One quality of the model is that it can be improved to regard the best two clusters as two different depth hypotheses at each mono, and the model can be modifiednot to make a decision between them until it can pass the hypotheses to a higher-level process which can make the decision. One example for such a high-level process is demonstrated in section 7.6.7.

As motivated in chapter 1, depth prediction can be understood as a feedback mechanism which com-pletes the missing information in early vision. This makes depth prediction a part of an early cognitive vision framework where different cues interact with each other to remove the ambiguities and the missing information in early vision.

We are planning to combine the depth prediction method with dense stereo methods in afeedback mechanism. In this mechanism, using high-resolution cameras with a built-in region of interest facility, it is possible to capture image regions at low and high resolutions. At the low resolution, the texture on a surface might be very weak, which favors the utilization of the depth prediction method. At the high resolution, the texture on a surface can become sufficiently distinguishable for stereo matching, which favors dense stereo methods. Based on these observations, we propose to use the depth prediction at the low resolution, and then zoom in to the region of interest to get more signal information (i.e.texture detail) from the surface, and verify or refine the original depth predictions using the disparity estimation from dense methods. Such a system can be considered as an attention mechanism where the details are acquired by attending to the regions of interest.

7.8 Acknowledgements

The publications of the author which are relevant for this chapter are [Kalkan et al., 2007b, Kalkan et al., 2007a, Kjargaard et al., 2007, Kraft et al., 2007, Bas¸eski et al., 2007, Kalkan et al., 2008].

Chapter 8 Conclusions

The current chapter concludes the thesis in the following two sections with a summary and an outlook of the contributions.

8.1 Summary

Extraction of different modalities and processing of local image structures are limited, ambiguous and incomplete, as argued in chapter 1. Biological vision systems can cope with such ambiguities and the missing information by:

1. exploiting the redundancy of information in the natural images, which are accessible through the statistical properties of the visual entities,

2. using feedback information from higher visual levels and

3. using lateral feedback information between different visual modalities, for example, in the form of an interpolation process.

Note that these issues overlap; i.e., utilization of one issue may make use of another. This thesis addressed the above mentioned issues in the context of an early cognitive vision system:

• In chapter 3, the extent of theproblem of local processing(i.e., the aperture problem) in the case of optical flow estimation is investigated using different optic flow estimation algorithms on natural images.

124

tic interpretation of junctions is used as afeedbackmechanism for removing outliers and selecting reliable junction detections.

• In chapter 5, the relation between local image structures and local 3D structure is investigated. The results of this investigation are important for understanding the possible mechanisms underlying depth interpolationprocesses.

• In chapter 6, the investigations in chapter 5 is extended using higher order relations between local 3D structures. The results of this investigation provide insights intodepth interpolation mecha-nisms and can be used as priors in a depth prediction model.

• In chapter 7, motivated from the results of chapters 5 and 6, a voting-based depth prediction model that predicts depth at homogeneous image areas is proposed. This model utilizes the sparse local 3D features extracted using a feature based stereo, and its performance is extensively compared against several dense stereo methods. Such a model can be regarded as a lateral feedback between the edge features over long distances that are extracted in early vision to complete the missing information at homogeneous image patches usingdepth interpolation.

The contribution of chapter 7 is the proposal of a depth cue that exploits the redundancy of in-formation in images. Currently, the depth cue makes use of the 3D inin-formation computed using stereo; however, it can work with other depth cues such as structure from motion as long as 3D positions and 3D line orientations are provided.

The thesis utilizes the concept of intrinsic dimensionality in all the chapters for detecting local image structures. Especially in chapter 3 for analyzing the quality of optic flow estimation, and in chapter 5 for investigating the relation between local image structures and local 3D structures, intrinsic dimension-ality proves to be a useful tool that can make thorough analysis and make explicit manifestations about properties of local image features, which are otherwise more difficult to observe.

Im Dokument Multi-modal Statistics of Local Image Structures and its Applications for Depth Prediction (Seite 122-126)