Multi-modal Statistics of Local Image Structures and its Applications for Depth Prediction

(1)

Dissertation

E  -D

”D  ”G-A-U¨ G¨

vorgelegt von Sinan Kalkan aus Ankara

G¨ottingen 2007

(2)

Referentin/Referent: Prof. Florentin Wörgötter Koreferentin/Koreferent: Prof. Norbert Krüger Tag der mündlichen Prüfung: 15 Januar 2008

(3)

involves the extraction of local visual modalities (like optical flow, disparity and contrast transition etc.) and local image structures (edge-like, junction-like and texture-like structures). Since information in early vision is processed only locally, it is inherently ambiguous. For example, estimation of optical flow faces the aperture problem, and thus, only the flow along the intensity gradient is computable for edge- like structures. Moreover, the extracted flow information at weakly-textured image areas are unreliable.

Analogously, stereopsis needs to deal with the correspondence problem: as correspondences at weakly textured image areas cannot be found, the disparity information at such places is not accurate. One way to deal with the missing and ambiguous information is to make use of the redundancy of visual information by exploiting the statistical regularities of natural scenes. Such regularities are carried in the visual system using feedback mechanisms between different layers, or by lateral connections within a layer.

This thesis is interested in the ambiguities and the biased and missing information in the processing of optic flow, stereo and junctions using statistical means. It uses statistical properties of images to analyze the extent of the ambiguous processing in optical flow estimation and whether the missing information in stereo can be recovered using interpolation of depth information at edge-like structures. Moreover, it proposes a feedback mechanism for dealing with the bias in junction detection, and another model for recovering the missing depth information in stereo computation using only the depth information at the edges.

3

(4)

Acknowledgements

First of all, I would like to thank myunofficialsupervisor Prof. Norbert Kr¨uger from Denmark for his supervision and his important contributions to my scientific thinking. Moreover, it is his understanding and friendly support about my personal requirements that kept this study going.

Prof. Florentin W¨org¨otter, my official supervisor, is an important ingredient of this thesis. He, being the official supervisor, allowed me to perform my research independent of a physical location. I also learned a lot from him regarding management of a big research group whose members are from totally different fields, working in different subjects.

This study is a product of working in three different locations: Stirling from Scotland; Odense, Copenhagen and Esbjerg from Denmark; and, Göttingen. I would like to thank all my friends and colleagues from these countries, some of which are (ordered with surname): Emre Bas¸eski, Babette Dellen, Tao Geng, Matthias Hennig, Christoph Kolodziejski, Dirk Kraft, Irene Markelić, Ailsa Millen, Florian Pilz, Yan Shi, Steffen Wischmann and Alexander Wolf. I also should mention that kicker (i.e., table football) was an amusing part of my daily life in Göttingen: thank you the Kicker Gang!

My friends and colleagues Nicolas Pugeault, Marina Wimmer and Ailsa Millen from Stirling were really important for this study. It is them who managed to make Scotland feel like a second home for me.

Tomas Kulvi˜cius is also to be thanked for leaving Scotland with me for G¨ottingen, and going through a tough beginning in Germany.

I’d like to thank my friends from Turkey whose remote and cyber friendship reserves a big credit, as going abroad never meant leaving them behind: ˙Irem Aktu˜g, Barıs¸ Sertkaya, Ergül Pekesen, Gülhan Bilen, Levent Karagöl, Gökhan Kars, Behiye Erkenci, Burçin Sapaz, Sevgi Yas¸ar. Special credits go to

4

(5)

5

(6)

Chapter 1 Introduction

Vision is the process of understanding scenes from their 2D projections, which are in the form of a set of images. The intensity values in an image are formed by one or more of the following factors: (1) the geometry, and (2) the illumination of the environment, (3) the reflectances of the surfaces, and (4) the viewpoint. By definition, this makes vision anill-posed¹inverseproblem [Bertero et al., 1987].

Processing in most artificial vision systems and in the human vision system starts with the extraction of local visual modalities (like optical flow, disparity and contrast transition etc.) and local image structures (edge-like, junction-like and texture-like structures). This stage is calledearly visionin,e.g., [Papathomas et al., 1995]. Since information in early vision is processed only localy, it is inherently ambigious. For example, estimation of optical flow faces the aperture problem, and thus, only the flow along the intensity gradient is computable for edge-like structures. Moreover, the extracted flow information at weakly-textured image areas are unreliable. Analogously, stereopsis needs to deal with the correspondence problem: as correspondences at weakly textured image areas cannot be found, the disparity information at such places is not accurate. Nonetheless, the human visual system can extract meaningful 3D interpretations from early vision in spite of the ambiguities and the missing information. Accordingly, an artificial vision system is expected to operate and create 3D world models from such information, too.

The ambiguous and the biased information from early vision is processed and integrated by global mechanisms at a stage ofearly cognitive vision(as defined in [W¨org¨otter et al., 2004]) in order to create

1According to [Hadamard, 1923], a problem is well-posed if (1) a solution exists, (2) the solution is unique, and (3) it depends continuously on the data. A problem is ill-posed if it is not well-posed.

10

(11)

morecondensedrepresentation in terms of a semantic descriptor of a reduced dimensionality is required;

i.e., feedback mechanisms should make use of sparse symbolic descriptors rather than the signal-level information.

Homogeneous image areas are signals of uniform intensity. Such image areas are neglected in early vision since retinal ganglion cells are excitable only by contrast differences. Early cognitive vision is believed toinfervisual information (including first estimates of depth information) at homogeneous image areas from the available visual information in early vision using interpolation mechanisms². There are already psychophysical experiments [Anderson et al., 2002, Collett, 1985, Julesz, 1971, Treue et al., 1995]

and computational theories [Barrow and Tenenbaum, 1981, Grimson, 1982, Terzopoulos, 1988] which suggest that the human visual system performs interpolation in depth and completes the missing depth information at weakly-structured image areas.

Feedback mechanisms in a vision system make use of the regularities in the input images. In fact, it is believed that the human visual system is adapted to the statistics of the retinal projections of the environment, in order to make use of the regularities or the redundancy of information in the environment [Brunswik and Kamiya, 1953]. With the availability of computational and technological means, it has been possible to prove such claims [Krueger, 1998, Geisler et al., 2001], and the results of such investigations have proven to be useful in several computational vision problems [Elder et al., 2003, Pugeault et al., 2004, Zhu, 1999] (see [Simoncelli, 2003] for a review).

As a summary, biological vision systems can cope with the ambiguities and the missing information mentioned above by (1) exploiting the redundancy of information in the natural images, (2) using feedback information from higher visual levels and (3) using lateral feedback information between different visual modalities (such as optical flow, colour, contrast etc.), for example, in the form of an interpolation process.

This thesis is concerned with the analysis of ambiguities in the visual modalities such as optical

2The terminterpolationis not meant in mathematical terms (i.e., regression) in this thesis, and filling-in missing information is usually called interpolation in the literature (see,e.g., [Grimson, 1982])

(12)

1.1. Marr’s Theory of Vision 12

flow and disparity, and with the computational modeling of feedback mechanisms for two problems:

(1) reliable and complete extraction of junctions in images (chapter 4), and (2) estimation of depth at weakly-structured image areas (chapter 7) where correspondence-based depth cues provide unreliable information or no information at all (see [Bayerl and Neumann, 2007] for a computational model of feedback mechanisms in optical flow estimation). Statistical investigations from natural images and chromatic range data are provided that support the models developed in this work and the previous works from other researchers and quantify some widely made assumptions by the vision community (chapters 3, 5 and 6).

This thesis contributes to an existing early cognitive vision framework in two aspects: (1) Junctions with condensed symbolic descriptors. (2) Homogeneous image patches with predicted depth information. This early cognitive vision framework is mainly developed in the European ECOVISION project [ECOVISION, 2003], which so far makes use of only edge-like structures [Pugeault et al., 2006]. By having depth information available in this framework, homogeneous image patches can be combined to create object surfaces which then can be used for several tasks such as grasping objects using a robot arm (European PACO-PLUS project [PACO-PLUS, 2007]), or driving a car on the road (European DRIVSCO project [Drivsco, 2007]).

In the following sections, the thesis is put into several contexts, describing the contributions of the thesis in every context.

1.1 Marr’s Theory of Vision

Vision research has been influenced most by David Marr’s paradigm [Marr, 1982]. This is because the paradigm (1) layed down the computational vision as an information processing task, (2) addressed the main problems that had to be solved in order to achieve this processing task, and (3) proposed a computational framework as a solution to it.

One of the first contributions of Marr was to combine the findings and the theories of his time from neurophysiology, psychology and artificial intelligence into a coherent and complete vision theory. He clearly defined vision as an information-processing task, and in combination with the existing psychophysical experiments, he could arrive at a distinction between (1) thecomputational theory, (2) therepresentation and algorithmic implementationof a theory, and (3) thehardwareimplementation (for

(13)

2. Primal sketch. The sketch of the image is extracted in terms of edges, corners and other local structures as well as perceptual groups.

3. 2¹₂-D sketch.This level is viewer-based and concerned with extracting the relative or absolute 3D distances and orientations of objects.

4. 3-D mode representation. This is the goal of acomplete visual system. It includes models of objects, in an object-centered coordinate system, as well as how these objects are organized in space.

Marr reduced the vision process into a set ofsubproblemsthat are calledvisual modules. The visual modules include stereo, shape-from-X methods, extraction of several visual modalities like optical flow, contrast transition etc. However, in the last decade, (1) the evidences from neurophysiology that biological visual systems are equiped with feedback mechanisms which constitute an important proportion of the visual cortex, and (2) the ambiguities and missing information in early vision led scientists to realize that visual modules cannot be solvedunambiguouslywithout feedback from other visual modules or from higher levels of visual processing, and several attempts have been initiated for combining the different visual modules (see,e.g., [Aloimonos and Shulman, 1989]).

In this context, this thesis contributes two feedback mechanisms for two different tasks. First, a simple feedback mechanism is proposed for the detection and extraction of junctions using their semantic interpretation (chapter 4). The semantic interpretation of junctions is used to detect and remove outliers and produce very reliable detections in spite of high sensitivity to contrast. Second, 3D features, which are extracted from a feature-based stereo algorithm, are used in a depth prediction model to laterally feedback and interpolate depth in homogeneous image areas where correspondence-based methods usually fail to compute depth (chapter 7).

(14)

1.2. Early Vision and Early Cognitive Vision 14

1.2 Early Vision and Early Cognitive Vision

According to Marr’s paradigm (see section 1.1 and [Marr, 1982]), vision involves extraction of meaningful representations from input images, starting at the pixel level and building up its interpretation more or less in the following order: local filters, extraction of relevant features, the 2¹₂-D sketch and the 3-D sketch. One possible distinction of image structures are as described below:

• Homogeneous structures: Homogeneous patches are signals of uniform intensities. It is assumed that they correspond to continuous surfaces (which is quantified in chapter 5), and they are not much made use of in early vision because retinal ganglion cells are not excitable by homogeneous intensities [Bruce et al., 2003].

• Edge–like structures: Edges are low-level structures which constitute the boundaries between homogeneous or texture-like image areas (see, e.g., [Koenderink and Dorn, 1982, Marr, 1982] for their importance in vision). Detection of edge-like structures in the human visual system starts with orientation sensitive cells in V1 [Hubel and Wiesel, 1969], and biological and machine vision systems depend on their reliable extraction and utilization [Marr, 1982, Koenderink and Dorn, 1982].

• Junction–like structures: Junctions are image patches where two or more edge-like structures with significantly different orientations intersect (see,e.g., [Guzman, 1968, Rubin, 2001, Shevelev et al., 2003]

for their importance in vision). It has been suggested that the human visual system makes use of them for different tasks like recovery of surface occlusion [Guzman, 1968, Rubin, 2001] and shape interpretation [Malik, 1987, Shevelev et al., 2003]. It is known that junctions are detected in the primary visual cortex (see,e.g., [Shevelev et al., 1998]).

• Texture–like structures: Although there is not a generally-agreed definition, textures are often defined as image patches which consist of repetitive, random or directional structures (for their analysis, extraction and importance in vision, see e.g., [Tuceryan and Jain, 1998]). Our world consists of textures on many surfaces, and the fact that we can reliably reconstruct the 3D structure from any textured environment indicates that the human visual system makes use of and is very good at the analysis and the utilization of textures.

(15)

Early vision involves acquisition of a set of visual modalities as well as extraction of the local image structures (exceptfor homogeneous image structures). These visual modalities include disparity, optical flow, texture information, occlusions etc. and, together with the local image structures, carry the information necessary to interpret a scene.

Owing to only local processing, early vision usually carries ambigious, biased or false information.

For example, the visual modalities face the correspondence problem;i.e., looking for the corresponding image features between the different views of a scene. Due to the correspondence problem, only the optic flow along the intensity gradient of an edge can be found; or, in the case of stereopsis, no disparity can be computed at weakly-structured image areas (see, [Baker et al., 2001]).

The ambiguous and biased information from early vision is processed and integrated by global mechanisms at the stage ofearly cognitive vision(as defined in [W¨org¨otter et al., 2004]) in order to create more accurate, meaningful and complete visual entities. At this stage, the visual information is disambiguated by recurrent loops, attention and feedback from higher visual processing layers. Moreover, it is our belief that homogeneous image patches that are neglected in early vision are added back to visual processing at this stage.

Physiological evidences [Angelucci et al., 2002, Galuske et al., 2002] as well as computational models [Bayerl and Neumann, 2007, Bullier, 2001] already exist that study and support the usage of feedback mechanisms in the processing of different kinds of visual information.

The contributions of this thesis in the context of early vision and early cognitive vision are:

1. As mentioned already in section 1.1, using a simple feedback mechanism to improve junction detection through semantic interpretation (chapter 4). The extracted interpretation of detected junctions is used to remove outliers and select reliable detections.

2. Analysis of the extent of the ambiguity of visual information in the context of optical flow using natural image statistics (chapter 3).

3. Analysis of the relation between local image structures and local 3D structures. Such an analysis

(16)

1.3. 3D Reconstruction, the Correspondence Problem and Depth Interpolation 16

is important for understanding possible mechanisms underlying interpolation processes (chapters 5 and 6).

4. As mentioned already in section 1.1, proposal of a depth prediction model that uses lateral feedback between 3D features, extracted from a feature-based stereo, to interpolate depth at homogeneous image areas (chapter 7). With this contribution, this thesis is a part of an early cognitive vision framework that so far includes edge features only [Kr¨uger et al., 2003, Pugeault et al., 2006].

1.3 3D Reconstruction, the Correspondence Problem and Depth Interpolation

Depth cues can be classified as pictorial, or monocular (such as shading, utilization of texture gradients or linear perspective) and multi-view (like stereo and structure from motion) [Faugeras, 1993]. Depth cues which make use of multiple views require correspondences between the different 2D views of a scene.

In contrast, pictorial cues use statistical and geometrical relations in one image to make statements about the underlying 3D structure.

Finding the correspondences between the different views of a scene means matching image points in one view to image points in other views that might have originated from the same 3D point. Junctions are the most distinctive local image features, which makes them suitable for finding correspondences. So are edge-like structures, unless they are parallel with the epipolar line, in which case correspondences cannot be found. As for homogeneous image areas, the correspondence problem is not solvable or very difficult to solve by direct methods as there is no structure (see,e.g., [Baker et al., 2001] for a systematic evaluation). However, many surfaces have only weak texture or no texture at all. Nevertheless, humans are able to reconstruct 3D information for these surfaces, too. Existing psychophysical experiments (see, e.g., [Anderson et al., 2002, Collett, 1985, Julesz, 1971, Treue et al., 1995]) and computational theories (see,e.g., [Barrow and Tenenbaum, 1981, Grimson, 1982, Terzopoulos, 1988]) suggest that in the human visual system, aninterpolation processis realized that, starting with the local analysis of edges, corners and textures, computes depth also in areas where correspondences cannot easily be found.

Processing of depth in the human visual system starts with the processing of local image structures

(such as edge-like structures, corner-like structures and textures) in V1 [Gallant et al., 1994, Hubel and Wiesel, 1969,

(17)

evolve rather late in the development of the human visual system. For example, pictorial depth cues are made use of only after approximately 6 months [Kellman and Arterberry, 1998]. This indicates that experience may play an important role in the development of these cues,i.e., that we have to understand depth perception as a statistical learning problem [Knill and Richards, 1996, Purves and Lotto, 2002, Rao et al., 2002]. A step towards such an understanding is the investigation and use of the statistical relations between the local image structures and the underlying 3D structure for each of these depth cues [Knill and Richards, 1996, Purves and Lotto, 2002, Rao et al., 2002].

This thesis distinguishesdepth predictionfromsurface interpolationbecause surface interpolation assumes that there is already a dense depth map of the scene available in order to be able to estimate the 3D surface-orientation at points which is then used to complete the missing depth information (see, e.g., [Grimson, 1982, Grimson, 1984, Guy and Medioni, 1994, Lee and Medioni, 1998, Lee et al., 2002, Terzopoulos, 1982, Terzopoulos, 1988]) whereas the understanding of depth prediction in this thesis makes use of only 3D line-orientations at edge-segments which are computed using a feature-based stereo algorithm proposed in [Pugeault and Kr¨uger, 2003].

This thesis, in the context of 3D reconstruction, makes the following contributions:

1. Analysis of the relation between local image structures and local 3D structure which is important for understanding the possible underlying mechanisms of depth interpolation processes (chapters 5 and 6).

2. As already mentioned in sections 1.1 and 1.2, proposal of a depth prediction model that uses lateral feedback between 3D features (extracted from a feature-based stereo) to interpolate depth at homogeneous image areas (chapter 7).

(18)

1.4. Vision and Natural Image Statistics 18

1.4 Vision and Natural Image Statistics

The amount of images that can be observed in nature is a very small subset of the images that can be constructed using arbitrary combinations of intensity values [Field, 1994]. This suggests that the natural images bear intrinsic regularities which are believed to be exploited by our visual system for perceiv- ing the environment (see, e.g., [Krüger and Wörgötter, 2004]), especially for the purpose of resolving ambiguities inherent in local processing of various visual modalities such as optic flow and disparity.

For example, it is widely acknowledged that Gestalt principles for perceptual organization are the results of our visual system’s adaptation to the statistical regularities in natural scenes. This hypothesis was first pointed out in [Brunswik and Kamiya, 1953], but could not be tested or justified until 90s due to insufficient computational means. In [Field et al., 1993], computer-generated randomly-oriented data was used to develop a theory of contour grouping in the human visual system, called theassociation field.

In 1998, [Krueger, 1998] used natural images instead of computer generated data to prove the relation between grouping mechanisms and the natural image statistics. Such investigations were extended in [Elder and Goldberg, 2002, Geisler et al., 2001, Krüger and Wörgötter, 2002], and the results were uti- lized in several computer vision tasks, including contour grouping, object recognition and stereo (see, e.g., [Elder et al., 2003, Pugeault et al., 2004, Zhu, 1999]).

Statistical regularities of natural images also helped researches to understand the principles of sensory coding in the early stages of visual processing. It was shown that Independent Component Analysis and Principle Component Analysis of image patches from natural images produce Gabor-wavelet like patterns which are believed to be what the simple cells in V1 of the human visual system are doing (see, e.g., [Jones and Palmer, 1987]).

Availability of relatively cheaper range scanners made it possible to analyze the statistical properties of 3D world together with its 2D image projections. Such analyses are important (1) for quan- tifying and understanding the assumptions that the vision researchers have been making and (2) for understanding the intrinsic properties of the 3D world. In [Yang and Purves, 2003, Huang et al., 2000, Potetz and Lee, 2003], the correlation between the properties of 3D surfaces (like roughness, 3D orientation, distance, size, curvature etc.) and the intensity of the images are analyzed. Such studies

(19)

chapter 5 for details). Moreover, range image statistics allow explanation of several visual illusions [Howe and Purves, 2002, Howe and Purves, 2004].

[Krüger and Wörgötter, 2004] provides a summary of the evidences from developmental psychology which suggest that depth extraction based on statistical regularities used in perceptual organization develops at a later stage than depth extraction based on stereopsis and motion. In particular, it is discussed that perceptual organization based on edge structures are in place after approximately 6 months of visual experience but not before [Kellman and Arterberry, 1998, Spelke, 1993] as also mentioned in the previous section. This suggests that the detection of statistical regularities in visual scenes plays an important role in the establishment of such abilities.

This thesis provides natural image statistics (some of which have already been mentioned in sections 1.2 and 1.3) regarding several visual processing phenomena. Chapter 3 investigates the extent of the aperture problem based on local image structures, and the quality of several optical flow algorithms, using ground truth optical flow. In chapter 5, the relation between local image structures and the underlying local 3D structure is analyzed. Chapter 6 tries to answer whether the depth at homogeneous image areas can be predicted from the depth of edge-like structures. The results provided in chapter 6 are important for understanding the possible mechanisms underlying depth interpolation processes and motivate the depth prediction model provided in chapter 7.

1.5 Outline and Contributions

In this section, the contributions of the thesis are summarized, and the relevant publications of the author are listed.

• Chapter 2provides background information about the continuous definition of intrinsic dimensionality that is used throughout the whole thesis for distinguishing between different local image structures. Moreover, this chapter introduces the visual features, calledprimitives, that represent different local image structures.

(20)

1.5. Outline and Contributions 20

Relevant publication from the author: [Felsberg et al., 2007b].

• Chapter 3analyzes the quality of different optical flow algorithms based on different image structures. This analysis provides insight into the extent of the aperture problem for different image structures. This chapter proposes intrinsic dimensionality as a new tool for better analysis of the inherent properties of optic flow algorithms depending on the local image structures.

Relevant publications from the author: [Kalkan et al., 2004a, Kalkan et al., 2004b, Kalkan et al., 2005].

• Chapter 4discusses the problems of junction detection methods, in relation to their sensitivity to contrast, and proposes a local feedback mechanism for improving the quality of any junction detection method. The feedback comes from the condensed description,i.e., semantic interpretation of the junctions, which is used to differentiate true positives from false positives. The chapter presents results on real examples showing the usefulness of such a feedback mechanism for different junction detection methods.

Relevant publication from the author: [Kalkan et al., 2007f, Pilz et al., 2007]

• Chapter 5uses chromatic range data to investigate the likelihood of observing a certain local 3D structure, given its 2D projection. The results justify a widely used assumption called’no news is good news’. This assumption basically states that two image points which do not have any contrast difference in-between can be assumed to be on the same surface. This chapter challenges this assumption by showing that most contrast differences also form continuous surfaces.

Relevant publications from the author: [Kalkan et al., 2006, Kalkan et al., 2007c]

• Chapter 6 investigates whether depth at homogeneous image areas can be predicted from the depth of edge-like structures. It shows that an edge segment in the neighborhood of a homogeneous image patch can predict the depth at the homogeneous image patch. The strength of this prediction is shown to decrease with distance and to increase with the existence of a second copla- nar edge-segment. This investigation is important for understanding possible mechanisms that might underlie depth interpolation mechanisms.

Relevant publications from the author: [Kalkan et al., 2007d, Kalkan et al., 2007c]

• Chapter 7, motivated from the statistics provided in chapter 6, develops a voting model that pre- dicts depth at homogeneous image areas from the depth of edge-like structures. The model is able

(21)

Kraft et al., 2007, Bas¸eski et al., 2007, Kalkan et al., 2008]

The list ofacceptedpublications:

Citation Year Journal/Conference Title Publication Type

[Kalkan et al., 2004a] 2004 Brain Inspired Cognitive Systems Conference [Kalkan et al., 2004b] 2004 Dynamic Perception Workshop Workshop [Kalkan et al., 2005] 2005 Network: Computation in Neural Systems Journal [Kalkan et al., 2006] 2006 IEEE Computer Vision and Pattern Recognition Conference [Kalkan et al., 2007d] 2007 Computer Vision Theory and Applications Conference [Kalkan et al., 2007f] 2007 Computer Vision Theory and Applications Conference [Kalkan et al., 2007b] 2007 Maersk Institute, Uni. of Southern Denmark Technical Report [Kalkan et al., 2007a] 2007 Maersk Institute, Uni. of Southern Denmark Technical Report [Kjargaard et al., 2007] 2007 Maersk Institute, Uni. of Southern Denmark Technical Report [Kalkan et al., 2007c] 2007 Network: Computation in Neural Systems Journal

[Bas¸eski et al., 2007] 2007 3D Representation for Recognition Workshop [Pilz et al., 2007] 2007 Int. Symposium on Visual Computing Conference [Kraft et al., 2007] 2007 International Journal of Humanoid Robotics Journal [Kalkan et al., 2008] 2008 Computer Vision Theory and Applications Conference

The list ofsubmitted/being-writtenpublications:

Citation Year Journal/Conference Title Publication Type [Felsberg et al., 2007b] 2007 Image and Vision Computing Journal

(22)

Chapter 2 Background

This chapter presents two crucial tools that are used throughout the thesis. Section 2.1 describes the concept of intrinsic dimensionality, which is used in this thesis for distinguishing between different kinds of local image structures, and section 2.2 briefly introduces the local homogeneous and edge-like features.

2.1 A Continuous Definition of Intrinsic Dimensionality

In image processing, intrinsic dimensionality (iD) was introduced by [Zetzsche and Barth, 1990] and was used to formalize adiscrete distinctionbetween homogeneous, edge-like and junction-like structures.

This corresponds to a classical interpretation of local image structures in computer vision.

Homogeneous, edge-like and junction-like structures are respectively classified byiDasintrinsically zero dimensional (i0D),intrinsically one dimensional (i1D)andintrinsically two dimensional (i2D).

The spectral representation of a local image patch (see figure 2.1(a,b)) reveals that the energy of an i0D signal is concentrated in the origin (figure 2.1(b)-top), the energy of an i1D signal is concentrated along a line (figure 2.1(b)-middle) while the energy of an i2D signal varies in more than one dimension (figure 2.1(b)-bottom).

It has been shown [Felsberg and Kr¨uger, 2003, Kr¨uger and Felsberg, 2003, Felsberg et al., 2007b]

that the structure of the iDcan be understood as a triangle that is spanned by two measures: origin variance and line variance. Origin variance describes the deviation of the energy from a concentration at the origin while line variance describes the deviation from a line structure (see figure 2.1(b) and 2.1(c));

22

(23)

The triangular structure of the intrinsic dimension is counter-intuitive, in the first place, since it re- alizes a two-dimensional topology in contrast to a linear one-dimensional structure that is expressed in the discrete counting 0, 1 and 2. As shown in [Kr¨uger and Felsberg, 2003, Felsberg and Kr¨uger, 2003, Felsberg et al., 2007b], this triangular interpretation allows for acontinuous formulationofiDin terms of 3 confidences assigned to each discrete case. This is achieved by first computing two measurements of origin and line variance which define a point in the triangle (see figure 2.1(c)). The bary-centric coordinates (see, e.g., [Coxeter, 1969]) of this point in the triangle directly lead to a definition of three confidences that add up to one:

c_i0D = 1−x,

ci1D = x−y, (2.1)

ci2D = y.

These three confidences reflect the volume of the areas of the three sub-triangles which are defined by the point in the triangle and the corners of the triangle (see figure 2.1(c)). For example, for an arbitrary pointPin the triangle, the area of the sub-triangle i0D-P-i1D denotes the confidence for i2D as shown in figure 2.1(c). That leads to the decision areas for i0D, i1D and i2D as seen in figure 2.1(d). See appendix A and [Felsberg et al., 2007a] for more details.

For the example image in figure 2.1, computediDis given in figure 2.2.

Figure 2.3 shows how a set of example local image structures map on to theiDtriangle. The figure shows that different visual structures map to different areas in the triangle. A detailed analysis of how 2D structures are distributed over the intrinsic dimensionality triangle and how some visual information depends on this distribution can be found in chapters 3 and 5 and references [Kalkan et al., 2005, Kalkan et al., 2006].

This thesis proposes intrinsic dimensionality as a new tool for analyzing the inherent properties of different image structures using the intrinsic dimensionality triangle. In chapter 3, this is performed for

(24)

2.2. Multi-modal Visual Features – Primitives 24

the analysis of the distribution of local image structures and the quality of different optic flow algorithms.

Chapter 5 uses theiDtriangle for the analysis of the relation between local 2D and 3D structures.

2.2 Multi-modal Visual Features – Primitives

This thesis extensively utilizes primitiveswhich are local, multi-modal visual feature descriptors that were introduced in [Kr¨uger et al., 2004b]. They are semantically and geometrically meaningful descrip- tions of local image patches, motivated by the hyper-columnar structures in V1 ([Hubel and Wiesel, 1969]).

Primitives can be edge-like and homogeneous and either 2D or 3D. For edge-like primitives, the corresponding 3D primitive is extracted using stereo. As for homogeneous primitives, the 3D primitive is estimated from the 3D edge-like primitives, which is the topic of chapter 7.

An edge-like 2D primitive is defined as:

π^e=(x, θ, ω,(c_l,c_m,c_r),f), (2.2)

wherexis the image position of the primitive;θis the 2D orientation;ωrepresents the contrast transition;

(c_l,c_m,c_r) is the representation of the color, corresponding to the left (c_l), the middle (c_m) and the right side (cr) of the primitive; and,fis the optical flow extracted using Nagel-Enkelmann optic flow algorithm [Nagel and Enkelmann, 1986].

As the underlying structure of a homogeneous image patch is different from that of an edge-like patch, a different representation is needed for homogeneous 2D primitives (calledmonos in this thesis):

π^m=(x,c), (2.3)

wherexis the image position, andcis the color of the mono¹.

See [Kr¨uger et al., 2007] for more information about these modalities and their extraction. Figure 2.4 shows extracted primitives for an example scene.

π^e is a 2D feature which can be used to find correspondences in a stereo framework to create 3D primitives (as introduced in [Kr¨uger and Felsberg, 2004, Pugeault et al., 2006]) which have the following

1For analyzing shape from shading, representation of local intensity variance can be included in a further study.

(25)

side (c_r) of the 3D primitive.

In chapter 7, we estimate the 3D representationΠ^mof monos which stereo fails to compute due to the correspondence problem:

Π^m=(X,n,c), (2.5)

where Xandcare as in equation 2.3, and nis the orientation (i.e., normal) of the plane that locally represents the mono.

2.3 Acknowledgements

Section 2.1 is a product of collaboration with Michael Felsberg and is published in a co-authored journal [Felsberg et al., 2007b].

(26)

2.3. Acknowledgements 26

(a) (b)

c

i1D

1 1

Origin Variance

Line Variance

i2D

0 i1D i0D 0

P

c c

i2D

i0D

(c)

i0D i1D

i2D

0

0 0.5 1

1

0.5

Origin Variance

Line Variance

(d)

Figure 2.1: Illustration ofiD(Sub-figures (a,b) are taken from [Felsberg and Kr¨uger, 2003]). (a) Three image patches for three different intrinsic dimensions. (b) The 2D spatial frequency spectra of the local patches in (a), from top to bottom: i0D, i1D, i2D. (c) The topology ofiD. Origin variance is variance from a point, i.e., the origin. Line variance is variance from a line, measuring the junctionness of the signal. ciNDfor N=0,1,2 stands for confidence for being i0D, i1D and i2D, respectively. Confidences for an arbitrary point P is shown in the figure which reflect the areas of the sub-triangles defined by P and the corners of the triangle. (d) The decision areas for local image structures.

(27)

Figure 2.2: ComputediDfor the image in figure 2.1, black means zero and white means one. From left to right: ci0D,ci1D,ci2D and highest confidence marked in gray, white and black for i0D, i1D and i2D, respectively.

0 0.2 0.4 0.6 0.8 1

Edge-like Texture-like

Homogeneous Edge-like

Corner-like

Contrast rntaOietion rnceVaia

Figure 2.3: How a set of 54 patches map to the different areas of the intrinsic dimensionality triangle.

Some examples from these patches are also shown. The horizontal and vertical axes of the triangle denote the contrast and the orientation variances of the image patches, respectively.

(28)

2.3. Acknowledgements 28

(a) (b)

(c) (d)

Figure 2.4: Extracted primitives (b) for the example image in (a). Magnified edge primitives and edge primitives together with monos are shown in (c) and (d) respectively.

(29)

Estimation

As mentioned in 1.2, optic flow information in early vision is ambiguous. This ambiguity in optic flow can be disambiguated by using the flow information available at the junction-like structures in early cognitive vision. Such a disambiguation has been modeled as a feedback mechanism in [Bayerl and Neumann, 2007].

This chapter investigates the extent of the ambiguity in optic flow estimation and analyzes it for different local image structures. Namely, the continuous definition of intrinsic dimensionality introduced in section 2.1 is used to investigate (1) the quality of different optic flow estimation algorithms depending on the underlying local image structure and (2) the distribution of signals in natural images according to their intrinsic dimensionality. Namely, it suggests that the quality of optic flow estimation and the underlying local image structure are strongly linked.

Regarding the distribution of signals, the chapter shows that:

D0. i0D signals split into two clusters; one peak corresponding to over-illuminated (white) or under- illuminated (black) patches and a Gaussian-shaped cluster corresponding to image noise at homogeneous but not under- or over-illuminated image patches (see figure 3.1(a)).

D1. For i1D signals, there exists a concentration of signals in a stripe-shaped cluster corresponding to high origin variance (high amplitude)andlow line variance (see figure 3.1(a)). This also reflects the

29

(30)

CHAPTER 3. LOCAL IMAGE STRUCTURES AND OPTIC FLOW ESTIMATION 30

smooth decrease of

corresponding stripe image patches to edge−like over-(or under)

illuminated patches

corresponding to cluster homogeneous patches

number of samples towards i2D corner (no corner cluster)

i0D i1D

i2D

(a)

i0D i1D

i2D

estimates unreliable

aperture affected by problem smooth gradient

with increasing with low slope and quality for/towards i2D signals

(b)

Figure 3.1: a) Schematic representation of the distribution of local image patches in natural images according to their intrinsic dimension.b)Schematic representation of the quality of optic flow estimation according to the intrinsic dimension of the underlying signal.

importance of an orientation criterion that is based on local amplitude and orientation information (see, e.g., [Princen et al., 1990]).

D2. In contrast to the i0D and i1D cases, there exists no cluster for i2D signals but there is a smoothly decreasing surface towards the i2D corner. This continuity in the distribution for the i2D case indicates that it is rather difficult to formulate a purely local criterion to detect corners in natural images.

Optic flow in early vision is ambiguous because local estimation of optic flow faces the well-known aperture problem: Through an aperture, the true flow is observable only for two dimensional structures, i.e., corner-like structures, end of edges and some kinds of textures. As for edge-like structures, only the flow which is along the intensity gradient can be computed.

The property of optic flow estimation at homogeneous image patches, edges and corners has been discussed extensively (see, e.g., [Barron et al., 1994, Zetzsche et al., 1991, Mota and Barth, 2000]). It has been argued that many different motion detectors specialised to particular image structures exist in human vision (for a discussion, see [Cavanagh and Mather, 1989, Johnston and Clifford, 1995]). In general, it is acknowledged that;

A0. Optic flow estimates at homogeneous image patches tend to be unreliable as the lack of structure makes it impossible to find correspondences in consecutive frames.

(31)

algorithms fail to estimate the true motion. In order to get the true motion field, flow algorithms need to deal with at least two different motions in the local area [Bayerl and Neumann, 2007].

This chapter investigates these claims more closely for several optic flow algorithms (namely, Nagel- Enkelmann, [Nagel and Enkelmann, 1986], Lucas-Kanade [Lucas and Kanade, 1981] and a phase-based approach from [Gautama and Hulle, 2002]). It will be shown that the continuous formulation of intrinsic dimensionality allows for a better quantitative investigation and characterization of the quality of optic flow estimation (and hence, the optic flow properties as stated in A0-A2) depending on the local signal structure. Namely:

• The algorithms that have been tested in this chapter all had problems with local image structures that were very close to the i0D corner of the iD-triangle (see figure 3.1(b)).

• The performance for image structures in the stripe shaped cluster corresponding to edge-like structures was effected by the aperture problem (see figure 3.1(b)). However, the results depend both quantitatively and qualitatively on the different algorithms and even on different parameters when the same algorithm was used.

• The improvement of performance for signals in the i2D area of the iD triangle was visible but small. Average performance increases smoothly and slightly towards the i2D corner (see figure 3.1(b)).

These results support the above-mentioned statements (A0)-(A2) about optic flow estimation. How- ever, by making use of a continuous understanding of intrinsic dimensionality, these statements have been made quantitatively more specific in terms of (1) characterization of sub-areas for which they hold and (2) their strength. The analysis in this chapter suggests a relationship between the distribution of the signals in the continuous intrinsic dimensionality space and properties of optic flow estimation. In this way, a new tool for better analysis of optic flow algorithms is introduced.

(32)

3.1. Distribution of Local Image Structures 32

There has been other works analyzing errors in optic flow estimates [Fermueller et al., 2001, Simoncelli et al., 1991, Nagel and Haag, 1998]. In [Simoncelli et al., 1991], using a probabilistic framework for estimating op-

tical flow, it is proven that uncertainty is involved in this estimation process due to several causes such as image noise and inherent limits of motion estimation. In [Nagel and Haag, 1998], it is shown that gradient-based motion estimation methods underestimate the true flow. In [Fermueller et al., 2001], too, it is analytically shown that certain kinds of bias in different classes of optic flow algorithms caused by noise in the image data usually lead not only to underestimate of the magnitude of optical flow and but also to consistent bias in the estimation of the direction. In contrast to the investigations in [Fermueller et al., 2001, Simoncelli et al., 1991, Nagel and Haag, 1998], this chapter is interested in the quality of flow estimates depending on the local image structure. This is achieved not by analytic means but by statistical comparisons using ground truth data.

3.1 Distribution of Local Image Structures

The distribution of local image structures is analyzed using a set of 7 natural sequences with 10 images each (see figure 3.5). The images have a resolution of 1276×1016. For the analysis, the origin and the line variance are computed for each pixel (for details see section 2.1). This corresponds to one point in the iD triangle (figure 2.1(c)). The distribution of the frequency of these points in the triangular structure is shown in figure 3.2(a). Since there exist large differences in the histogram, only the logarithm is shown.

The distribution shows two main clusters. The peak close to the origin corresponds to low origin variance. It is visible that most of the signals that have low origin variance have high line variance.

These correspond to nearly homogeneous image patches. Since the orientation is almost random for such homogeneous image patches, it causes high line variance. There is also a small peak at position (0,0) that corresponds to saturated/black image patches. The other cluster is for high origin variance signals with low line variance, corresponding to edge-like structures. The form of this cluster is a small horizontal stripe rather than a peak. Finally, there is a smooth decrease while approaching to the i2D area of the triangle. That means that there does not exist a cluster for corner-like structures like the ones for homogeneous image patches or edges. Along the origin variance axis, a small continuous gap is observed. This gap suggests that there are no signals with zero line variance. This is due to the fact that at positions with positive origin–variance (i.e., positive magnitude), there is always noise included which

(33)

(a) (b)

Figure 3.2: Logarithmic plot of the distribution of intrinsic dimensionality. (a) The distribution for regularly sampled points. (b)The distribution when the positions are modified according to iD (See the text for details of this modification).

causes some line variance.

Also seen from the figure is that there are far more i0D signals than i1D or i2D signals. Besides, it is clear that there are more i1D structures than i2D structures in natural images. The percentages of i0D, i1D and i2D structures turned out to be 86%, 11% and 3%, respectively, in the natural images that have been used.

(a) (b)

Figure 3.3: Illustration of positioning for an edge. (a) Without positioning. (b) With positioning as explained in the text.

Edges or corners are structures that are bound to a specific position. For example, the position of an edge is supposed to be placed directly on the image discontinuity; or, for a corner, in which a certain number of lines intersect, the corner should be placed directly on the intersection. This positioning can be achieved by making use of the local amplitude information in the image depending on the intrinsic dimensionality which is described in detail in [Kr¨uger et al., 2004a] (see figure 3.3). Note also that

(34)

3.2. Distribution of Orientation of Local Image Structures 34

features such as orientation and optic flow depend on this positioning. When the positions of edges and corner-like structures accordingly are determined accordingly, the distribution of local image structures becomes as shown in figure 3.2(b). It is qualitatively similar to the distribution achieved with regular sampling. However, since the position is determined depending on the local amplitude (and in this way by maximizing origin variance; see [Kr¨uger and Felsberg, 2003]), there is a shift towards positions with higher amplitude that constitute the gaps at the border between i0D, i1D and i2D signals and the stripe along the i1D-i2D border of the triangle. In the later stages of the analysis in this chapter, this positioning is adopted.

Zetzsche and his colleagues also investigated the distribution of local image structures in [Wegmann and Zetzsche, 1990].

They analyzed the multi-dimensional hyperspace which was constructed from all possible combinations of orientation filter outputs. The hyperspace consisted ofmaxes corresponding tomdifferent orientations such that the origin denoted the homogeneous signals; the axes and the planes between the neighboring axes denoted the i1D structures; and, the planes between the non-neighboring axes denoted i2D signals.

Zetzsche and his colleagues could drive proportions of the different local structures (which basically reflect the percentages provided above) and visualize clusters of the structures for a few orientation pairs.

Due to the complexity of the hyperspace, however, the visualization becomes more complex than the triangular representation of iD.

3.2 Distribution of Orientation of Local Image Structures

0 pi/ 2 pi 0

2000 4000 6000 8000

10000 ORIENTATION

0 pi/ 2 pi 0

1000 2000 3000 4000 5000 6000 7000

8000ORIENTATION −− i0D

0 pi/ 2 pi 0

500 1000 1500 2000

0 pi/ 2 pi 0

50 100 150 200

Figure 3.4: Orientation distribution depending on iD. The first image shows the total distribution. The sequences that have been used for this analysis are introduced in section 3.1.

(35)

that the distribution of orientations of i0D and i2D signals should be homogeneous.

The distribution of the orientation of signals and the quantitative differences depending on the intrinsic dimensionality of the patches is displayed in figure 3.4. The figure shows that there are significant peaks for the i0D and i2D signals although they are smaller than the peaks in the distribution of i1D signals. This suggests that orientation is a meaningful concept for some non-i1D signals, too. This also stresses the advantages of a continuous understanding of intrinsic dimensionality.

3.3 Optic Flow Estimation Algorithms

This section briefly describes the optic flow algorithms that have been used in this chapter.

3.3.1 The Lucas-Kanade Algorithm

The Lucas-Kanade algorithm works by minimizing the following functional [Lucas and Kanade, 1981]

over a spatial neighborhoodΩ:

"

ΩW²(x,y)h

∇I(x,y,t)·v+It(x,y,t)i2

dxdy, (3.1)

whereW(x,y) is the window function overΩthat gives more influence to constraints at the center of the neighborhood;∇I(x,y,t) denotes the intensity gradient at timetat spatial location (x,y);vis the velocity field to be found; and, It denotes the derivative of I with respect to t. Basically, the Lucas-Kanade algorithm makes use of the well-known gradient constraint equation∇I^T·v+It =0 where weighting is performed over a local neighbourhood.

The Lucas-Kanade is an optic flow algorithm which uses first order derivatives. Due to its smaller complexity when compared with others, it is known to be a fast algorithm.

(36)

3.3. Optic Flow Estimation Algorithms 36

3.3.2 The Nagel–Enkelmann Algorithm

The Nagel–Enkelmann algorithm [Nagel and Enkelmann, 1986] also makes use of the gradient constraint equation but applies a second order derivative constraint in addition. The following functional is mini- mized:

"

(∇I^Tv+I_t)²+ α² k ∇Ik²₂+2δ

h(u_xI_y−u_yI_x)²+(v_xI_y−v_yI_x)²+δ(u²_x+u²_y+v²_x+v²_y)i

dxdy, (3.2)

whereαandδare constants; uandvare respectively the horizontal and the vertical components of the velocity vectorv; and, for a functionF,Fzdenotes the partial derivative ofFwith respect to variablez.

Main terms of the formula are (uxIy−uyIx)²+(vxIy−vyIx)²and (u²_x+u²_y+v²_x+v²_y). The first term smoothes velocity an-isotropically, i.e., orthogonal to the intensity gradient. The second isotropic term states that velocity should be constant over position¹.

Since the Nagel–Enkelmann algorithm can be interpreted as a diffusion process (see [Alvarez et al., 2000]) with fixed number of iterations, an increase in the number of iterations means an increase in the region of influence used in the computation, and hence, using more global information. The Nagel-Enkelmann algorithm encourages slow variations in the gradient of the vector field by the smoothing term in equation 3.2. This leads with increasing number of iterations (i.e., increasing diffusion) naturally to a more regular distribution of directions (as visible in the first two rows of figure 3.6). In this chapter, the effect of using more global information on the accuracy of the flow estimation is also provided.s

3.3.3 The Phase-Based Approach

Phase-based optic flow algorithms make use of the phase gradient for finding the flow. It has been shown that temporal evolution of contours of constant phase provides a better approximation to local flow (see e.g., [Fleet and Jepson, 1990]). The basic assumption is that phase contours should be constant over time [Fleet and Jepson, 1990, Gautama and Hulle, 2002]. This assumption can be formulated asφ(x,y,t)=c, whereφ(x,y,t) denotes the phase component at spatial location (x,y) at time t. Taking differentiation

1In our simulations, the standard values 0.5 and 1.0 forαandδ, respectively, are used as suggested and usually practiced in the literature (see, e.g. [Barron et al., 1994]).

(37)

which the constraint (3.3) is solved for a number of Gabor filters and the flow orthogonal to the orientation of each filter is found. Combining the solutions reached by the filters yields the true displacement.

This chapter will show that in this way, even for a large number of i0D signals good optic flow can be estimated.

Figure 3.5: Some of the image sequences used in our analysis. The first 3 images are from one of the sequences (the starting image, the middle image and the last image). Remaining figures are the images from other sequences.

3.4 Optic Flow Estimation

This section analyzes the distribution of optic flow direction (subsection 3.4.1) and the error of optic flow estimation and its relation to the iD triangle (subsection 3.4.2).

(38)

3.4. Optic Flow Estimation 38

3.4.1 Optic Flow Direction

The distribution of the flow direction of the optic flow vectors (using the Nagel–Enkelmann algorithm with 10 and 100 iterations, and the phase-based approach) is shown in figure 3.6.

The distribution of the direction varies significantly with the intrinsic dimensionality. The statistics of the true flow can be expected to show some homogeneity since a translational forward motion is dominant in the sequences that leads to a regular flow field (see, e.g., [Lappe et al., 1999]). A detailed discussion of first order statistics of optic flow in natural scenes can be found in [Calow et al., 2004].

They showed that the main factor for irregularity is that the large amount of structure near in the lower visual field as compared to the lack of structure in the upper visual field causes larger flow in the lower visual field. This, however, does not effect the magnitude but only the orientation. However, for the Nagel–Enkelmann algorithm with 10 iterations (figure 3.6, top row), the distribution of the direction of optic flow vectors of i1D signals directly reflects the distribution of orientation of i1D signals. Since only the normal flow can be computed for ideal i1D signals (using local information only), the dominance of vertical and horizontal orientations (see section 3.2) leads to peaks at horizontal and vertical flows. The fact that basically there exits a direct quantitative equivalence of the distribution of i1D orientations and the distribution of optic flow directions reflects the seriousness of the aperture problem. In contrast, the distribution of direction of optic flow vectors of i0D and i2D signals is much more homogeneous. When the number of iterations is increased (and hence, more global information is used in the computation of the flow as explained in section 3.3), the peaks that correspond to horizontal and vertical lines become smaller (figure 3.6, middle row). For the phase-based approach and Lucas-Kanade, a different picture occurs (figure 3.6, last two rows): the peaks are less apparent.

As a summary, figure 3.6 suggests that there is a relation between the direction of estimated optic flows and the orientation distribution of signals. However, the strength of this relation depends on the particular algorithm and its parameters. For example, when the used information is very local, the Nagel–

Enkelmann algorithm computes basically the normal flow which results in a strong relation between the distribution of optic flow direction and distribution of orientations in the images. However, when the number of iterations is increased, this relation becomes weaker because of the decrease of the aperture effect due to using more global information.

(39)

− p i −pi/2 0 pi/ 2 pi 0 1

2

3x 10⁴ TOTAL

Nagel−−Enkelmann with 10 iterations

− p i −pi/2 0 pi/ 2 pi 0 0.5

1 1.5 2

2.5x 10⁴ i0D

− p i −pi/2 0 pi/ 2 pi 0 2000

4000

6000 i1D

− p i −pi/2 0 pi/ 2 pi 0 500

1000 1500

2000 i2D

− p i −pi/2 0 pi/ 2 pi 0 1

2 3x 10⁴

Nagel−−Enkelmann with 100 iterations − p i −pi/2 0 pi/ 2 pi 0

0.5 1 1.5 2 2.5x 10⁴

− p i −pi/2 0 pi/ 2 pi 0 2000

4000 6000

− p i −pi/2 0 pi/ 2 pi 0 500

1000 1500 2000

− p i −pi/2 0 pi/ 2 pi 0 1

2 3x 10⁴

The phase−based approach

− p i −pi/2 0 pi/ 2 pi 0 0.5

1 1.5 2 2.5x 10⁴

− p i −pi/2 0 pi/ 2 pi 0 2000

4000 6000

− p i −pi/2 0 pi/ 2 pi 0 500

1000 1500 2000

− p i −pi/2 0 pi/ 2 pi 0 1

2 3x 10⁴

Lucas−Kanade

− p i −pi/2 0 pi/ 2 pi 0 0.5

1 1.5 2 2.5x 10⁴

− p i −pi/2 0 pi/ 2 pi 0 2000

4000 6000

− p i −pi/2 0 pi/ 2 pi 0 500

1000 1500 2000

Figure 3.6: Distribution of direction of optic flow vectors depending on the intrinsic dimension. The his- tograms show the summed up distributions over the sequences which are introduced in section 3.1. From top to bottom: The Nagel–Enkelmann algorithm with 10 iterations; The Nagel–Enkelmann algorithm with 100 iterations; The phase-based approach; The Lucas-Kanade algorithm. From left to right: The total distribution; The distribution for i0D signals; The distribution for i1D signals; The distribution for i2D signals.

(40)

3.4. Optic Flow Estimation 40

3.4.2 Analysis of Quality of Optic Flow Estimation

This subsection analyzes the qualities of the optic flow estimation depending on the intrinsic dimension.

For this, the computed flow needs to be compared with a ground truth. For this, the Brown Range Image Database (BRID), a database of 197 range images collected by Ann Lee, Jinggang Huang and David Mumford at Brown University (see also [Huang et al., 2000]) is used. The range images are recorded with a laser range-finder². The data of each point consist of 4 values: the distance, the horizontal angle and the vertical angle in spherical coordinates and a value for the reflected intensity of the laser beam (see figure 3.7). The knowledge about the 3D data structure allows for a simulation of a moving camera in a scene and is used to estimate the correct flow for nearly all pixel positions of a frame of an image sequence. It should be noted that this approach cannot produce correct flow for occluded areas.

The simulated motion is forward translation. Different motion types (such as rotation, and rotation and translation) may produce different global motion types. Therefore, the results in this chapter are valid only for translational motions, and other types of motions should be expected to yield quantitatively if not qualitatively different results.

Different flow estimation algorithms yield flow fields with different densities; i.e., they can make an estimation of the motion for a certain proportion of the image data. By adjusting the parameters of the flow algorithms that have been used in this chapter, the flow fields were made as dense as possible for our analysis, which happened to be on the average 100%, 90% and 86% respectively for the Nagel- Enkelmann, the Lucas-Kanade algorithms and the phase-based approach.

The quality of optic flow estimation is displayed in a histogram over the iD triangle (see figures 3.8 and 3.9). The error is calculated using the well known measure:

e(u,v)=acos u·v+1 (u·u+1)(v·v+1)

, (3.4)

whereuandvare the flow vectors (see also [Barron et al., 1994])³This measure is called thecombined errorin this chapter.

2Each image contains 44×1440 measurements with an angular separation of 0.18 degree. The field of view is 80 degree vertically and 259 degree horizontally. The distance of each point is calculated from the time of flight of the laser beam, where the operational range of the sensor is 2−200m. The laser wavelength of the laser beam is 0.9µmin the near infrared region.

3Measurements using angular and magnitudal errors with the formulase_ang(u,v)=acos(_|u|.|v|^u·v ),e_mag(u,v)= abs(|u|−|v|)

|u|+|v| yield similar results (for details, see [Kalkan et al., 2004a]).

Multi-modal Statistics of Local Image Structures and its Applications for Depth Prediction

Acknowledgements

Contents

Chapter 1

Introduction

1.1 Marr’s Theory of Vision

1.2 Early Vision and Early Cognitive Vision

1.3 3D Reconstruction, the Correspondence Problem and Depth Interpolation

1.4 Vision and Natural Image Statistics

1.5 Outline and Contributions

Chapter 2

Background

2.1 A Continuous Definition of Intrinsic Dimensionality

2.2 Multi-modal Visual Features – Primitives

2.3 Acknowledgements

c

c c

Estimation

3.1 Distribution of Local Image Structures

3.2 Distribution of Orientation of Local Image Structures

3.3 Optic Flow Estimation Algorithms

3.3.1 The Lucas-Kanade Algorithm

3.3.2 The Nagel–Enkelmann Algorithm

3.3.3 The Phase-Based Approach

3.4 Optic Flow Estimation

3.4.1 Optic Flow Direction

3.4.2 Analysis of Quality of Optic Flow Estimation