Image Primitives and Correspondence - OPUS 4 | Binocular ego-motion estimation for automotive a

2.4 Image Primitives and Correspondence

One of the main problems of Computer Vision is the establishment of correspon-dences between images. The correspondence problem consists of finding a given pattern in multiple images. In most applications the real interest lies in finding the pixel coordinates of the same world point in multiple images, i.e. the correspon-dence of image points which are projections of the same point in space. Once the correspondence problem is solved, more specific geometric information about the world can be obtained. For example, given the image position of the same world point in the left and right rectified images of a calibrated standard stereo camera, the corresponding 3D position of the world point is obtained using Equation 2.12.

The main problem in correspondence is how to infer the motion of 3D geometric structures given only the measurement of their reflected or emitted light intensity. It can be assumed that a point in the left and right images having the same intensity value correspond to the projection of the same world point. Finding solution based on this observation will fail for a number of reasons. The first reason is that given the small range of intensity value of the image (typically encoded in 8 bits, i.e.[0,255]), each intensity value of the sought point is expected to appear multiple times within the image (assuming an equal distribution of intensity values in a VGA image, the same brightness value is expected to appear1200times).

A second problem is the noise affecting the image brightness. The effects of noise can be reduced by attaching a support region around the point of interest. But an exact solution is still very unlikely to be found. It is therefore more convenient to de-fine some discrepancy function and search for those image regions which minimize it. We address this point later in this section.

2.4.1 Translational Motion Model

The simplest image correspondence model is obtained assuming that the whole sup-port region of the reference point moves constantly. This is valid actually only for those world regions which are flat, parallel to the image plane and which move par-allel to it. But even when these assumptions are not fulfilled, almost any motion of any structure form is well approximated through this model, if the camera motion is small.

Let us suppose that both images are taken from infinitesimally close vantage points. Let us rewrite I₁(p₁) and I₂(p₂) as I(p(t), t) and I(p(t +dt), t +dt) re-spectively, i.e. the continuous version of the discrete equations. Let us assume that the brightness remains constant over time, i.e. I(p(t), t) is constant for all t. Now sincedtis an infinitesimal increment we can rewritep(t+dt) =p(t) +vdtwherev is a velocity vector. Then we have

I(p(t), t) =I(p(t) +vdt, t+dt). (2.17) Applying Taylor series expansion aroundp(t)to the r.h.s. and neglecting the

sec-18 Image Geometry and the Correspondence Problem

Figure 2.7: The aperture problem. Two parts of a figure are observed through aper-tures. Although the triangle moves diagonally, only a horizontal motion can be observed from the lower aperture. When enough 2D information is available, the 2D motion can be measured (upper aperture).

ond and higher-order terms we obtain thebrightness constancy constraint

∇I^Tv+I_t= 0 (2.18) The vector∇Iis the frame spatial derivative or image gradient and the scalarI_tis the temporal derivative ofI(p, t). Observing that the velocity vectorv = (du/dt, dv/dt), Equation 2.18 can also be written as:

∂I

Equations 2.18 and 2.20 involve two unknowns and one constraint, therefore there are infinitely many solutions for v that satisfy the equation. This is called the aper-ture problemin the literature. The name comes from the geometrical interpretation which exemplifies this fact and which is shown in Figure 2.7. If the solution is found in the direction of the image gradient∇I, the resulting vector is callednormal flow.

Geometrically, the normal flow is the minimum norm vector that satisfies the bright-ness constancy constraint and represents the projection of the real motion vector onto the gradient direction. It is given by

v_n =− I_t∇I

k∇Ik² (2.21)

2.4.1 Translational Motion Model 19

Figure 2.8: Features selected for tracking. Observe that every selected window presents a high textureness and regions of constant intensity values are avoided.

For the computation of the real motion vector, more constraints are required. A sup-port region around the point provides enough information to find a solution if i) the region contains enough “information” and, ii) the motion of the region is constant.

The second condition is required in order to provide enough constraints onv. The first condition implies that the region must contain enough “texture”. As it was also remarked before it is quite unprovable that the brightness remains constant over time because of image noise. Equation 2.17 can be rewritten to consider noise as

I(p(t), t) =I(p(t) +vdt, t+dt) +η (2.22) whereηis a noise term. A solution is then found minimizing the sum of the squared residuals, which leads to

E(v) = X

˜ p∈W(p)

∇I^T(p)˜ v+I_t(p)˜²

(2.23)

where W(p) is the support region around p. The minimum can be found in the least-square sense by finding the zero derivatives ofE(v)w.r.t.v, i.e.

I_u² P I_uI_v PIuIv P

I_v²

PI_uI_t PIvIt

= 0 (2.24)

or in matrix form,

Gv+e= 0. (2.25)

If the matrixGis invertible a solution can be found forv

v =−G⁻¹e. (2.26)

If the vector p, for which v was computed is some fixed integer position in the

20 Image Geometry and the Correspondence Problem

image, then the motion vector is calledoptical flow. If insteadpis computed repeat-edly as a particle which moves trough the image domain, the motion is calledfeature tracking. WhenG is singular, then no solution can be found by 2.8. This happens when the intensity variations in the support region varies in only one dimension (i.e.

I_u = 0orI_v = 0) or there is no variation at all (i.e.I_u = 0andI_v = 0). In fact, Equa-tion 2.26 can be solved reliably if the matrix G is well-conditioned and above the image noise level. The matrix Gis above the image noise level, if both eigenvalues are large. The conditioning requirement means that the eigenvalues cannot differ by several orders of magnitude. Both requirements are normally implemented by just checking if the smallest eigenvalue is larger than a predefined threshold. This is normally enough to check for both requirements since the maximal eigenvalue is ac-tually upper bounded because of the limited intensity range of the image. Therefore, the difference between eigenvalues is also finite. Observe that the matrix G gives a measure of the textureness contained in the support region. This method is used by theKLT tracker [LK81] [ST94] [TK91], which is used in the experimental results of this dissertation for the computation of feature tracking. Another possibility is to threshold the quantitydet(G) +k tr(G), wherekis some small value. This variation is known as Harris corner detector [HS88].

Figure 2.8 shows an example of feature selection. The matrix Gis computed for every point of the image forming a descending sorted list according to the smallest eigenvalue of the matrix. The top 200 features which satisfy a minimal distance constraint are chosen as features and are shown in the figure.

The computation of image velocities with the method described above is usually expensive. It requires the computation of spatial and time derivatives as well as some matrix operations. An alternative is to define some function, which measures the discrepancy between support regions, and then find the displacement which minimizes it. A typical dissimilarity measure is the Sum of Squared Differences (SSD) criterion. Considering Equation 2.22, the sum of the squared of the residua is minimized this way, and therefore,

SSD= X

p∈W˜ (p)

(I(p(t), t)−I(p(t) +d, t+dt))² (2.27)

where d = vdt is the displacement. The result is found as the displacement d that minimizes the SSD. Observe that Equations 2.27 and 2.26 are related since vdt = (−G⁻¹e)dt is a first order approximation of d. The SSD is the correlation function of the stereo algorithm used in the experimental results in Chapter 8.

Other dissimilarity criteria are also used in the literature. The Sum of Absolute Differences(SAD) is an alternative to the SSD. Locally scale versions as well as zero-mean normalized version of SAD and SSD are also commonly used. The Zero-mean Normalized Cross Correlation coefficient(ZNCC) and the pseudo normalized correlation [Mor80] are also examples of similarity functions. All these functions are listed and evaluated for performance and robustness in [AG92] and [Bad02].

2.4.2 Affine and Projective Motion Models 21

2.4.2 Affine and Projective Motion Models

More advanced motion models have been proposed. It is actually rather improbable that all the points corresponding to the support region have the same transformation, which is a requirement for finding a solution. A better model would be to consider some deformation of the support region. From Equation 2.17 a general transforma-tion model is expressed as

I(p(t), t) =˜ I(h(p(t),˜ α), t+dt)∀p˜∈W(p) (2.28) where h is the function of motion model and α captures all the parameters corre-sponding to the model (e.g. in the translational caseh(p,α) =p+dand α={d}).

In the affine transformation model, the motion of every point in the support region depends linearly on its location with respect to the reference point p. So h(p,α) = Ap+d(and α = {A,d}) where Ais a 2×2 deformation matrix. This model approximates a motion of a planar patch with arbitrary translation, arbitrary rotation about the optical axis, and small rotation about any other axis. The affine transformation model is used by Shi and Tomasi [ST94] for monitoring the quality of a track for non-consecutive frames.

An arbitrary rotation and translation of planar surface is modeled by theprojective motion model, which applies an affine model to the homogeneous coordinates of the points. In this case, h(p,α) = Hp where H is a 3×3 matrix defined up to a scale factor. The projective motion model is also called a homography motion model, sinceH describes the homography betweenp(t)andp(t+dt).

Observe that the complexity in the derivation of the solution for the affine and projective motion models increases and the corresponding implemented algorithms are much slower (in comparison to the translational model). The readers are referred to the bibliography cited in the next section for further details.

Im Dokument OPUS 4 | Binocular ego-motion estimation for automotive applications (Seite 35-39)