Neuro-plausible Visual Object Perception - Developing Foundations for Natural Human-Robot

Developing Foundations for Natural Human-Robot

3.3 Neuro-plausible Visual Object Perception

In chapter 2.1.2 we discussed the involvement of the brain’s posterior regions of the Inferior Temporal Cortex (ITC) in language processing. Moreover, the ITC, in particular the posterior Inferior Temporal Sulcus (ITS), is part of the ventral pathway in visual processing, being involved in representing visual information in the process of recognising objects¹¹ by primarily integrating shape and colour features received from the Visual Cortex Four (V4) area [150, 204]. The shape representation¹² codes the discrimination of objects by combining a number of contour fragments described as the curvature-angular position relative to the objects’

center of mass [212, 304]. The colour representation codes hue (and saturation) information of the object invariant to luminance changes [97, 277]. To allow a neurocognitively plausible learning robot to visually observe an object in the environment, it is a necessary condition to include an object recognition that can capture these representations found in the V4 area and provide this information for a neural model mimicking the integration.

To learn and capture visual object characteristics fast and efficient, Lowe pro-posed the Scale Invariant Feature Transform (SIFT) feature-based approach [171].

In SIFT, a concept for key locations is introduced that basically seeks local minima and maxima to the eight surrounding pixels and compares them with extremes on layers of increased levels of scaling. The key locations are local descriptors of gradients for salient points in the image, which get filtered, weighted, and ordered in bins of orientation histograms. Overall, the result is a vector of (usually 128) features, which are stored for an object and used for later comparison. In Speeded Up Robust Features (SURF), the same features are used, but the filter-step is done on integral images and the key locations are determined by the Hessian matrices instead of calculating the gradients, which further accelerates the approach [18].

Other efficient approaches are based on or combined with a) Haar-like features, whereby combinations of salient pixels (e.g. L-shaped) are associated with specific locations of an image patch; b) Histogram of Gradients (HOG) features, where salient points are described by most occurring orientations of gradients (similar to SIFT); or c) Principle Component Analysis (PCA) features, which define salient points by the most important eigenvectors in a feature sub-space [58, 80, 213].

The discussed approaches are widely used in vision for robotics. However, they share the main drawback in terms of describing objects by a number of relative global or sub-space features of the image, but not necessarily by combining features of the physical entity alone. The resulting representation thus can differ vastly from the representation in V4/posterior ITS. As an alternative, the approach developed for this thesis captures objects by determining salient points on the contour of an object represented as normalised distances to the center of mass as well as constant hue values for the area within the contour. The steps of this approach make use of conventional visual perception methods and are shown in figure 3.6.

11Objects recognition defines perceiving known objects or objects with known components.

12Findings mainly based on studies of the Macaque brain.

n_pos,1 n_col,1 n_sha,16

n_sha,01 Position (x,y)

FieldDofDView MeanShift CannyDEdge SalientDPoints Perception d

&DContour Distances

0 1

⋮

Colour (R,G,B ) Shape=d

z A

⋮

Figure 3.6: Schematic process of visual perception and encoding. The input is a single frame taken by the NAO camera, while the output is the neural activity overN neurons, with N being the sum over shape + colour + position features.

Visual Perception and Encoding

At first the mean shift algorithm is employed for segmentation on an image taken by the robotic learner [54]. The algorithm finds good segmentation parameters by determining modes that describe best the clusters in a transformed 3-D feature space¹³by estimating best matchingProbability Density Functions(PDFs). Secondly the Canny edge detection as well as the OpenCV¹⁴ contour finder are applied for object discrimination [44, 273]. The first algorithm basically applies a number of filters to find strong edges and their direction, while the second determines a complete contour by finding the best match of contour components. Thirdly, the centre of mass and 16 distances to salient points around the contour are calculated.

Here, salient means for example the largest or shortest distance between the center of mass and the contour within intervals of 22.5^◦. Finally, the distances are scaled by the square root of the object’s area and ordered clockwise – starting with the largest. The resulting encoding of 16 values in [0,1] represents the characteristic shape, which is invariant to scaling and rotation.

Encoding of the perceived colour is realised by averaging the three R, G, and B values of the area within the shape. Other colour spaces e.g. based on only hue and saturation could be used as well, but they are in this step mainly a tech-nical choice. Additionally, the perceived relative position of the object is encoded by measuring the two values of the centroid coordinate in the field of view to allow for tests on interrelations between multiple objects later. For an overview figure 3.7a shows some of the used objects, figure 3.7b displays the prototyp-ical objects from the perspective of the robotic learner, and figure 3.7c provides two example results of the perception process. The objects have been designed via 3D-print to possess similar masses despite different shapes and similar colour characteristics across the shapes to provide for robustly and controllably perceivable characteristics.

13E.g. theL^∗u^∗v^∗colour space (colourimetri) that aims to describe the human colour perception as defined by the International Commission on Illumination(CIE) [235].

14OpenCV foropen source computer vision is a library of recent computations, algorithms, and machine learning mechanisms for computer vision [33].

(a) Some objects of interest.

(b) The robot’s view. (c) Prototypical perceived shapes.

Figure 3.7: Exemplary objects and results for visual perception.

Approach Summary

Overall, the approach for object perception can be applied easily to the video stream of the NAO robot as well as other robot platforms. While recording data, frame rates up to 5 Frames Per Second (FPS) were measured on a standard remotely connected PC due to the expensive computations of the mean shift algorithm plus finding the contour, which makes the approach not perfect for real time. However, the process works quite robust for objects with a simple texture and a reasonable level of noise. We can observe quite similar shape, colour, and position features for our objects on plain and on moderately structured backgrounds, but inconsistent features only for objects with diverse texture (e.g. a cup with multi-colour logos).

Nevertheless, other approaches have been proposed with the purpose of closely reproducing the humans’ visual system. The attention model by Itti et al. extracts salient features of a scene inspired by the visual system in primates [137]. In contrast to the approach used in this thesis, the authors proposed an architecture of center-surround processing units that result in relations between regions in a scene and therefore helps to either find particularly salient parts in that image or to provide a description of a whole scene. With the Hierarchical Model and X (HMAX) algorithm, Riesenhuber and Poggio proposed to determine object features by a hierarchy of alternating simple and complex cells [233]. In those layers the simple features like edge orientations in small patches of an image are composed and then pooled, e.g. by basic linear summation or nonlinear maximization. The results are again translation and scale invariant features that describe parts of the image and can be used to compare new image patches. In a similar approach

Borghetti Soares et al. integrated 3D-point-clouds for objects into a hierarchical convolution architecture [29]. This framework inherently captures features from the largest connected surface in the field of view and represents distinctive features by geometrical relations to each other. In particular, a coherent representation is build up based on multiple viewpoints to determine a mental 3D-representation.

However, for our goal of finding an invariant description of a specific shape with potentially little diversity in the texture, the method developed for this thesis is computationally sufficient and similarly plausible.

Im Dokument Natural language acquisition in recurrent neural architectures (Seite 58-61)