Implicit Shape Model - Furniture Categorization

Applying Semantics - Grounding through Visual Perception

4.1. Furniture Categorization

4.1.1. Implicit Shape Model

The implementedISMapproach is based on a method first proposed by Salti et al. (2010). They suggest to use the original 2DISMmethod (Leibe et al., 2004) in an extended form for 3D object categorization. The main benefit ofISMclassification compared to other statistically based classifiers is that it considers the spatial relations of the object parts found in the models.

They define theImplicit Shape Model for a category C as

ISM(C) = (I_C, P_I,C) (4.1) whereI_C is an alphabet of typical local appearances of the selected object category (termed “codebook”) andP_I,C is a spatial probability distribution which specifies where each codebook entry may be found on an object. So the learned models contain the frequencies and relative positions of typical regions of the objects within the corresponding class. Specifically, the ap-proach learns the possible geometric relations between the features and a reference point – preferably the object’s center. This is realized by assigning a vector to the learned features which points to the reference point. If the same features and relations to the reference point of one model are found on a candidate object, this object can be classified as the corresponding class.

For the furniture recognition system presented here, the original 3DISM algorithm of Salti et al. (2010) is adapted in two ways. First, the feature calculation step was adapted to the special requirements in the domain of furniture recognition. Typical indoor room scenes usually contain — be-cause they are man-made — many planar surface structures. This includes general structures of the room itself like walls, floors, and doors, but this is also true for the furniture within the room, especially shelves, cupboards, and tables. Using 3D shape descriptors should focus on non-planar features of the furniture, because the surfaces do not contain sufficient descriptive power to distinguish furniture from the background and to describe the furniture’s properties that enable categorization.

Secondly, an alternative approach to the original Hough Space Voting is presented which is used for feature position aware detection and classifica-tion of the shape models. The new approach allows to use an unlimited amount of training data while keeping the upper bound of computational effort at a constant level. Further it eliminates the need to use models of a correct real-world scale for training and classification of shape models.

4.1. Furniture Categorization

Training Procedure

For the training of the classifier a web database of artificial 3D models of furniture is used. A successful learning scheme for furniture categorization using this database was demonstrated by Martinez Mozos et al. (2012).

Since the features for the Shape Models are calculated from point clouds, the artificial meshes from the database need to be preprocessed. In order to receive realistic data that is similar to the expected real-world data in the prediction process, virtual 3D scans of the models are created (see Fig-ure 4.1). These emulate the use of a depth sensor for creating realistic point clouds from 12 virtual positions around the target object. Visualizations of the models used can be found in Appendix C.

Figure 4.1.: Left: Furniture meshes from the database. Right: virtual scans.

As stated above, it is important to focus on non-planar regions of the objects in order to find descriptive features. This is realized by performing a boundary estimation on the given point clouds in order to receive a set of keypoints for feature calculation. The boundary estimation is based on angle differences of normals in the neighborhood of a target point.

For describing the keypoints found on the object’s boundaries the Sig-nature of Histograms of Orientations (SHOT) descriptor is used (Tombari et al., 2010). This local descriptor aims at characterizing a keypoint by generating a description of the neighborhood (support) of a target point.

4. Applying Semantics

One of the reasons for the choice of this descriptor is the fact that it is able to define an unique local reference frame to the target point. The detected characteristics of the surface in the surrounding are stored using the local coordinates which makes it rotation- and viewpoint-invariant. The descrip-tor vecdescrip-tor is calculated by assigning the neighboring points to spatial bins which are defined by performing 8 azimuth, 2 elevation, and 2 radial divi-sions of a virtual sphere around the target point. For each bin a histogram over the cosines of the angles between normals corresponding to the points within the bin and the keypoint’s normal is calculated. The use of cosines of the angles has the effect that — when using equally spaced bins — the histogram is more coarse for the angles parallel to the keypoint’s normal and more fine grained for angles orthogonal to the keypoint’s normal. As the points with normals that have a large angular distance to the keypoint’s nor-mal are the most informative ones, a finer binning supports the descriptive power of the descriptor. The 32 spatial bins around the keypoint containing 11-dimensional histograms of directions result in a 352-dimensional descrip-tor vecdescrip-tor which is ultimately normalized so that it is independent of the number of points in the neighborhood.

The local reference frame of the feature needs to be repeatable and un-ambiguous in order to be able to generate the same descriptor independent of the viewpoint. It therefore uses an adapted Principal Component Anal-ysis of the neighboring data. The data for the calculation of the covariance matrix is weighted by the distance to the target point:

C= 1

i:di≤R(R−d_i) X

i:di≤R

(R−d_i)(p_i−p)(p_i−p)^T (4.2) where R is the the radius of the sphere and di is the euclidean distance betweenpi and the keypoint p. This increases the repeatability of the local reference frame in presence of clutter. In order to disambiguate the axes of the found principal components the algorithm ensures that the sign of the eigenvectors is coherent with the majority of points it represents (see Tombari et al., 2010, for more details).

These local descriptors are now used to generate a codebook describing typical parts of furniture. It is generated by clustering SHOT descriptors from all training samples in the feature space using theK-Meansclustering scheme. The centroids of the corresponding clusters define the words for

4.1. Furniture Categorization

the codebook of the size K.

In order to generate a Shape Model for each class the calculated SHOT descriptors on the training samples of that class activate codewords using the Nearest-Neighbor search. Additionally, each codeword builds up a his-togram of voting directions. This strategy is different from the original 3D ISMapproach. Every keypoint that actives a corresponding codeword votes for a direction that points to the object’s center. The voting direction is represented in the descriptor’s local reference frame. Hereby the codeword is applied with a probabilistic belief about the direction in which the ob-ject’s center is located. This information is later used for the Hough Space Voting mechanism. So the Shape Model for a category of furniture consists of a set of activated codewords with an individual vote direction histogram assigned to it. The frequencies of codewords are implicitly represented in the vote direction histograms.

Im Dokument The attentive robot companion: learning spatial information from observation and verbal interaction (Seite 114-117)