• Keine Ergebnisse gefunden

The huge number of objects in the world causes humans to group similar entities into mean-ingfulcategoriesstarting in their second year [31]. This process is, at first, much facilitated by color or shape of objects and, at a later stage, by higher-level features. According to Piaget, one of the leading scientists on child development in the 60s, thisAdaptation-process can be split into two complementary sub-processes: AssimilationandAccomodation[32]. Assimila-tiondescribes the mechanism by which perceived and familiar objects are sorted into existing categories. If the visual-impression of a new object is too different from existingcategories, the Accomodation-process forms a novelcategory[33]. Assimilation, for example, is a powerful acquisition during a child’s development as it allows transferring knowledge from a group of known objects to new objects. If you encounter a new knife and recognize it as belonging to the knifecategory, you can recall trajectories and grasping points for cutting and quickly handle it without relearning it from scratch. Whilerecognizingthe knife demandsObject Categoriza-tion (OC), using it requires (among others)Pose Estimation at the category level (PEC).

3.2.1 Object Categorization (OC)

Being able to recognize the belongingcategoryfor a given object is calledOC. As different objects are being combined into singleclasses, this can lead to complex decision boundaries in theobject signaturespace.

While generalizability forInstance Recognition (IR)is limited to recognizing a known

ob-some discriminative power for generalizability. Constrained algorithms like template match-ing or Hough transforms do not work well anymore, because objects in onecategorycan have remarkable differences in their appearance, such that local image patches have little spatial co-herence across different objects. This is why a geometric verification stage is less common in OC. Instead, objects are more frequently described byobject signaturesusing for exampleBag of Words (BoW)[34], Fischer-Vectors [35–37], or sparse coding [38,39].

Machine learning algorithms (e.g., Support Vector Machines(SVMs) [40–42], decision trees and random forests [43–47], or boosting [48]) are then used for training onsignatures with known labels (supervised learning). After learning, the predictive model should be able to mergesignaturesof objects of the sameclass(by assigning same labels) and differentiate them fromobject signaturesof otherclasses(by assigning different labels).

While these classical methods tend to work well for a small number ofclasses, they do not scale to large object categorization problems with up to 1000classesand hundreds of thou-sands ofinstanceslike theImageNet Large Scale Visual Recognition Challenge (ILSVRC)³ [49]. This growth in data led to the advent ofDeep Convolutional Neural Network (DCNN) architectures⁴with thousands of learnable parameters.DCNNstake a special role in that they replace the traditional pipeline<feature extraction>→<signaturegeneration>→<class learn-ing>(with fixedfeature extractionandsignaturegenerationsteps) by a pipeline which starts at the signal level provided by, for example, anRGB-Dcamera⁵. They use a stack of consecutive layers (the first is the input signal layer, the last is the output layer) with each layer being the input to the next layer. A very powerful property ofDCNNsis the fact that the last layer can predict any kind of output (e.g., in Section5.2we predict not only thecategory, but also the pose of objects). Layers are connected by neurons which are only applied in local regions of their input layer (receptive field) and share weights across image regions (using convolution).

Interestingly, lower layer neurons automatically tune to local image features based on gradi-ents (like edges and corners) whereas neurons in later layers adapt to characteristic higher-level features forcategories[50]. Thus they learn the steps of traditionalfeature extractionand signaturegeneration⁶.

WhileDCNNscan be tailored to datasets by pure learning, they need an enormous amount of labeled training data to avoid over-fitting. While there are popular ways like augmenting training data (shifted, rotated, and/or flipped versions of training images) or drop-out [55]

(randomly deactivating neurons in the network to prevent over-fitting),DCNNsare not prac-tical if training data is scarce.

³http://image-net.org/challenges/LSVRC/

⁴ A comprehensive introduction and tutorial onDCNNsfor Visual Recognition by Li and Karpathy is available athttp://vision.stanford.edu/teaching/cs231n/syllabus.html.

⁵ A camera which records red, green, and blue as well as distance to the camera for each pixel.

we introduce ourTransCleanalgorithm. It is able to automatically retrieve large amounts of OCtraining data when given acategoryname (e.g.,nut) together with a descriptive context (e.g.,crackordelicious). In this examplenuthas a double-meaning: It can refer to either a food-nutor ahex-nut. Downloading training data directly from large word-based image-databases like Google-Image-Search results in many irrelevant images in the training set. Using the con-text,TransCleancan disambiguate the name of thecategoryand retrieve task-relevant images.

In Section5.2we randomly generate cluttered indoor scenes in 3D. From those we can syn-thesize an unlimited number of training images. Compared to theTransCleanalgorithm, this approach is limited tocategorieswhere full 3D models are available. It quickly makes up for this disadvantage by being able to automatically annotate all objects with a full 6Degrees of Freedom (DoF)pose. This allows for the training of methods for category level pose estima-tion.

3.2.2 Pose Estimation at the Category Level (PEC)

In order to execute tasks, the agent needs to determine exactly where and in which pose an object is located in a scene. For example, when filling a cup you should not hold it upside-down. If you want to sit down on a chair, you cannot do so from the backrest’s side. The problem of finding the transformation between an object’s intrinsic reference frame (where actions are being defined) and the object recording in the world’s reference frame (e.g., room-axis aligned) is addressed by pose estimation. Interestingly, we, as humans, can infer pose of an object even if we have never seen it before. For example, if you see a new object of known category(a cup, a knife, and so on), you immediately know how to use it. While trivial for us, artificial agents cannot accomplish this, yet.

A lot of research has been conducted to solve pose estimation for known instances in a scene (see Section3.1.2). Here it is normally solved by aligning full object models to the par-tial recordings of objects in a scene. However, this can only be done with knowninstances.

Doing pose estimation at thecategorylevel is much harder. For example, aligning two similar, but different objects usually fails. Comparing each object in a scene to a huge collections of stored models (from onecategory), in order to increase the chances of finding a good match, would be very inefficient. While the variance within acategoryas well as the complexity of the problem calls for rich models (e.g.,DCNNs), scarcity of training data with annotated 6