• Keine Ergebnisse gefunden

Syntactic Methods in Another Domain

3. Syntactic Assembly Modeling 27

3.4. Syntactic Methods in Another Domain

could be improved by 10% [112]. As we will see in Chapter 6, this of course affects the performance in assembly detection. Right now, we can conclude that the integration of lower level image processing and higher level analysis of symbolic information suggests an avenue towards more robust and reliable recognition in computer vision.

...

ROOM-SCENE

SPACE-CONSUMING-PART SPACE-PROVIDING-PART

TABLE-SCENE

TABLE-PART

TABLE

CHAIR-PART

...

... ...

ROOM ... SHELF-SCENE ...

RACK CARRIER-PART

C-SPC-PROV-PRT C-SPC-CONS-PRT CARRIED-PART

SHELF BOOK ...

Figure 3.26.: Syntactic model of a room scene. A room scene is decomposed into two functional units, the one that consumes space might be represented by several subscenes which again can have recursive component structures.

how to instruct and steer a mobile robot in ordinary home or office environments. It does not primarily aim at traditional problems of mobile robotics like autonomous exploration or path planning but, like the SFB 360, is focused on man-machine interaction. And similar to the cooperative construction scenario, the intended service robot should rely on visual and acoustic input while communicating with a user. Capabilities like self-localization or obstacle avoidance, in contrast, may also rest on laser or sonar sensors but need not to be restricted to these.

A mobile robot that has to navigate in an environment of several distinct rooms certainly needs a mechanism to estimate its current position. Consequently, there is in-tensive research on problems of global and local localization and literature on these issues is vast (cf. [43] for an overview). Besides approaches applying sensors like sonar or laser rangefinders, computer vision based techniques are very popular. These usually make use of landmarks which may be objects characteristic for a location or visual markers attached to, for example, walls. As the latter constitute an artificial and auxiliary means generally not present in ordinary environments, they are ineligible for our setting. Thus, questions of self-localization from vision motivated preparatory work on 3D furniture recognition [50]. However, some pieces of furniture might be present in several rooms and thus are of no use for localization. Others might be characteristic but unrecogniz-able due to perspective occlusion. This led to the idea to take contextual information into account and to model the furnishing of entire rooms syntactically.

Figure 3.26 illustrates how this might be done applying the techniques presented so far. It depicts a general syntactic model of the component structure of a room scene. The model assumes that a ROOM-SCENE is made of two functional parts: one that provides space and one that consumes space. The space providing unit might either be an empty room or a room with some furniture in it, i.e. aROOM-SCENE. Typical groups of furniture

are space consuming units. A group of chairs arranged around a table, for instance, can be recursively decomposed into a chair and a table with a lesser number of chairs. Thus, a TABLE-SCENE could be composed of aTABLE-PART and aCHAIR-PART. The TABLE-PART might be a TABLE or a less complex TABLE-SCENE while theCHAIR-PART could be one of several elementary types of chairs.

Other space consuming parts may have more complex structure. ASHELF-SCENE, for instance, should generally consist of a unit that carries several shelves. As the number of shelves is arbitrary thisCARRIER-PARTis either aRACKor aSHELF-SCENEitself. And as the shelves carried by a rack may contain several books or other items each CARRIED-PART

of aSHELF-SCENEis again recursively structured.

In contrast to mechanical assemblies, this syntactic model of a ROOM-SCENE and its components is not bound to functions of mating features. The functional relations between the variables are rather abstract instead and the model was deduced from experience. Nevertheless, if there was a set of suitable relational representations of room scenes the algorithm presented in Section 3.1.2 would be able to produce a similar context free grammatical model.

While the concept of mechanical mating features thus seems superfluous in deriving context free models of classes of complex entities, the next section will show that it still is advantageous when implementing detectors for syntactical structures.

3.4.2. An Experimental Implementation

Intending to verify the applicability of syntactic methods in detecting constellations of furniture from sets of 3D object data, we slightly modified the TABLE-SCENEpart of the room model and implemented it as anErnestnetwork. The modification further gener-alized the model in order to allow the treatment of scenes like shown in Fig. 3.27(a) which contain more than one table. Assuming binary partitions of complex scenes resulted in the following CFG:

TABLE-SCENE TABLE-PART CHAIR-PART|TABLE-PART TABLE-PART TABLE-PART TABLE-SCENE|T ABLE1|T ABLE2| . . .

CHAIR-PART CHAIR1|CHAIR2| . . .

A TABLE-SCENEmay thus consist of aTABLE-PART and aCHAIR-PARTor of two

TABLE-PARTs. And while theCHAIR-PART cannot be of complex nature, aTABLE-PART may have recursive structure.

Implementing the grammar as an Ernest knowledge base is straightforward. Vari-ables and terminals are mapped to concepts and productions are represented by suitable partlinks. However, as mentioned above, the data that should be searched for syntactic

(a) (b)

Figure 3.27.: A table scene consisting of several chairs and tables and corresponding positions of virtual mating features. Green balls associated with tables in-dicate positions near which one would expect a chair or another table to be situated; red balls indicate the parts of chairs that usually are near a table.

structures results from 3D object recognition, i.e. suitable strategies to parse sets of 3D object models have to be found.

A practicable control algorithm to detect table scenes can actually be adopted from the ideas developed for thebaufixrdomain. On the first sight, this might surprise since theErnest network for baufixrassembly detection continuously makes use of the idea and properties of mating features. Remember that their properties not only led to the syntactic model of bolted assemblies. Considerations regarding their appearance and distribution in an image guide the detection of assembly structures and allow to deduce connection details. But there are no such entities in the furniture domain. Instead, pieces of furniture might form a group without being physically connected.

However, though they are more abstract and not as obvious than in the case of me-chanical assemblies, relations among pieces of furniture can yet be stated geometrically.

A chair, for instance, has an intrinsic front which in a common table scene is oriented towards a table. Thus, although there are no rigid connections there still are geometrical or spatial constraints that apply to well formed table scenes. Such constraints, of course, can easily be modeled by means of attributes or restrictions associated with concepts of an Ernestnetwork.

But spatial relations among three-dimensional objects are inherently fuzzy. For ex-ample, a chair near a table might be oriented towards the table. Likewise, there will be orientations where the chair does not point to the table. But then there also will be orientations where it somehow does and somehow does not point towards the table. In

(a)

(b)

TABLE-SCENE

TABLE-PART

TABLE-PART

1

CHAIR-PART TABLE-SCENE

TABLE-SCENE

TABLE-PART CHAIR-PART

CHAIR-PART

TABLE CHAIR

CHAIR

CHAIR

1 1

2

(c)

Figure 3.28.: 3.28(a) A scene of several pieces of furniture. 3.28(b) The table scene de-tected within this configuration. 3.28(c) Syntactic structure resulting from a parsing process.

other words, there are fuzzy transitions from one orientation to the other where a binary decision will be difficult.

Actually, this phenomenon can be observed for any relation expressed with spatial prepositions and there is intensive research on automatic spatial reasoning and seman-tic modeling of spatial prepositions (cf. e.g. [1, 42, 94]). For our purposes we were not interested in especially sophisticated computational models but looked for an easy and practicable way to calculate spatial restrictions for well formed table scenes. In analogy to mechanical assemblies and in order to facilitate computations and to adopt our estab-lished control strategy we thus assignedvirtual mating featuresto the geometric furniture models. Figure 3.27(b) shows such features assigned to exemplary pieces of furniture.

Each table is associated with a set of 3D points which are indicated by green balls. They represent prototypical positions round the table near whom one would usually expect to find a chair or another table. Chairs come along with points indicated by red balls which represent the part of a chair that, in a common table scene, is most likely to be found near a table.

The idea of virtual mating features apparently simplifies the search for interrelated objects in a set of furniture models. Constraints for well-formedness could be stated using any measure of distance between virtual features. In the current implementation we simply apply the Euclidean distance in 3 but more sophisticated methods, as for

instance proposed in [1], would be possible, too. And like in the case ofbaufixrassemblies positions of mating features can guide the parsing process performed by the semantic network.

Again the Ernest knowledge base contains attributes which represent context de-pendent, domain specific knowledge like, for example, the states of virtual mating fea-tures. And in analogy to the baufixrdomain, there are attributes which govern the instantiation process for they register examined objects or estimate how to continue parsing.

Given sets of 3D object models as input data, the semantic network searches them for table scenes similar to the method introduced in Section 3.2.2. A table is an obligatory part of any table scene. Thus, after instantiating concepts for all objects in the input data, a table serves as the starting point of a first parsing attempt. It is instantiated as a

TABLE-PARTwhich in turn is annotated as a part of a modified concept of aTABLE-SCENE. Then, all other objects with unused virtual mating feature in a certain range about an unused feature of the initial table are considered as possible further components of the

TABLE-SCENE. The nearest one that complies with the syntactic model is instantiated correspondingly, e.g. as aCHAIR-PART, and theTABLE-SCENEbecomes and instance as well.

The features responsible for this instantiation are henceforth considered to be usedand the TABLE-SCENE is conjoined with earlier found ones, if necessary. Afterwards, parsing restarts at another table if there is any or continues with unexamined objects near to already instantiated ones. This process iterates until all objects in the input data have been examined. Figure 3.28(b) exemplifies how detected structures are visualized. It depicts a result obtained from the scene in Fig. 3.28(a). In this example, the chair upon the table is not close enough to any of the table’s virtual features to be understood as part of a conventional table scene. Consequently, it was not instantiated as such and thus appears as a wireframe model in the output.

The syntactic structure which the semantic network produced for this table scene is shown in Fig. 3.28(c). Since there is just one table to initialize parsing, no conjoin operations were necessary and the resulting structure is linear (see page 48). Two of the three chairs included in the structure are of typeCHAIR1. The one at the right front of the table is closest to a virtual mating feature of the table and thus is represented on the lowest level of the parsing tree. The CHAIR2 type chair at the left front is to be found on the next level and, even though the perspective is confusing, the chair in the back is farthest from the table and thus integrated on the highest hierarchical level.