• Keine Ergebnisse gefunden

The attentive robot companion: learning spatial information from observation and verbal interaction

N/A
N/A
Protected

Academic year: 2021

Aktie "The attentive robot companion: learning spatial information from observation and verbal interaction"

Copied!
246
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

The Attentive Robot Companion

Learning Spatial Information from

Observation and Verbal Interaction

(2)
(3)

Declaration of Authorship

According to Bielefeld University’s doctoral degree regulations §8(1)g: I hereby declare to acknowledge the current doctoral degree regulations of the Faculty of Technology at Bielefeld University. Furthermore, I certify that this thesis has been composed by me and is based on my own work, unless stated otherwise. Third parties have neither directly nor indirectly received any monetary advantages in relation to mediation advises or ac-tivities regarding the content of this thesis. Also, no other person’s work has been used without due acknowledgment. All references and verbatim extracts have been quoted, and all sources of information, including graphs and data sets, have been specifically acknowledged. This thesis or parts of it have neither been submitted for any other degree at this university nor elsewhere.

(4)
(5)

The Attentive Robot Companion

Learning Spatial Information from

Observation and Verbal Interaction

Leon Ziegler

March 2015

A doctoral thesis presented for the degree of Doctor of Engineering (Dr.-Ing.) at

Faculty of Technology Bielefeld University Applied Informatics Inspiration 1 33619 Bielefeld Germany Reviewers

Dr.-Ing. habil. Sven Wachsmuth Prof. James J. Little

Examination Board

Prof. Dr. Philipp Cimiano Dr.-Ing. Kirsten Bergmann

(6)
(7)

Acknowledgments

This thesis would not have been possible without the support of so many people. At this point I would like to take the opportunity to express my appreciation and say “thank you”. First of all, I owe the possibility to even study and choose a field I am truly interested in to the overwhelming support of my parents and family who always believe in my abilities and supported me not only financially but also with their encouragement in numerous other ways. Especially during my time as a PhD student I have to thank my lovely girlfriend Julia, who never tired of providing me with optimism and confidence. It helped a lot to know that someone is always there for support and assistance outside the university. You are the one that had to suffer from the very time consuming and resource demanding endeavor that is a doctoral thesis. But also my friends, especially my former roommates deserve a special “thank you”.

In my professional environment, special thanks go to Sven who has been a great supervisor and always provided valuable input and discussions through-out the process of this thesis. Furthermore, I would like to thank Jim Little to also instantly agree on reviewing my thesis.

Thank you Gerhard, Franz, and Britta to accommodate me in the Ap-plied Informatics Group and proving such a wonderful working environment. The same holds for all my colleagues who always ensured a communicative, friendly, and amusing atmosphere. Especially Frederic, Florian, and Marco

(8)

thankful for the opportunity of being a part of Team ToBi.

Speaking of which, I want to thank all other ToBis for their hard work and commitment during the various competitions. In particular, I want to thank Sven (as the team leader), Frederic (for the early years), as well as Sebastian, Matthias, and Lukas (for the more recent years). It has been a great time not only meeting all of you but also working with you – and supervising at least some of you. Thank you all, it has been a great experience and a great success!

Furthermore, I thank Jens and Florian for sharing an office and the fruit-ful collaborations and discussions. Also thanks to my student assistants Michael, Lukas, Phillip, and Tobias.

Finally, I very much value the consent of my sister Lena, as well as, Gwendolyn and Johannes to proofread my thesis.

(9)

Abstract

This doctoral thesis investigates how a robot companion can gain a certain degree of situational awareness through observation and interaction with its surroundings. The focus lies on the representation of the spatial knowl-edge gathered constantly over time in an indoor environment. However, from the background of research on an interactive service robot, methods for deployment in inference and verbal communication tasks are presented. The design and application of the models are guided by the requirements of referential communication. The approach here involves the analysis of the dynamic properties of structures in the robot’s field of view allowing it to distinguish objects of interest from other agents and background struc-tures. The use of multiple persistent models representing these dynamic properties enables the robot to track changes in multiple scenes over time to establish spatial and temporal references. This work includes building a coherent representation considering allocentric and egocentric aspects of spatial knowledge for these models. Spatial analysis is extended with a semantic interpretation of objects and regions. This top-down approach for generating additional context information enhances the grounding pro-cess in communication. A holistic, boosting-based classification approach using a wide range of 2D and 3D visual features anchored in the spatial representation allows the system to identify room types. The process of grounding referential descriptions from a human interlocutor in the spatial representation is evaluated through referencing furniture. This method uses a probabilistic network for handling ambiguities in the descriptions and em-ploys a strategy for resolving conflicts. In order to approve the real-world

(10)
(11)

Contents

List of Tables V

List of Figures VII

Glossary XI

1. Introduction 1

1.1. Robot Companions in the Home . . . 4

1.2. From Vista Space to Environmental Space . . . 8

1.3. Research Questions . . . 9

1.4. Scenario & System Foundation . . . 11

1.5. Outline . . . 14

2. Analysis and Statement of the Research Problem 15 2.1. Functional Requirements . . . 16

2.2. The Choice of Scope . . . 17

2.3. Knowledge Representation . . . 18

2.4. Applying a Situation Model in Interaction . . . 20

2.5. Summary: Contribution of this Thesis . . . 22

3. Partitioning the Workspace 25 3.1. The Geometric Foundation . . . 28

3.2. Detecting Roles in Articulated Scenes . . . 32

(12)

Contents

3.2.2. The Challenge in Segmenting a Scene . . . 35

3.2.3. The Articulated Scene Model . . . 37

3.3. Anchoring and Integrating Egocentric Models . . . 41

3.3.1. A Twofold Spatial Representation . . . 42

3.3.2. Registering a Scene Model with the Current View . . 44

3.3.3. Generating a Valid Model for the Current View . . . . 45

3.3.4. Applications Exploiting the Model’s Potential . . . 51

3.4. Focusing the Robot’s Attention . . . 55

3.4.1. The Interlocutor’s Viewing Direction . . . 56

3.4.2. Detecting Interaction Spaces for Manipulation . . . . 59

3.4.3. Repositioning for Observation . . . 61

3.5. Evaluation . . . 63

3.5.1. Quantitative Evaluation . . . 63

3.5.2. Event Detection with ASM . . . 75

3.5.3. Robot Behavior Performance . . . 79

3.6. Summary . . . 79

4. Applying Semantics 81 4.1. Furniture Categorization . . . 87

4.1.1. Implicit Shape Model . . . 88

4.1.2. Ray-Based Hough Space Voting . . . 91

4.1.3. Evaluation . . . 96

4.1.4. Summary & Discussion . . . 99

4.2. Classification of Household Objects . . . 102

4.2.1. Boosted Classification . . . 104

4.2.2. Evaluation . . . 109

4.3. Room Categorization . . . 116

4.3.1. Generation of Training Data . . . 117

4.3.2. Anchoring of Features . . . 118

4.3.3. Evaluation . . . 121

4.4. Summary . . . 131

5. Perception and Communication 135 5.1. Benefits of Combining Perception and Communication . . . . 140

5.2. Reference Frame Selection in Human Communication . . . . 141

5.2.1. Conducting an Online Study . . . 144

5.2.2. Empirical Results . . . 144

(13)

Contents

5.3. A Probabilistic Model . . . 146

5.3.1. Visual Analysis . . . 146

5.3.2. Maintaining and Updating the Spatial Network . . . . 148

5.3.3. Resolving Conflicts . . . 154

5.3.4. Adaptation to Personal Preferences . . . 156

5.3.5. Application in Human-Robot Interaction . . . 158

5.4. Evaluation . . . 161

5.4.1. Online Evaluation . . . 161

5.4.2. Real-World Evaluation . . . 166

5.5. Summary . . . 174

6. Discussion & Conclusion 177 Bibliography 181 Appendices 203 A. Situation cases for ASM evaluation 205 B. Results from Multi-View ASM Evaluation 207 B.1. Evaluation of simple ASM . . . 208

B.2. Evaluation of naive matching ASM . . . 209

B.3. Evaluation of multi-view ASM . . . 210

C. Models for 3D ISM Training 213

D. Results from Evaluation of Household Object Classification215

(14)
(15)

List of Tables

3.1. Categories for pixel-wise evaluation . . . 68

3.2. Categories for event evaluation . . . 77

4.1. Confusion matrix of the voting scheme evaluation . . . 97

4.2. Results of the furniture categorization . . . 98

4.3. Recognition results on real-world indoor scenes . . . 99

4.4. Confusion matrix of E-SAMME-ALL condition . . . 112

4.5. Confusion matrix of the E-SAMME-3D condition . . . 113

D.1. Confusion matrix of the E-SAMME-2D configuration . . . 216

D.2. Confusion matrix of the E-SAMME-OBJ-T configuration . . 216

D.3. Confusion matrix of the E-SAMME-OBJ-S configuration . . . 217

(16)
(17)

List of Figures

1.1. The humanoid robot Nao . . . 4

1.2. First generation of BIRON . . . 5

1.3. Cosero . . . 6

1.4. BIRON II . . . 13

3.1. SLAM example . . . 26

3.2. Articulated Scene Model . . . 33

3.3. Scene segmentation processing pipeline . . . 35

3.4. Twofold spatial representation . . . 43

3.5. Registering a scene model . . . 45

3.6. Naive model matching . . . 47

3.7. Merging premises . . . 49

3.8. Rear projection to view frustum . . . 50

3.9. Scene from different perspectives . . . 51

3.10. Movement strategies for a mobile robot . . . 54

3.11. Multi-modal anchoring . . . 57

3.12. SeAM Layers . . . 60

3.13. Viewpoint calculation . . . 63

3.14. Schematic visualization of settings (I) . . . 65

3.15. Schematic visualization of settings (II) . . . 66

3.16. Required information for quantitative analysis . . . 67

3.17. Schematic visualization of settings (II) . . . 69

3.18. Quantitative results from additional test cases . . . 70

(18)

List of Figures

3.20. Evaluation results from additional settings . . . 74

3.21. Evaluation results from additional settings . . . 74

3.22. Scenario for the qualitative evaluation . . . 76

4.1. Virtual scans of furniture meshes . . . 89

4.2. Clustering of vote rays . . . 93

4.3. Intersection of spheres and vote rays . . . 94

4.4. Recognition results on real world indoor scenes . . . 100

4.5. SAMME error plot for changing number of classes . . . 107

4.6. Objects used for evaluation of E-SAMME . . . 110

4.7. Recognition results on household objects . . . 112

4.8. Feature appearance in E-SAMME . . . 113

4.9. Feature test error comparison for object recognition . . . 114

4.10. A reconstruction of a living room. . . 119

4.11. Feature test error comparison . . . 122

4.12. Single feature comparison . . . 124

4.13. Feature appearance in E-SAMME . . . 125

4.14. Confusion matrices from training on room database . . . 126

4.15. Graph of base classifier usage . . . 127

4.16. MLP comparison . . . 128

4.17. Confusion matrices of training on IKEA room database . . . 130

5.1. Leonardo . . . 137

5.2. Reference object with located object for RF selection . . . 142

5.3. Vehicle and opposite objects . . . 143

5.4. Percentage use for each reference frame . . . 145

5.5. Furniture Segmentation . . . 147

5.6. Furniture graph example . . . 149

5.7. Details of the probabilistic model. . . 150

5.8. A sequence of graph configurations for backtracking . . . 155

5.9. Selected graph configurations after three descriptions . . . 156

5.10. Selected graph configurations after four descriptions . . . 156

5.11. Sequences of graph configurations with RF preferences . . . . 157

5.12. Room for online study . . . 162

5.13. Probability distributions of furniture in online study . . . 164

5.14. Correct matches in online evaluation . . . 165

5.15. Describing spatial relations in a real-world apartment . . . . 167

(19)

List of Figures

5.16. Furniture layout for evaluation . . . 168 5.17. Probability distributions of furniture in real-world study . . . 170 5.18. Correct matches in evaluation . . . 171 5.19. Amount of correct matches by groups . . . 172 5.20. Amount of vertices that were assigned a wrong label . . . 173

(20)
(21)

Glossary

AdaBoost

A supervised machine learning meta algorithm for combination of sev-eral weak classifiers to a single strong classifier.. 84, 85, 102, 104, 105, 109, 179

ASM

Articulated Scene Model. 32, 37, 40, 41, 47, 49, 50, 52–55, 62–65, 67–70, 73–75, 77–79, 86, 146, 159

Base classifier

One of the simple classifiers used in boosting for generating an ensem-ble classification scheme.. 102–106, 109, 114, 128, 132

BIRON

Bielefeld Robot Companion. 4, 5, 11, 19, 55, 139, 158, 166, 167

BIRON II

Bielefeld Robot Companion V2. 11, 12, 18, 19, 67

BonSAI

Biron Sensor and Actuator Interface. 13, 61, 76, 79

BoW

(22)

Glossary

BoW

Bag of Words. 111, 120, 123

BRISK

Binary Robust Invariant Scalable Keypoints. 103

DTree

Decision Tree. 102, 103, 110, 114, 121, 127, 128

E-SAMME

Exhausive SAMME. 106, 109–111, 113, 120–122, 128, 129

Environmental space

The psychological space that is projectively larger than the body and surrounds it. It is too large to apprehend directly without considerable locomotion. 8, 116

Figural space

The psychological space that is projectively smaller than the body and can be directly perceived from one place without appreciable locomo-tion. 8

FPFH

Fast Point Feature Histogram. 103, 110, 123, 125, 127, 131

FPR

False Positive Rate. 72

FREAK

Fast Retina Keypoint. 103, 110

Geographical space

The psychological space that is projectively much larger than the body and cannot be apprehended directly through locomotion. 8

(23)

Glossary

Home tour

A scenario in which a human introduces a new robot to her apartment and shows it around to familiarize it with this new environment. 4, 5, 33

HRI

Human-Robot Interaction. 2, 4, 6, 7, 20, 55, 135–137, 139, 140, 175, 178, 179

ICP

Iterative Closest Point. 43, 72, 117, 158

ISM

Implicit Shape Model. 87, 88, 90, 91, 96–100, 132, 148, 169, 178

KinFu

Kinect Fusion. 117, 118, 121

Lost key scenario

A recurring scenario for demonstrating various aspects of this thesis. A mobile robot is able to tell where certain objects are, just by observing the human’s actions. 10, 41, 78, 179

MLP

Multilayer Perceptron. 102, 103, 110, 121, 127–129

Opposite object

The intrinsic left/right axis of opposite objects is primarily assigned in a way that corresponds to standing in front of the object. 143

ORB

Oriented FAST and Rotated BRIEF. 103, 110, 113

PCL

(24)

Glossary

Point cloud

A set of data points in some coordinate system. In this theses this term always refers to a set of points in a three-dimensional cartesian coordinate system. 18, 19, 27, 34, 36, 43–45, 49, 54, 58, 59, 70, 72, 77, 86, 88, 89, 99, 100, 103, 109, 117–119, 121–125, 127, 147, 158

RANSAC

Random Sample Consensus. 59

RBPF

Rao-Blackwellized Particle Filters. 26

RF

Reference Frame. 139–146, 149–153, 157, 158, 160–166, 169, 171–173, 175

RSB

Robotics Service Bus. 13

SAMME

Stagewise Additive Modeling using a Multi-Class Exponential Loss Function. 104–106

SeAM

Semantic Annotation Mapping. VII, 59–61, 79, 80

SHOT

Signature of Histograms of Orientations. 89, 90, 92, 96, 103, 110, 113, 123, 125, 127, 131

SIFT

Scale-Invariant Feature Transform. 103, 110

Situation awareness

An awareness about the geometrical, functional, and social situation an agent is located in. 1, 2, 82, 135

(25)

Glossary

Situation model

This term is used in psychology as means to express the multi-dimen-sional representation of the situation under discussion. 2, 3, 5, 6, 9–11, 14, 15, 22, 25, 54, 55, 61, 81, 82, 101, 114, 131, 133, 135, 139, 140, 177, 179, 180

SLAM

Simultaneous Localization and Mapping. 26, 43, 59, 60, 67, 79

SPD

Scene Plane Descriptor. 116, 120, 122, 123, 127, 129, 131

Superpixel

A set of individual pixels of a digital image representing an image segment of arbitrary criterion. 36

SURF

Speeded Up Robust Features. 103, 110, 111, 113

SVM

Support Vector Machine. 103, 110, 111, 114, 121, 127

Vehicle object

The intrinsic left/right axis of vehicle objects is primarily assigned in a way that corresponds to sitting in the object. 143

Vista space

The psychological space that is projectively as large or larger than the body but can be visually apprehended from a single place without appreciable locomotion. 8, 116

(26)
(27)

Chapter 1

Introduction

The development of mechanical and digital hardware is progressing rapidly, so researchers are trying to bring robotic applications into human living and working environments. Personal robots with a human-like situation

awareness who are able to perform seamlessly as companions in everyday

situations are the subjects of many utopian visions. Considering the rapid aging of populations in Europe and many other countries, personal assistive robots are considered a key technology for prolonging the independence of elderly people. According to Schaal (2007), even more functions relevant to our society will be fulfilled by robots, like in education, health care, rehabil-itation, and entertainment. However, we have learned that the progression from static and well-defined environments in laboratories or industrial set-tings to dynamic, uncertain and very complex domains is extremely hard. There is still a long way to go before real personal robots become mature enough to function among us.

“One reason for this gap is that it has been much harder than expected to enable computers and robots to sense their surround-ing environment and to react quickly and accurately.”

(Gates, 2007) An awareness of what the environment looks like is crucial for an artificial agent. In recent years, advances in technologies for sensing and interpreting the surrounding’s spatial properties enabled researchers to develop robotic systems that were able to perform in highly complex real-world scenarios (Thrun et al., 2006). But not only the spatial structure of the environment

(28)

1. Introduction

is important. For a personal robot, it is at least evenly important to know what the structure’s function is and what situation it represents. This en-ables it not only to perform the specific tasks it is asked to do, but it is also a prerequisite for successful Human-Robot Interaction (HRI). Especially in terms of appropriate communication about items in the environment, a so-phisticated situation model is essential. The term situation model is used in psychology as means to express the multi-dimensional representation of the situation at hand (van Dijk and Kintsch, 1983; Johnson-Laird, 1983). Zwaan and Radvansky (1998) state that the model contains at least five dimensions of situations: time, space, causality, intentionality, and protag-onist (reference to the main individuals under discussion).

However, it is not enough to build up an isolated knowledge base of facts about the physical environment. In communication, dialog turns are linked across interlocutors, and the meaning of the conversational content depends on the interlocutors’ implicit consensus, not on explicit definition (Sacks et al., 1974; Brennan and Clark, 1996). This means that a model for

sit-uation awareness always depends on the context of the current situation

and the alignment in communication — in other words, the common ground between the interaction partners. According to Branigan et al. (2000) a co-ordination of interlocutors occurs when they share the same representation at some level. So, Pickering and Garrod (2004) argue that the “Alignment of situation models [. . . ] forms the basis of successful dialogue”. Whereas the alignment is not per se necessary for successful communication, alter-natives would be very inefficient in terms of production and comprehension of utterances.

From a usability point of view, the components of a system not only have to operate as the developer conceptualizes them, meaning that they fulfill their functions and are technically stable, the system also has to be both easy and safe to use, as well as socially acceptable (e.g. Dix et al., 2004; Nielsen, 1993).

If robots are supposed to actually be involved in our society in the future like Schaal suggests, they need to be accepted by children and adults. This can only be realized if they comply with certain social behaviors and stan-dards that we as humans find acceptable. Dautenhahn (2007) formulates a set of social rules for robot behavior containing different paradigms regard-ing the social relationship of robots and people. This includes a means of communication that aligns to the communication partner and the context

(29)

of the interaction. This exposes a need for a representation that comprises interlocutor-specific and context-specific semantic knowledge — the

situa-tion model.

Now the question is: Which information should be available in a situation

model, and how should this information be represented? Also, which

mech-anisms are needed in order to apply the knowledge in real-world situations? These questions outline the work I will present in this thesis, though it is not possible to answer them in a comprehensive way. Instead I will take a closer look at three different aspects of building a consistent situation

model. These aspects focus on the space, time, and protagonist dimensions

of Zwaan and Radvansky (1998)’s definition. The causality and intention-ality dimensions will only be covered marginally in the enclosing high-level applications.

The basis for such a model is a geometric description of the surroundings. I will explore possibilities for representing the data in a way that allows an appropriate level of detail for the task at hand and enables inference about the functional roles of certain structures through observation. The aspect of learning (in terms of knowledge acquisition) is very important for successful generation of an adaptive model. This is also true for the interpretation of the surrounding that can not directly be inferred from observation. How-ever, it is important for a personal robot to also have semantic knowledge about different areas of its working environment in order to act appropri-ately. In order to enrich the situation model with the according information, I will present an approach for applying semantics to the enclosed areas of an apartment. Nevertheless, a situation model should not consist purely of visually perceived information. The communication with a human interlocu-tor provides useful information as well. Not only about the situation itself, but also about the way this information is represented in the interlocutor’s mental situation model. In order to align to the partner on a communicative level, it is important to establish methods to access the situation model in a way that supports this alignment process. This thesis is embedded in the research program of the collaborative research cluster called “Alignment in Communication” at Bielefeld University. The program involves many inter-disciplinary projects which collaborate to reach two goals: different kinds of alignment phenomena and their implications on conversation and situation

models. Wachsmuth et al. (2013) give an overview of selected topics within

(30)

1. Introduction

1.1. Robot Companions in the Home

There were many robotic platforms developed in recent years that aim to lead the way for future personal robot companions. Many of them focus on technical design and appearance in order to support the research on motion and HRI, like the adult-sized futuristic looking robot HRP-4 (Kaneko et al., 2011), the infant-sized iCub (Metta et al., 2010), and the anthropomorphic robot head Flobi (L¨utkebohle et al., 2010).

There are a few robots that made the transition into the real-world like the impressive robotic car Stanley, winner of the DARPA Grand Challenge developed by Thrun et al. (2006) or the TOOMAS shopping guide (Gross et al., 2009). However, all of these robots have a very distinct task to fulfill and their hardware and software design is highly optimized for the task.

Figure 1.1.: Nao by

Alde-baran Robotics. Image taken from: Bader et al. (2013). Other obvious examples are vacuum

cleaning, floor washing, and lawn mowing robots that have been available for purchase for several years now. But there are also commercial robots on the market that serve still a very limited, but social function. The robotic seal Paro is used in care facilities with elderly people or other patients in or-der to increase their social interaction, sim-ilar to animal-assisted therapy (Wada and Shibata, 2007). Comparable effects were found with the toy dinosaur Pleo in chil-dren’s play (Fernaeus et al., 2010).

Another commercially available robot is the Nao by Aldebaran Robotics (Gouaillier et al., 2009) (see Figure 1.1). It has been designed for a much wider range of applica-tions than the afore-mentioned robots. In practice, however, it is mostly used as a toy or a research platform because the software

still lacks essential abilities to truly understand its surroundings and its communication partner.

Most of the basic research in robotics is done using platforms not designed for end users, but to support the research itself. The first generation of our

(31)

1.1. Robot Companions in the Home

research platform, Bielefeld Robot Companion (BIRON), was introduced by Haasch et al. (2004). It was a modified PeopleBot from ActiveMedia equipped with a pan-tilt camera, a pair of microphones, and a laser range finder (see Figure 1.2).

Figure 1.2.: First

gen-eration of BIRON. At the time, it was used in a home tour

sce-nario which involved sensing of humans (Fritsch et al., 2004), sensing the environment using the laser range finder for obstacle avoidance, and recognition of human speech (Wachsmuth et al., 1998) coupled with a basic dialog management system. Based on contemporary standards the

home tour scenario is a comparatively simple

challenge. The robot behaves purely reactively. It passively follows a human to new locations and is introduced to new facts about the environ-ment, which basically consist of references from labels to coordinates. It does not learn anything new about its environment except when notified by the human.

Today comparable research projects go be-yond the home tour scenario and progress to more complex scenarios. Those either require a more sophisticated situation model, more pow-erful perception, or a dialog system that han-dles more complex interactions. Further, most

projects involve a pro-active robot behavior like in the scenario for Dora The Explorer, first introduced by Hawes et al. (2010). Dora is driven by a motivational system that triggers an active exploration behavior to fill gaps in the spatial knowledge of the environment. Meanwhile, the robot tries to do a categorical labeling of rooms by analyzing functionally impor-tant objects, as well as considering ontology-driven inference on the results of this uninformed search. The architecture is composed of reactive goal generators which create new goals that pass a collection of filters for a first selection step. A management mechanism then determines which of the remaining goals to pursue (Sjoo et al., 2010). It contains goal generators for frontier-based exploration, view planning, and a visual search using the pan-tilt-zoom camera. The spatial information is stored using a framework

(32)

1. Introduction

for cognitive spatial mapping (Pronobis et al., 2009). The map is assembled by so-called “places” that define the spatial relations representing the struc-ture of the environment. A “place” is a collection of arbitrary distinctive features that can be complex or abstract in nature. Also, there is a concept called “scene” for segmentation of space and grouping of similar feature val-ues. The map is only a topological representation of the environment and does not require a maintenance of a global spatial consistency.

Meger et al. (2010) pursued a similar goal with the visual searching plat-form Curious George. This robot won the first place in the 2007 and 2008 robot league of the Semantic Robot Vision Challenge (Helmer et al., 2009). As the competition requires the robots to identify the objects from instantly-learned categories using web imagery, Curious George is able to download the required data from web services like Google Image Search. For explo-ration, the team implemented a frontier-based strategy as proposed by Ya-mauchi (1997). For visual search they do not use a 2D occupancy grid, but a 3D representation of the environment, the result of a horizontal surface-finding algorithm developed by Rusu et al. (2009d) as a package for the Robot Operating System (ROS) (Quigley et al., 2009). As an attention sys-tem, they implemented the saliency map approach proposed by Itti et al. (1998).

Figure 1.3.: Cosero

One of the most advanced robot platforms in terms of real-world applicability in house-hold scenarios is probably Cosero (St¨uckler et al., 2014) (see Figure 1.3). The team NimbRo@Home from the University of Bonn won the RoboCup@Home competition (Wis-speintner et al., 2009) in 2011, 2012, and 2013 us-ing the Cosero platform and its predecessor Dy-namaid. It is equipped with a height-adjustable torso on an omni-directionally moving base and two anthropomorphic arms. The human-like ap-pearance is meant to support HRI. To repre-sent the environment, the deployed system uses a global occupancy map refined through the so-called 3D surfel grid approach (St¨uckler and Behnke, 2014). This global representation is

used mainly for planning in navigation, while an egocentric 3D

(33)

1.1. Robot Companions in the Home

tion of the current situation is used mainly for local planning and grasping. For people awareness they augment the global environment representation with person hypotheses (St¨uckler and Behnke, 2011) which in turn profits from semantic knowledge about the surrounding structure retrieved from this representation.

Although these robots are already quite sophisticated, they still lack a knowledge representation that is powerful enough to handle future tasks of a truly personal service robot. The current representations do not gen-eralize to arbitrarily different tasks than those described in the research publications.

Further, large parts of the gathered information is not preserved long-term. Most of it is only kept for intermediate usage and only very high-level representations are preserved for later reference (Dora The Explorer is an exception here). Another shortcoming of the robotic systems described here is that there seem to be no strategies to align the situation model to the communication partner. This is certainly a requirement for future personal robots. It is impractical to keep up the command-like communication pat-tern that current artificial systems require in order to understand the inter-locutor. HRI will be based on natural language in the future, which will require alignment strategies in the robotic systems that are able to match internal representations to instances of differently represented information. That a robot like Cosero, which lacks these abilities, is so successful in the RoboCup@Home competition shows that research must still evolve in these areas. The tasks assigned to the robots are not designed to require such abilities1. This is probably because research has not come far enough yet

for enabling the participating teams to perform real natural language HRI or accessing a generalizable multi-purpose knowledge base. The “Endur-ing General Purpose Service Robot” task goes in this direction, but from personal experience, I can report that the last years’ commands were all solvable using standard tools. Most tasks require pre-knowledge of allocen-tric information like labeled locations and areas. Accordingly, the majority — if not all — allocentric knowledge is provided beforehand and the robot just has to build up egocentric representations to carry out commands at specific locations.

1The 2014 rulebook for the RoboCup@Home competition can be found at http://www.

(34)

1. Introduction

It seems there is a lack of widespread, functioning solutions for an in-tegrated approach to gathering and maintaining knowledge with varying spatial scope. This is one of the reasons why it might be valuable to shift the attention in research from processing robotics problems in the easily-perceivable space in the direct vicinity of the robot to a more comprehensive view of the wider environment.

1.2. From Vista Space to Environmental Space

In psychology, it is mutually agreed that cognitive functions differ when ap-plied to different scales of space, as discussed by Montello (1993). He argues that in human psychology the representation of space is scale-dependent. Applied to actual tasks this means that comparably small scenes such as those in manipulation tasks are represented differently than those in tasks like navigating to another room, which requires representation of a much wider area. Montello (1993) distinguishes four major classes of psychologi-cal spaces. The figural space is “projectively smaller than the body and can be directly perceived from one place without appreciable locomotion”. The

vista spaceis “projectively as large or larger than the body but can be

visu-ally apprehended from a single place without appreciable locomotion”. The

environmental spaceis “projectively larger than the body and surrounds it.

It is too large to apprehend directly without considerable locomotion”. Usu-ally it requires the integration of information over a significant period of time to fully perceive this space. Geographical space is “projectively much larger than the body and cannot be apprehended directly through locomotion”.

For robotics, this means that it might also be advantageous to make a similar distinction. In previous work, Swadzba (2011) explored ways to model the vista space of a mobile robot. In this thesis, I will proceed to a more comprehensive view of the spaces relevant for a personal robot in an apartment environment. However, although I will present approaches for integrating representations of different scopes, the focus will lie on vista

spaceand environmental space representations. Geographical space is out of

scope for a domestic service robot, and figural space is explicitly covered by another project within the collaborative research cluster in which this thesis is embedded (cf. Meier et al., 2011; Li et al., 2012).

As the representation of space should be applied in a real-world scenario,

(35)

1.3. Research Questions

it must be handled as a continuous model, although the different scopes are represented in different ways. It is impractical to model scopes of a scene in completely isolated representations preventing a bidirectional interchange or collaboration.

Ruetschi and Timpf (2005) argue for a similar distinction between spaces with different scopes in the real-world scenario of wayfinding in public trans-port. They found that humans use the network space, which is “a mediated space, presenting itself by means of maps and schedules, but also by au-dible announcements and tardiness. It exhibits a network structure”. In addition, they use the scene space, which is “directly experienced but doc-umented only implicitly and within itself. [. . . ] It exhibits a hierarchical structure”. These spaces have a geographical scale and an environmental or

vista scale, respectively, following Montello’s definition. Further, they state

that network space and scene space are linked in many ways and interact closely in the application domain of public transport.

So in addition to the definitions mentioned in Section 1, a situation model should support representation of different scopes of the space surrounding an agent in a scope-dependent, but continuous way. A robotic system that implements such a situation model needs strategies for cooperation with different scope-dependent representations.

1.3. Research Questions

Although we have seen many very impressive performances of robots in re-cent years in a wide spectrum of application scenarios, when looking at an individual system, the capabilities are very limited. Usually, complete sys-tems are designed to function in exactly one scenario. They may represent the optimal system for solving the task at hand, but apart from that, most systems are completely useless. There are different reasons for this. One is certainly that many researchers perform basic research on delimited fields, which is good because most basic capabilities a true personal robot requires are still far from solved. Another aspect might be the lack of applicability of many software components to realistic circumstances, or to some extent, the inability to adapt to situations other than the one they were optimized for. This leads to another aspect that is typically underestimated: The integration of context to analyses and mechanisms. An object recognition

(36)

1. Introduction

component could largely profit from knowing which functional role the cur-rently perceived scene has, and a pro-active knowledge gathering behavior might not be appropriate in the middle of the night.

With the advancements in available functionality, middleware implemen-tations and system coordination approaches over the last years, more and more focus is applied to system integration aspects. With the availability of more complex (in terms of number of available functions) and more com-patible systems, there is also a growing demand for more unary solutions rather than multiple island solutions — especially in knowledge represen-tation techniques. This demand leads to the following research questions which represent the basic skeleton of this work.

Question 1: How to represent spatial knowledge?

Which frames of reference should be used (egocentric, global)? How can structural information or instances and their relations be represented? Which data structures should be used? How can world knowledge and inferred knowledge be combined?

Question 2: What and when to represent?

Which level of detail should be applied and how does this depend on the situation? How can the relevance of certain data be judged before insertion?

Question 3: How to solve temporal integration?

How does the update process work? Which additional dimensions are required in the situation model? How can the temporal aspect of the representation be exploited?

Question 4: How to include context information?

How do components benefit from context knowledge? How can back-ground knowledge be referenced on later occasion? Can the spatial lay-out of the knowledge representation facilitate the selection of peripheral information?

All of these questions need to be answered when trying to build a situation

modelfor a personal robot companion that is applicable in arbitrary

situa-tions. Certainly there are more aspects to this topic (semantic ontologies,

(37)

1.4. Scenario & System Foundation

logical inference, intentions, etc.), which cannot be answered in the scope of this thesis. These questions indicate, however, that the pure representation is not the only key to a successful situation model development, but also that strategies for handling data processing are required.

Nevertheless, the posed questions underly the work described in this the-sis, and in the following chapters, I will propose answers to these questions.

1.4. Scenario & System Foundation

The different aspects covered in the following chapters will be linked to a mutual scenario in order to demonstrate the various applications of the described solutions. I will refer to this scenario as the lost key scenario. It is an analogy of a situation in which a person can not remember where she/he last placed a key ring, asking someone for help finding it. In concrete terms, the mobile service robot of a homeowner observes the actions and utterances in its surroundings to build up a situation model. For this, it pro-actively moves around the apartment and closely inspects presumably relevant events or locations. At some point, when someone asks it about a certain object, it is able to report the location or the last performed ma-nipulation of the target object. This scenario requires the afore-mentioned aspects investigated in this thesis. It requires a situation model implemen-tation with long-term capabilities in representing distinct structures and actions. To maximize the informative content and minimize the effort, the robot must select the most relevant events to observe and represent those in an efficient way. It must be able to link possibly ambiguous verbal references to spatial structures by aligning descriptions to actions and to the situation

model. Further, a verbalization of the found results must be available which

supports the alignment to the communication partner. The hardware and software components needed to enable such a scenario as a prerequisite for the implementations done for this thesis will now be explained.

The Hardware Platform

The Bielefeld Robot Companion V2 (BIRON II) hardware platform (see Figure 1.4) we use, based on the research platform GuiaBot™ by Adept

(38)

1. Introduction

MobileRobots2, is customized and equipped with sensors that allow

anal-ysis of the current situation in a human-robot interaction. The platform used here is the second generation of the BIRON platform series, which has been continuously developed since 2001. It comprises two piggyback lap-tops to provide the computational power and to achieve a system running autonomously and in real-time for HRI. The robot base is a PatrolBot™ which is 59cm in length, 48cm in width, and weighs approx. 45 kilograms with batteries. It is maneuverable with 1.7 meters per second maximum translation and 300+ degrees rotation per second. The drive is a two-wheel differential drive with two passive rear casters for balance. Inside the base, there are two laser range finders that add up to a 360° degree laser scan with a scanning height of 30cm above the floor. To control the base and solve navigational tasks, we rely on the ROS navigation stack3.

For room perception, gesture recognition and 3D object recognition, the robot has two ASUS Xtion Pro Live RGB-D sensors4for real time 3D image

data acquisition: one facing down (objects) and an additional one facing towards the user/environment. The object recognition system is supported though high quality 2D imagery from a Sony Alpha 5100 consumer camera. A high-resolution webcam is used for facial recognition. The corresponding computer vision components rely on implementations from Open Source libraries like OpenCV5 and PCL6.

Additionally, the robot is equipped with the Katana IPR 5 degrees-of-freedom (DOF) arm; a small and lightweight manipulator driven by 6 DC-Motors with integrated digital position encoders. The end-effector is a sensor-gripper with distance and touch sensors (6 inside, 4 outside) allowing it to grasp and manipulate objects up to 400 grams throughout the arm’s envelope of operation. The on-board microphone has a hyper-cardioid polar pattern and is mounted on top of the upper part of the robot. For speech recognition and synthesis, we use the Open Source toolkits CMU Sphinx7

and MARY TTS8. The upper section of the robot also houses a touch screen

2http://www.mobilerobots.com/, (visited: March 1, 2015)

3http://wiki.ros.org/navigation, (visited: March 1, 2015)

4

http://www.asus.com/de/Multimedia/Xtion_PRO_LIVE/(visited: March 1, 2015)

5http://opencv.org/(visited: March 1, 2015)

6

http://pointclouds.org/(visited: March 1, 2015)

7

http://cmusphinx.sourceforge.net/(visited: March 1, 2015)

8

http://mary.dfki.de/(visited: March 1, 2015)

(39)

1.4. Scenario & System Foundation

(≈ 15in) as well as the system speaker. The overall height is approximately 140cm.

Figure 1.4.: BIRON II. For real-world applications, the robot can

be deployed in a laboratory apartment in the new CITEC building of Bielefeld University. This so-called Intelligent Apartment measures 60 square meters and has three rooms, includ-ing a kitchen, a livinclud-ing room, a gym, and a bath-room. It contains plenty of hidden technology, but looks like a regular apartment.

System Architecture

To model the robot behavior in a flexible man-ner, we use the Biron Sensor and Actuator

In-terface (BonSAI) framework. It is a

domain-specific library built on the concept of sensors and actuators that allow the linking of per-ception to action (Siepmann and Wachsmuth, 2011). These are organized into robot skills that exploit certain strategies for informed de-cision making (Lohse et al., 2013). BonSAI supports modeling of the control-flow using State Chart XML. The coordination engine serves as a sequencer for the overall system by executing BonSAI skills to construct the de-sired robot behavior. This allows the robot to separate the execution of the skills from the data structures they facilitate thus increasing the re-usability of the skills.

The robot’s architecture relies on the

lightweight and flexible middleware Robotics Service Bus (RSB) for inter-component communication (Wienke and Wrede, 2011). RSB-enabled com-ponents communicate using a message-oriented, event-driven pattern over a logically unified bus that is organized through hierarchical scopes.

(40)

1. Introduction

1.5. Outline

The thesis is structured as follows: Chapter 2 takes a more detailed look at the problem at hand. I will analyze the research questions, formulating actual approaches to be further developed in subsequent chapters. Also the main contributions of this thesis will be formulated. After this I will pro-ceed semantically, starting with basic requirements and ending with complex conversational aspects. Chapter 3 deals with the structural representation of the robot’s surroundings (Research Question 1). The chapter will dis-cuss spatial and temporal integration aspects from the research questions (mainly Questions 2 and 3) and present a pro-active robot behavior that utilizes the developed models. A more semantic view of the surrounding structures is employed in Chapter 4. Here, a grounding of certain entities and areas to general semantic categories which expose certain functional properties is described. Further, the integration of peripheral information into the decision making process will be explored. This mainly refers to Research Questions 3 and 4. Chapter 5 takes a look at instances and their relations in the situation model, suggesting how to align these to the model of a interlocutor in a communicative situation (Research Question 1). It also demonstrates how different cognitive functions can benefit from one another, mainly contributing to Question 4. Finally, I will give a closing overview in Chapter 6 including a discussion of the approached contribu-tion and an outlook to future perspectives for implementing situacontribu-tion models for personal robots.

(41)

Chapter 2

Analysis and Statement of the

Research Problem

In the introduction I promised to explore ways of representing a situation

model in an artificial robotic system. As stated above, there are many

as-pects to this topic that cannot all be answered comprehensively. For this thesis, I chose referential communication as a guiding theme to develop rep-resentations and strategies to contribute to a universal situation model. In concrete terms, the following chapters cover aspects that enable an inter-active robot to ground references to specific objects in a scene in multiple modalities. For that reason, it gains spatial knowledge as a persistent model in a way that allows it to ground these references. Mainly three different aspects of building a consistent model will be pursued: Representing the spatial layout of the environment, applying semantics to geometric struc-tures and areas of the environment, and deploying and aligning the model in human-robot interaction. This chapter aims to analyze the implications of the different available choices regarding these three aspects. It also contains conclusions from experiences that had been made during the work with mo-bile robots by myself and others. By dealing with these questions, a more fine-grained statement about what the research problem is will emerge. Ulti-mately, this analysis leads to specifically formulated goals and contributions of this thesis.

(42)

2. Analysis and Statement of the Research Problem

2.1. Functional Requirements

With the vision of a multi-purpose, generally-deployable robotic system in mind, it becomes quite clear that, to build a coherent system, it is unre-alistic to just combine the many island solutions that currently exist for isolated problems. Today many researchers focus on their specific problems and invent representations that perfectly fit their requirements. This leads to a variety of very distinct solutions with no avenue of collaboration or exchange, instead resulting in huge overhead to maintain all the different representations. It would therefore be desirable to build a central, compre-hensive representation that can be used by all components of the robotic system. The problem is that the requirements of the solutions for the vari-ous problems of such a general-purpose system are very divergent. It would require a very flexible and powerful representation with many support mech-anisms to serve all the posed requirements. A thesis like this can not claim to find the ultimate solution for this problem, but I do propose a represen-tation designed to function as a basis for several software components in a robotic system. Considering the guiding theme of this thesis, components enabling referential communication will be considered for design decisions. Further it may be extended to other problems not considered in this thesis. But what are the general functional requirements of such a representation apart from the task-specific ones? First, it must grant direct access to the data. This means a component must be able to easily receive the required data without having to transform or remap the information in order to fit them to the internally used format. To a certain degree, this also requires the components to adjust their formats to those supported by the central representation. Otherwise, the divergence in the representations on compo-nent level would just be transferred to the central representation, yielding no gain for the system.

Further, the representation must support the efficient analysis of the data. The representation’s data structures must be chosen so that the components can implement fast algorithms on them. However, if multiple data structures are maintained, they also need to be closely linked to quickly transfer data. The representation itself needs to be resource-efficient to guarantee low latency when components access the data. This means that transformation or search tasks within the descriptions must be implemented efficiently. This calls for a sparse representation of the spatial data. The implementation

(43)

2.2. The Choice of Scope

should support the representation of every kind of data in a level of detail appropriate for the task at hand. This reduces the memory load and thereby the latency.

2.2. The Choice of Scope

One of the most obvious differences in the many representations used across different components is the scope of the spatial description. In object ma-nipulation tasks, the spatial representation has a totally different scope than in a path-planning task for navigation. In general, they can be divided into an allocentric scope, which defines a view on the scene from a global per-spective, and the egocentric scope, which defines a view from the personal perspective. The latter supports a description of the scene relative to the point of view of the agent that generates it, usually representing the field of view of the perceptual system. Whereas the allocentric scope may support representations of the complete known environment from a global coordi-nate system, or just a subspace (e.g. the intermediate surrounding of the robot).

From participating in several RoboCup@Home1 competitions, I can

re-port that nearly all competing robotic systems use an allocentric represen-tation for long-distance navigation and storing positions of relevant objects and locations. Meanwhile, obstacle avoidance and manipulation tasks are nearly exclusively done exploiting the egocentric scope. This is not surpris-ing because localization tasks usually profit from relatsurpris-ing entities (includsurpris-ing the self) to landmarks, which is particularly convenient in allocentric rep-resentations. On the other hand, avoidance and manipulation tasks rely on the relation of the self to structures in the immediate environment. For this, egocentric representations are most suitable.

A general representation should cope with these different scopes. It must enable the components to choose the scope of their spatial descriptions, but must also maintain links and relation between allocentric and egocentric views. The argument for a sparse representation applies here as well. Not every part of the environment needs to be represented egocentrically, and less so from multiple points of view. Similarly, it may not be necessary to

1

(44)

2. Analysis and Statement of the Research Problem

keep the egocentric representations updated all the time. Depending on the task at hand, they may only become relevant in certain situations.

2.3. Knowledge Representation

In order to identify a preliminary set of data structures for the general spa-tial representation scheme, I will have a look at the software components currently running on the BIRON II platform (see Section 1.4). One of the most fundamental components of a mobile robot’s system is the navigation. For localization and mapping of landmarks in the form of physical obsta-cles, it uses a probabilistic occupancy grid representation of obstacles in the environment (Moravec, 1988). It is an allocentric representation that de-picts the spatial layout of the complete environment known to the robot. A semantic mapping approach for probabilistically labeling areas in the envi-ronment based on certain semantic properties uses a similar representation (Ziegler, 2010). The resolution of those representations is adjustable, but in practice, a rather coarse resolution is chosen (usually ∼ 5cm cell size) because the associated tasks do not require more detail.

For a persistent storage of locations and objects, the system uses a plain database containing descriptions of the entities in global coordinates that relate to the occupancy grid representation.

The person tracking component contains an allocentric representation, as well. Person hypotheses are also maintained on an instance level with global coordinates. However, these hypotheses are fed with information from detectors building upon egocentric representations. Human torsos are detected using an egocentric point cloud representation from a depth camera. Legs are detected from a polar representation of distance measures from a laser scanner.

More egocentric representations are used in recognition and manipulation. Geometric analyses for finding candidates and obstacles for grasping also use a point cloud representation from a depth camera. However, the content of the the point cloud has a higher resolution than that to detect torsos and is limited to the maximum range of the robot’s arm, whereas the torso detector needs a significantly longer sight. The visual object recognition component cooperates closely with the 3D geometrical analysis component and works on 2D imagery taken from the robot’s visual sensors.

(45)

2.3. Knowledge Representation

Summarizing these insights, one can identify a set of data structures that would satisfy the demands of most components of the current BIRON II system.

Allocentric areal representation. This could be a probabilistic grid

struc-ture or, alternatively, a hierarchical quadtree representation (Hun-ter and Steiglitz, 1979). A three-dimensional voxel grid or octree (Meagher, 1982) representation would be imaginable, as well.

Allocentric instance representation. In the current BIRON system this is

just a plain database of instance descriptions, but a network structure would be possible as well.

Egocentric areal representation. An obvious data structure for this is a

point cloudor depth image.

This selection of data structures deliberately misses representations for the 2D imagery and polar range descriptions mentioned above. There are several reasons for this. First of all, a general representation needs to rep-resent the lowest common denominator for all named requirements. But it certainly cannot universally manage all types of representations that used internally across the components of a system. A compromise must be found. Secondly, the relevance of certain data structures in a central representation is proportionate to the persistence in their demands. Both data structures in negotiation require no persistence in the ways they are used in their re-spective software components; their data is processed and directly forgotten. Only the results of the analysis may be relevant for future reference, but these can be represented using the identified set of data structures. The same holds true for egocentric instance representations that may be rele-vant in the specific execution of, for example a manipulation task, but to persistently represent these instances, the egocentric frame is probably un-necessary. Nevertheless, if new requirements occur that demand persistence for these structures, an extension mechanism that allows one to link-in ar-bitrary structures to the default representation would be imaginable.

Persistence is a central topic for a general knowledge representation of a multi-purpose robotic system. It allows the system to use the representation as a spatial memory. The current BIRON system only has a limited spatial memory. As far as I can tell, this also applies for most artificial robotic

(46)

2. Analysis and Statement of the Research Problem

systems that participated in the RoboCup@Home competition in the last years. From my experience, it is sufficient for the robots to maintain the allocentric representations for later reference. Since there is no task where the robot has to re-visit a previously analyzed scene for a second time, there is no need to reference previously gathered egocentric knowledge at a later occasion. In situations when egocentric representations are required — like when grasping objects — the scene is analyzed bottom-up, and the data is discarded as soon as the robot finishes its task at this location. In real-world applications that go beyond those in the RoboCup competition, this is of greater importance. A robot needs to transfer knowledge from one location to a different situation in the future. This is especially important for infor-mation that cannot be re-generated in a bottom-up manner. Nonetheless, this also reduces the cognitive effort of the system by eliminating the need to repeatedly analyze the same scene from scratch.

For a general persistent knowledge representation, this means it needs to maintain several egocentric representations for later reference. They need to be linked in a way that allows the system to compare these models with each other and with the allocentric representation (cf. Section 2.1). This is particularly important for supporting the inference of referenced objects in communication. Further, methods for spatial and temporal integration, which are self-evident for allocentric representations, also need to be imple-mented for these egocentric structures.

2.4. Applying a Situation Model in Interaction

In the previous sections, I mainly discussed functional properties of a spa-tial representation for multi-purpose service robots. However in interaction situations, methodical aspects of such a representation become particularly relevant. In HRI, the communication is not purely auditory, although in many robotic systems the communication is limited to the speech modality. Similarly, the interpretation of action should not be purely visual. A robotic system that aims to understand humans in a way that promotes their accep-tance in society will have to cope with multiple modalities in interactions. Referencing objects in a dialog is a common example of this. To correctly ground the sentence “This is the object I mean”, the system needs to either interpret a gesture or it must know certain properties of the objects in the

(47)

2.4. Applying a Situation Model in Interaction

near vicinity of the interlocutor in order to narrow down a probable target object. Similarly, to interpret the sentence, “I mean the chair in front of the cupboard”, one requires a rough concept of which objects might be meant by “chair” and “cupboard”, especially if the current scene contains multiple instances with these labels. Also the spatial relation meant by “in front of” needs to be interpreted regarding different perspectives or reference frames. Also the context of an interaction might be important for the correct inter-pretation. For example, the sentence, “Please bring me the book”, might relate to the novel the interaction partner is currently reading and is located in the bookshelf — if this conversation takes place in the living room. How-ever, if this sentence is said in the kitchen while cooking dinner, it might relate to the cookbook lying open on the table.

For general spatial representation, this means it must support the in-ference about multiple aspects represented in the system. For example, a component for gesture recognition that works on a 3D egocentric represen-tation may also reference the allocentrically represented surrounding of the robot in order to correctly interpret the gesture.

Especially when grounding utterances, a close collaboration of the dif-ferent representations is crucial for the success. Although the utterance was perceived correctly on a linguistic level, the content might still be am-biguous. Including semantic information about the context (e.g. in which room is the interaction and what is its function?) might improve the in-terpretation process. In referential communication “perspective-taking” is a key concept for enhancing the process of associating the described relations to instances. This involves both egocentric and allocentric representations. The same is true when using different reference frames in the descriptions. In turn, the production of signals to the interlocutor profits from close col-laboration of the different representations in the same way. However, not only the representation is the key to successful resolving ambiguities. It requires a sophisticated algorithm, that can handle multiple hypotheses in a probabilistic way and include a variety of evidences in the process of find-ing the most likely interpretation of the ambiguous utterance. One of the evidences may also be a linguistic world knowledge regarding preferences of humans in speech production in various situations.

These arguments again promote a good interconnection of the different types of representations. It must be easy to transfer information from one representation into the other. More importantly, it becomes obvious that

(48)

2. Analysis and Statement of the Research Problem

references should play a significant role in a persistent spatial representa-tion. Both spatial relations for interpreting speech and action, as well as temporal references to the status of a past scene to detect change, are of great importance. Only this allows inference about the dynamic properties of certain structures and, therefore, communication about events that were not directly observed.

2.5. Summary: Contribution of this Thesis

These analyses allow a more precise formulation of the topics explored in this thesis. This, in turn, helps define the specific goals to pursue. The research problems identified in the previous sections closely relate to the research questions identified from the semantic analysis of the situation model in Section 1.3.

Research Question 1 addresses the representation of spatial knowledge. As discussed in Section 2.1, a complex robotic system for multiple purposes should contain a central representation that handles the spatial information for the individual software components. This storage should represent the data sparsely to minimize computational overhead. Also, it should manage different types of representations that are well connected and allow com-ponents to use the type of representation that suites their algorithms best. These types differ in the spatial scope and the data structures they use. Specifically, three types of representations were identified: An areal- and an instance-based representation with an allocentric scope, and an egocentric representation for describing structures in the robot’s field of view. Chap-ter 3 will explore the realization of such a representation.

The requirement that the representation should be sparse has implica-tions on Research Question 2 (What and when to represent?). The exis-tence of multiple types of representations within the central storage enables the developer to choose an appropriate level of details for the different rep-resentations. These can be chosen according to the application they are used for and the resolution required by the algorithm using it. As discussed before, there should be a set of egocentric representations in addition to the allocentric ones. Consequently, there need to be strategies that decide when new egocentric models need to be introduced and when they need to be merged or deleted. These will be discussed in Chapter 3.

(49)

2.5. Summary: Contribution of this Thesis

As seen in the analysis of application in interaction, temporal integration and temporal references are crucial mechanisms for spatial representation in a robotic system. The same issue is addressed by Research Question 3. To detect change that is not directly observable, the maintenance of a history of certain structures or properties is important. In the detection process references to situations in the past will be established which need to be represented by the central spatial storage. This aspect will be discussed in Chapter 3. In Chapters 4 and 5, the aspect of spatial and temporal integra-tion while updating the instance based representaintegra-tion will be discussed.

Research Question 4 focuses on the integration aspect of multiple types of data in the interpretation process. The analysis of the application in interaction suggests that the different representations need to collaborate closely to enable the interpretation process to integrate context data. A multi-cue classification process is described in Chapter 4 that relies on this collaboration and the interconnection aspect of the representation. This system explores a boosting-based classification approach that uses a vari-ety of different features to classify different room types. It gathers a vast amount of visual cues and uses them to label different parts of the environ-ment according to their function. Using the allocentric areal representation, these labels are published as context information for other interpretation processes. The system described in Chapter 5 does not focus so much on using the central spatial representation; it rather explores an approach for resolving ambiguities using a variety of evidences from multiple modalities. It incorporates an allocentric probabilistic network approach for tracking multiple hypotheses for interpretation.

(50)
(51)

Chapter 3

Partitioning the Workspace

-Spatial and Temporal Integration

of Informative Local Observations

For building an informative situation model, the first requirement is to have an idea of the general spatial structure surrounding the robot. This has already been discussed in the previous chapters. It does not necessarily mean that a complete detailed three dimensional representation of the envi-ronment needs to be tracked through 3D Slam or similar approaches. Sev-eral approaches for reconstructing a robot’s environment have been pre-sented which typically build up a comprehensive allocentric representation (cf. Wiemann, 2013). However, for specific atomic tasks like grasping an object the representation of spatial structures is often strictly limited to the relevant parts. Typically, only the target object and possible obstacles in the close neighborhood are represented in an egocentric fashion (cf. Rusu et al., 2009c). Particularly, in the field of domestic service robotics a large set of assumptions about the setting can be applied, for example about the size of the work space, the number of entities inside this space, structural properties of the floor and walls, etc. However, these might not be true for other fields of robotic research like rescue or outdoor scenarios. Depending on the task at hand it might suffice to know the rough layout – in domestic robotics of the apartment and a small set of more detailed areas, which are relevant for typical tasks and interactions. The scope of such a representa-tion would be located between those of the two extremes described above.

Referenzen

ÄHNLICHE DOKUMENTE

1) In the human impact competition we asked users to classify the land cover based on the dominant land cover type. We have now added the ability to indicate how many land cover

In contrast, in our model all voters share the same information and beliefs, but are eventually hindered from learning the truth because further inference becomes impossible once

In this thesis a portion of the MTIMBA database extracted from the Rufiji DSS was employed to i) develop Bayesian geostatistical models to analyze very large and sparse

When social mixing is slow, and individual opinion changes are frequent, accuracy is determined by the absolute number of informed individuals.. As mixing rates increase, or the rate

In the opening pages of &XOWXUHDQG,PSHULDOLVP he observes: 'Just as none of us is outside or beyond geography, none of us is completely free from the struggle over geography'

Moving over to the settlement systems scale, a differential urban performance in terms of spatial interaction patterns may be seen as a factor contributing to shifts in population

The associated left eigenvector represents the regional distribution of the reproductive values at birth, whereas the right eigenvector denotes the regional allocation of births.

4.2 The Spatial Momentum of an Initially Stable Population An abrupt 'decline in fertility to bare replacement level in a single-region population that initially is experiencing