• Keine Ergebnisse gefunden

A system approach to interactive learning of visual concepts

N/A
N/A
Protected

Academic year: 2022

Aktie "A system approach to interactive learning of visual concepts"

Copied!
2
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

A system approach to interactive learning of visual concepts

Danijel Skoˇ caj

1

Matej Kristan

1

Aleˇ s Leonardis

1

Marko Mahniˇ c

1

Alen Vreˇ cko

1

Miroslav Jan´ıˇ cek

2

Geert-Jan M. Kruijff

2

Pierre Lison

2

Michael Zillich

3

Charles Gretton

4

Marc Hanheide

4

Moritz G¨ obelbecker

5

1

University of Ljubljana, Slovenia

2

DFKI, Saarbr¨ ucken, Germany

3

Vienna University of Technology, Austria

4

University of Birmingham, UK

5

Albert-Ludwigs-Universit¨at Freiburg, Germany

Abstract

In this work we present a system and un- derlying mechanisms for continuous learning of visual concepts in dialogue with a human.

1. Introduction

Cognitive systems are often characterised by their ability to learn, communicate and act autonomously.

In combining these competencies we envision a sys- tem that incrementally learns about the scene by be- ing engaged in mixed initiative dialogues with a hu- man tutor. In this paper we outline how our robot George, depicted in Fig. 1, learns and refines visual conceptual models, either by attending to informa- tion deliberatively provided by a human tutor (tutor- driven learning: e.g., H: This is a red box.) or by taking initiative itself, asking the tutor for specific information (tutor-assisted learning: e.g., G: Is the elongated object yellow?). Our approach unifies these cases into an integrated approach comprising incre- mental visual learning, selection of learning goals, continual planning to select actions for optimal learn- ing behaviour, and a dialogue subsystem. George is one system in a family of memory-oriented inte- grated systems that aim to understand where their own knowledge is incomplete and that take actions to extend their knowledge subsequently. Our objec- tive is to demonstrate a cognitive system that can efficiently acquire conceptual models in an interac- tive learning process that is not overly taxing with respect to tutor supervision and is performed in an intuitive, user-friendly way.

2. The system

The implementation of the robot is based on CAS (Hawes and Wyatt, 2010), the CoSy Architec- ture Schema. The schema is essentially a distributed working memory model composed of several subar- chitectures (SAs) implementing different functional-

ities. George is composed of four such SAs, as de- picted in Fig. 1 (here, the components are depicted as rounded boxes and exchanged data structures as rectangles, with arrows indicating a conceptual in- formation flow).

The Visual SA processes the scene as a whole using stereo pairs of images and identifies spaces of interest, which are further analysed; the poten- tial objects are segmented and are then subjected to feature extraction. The extracted features are then used for learning and recognitionof objects and qualitative visual attributes, like colour and shape.

The learning is based on an online learning method that enables updating from positive examples (learn- ing) as well as from negative examples (unlearn- ing) (Kristan et al., 2010). Our approach also does not assume the closed world assumption; at every step the system also takes into account the proba- bility that it has encountered a concept that has not been observed before.

The recognised visual properties are then for- warded to theBinder SA, which serves as a central hub for gathering information from different modal- ities about entities currently perceived in the envi- ronment. Since the information was extracted by the robot itself, we call the resulting information struc- ture aprivate belief.

Beliefs can also be created by the Dialogue SA.

It analyses an incoming audio signal, parses the cre- ated word lattice and chooses the contextually most appropriate meaning representation for the utter- ance (Lison and Kruijff, 2009). Then it establishes which meaningful parts might be referring to objects in the visual context. The actual reference resolution then takes place when we perform dialogue interpre- tation taking into account the information stored in the robot beliefs. In this process, we use weighted abductive inference to establish the intention behind the utterance. As a result of this process, an at- tributed belief containing the information asserted by the human is constructed from the meaning of

(2)

Video/Stereo server

Image 3D points

Bottom-up attention

SOI SOI

analyser Object

analyser Visual mediator

Visual learner/

recognizer Segmentor

Private belief

Proto-logical form

Dialogue SA Binder SA

Visual SA

Dialogue interpretation

Multi-modal fusion

Quantitative layer Qualitative

layer

Mediative layer Proto

object

Visual object Speech

recognition Parsing

Word lattice

Linguistic meaning

Output planning Speech

synthesis

Generated utterance

Attributed belief

Shared belief

Planning SA

Goal generator

Epistemic goal

Active goal

Goal management

Planner

Plan

Executor Learning

instruction

Motivation layer

Execution layer Dialogue production

Dialogue comprehension

User intention

Robot intention Dialogue management

Goal realisation

Planning layer Reference

resolution

Figure 1: Left: Scenario setup and observed scene. Right: Schematic system architecture.

the utterance.

The beliefs, being high-level symbolic representa- tions, provide a unified model of the environment and the attributed information, which can be effi- ciently used for planning. In thePlanning SA, the motivation layer (Hanheide and et.al., 2010) moni- tors beliefs to generate goals. In the tutor-driven case attributed beliefs are taken as learning oppor- tunities eventually leading to epistemic goals that require that this new information is used to update the visual models. To implement robot-initiated tu- tor assisted learning the Planning SA also continu- ously accounts for the goal to maximise the system’s knowledge in terms of reducing the uncertainty in the visual models. In order to only take action if there is a significant learning opportunity in terms of reward, goal management employs a threshold to decide whether a plan should be executed. The goal management continuously manages goals according to their priority, eventually interrupting execution if a higher priority goal shows up. We assign human- initiated goals a higher priority, enabling the system to immediately respond to human input.

Planexecution proceeds according to the continual planning paradigm (Brenner and Nebel, 2009) mon- itoring the system’s belief to trigger replanning if re- quired. In tutor driven learning, actions scheduled for execution typically include sending a learning in- struction to the Visual SA, which triggers the update of the visual representations. In tutor assisted learn- ing the execution usually involves sending a clarifica- tion request to the Dialogue SA, which is then sub- sequently synthesised, typically as a polar or open question about a certain object property, and the tutor’s answer is then used to update the models.

3. Conclusion

In this work we briefly presented the integrated system and underlying mechanisms for continuous learning of visual concepts in dialogue with a human tutor. Building on this system, our final goal is to produce an autonomous robot that will be able to efficiently learn and act by capturing and process- ing cross-modal information in interaction with the environment and other cognitive agents.

Acknowledgment

The work was supported by the EC FP7 IST project CogX-215181.

References

Brenner, M. and Nebel, B. (2009). Continual plan- ning and acting in dynamic multiagent environ- ments. JAAMAS, 19(3):297–331.

Hanheide, M., et.al. (2010). A framework for goal generation and management. In Proceedings of the AAAI Conference on Artificial Intelligence.

Hawes, N. and Wyatt, J. (2010). Engineer- ing intelligent information-processing systems with CAST. Advanced Engineering Infomatics, 24(1):27–39.

Kristan, M., et.al. (2010). Online kernel density es- timation for interactive learning. Image and Vi- sion Computing, 28(7):1106–1116.

Lison, P. and Kruijff, G. (2009). Efficient parsing of spoken inputs for human-robot interaction. In Proceedings of the RO-MAN 09, Toyama, Japan.

Referenzen

ÄHNLICHE DOKUMENTE

ACZÉL, Petra, is a Professor at Corvinus University of Budapest and head of the Institute of Behavioural Science and Communication Theory, as well as member of the

[r]

We present representations representations and mechanisms mechanisms that are necessary for continuous learning of visual concepts in dialogue with a tutor.. We present an

To enable interactive learning, the system has to be able to, on one hand, perceive the scene and (partially) interpret the visual information and build the

We focus here on the former type of interaction between modalities and present the representations that are used for continuous learning of basic visual concepts in a dialogue with

The challenge of specifying a visual programming language on diverging mobile platforms truly demands new and different approaches. Using Cucumber to com- pose

Ein Wrapper-Learner (WIM) lernt eine Klasse C von Wrappern, falls er jegliche Semantiken lernen kann, die durch Wrapper aus C beschrieben werden können.. 6.1

When an organization does not have an environmental management system, EPE can assist the organization in: — identifying its environmental aspects; — determining which aspects it