Multimodal Learning of Actions with Deep Neural Network Self-Organization

(1)

Multimodal Learning of Actions with

Deep Neural Network Self-Organization

Dissertation

submitted to the Universit¨at Hamburg, Faculty of Mathematics, Informatics and Natural Sciences, Department of Informatics, in partial fulfilment of the requirements for the degree of Doctor rerum naturalium (Dr. rer. nat.)

German I. Parisi Hamburg, 2016

(2)

Submitted:

November 10, 2016

Day of oral defence: March 10, 2017

Dissertation committee:

Prof. Dr. Stefan Wermter (advisor)

Department of Informatics, Universit¨at Hamburg, Germany

Dr. Victor Uc-Cetina (reviewer)

Department of Informatics, Universit¨at Hamburg, Germany

Prof. Dr. Jianwei Zhang (chair)

(3)

All illustrations, except were explicitly noticed, are work by German I. Parisi and are licensed under the Creative Commons Attribution-ShareAlike 4.0 In-ternational (CC BY-SA 4.0). To view a copy of this license, visit: https:// creativecommons.org/licenses/by-sa/4.0/

(4)

(5)

“Experience is the teacher of all things.” — JULIUS CAESAR

(6)

Abstract

Perceiving the actions of other people is one of the most important social skills of human beings. We are able to reliably discern a variety of socially relevant information from people’s body motion such as intentions, identity, gender, and affective states. This ability is supported by highly developed visual skills and the integration of additional modalities that in concert contribute to providing a ro-bust perceptual experience. Multimodal integration is a fundamental feature of the brain that together with widely studied biological mechanisms for action percep-tion has served as inspirapercep-tion for the development of artificial systems. However, computational mechanisms for processing and integrating knowledge reliably from multiple perceptual modalities are still to be fully investigated.

The goal of this thesis is to study and develop artificial learning architectures for action perception. In light of a wide understanding of the brain areas and un-derlying neural mechanisms for processing biological motion patterns, we propose a series of neural network models for learning multimodal action representations. Consistent with neurophysiological studies evidencing a hierarchy of cortical lay-ers driven by the distribution of the input, we demonstrate how computational models of input-driven self-organization can account for the learning of action fea-tures with increasing complexity of representation. For this purpose, we introduce a novel model of recurrent self-organization for learning action features with in-creasingly large spatiotemporal receptive fields. Visual representations obtained through unsupervised learning are incrementally associated to symbolic action la-bels for the purpose of action classification.

From a multimodal perspective, we propose a model in which multimodal action representations can develop from neural network organization in terms of associa-tive connectivity patterns between unimodal representations. We report a set of experiments showing that deep self-organizing hierarchies allow to learn statisti-cally significant features of actions, with multimodal representations emerging from co-occurring audiovisual stimuli. We evaluated our neural network architectures on the tasks of human action recognition, body motion assessment, and the detection of abnormal behavior. Finally, we conducted two robot experiments that provide quantitative evidence for the advantages of multimodal integration for triggering sensory-driven motor behavior. The first scenario consists of an assistive task for the detection of falls, whereas in the second experiment we propose audiovisual integration in an interactive reinforcement learning scenario. Together, our re-sults demonstrate that deep neural self-organization can account for robust action perception, yielding state-of-the-art performance also in the presence of sensory uncertainty and conflict.

The research presented in this thesis comprises interdisciplinary aspects of ac-tion percepac-tion and multimodal integraac-tion for the development of efficient neu-rocognitive architectures. While the brain mechanisms for multimodal perception are still to be fully understood, the proposed neural network architectures may be seen as a basis for modeling higher-level cognitive functions.

(7)

Zusammenfassung

Die Wahrnehmung von Aktionen anderer Personen ist eine der wichtigsten sozialen Kompetenzen von Menschen. Wir sind in der Lage, eine Vielzahl von relevanten sozialen Informationen aus den Körperbewegungen von Personen zu extrahieren; dazu gehören Absichten, Identität, Geschlecht und Gefühlszustände. Diese F¨ ahig-keit stützt sich auf ein hochentwickeltes visuelles System und die Integration von zusätzlichen Modalitäten, die gemeinsam dazu beitragen, eine robuste Wahrneh-mungserfahrung zu schaffen. Multimodale Wahrnehmung ist eine fundamentale Eigenschaft des Gehirns, welche zusammen mit den biologisch gut erforschten Mechanismen zur Aktionswahrnehmung als Inspiration für die Entwicklung k¨ unst-licher Systeme gedient hat. Dennoch ist die Forschungsfrage, wie man Wissen maschinell aus einer Vielzahl von Modalitäten verlässlich verarbeiten und verbinden kann, noch offen.

Die vorliegende Arbeit beschäftigt sich mit der Erforschung und Entwicklung von künstlichen Lernarchitekturen zur Aktionswahrnehmung. Vor dem Hinter-grund des weitverbreiteten Verständnisses der Gehirnregionen und zugrundeliegen-den neuronalen Mechanismen zur Verarbeitung von Bewegung in biologischen Sys-temen, präsentieren wir eine Reihe von neuronalen Netzwerkmodellen zum Erler-nen von Repräsentationen von multimodalen Aktionen. In Einklang mit neuro-physiologischen Studien, die eine stimulusgetriebe Hierarchie von kortikalen Ebe-nen belegen, zeigen wir, wie Computermodelle von stimulusgetriebener Selbstor-ganisation für das Erlernen von Aktionsmerkmalen Rechnung tragen können. Zu diesem Zweck stellen wir ein neues Modell rekurrenter Selbstorganisation zum Erlernen von Aktionsmerkmalen vor, welches wachsende raum-zeitliche rezeptive Felder nutzt. Visuelle Repräsentationen, welche mit Hilfe von unüberwachtem Ler-nen gewonLer-nen werden, werden zum Zweck der Aktionsklassifikation inkrementell mit symbolischen Aktionslabeln assoziiert.

Von einer multimodalen Perspektive stellen wir ein Modell vor, in dem sich Aktionsrepräsentation aus neuronaler Netzwerkorganisation ergibt, im Sinne von Mustern in der Konnektivität von Assoziationen unimodaler Repräsentationen. Wir führen eine Reihe von Experimenten durch, die zeigen, wie tiefe, selbstorgan-isierende Hierarchien das Erlernen von statistisch signifikanten Aktionsmerkmalen erlauben, wobei multimodale Repräsentation aus gemeinsam auftretenden audio-visuellen Stimuli hervorgeht. Wir evaluieren unsere neuronalen Netzwerkarchitek-turen mit Aufgaben zur Erkennung menschlicher Bewegung, zur K¨ orperbewegungs-beurteilung und zur Erkennung von abnormalem Verhalten. Abschliessend führen wir zwei Experimente mit Robotern durch, welche quantitativ die Vorteile von mul-timodaler Integration zum Auslösen von sensorgetriebenem motorischen Verhalten belegen. Das erste Szenario besteht aus einer assistiven Aufgabe zur Sturtzerken-nung, während im zweiten Experiment ein Vorschlag zur audiovisuellen Integration in einem interaktiven Szenario erbracht wird. Zusammen zeigen unsere Ergeb-nisse, dass tiefe neuronale Selbstorganisation eine robuste Aktionswahrnehmung ermöglicht und dem Stand der Technik entsprechende Ergebnisse liefern kann, selbst bei unsicheren oder widersprüchlichen Sensormessungen.

(8)

Abstract

Die Forschung in dieser Arbeit beinhaltet interdisziplinäre Aspekte der Aktions-wahrnehmung und der multimodalen Integration mit dem Ziel der Entwicklung von effizienten neurokognitiven Architekturen.Während die Mechanismen, welche das Gehirn zur multimodalen Wahrnehmung nutzt, noch näher erforscht werden müssen, können die vorgestellten neuronalen Netzwerkarchitekturen als Basis zur Modellierung von höheren kognitiven Funktionen gesehen werden.

(9)

Contents 7.3.5 Audiovisual Integration . . . 114 7.3.6 Experimental Results . . . 116 7.4 Summary . . . 118 8 Conclusion 121 8.1 Thesis Summary . . . 121 8.2 Discussion . . . 122 8.3 Future Work . . . 126 8.4 Conclusion . . . 129 A List of Abbreviations 131 B Supplementary Algorithms 133 C Action Sequences 135 D Additional Results 136

E Publications Originating from this Thesis 138

F Acknowledgements 141

(12)

(13)

List of Figures

2.1 Schematic illustration of the brain for visual processing . . . 8

2.2 Different types of self-organizing networks . . . 9

2.3 Person monitoring in a home-like environment . . . 12

3.1 Different types of self-organizing networks . . . 26

3.2 Comparison of GNG and GWR . . . 30

4.1 Snapshots of actions from the KT dataset . . . 36

4.2 Action representations . . . 36

4.3 Three-stage GNG hierarchical pose-motion processing. . . 38

4.4 Evaluation on recognition accuracy (GNG) . . . 41

4.5 GWR hierarchical architecture . . . 42

4.6 A GWR network trained with a normally distributed data . . . 44

4.7 Noise detection from a GWR network . . . 44

4.8 Confusion matrix for GNG-based architecture . . . 46

4.9 Confusion matrix for GWR-based architecture . . . 46

4.10 Daily actions from the CAD-60 dataset . . . 47

4.11 Architecture for transitive action recognition . . . 51

4.12 Skeletons of transitive actions . . . 53

4.13 Evaluation of transitive action recognition . . . 53

5.1 Multimodal hierarchical learning architecture . . . 58

5.2 Hierarchical learning of neural activations . . . 60

5.3 Classification accuracy of OSS-GWR . . . 62

5.4 Λλ _{function for different firing counters . . . .} ₆₄

5.5 Representation of full-body actions from the KT dataset . . . 66

5.6 Confusion matrix for the OSS-GWR approach . . . 68

5.7 Confusion matrix for the S-GWR approach . . . 68

5.8 OSS-GWR: Average classification accuracy . . . 69

5.9 Visual representations generated from speech . . . 69

6.1 Visual feedback for squat sequence . . . 74

6.2 Learning architecture for motion assessment . . . 75

6.3 Temporal quantization error over 30 timesteps . . . 77

6.4 Movement prediction for action assessment . . . 78

(14)

List of Figures

6.6 Segmented body motion representation . . . 87

6.7 Confusion matrix for the AG-GWR approach . . . 89

6.8 AG-GWR: Classification accuracy on the KT action dataset . . . . 89

6.9 Sample frames of body shapes from the Weizmann dataset . . . 91

6.10 AG-GWR: Classification accuracy on the Weizmann dataset . . . . 92

7.1 Overall architecture of our multimodal system . . . 96

7.2 Nao with Xtion sensor . . . 97

7.3 Active tracking with Nao . . . 97

7.4 Communication network diagram . . . 98

7.5 SSL with cross-correlation using different microphones . . . 100

7.6 Multimodal robot perception . . . 101

7.7 Fall detection scenario . . . 102

7.8 Abnormal event detection from video sequences . . . 103

7.9 Flow chart of our SOM-based learning stage . . . 104

7.10 Effects of outliers in the clustering of training data . . . 104

7.11 Multimodal architecture for our IRL scenario . . . 109

7.12 Cleaning scenario with the NICO robot . . . 110

7.13 Hand segmentation and pose estimation . . . 112

7.14 FINGeR pipeline for hierarchical processing . . . 112

7.15 Gestures used as advice in the robotic scenario . . . 113

7.16 Confidence functions . . . 115

7.17 Integrated rewards with different thresholds . . . 117

7.18 Collected rewards with advice from audiovisual input . . . 117

C.1 Example sequences from the KT action dataset . . . 135

(15)

List of Tables

4.1 Our approach compared to the state of the art for CAD-60 . . . 48

4.2 Two-stream hierarchical learning: Training results on the two datasets 49 5.1 Training parameters for the S-GWR and the OSS-GWR . . . 61

6.1 Single-subject evaluation. . . 80

6.2 Multi-subject evaluation. . . 80

6.3 Training parameters for the Gamma-GWR architecture . . . 88

6.4 Results on the Weizmann dataset for 10-second action snippets . . . 92

6.5 Results on the Weizmann dataset for full action sequences . . . 92

7.1 ASUS Xtion Live sensor specifications . . . 96

7.2 Performance of our abnormality detection algorithm . . . 107

7.3 Training parameters for GWR hierarchical learning . . . 114

(16)

(17)

Chapter 1 Introduction

The daily perceptual experience of human beings is driven by an array of sensors that in concert contribute to the efficient and robust interaction with the environ-ment (Stein and Meredith, 1993; Ernst and B¨ulthoff, 2004; Stein et al., 2009). We are able to reliably discern a variety of relevant social cues from people’s body mo-tion such as intenmo-tions, identity, gender, and affective states (Blake and Shiffrar, 2007; Giese and Rizzolatti, 2015), which is supported by the development of a highly skilled visual perception and the integration of additional modalities. The ability to integrate multisensory information is a fundamental and widely studied feature of the brain, yielding the effective processing of body motion patterns also from strongly degraded stimuli (Neri et al., 1998; Thornton et al., 1998; Poom and Olsson, 2002). Therefore, the findings of the underlying biological mechanisms for action perception have played an inspiring role in the development of artifi-cial systems aimed to address the robust recognition of actions, for instance, by integrating auditory and visual patterns. Computational models for multimodal integration are a paramount ingredient of autonomous robots to forming robust and meaningful representations of perceived events (Ursino et al., 2014).

Multimodal representations have been shown to improve performance in the research areas of human action recognition, human-robot interaction, and sensory-driven robot motor behavior (Kachouie et al., 2014; Noda et al., 2014; Bauer et al., 2015). However, multisensory inputs must be represented and integrated in an ap-propriate way so that they result in a reliable perceptual experience aimed to trigger adequate behavioral responses. Since real-world events unfold at multi-ple spatial and temporal scales, artificial learning architectures aiming at tackling complex perceptual tasks should account for the multimodal processing of spa-tiotemporal stimuli with multiple levels of complexity and abstraction (Fonlupt, 2003; Hasson et al., 2008; Lerner et al., 2011). This kind of hierarchical aggrega-tion is an essential organizaaggrega-tional principle of brain cortical networks that together with the interplay of multiple modalities drives a series of perceptual and cogni-tive processes (Taylor et al., 2015). Consequently, the question of how to acquire, process, and integrate multimodal knowledge in artificial neurocognitive systems represents a fundamental issue still to be fully investigated.

(18)

Chapter 1. Introduction

Research Objective

The main goal of this thesis is the study and development of artificial learning ar-chitectures for action perception motivated by a set of neurophysiological findings and behavioral studies. We take inspiration from the underlying neural mecha-nisms of the brain areas dedicated to processing biological motion from a set of available perceptual cues. These mechanisms include the hierarchical nature of cortical areas for processing spatiotemporal patterns with an increasing complex-ity and abstraction of representation (Hasson et al., 2008; Taylor et al., 2015) and the development of cortical connectivity patterns through neural network self-organization (Willshaw and von der Malsburg, 1976; Nelson, 2000). In the light of a more substantial understanding of the development and properties of cortical maps in the mammalian brain, well-studied computational mechanisms of input-driven self-organization can be extended to model learning architectures that account for complex multimodal tasks, e.g., from rudimentary action perception to higher-level cognitive functions.

The key objective of this thesis is in the development of multimodal action representations from neural network self-organization. More specifically, how can statistically significant action cues from co-occurring auditory and visual inputs be combined in an unsupervised manner by learning connectivity patterns between unimodal representations. Although the development of associations between co-occuring stimuli for multimodal binding has been supported extensively by neuro-physiological studies (Fiebelkorn et al., 2009) with strong links between the brain areas governing visual and language processing (Foxe et al., 2000; Pulverm¨uller, 2005), computational models for the efficient multimodal binding of spatiotemporal features have remained an open issue (Ursino et al., 2014).

As a complementary goal, we aim to validate the proposed neural network models for multimodal action perception in robot experiments with real-world tasks. In contrast to the evaluation of computational models with data collected in highly controlled conditions, these experiments are aimed at assessing how the proposed neural architectures deal with rich streams of information also in the case of sensory uncertainty and conflict. In particular, we wish to provide quantitative evidence on the advantages conveyed by the use of multiple modalities for human-robot interaction tasks comprising sensory-driven motor behavior.

Contribution to Knowledge

The contribution to knowledge of this thesis is a detailed study of neural net-work self-organization and the development of deep self-organizing architectures for learning multimodal action representations. These architectures are in line with a set of biological findings evidencing a hierarchy of neural detectors for processing spatiotemporal body motion cues with increasing complexity of repre-sentation. We demonstrate how self-organizing architectures can be extended to account for a set of visual tasks such as human action recognition, body motion assessment, and the detection of abnormal behavior. In particular, we propose a

(19)

deep self-organizing architecture for learning visual action representations in an un-supervised manner. This architecture comprises multiple layers of recurrent neural networks to implement the hierarchical processing of visual cues with increasingly larger spatiotemporal receptive fields from depth map videos. Furthermore, we propose an approach for learning multimodal action representations from neural self-organization in terms of asymmetric connectivity patterns between unimodal representations, allowing the bidirectional retrieval of audiovisual patterns. Our experimental results with computer simulations and interactive robots show the importance of multimodal processing for improving human-robot interaction and sensory-driven motor behavior, especially in the case of sensory uncertainty and conflict in real-world tasks.

Thesis Organization

For a better understanding of the challenges considered in this thesis, we provide an introduction to multimodal action recognition in Chapter 2, where we review well-established findings regarding action perception in the brain along with a back-ground on computational architectures for state-of-the-art human action recogni-tion, body motion assessment, and abnormal behavior detection in assistive robot scenarios. In Chapter 3, we present the pillars of experience-driven cortical or-ganization and computational models of neural network self-oror-ganization. As a modelling foundation to address our research question, we focus on a number of topology-preserving networks for the development of topological maps driven by the distribution of the input.

In Chapter 4, we propose a set of neurobiologically-motivated neural network architectures for action recognition from depth map videos in real time. Our ap-proach consists of hierarchically-arranged self-organizing networks processing ac-tion cues in terms of body posture and moac-tion features. Furthermore, we introduce our dataset of full-body actions that we use to evaluate the architectures proposed in this and following chapters. In Chapter 5, we investigate the use of hierarchical self-organizing learning for the development of congruent multimodal action rep-resentations. In particular, we propose a model where multimodal representations emerge from the co-occurrence of auditory and visual stimuli via the learning of as-sociative connections between unimodal representations, yielding the bidirectional retrieval of audiovisual patterns.

In Chapter 6, we propose a novel temporal extension of a self-organizing net-work equipped with recurrent connectivity for dealing with time-varying patterns. We use this recurrent network in a hierarchical architecture for the unsupervised learning of action representations with increasingly larger spatiotemporal recep-tive fields. In order to compare our proposed architecture with respect to current trends in deep learning, we show how our model accounts for the learning of robust action-label mappings also in the case of occasionally absent or even contradictory action class labels during training sessions. Additionally, we show how the same recurrent neural network mechanism can deal with both action recognition and body motion assessment in real time.

(20)

Chapter 1. Introduction

In Chapter 7, we apply aspects of multimodal integration for enhancing human-robot interaction and triggering robust sensory-driven human-robot behavior in dynamic environments. We conduct experiments in two scenarios: a robot-human assistance task for fall detection and a multimodal interactive reinforcement learning task with a robot cleaning a table and receiving instructions from both vocal and gesture commands. Experiments show that the integration of multiple modalities leads to a significant improvement of performance with respect to unimodal approaches.

Concluding in Chapter 8, the proposed neural network architectures and re-ported results are discussed from the perspective of our research questions, ana-lyzing analogies and limitations with respect to biological findings and providing a number of future research directions.

(21)

Chapter 2 Multimodal Action Recognition

The robust recognition of others’ actions represents a crucial component under-lying social cognition. Humans can reliably discriminate a variety of socially rel-evant cues from body motion such as intentions, identity, gender, and affective states (Blake and Shiffrar, 2007; Giese, 2015). Neurophysiological studies have identified a specialized area for the visual coding of complex motion in the mam-malian brain (Perrett et al., 1982), comprising neurons selective to biological mo-tion in terms of time-varying patterns of form and momo-tion features in a wide number of brain structures (Giese and Rizzolatti, 2015). Furthermore, the ability of the brain to integrate multisensory information plays a crucial role aimed to provide a robust perceptual experience for an efficient interaction with the envi-ronment (Stein and Meredith, 1993; Ernst and B¨ulthoff, 2004; Stein et al., 2009). Consequently, the investigation of the biological mechanisms of action perception is fundamental to the development of artificial systems that should account for the robust processing of body motion cues from cluttered environments and rich streams of information.

In Section 2.1, we provide an introduction to multimodal action perception in humans and the underlying neural mechanisms in the brain, whereas in Section 2.2 we describe a variety of computational models aimed to tackle complex visual tasks such as human action recognition, body motion assessment, and the detection of abnormal behavior, along with a set of technical challenges involve in embedding these systems into robotic platforms.

2.1 Action Recognition in the Brain

2.1.1 How We Learn to See Others

The skill to recognize biological motion in humans arises in early life. The ability of neonates to imitate manual gestures suggests that the recognition of complex motion may depend on innate neural mechanisms (Meltzoff et al., 1977). Studies on preferential looking with four-month-old infants evidence a preference for staring at human motion sequences for a longer duration than at sequences with random motion (Bertenthal and Pinto, 1993). Additional behavioral studies have shown

(22)

Chapter 2. Multimodal Action Recognition

that young children aged three to five years steadily enhance their skills to identify human and non-human biological motion portrayed as animations of point-light tokens and reach adult performance by age five (Pavlova et al., 2001).

The preservation of the ability to reliably discriminate different forms of body motion from normal and impoverished stimuli has been reported for observers older than sixty years old (Norman et al., 2004), in contrast to reported age-related deficits in the visual system such as deterioration of speed discrimination and de-tection of low-contrast moving contours. Experiments on action discrimination tasks have evidenced a remarkable efficiency of adult observers to temporally inte-grate body motion from highly improvised visual stimuli, e.g., partially occluded bodies, body motion embedded within noise or animated figures represented by a small number of moving dots (Johansson, 1973; Neri et al., 1998; Thornton et al., 1998; Poom and Olsson, 2002). On the other hand, significantly decreased performance of action perception has been reported for temporal disruptions of the stimuli (temporally scrambled frames of videos) and strong spatial rotation (upside-down clips) of both biological and artificial motion morphs (Bertenthal and Pinto, 1993; Jastorff et al., 2006). Interestingly, Jastorff et al. (2006) have shown that after a number of trials, observers improve their ability to recognize sequences of upside-down body motion, whereas such an improvement over multi-ple trials has not been reported for temporally disrupted versions of videos, thus suggesting that action recognition is highly selective in terms of the temporal or-der of presented stimuli. Moreover, these studies have shown that learning plays an important role in complex motion discrimination, with recognition speed and accuracy of humans being improved after a number of training sessions, not only for biologically relevant motion but also for artificial motion patterns underlying a skeleton structure (Jastorff et al., 2006; Hiris, 2007).

In addition to highly skilled visual mechanisms for motion analysis, a vast variety of studies has shown that visual perception is strongly interwoven with additional perceptual modalities and higher-level cognitive processes (Foxe et al., 2000; Raij et al., 2000; Pulverm¨uller, 2005). Words for actions and events appear to be among children’s earliest vocabulary (Bloom, 1993). A central question in the field of developmental learning is how children first attach verbs to their referents. During their development, children have a wide range of perceptual, social, and linguistic cues at their disposal that they can use to attach a novel label to a novel referent (Hirsch-Pasek et al., 2000). The referential ambiguity of verbs may then be solved by children assuming that words map onto the most perceptually salient action in their environment. Recent experiments have shown that human infants are able to learn action–word mappings using cross-situational statistics, thus also in the presence of occasionally unavailable ground-truth action words (Smith and Yu, 2008). Furthermore, action words can be progressively learned and improved from linguistic and social cues so that novel words can be attached to existing visual representations. This hypothesis is supported by neurophysiological studies evidencing strong links between the cortical areas governing visual and language processing, and suggesting high levels of functional interaction of these areas for the formation of multimodal representations of audiovisual stimuli (Foxe et al.,

(23)

2.1. Action Recognition in the Brain

2000; Raij et al., 2000; Belin et al., 2000, 2002; Pulverm¨uller, 2005).

Together, these studies suggest a highly robust and adaptive system for the ef-ficient analysis of biological motion and synthetically generated patterns of biome-chanically plausible motion. For over five decades, the neural mechanisms of the mammalian brain for action perception have been subject to multidisciplinary studies, with insights about biological motion processing having the dual goal of improving our understanding of the brain and contributing to the development of artificial models of perception.

2.1.2 Neural Mechanisms for Action Perception

Studies have identified a specialized area for the visual coding of complex, artic-ulated motion in the mammalian brain (Perrett et al., 1982). Early processing of visual input starts in the primary visual cortex (V1) and extends to higher-level and diverse areas of the brain. In particular, neurons selective to biological motion in terms of time-varying patterns of form and motion features have been found in a wide number of brain structures such as the superior temporal sulcus (STS), the parietal, the premotor and the motor cortex (Giese and Rizzolatti, 2015). A schematic illustration of the brain containing a series of areas involved in visual processing is shown in Fig. 2.1.

Two-Pathway Processing of Visual Cues

Neurophysiological studies have shown that the mammalian visual system pro-cesses biological motion in two neural pathways (Ungerleider and Mishkin, 1982; Felleman and Van Essen, 1991). The ventral pathway recognizes sequences of snapshots of body form, while the dorsal pathway recognizes movements in terms of optic-flow patterns. Both pathways comprise hierarchies that extrapolate visual features with increasing complexity of representation. Visual processing in corti-cal areas is hierarchicorti-cal, with increasingly larger spatiotemporal receptive windows where simple features manifest in low-level layers closest to sensory inputs, while increasingly complex representations develop in deeper layers (Taylor et al., 2015; Hasson et al., 2008; Lerner et al., 2011). Specifically for the visual cortex, Hasson et al. (2008) have shown that while early visual areas such as the primary visual cortex (V1) and the motion-sensitive area (MT+) yield higher responses to instan-taneous sensory input, high-level areas such as the superior temporal sulcus (STS) are more affected by information accumulated over longer timescales. Neurons in higher levels of the hierarchy are also characterized by gradual invariance to the position and the scale of the stimulus (Orban et al., 1982). This kind of hierarchi-cal aggregation is a fundamental organizational principle of cortihierarchi-cal networks for dealing with perceptual and cognitive processes that unfold over time (Fonlupt, 2003).

Although there has been a long-standing debate on which visual cue is pre-dominant to action understanding, i.e. either snapshots of body form (Lange and Lappe, 2006) or optic flow patterns (Troje, 2002), it has been found that neurons

(24)

Chapter 2. Multimodal Action Recognition Occipital Lobe Parietal Lobe Temporal Lobe Frontal Lobe V1 STS V2 MT V2 Motor cortex STG V4 Premotor cortex

Dorsal visual stream

IT

Ventral visual stream

Figure 2.1: Schematic illustration of the brain for visual processing. IT, inferior temporal cortex; MT, middle temporal cortex; STG, superior temporal gyrus; STS, superior temporal sulcus; V1, primary visual cortex; V2, secondary visual cortex (prestriate cortex); V4, visual area in the extrastriate visual cortex.

in the macaque STS that are sensitive to both motion and posture for represent-ing similarities among actions, thus suggestrepresent-ing contributions from convergrepresent-ing cues received from the ventral and dorsal pathways (Oram and Perrett, 1996). On the basis of additional studies showing that neurons in the human STS activate by body articulation (Beauchamp et al., 2003), there is a consensus that posture and motion together play a key role in biological motion perception (Garcia and Grossman, 2008; Thirkettle et al., 2009). It is to be noted that the conceptual sepa-ration into two distinct pathways represents a simplification, while it is known that the two processing streams comprise interactions at several levels (Felleman and Van Essen, 1991). The underlying neural mechanisms and functional underpinning of this interaction are still to be fully investigated.

A well-established computation model used to provide a qualitative analysis of existing data on biological movement recognition was proposed by Giese and Pog-gio (2003). It consists of a feedforward, two-pathway architecture for learning pro-totypical action patterns based on neurophysiological evidence. The architecture includes primarily visual areas involved in the recognition of body movement. An overview of the architecture is illustrated in Fig. 2.2, showing the different types of neuron detectors and corresponding areas in the mammalian brain involved in the processing. Consistent with biological findings, both streams comprise a hierarchy of neural detectors that process form-motion features with increasing complexity, i.e. the size of the receptive fields and the position and scale invariance of the detectors increase along the hierarchy. The model assumes that the hierarchy is

(25)

2.1. Action Recognition in the Brain Invariant bar detectors V1, V4 Local orientation detectors V1 Snapshot neurons IT, STS, FA Local motion detectors V1/V2, MT Local OF detectors MT, MST, KO OF neuron patterns STS, FA Motion pattern neurons STS, F5, FA Form pathway Motion pathway Walking Other Running

Figure 2.2: Hierarchical, two-pathway neural model for the processing of form and motion. F5, ventral premotor cortex; IT, inferior temporal cortex; KO, kinetic occipital cortex; MT, middle temporal cortex; MST, medial superior temporal cortex, OF, optic flow; STS, superior temporal sulcus; V1, primary visual cortex; V2, secondary visual cortex (prestriate cortex); V4, visual area in the extrastriate visual cortex. Adapted from (Giese and Poggio, 2003).

predominantly feedforward. While this assumption does not rule out the need for top-down signals, the assumption is based on the fact that recognition of bi-ological motion in the STS exhibits short latencies, thus making the key role of top-down modulation unlikely for early action perception. For instance, Johans-son (1976) showed that stimulus presentation times below 300 ms are sufficient for the recognition of biological motion, while Oram and Perrett (1996) observed that motion-selective neurons in the STS exhibit latencies of less than 200 ms. How-ever, anatomical and neurophysiological studies have shown that the visual cortex is characterized by significant feedback connectivity between different cortical ar-eas (Felleman and Van Essen, 1991; Salin and Bullier, 1995). In particular, action perception demonstrates strong top-down modulatory influences from attentional mechanisms (Thornton et al., 2002) and higher-level cognitive representations such as biomechanically plausible motion (Shiffrar and Freyd, 1990). Furthermore, al-though the model accounts for the biologically plausible processing of form-motion cues, it does not explain how information from the two streams is subsequently integrated as a joint percept.

Multimodal Action Perception

It has been argued that the STS in the mammalian brain may be the basis of an action-encoding network with neurons driven by the perception of dynamic human bodies and that for this purpose it receives converging inputs from earlier visual areas from both the ventral and dorsal pathways (Beauchamp, 2005; Garcia and Grossman, 2008; Vangeneugden et al., 2009; Thirkettle et al., 2009). Neuroimag-ing studies have shown that the posterior STS (pSTS) shows a greater response

(26)

for audiovisual stimuli than to unimodal visual or auditory stimuli (Calvert, 2001; Beauchamp et al., 2004; Wright et al., 2003; Senkowski et al., 2011). Wright et al. (2003) conducted an event-related fMRI study showing a strong activation of the STS region in subjects evoked by both unimodal and multimodal audiovi-sual stimuli from an animated character, but that greatest levels of activity were elicited by audiovisual speech. In a study of actions involving the use of objects, (Beauchamp et al., 2004) observed that the pSTS and middle temporal gyrus (MTG) showed an enhanced response when auditory and visual object features were presented together with respect to the response to a single modality. Thus, the STS area is thought to be an associative learning device for linking different unimodal representations and accounting for the mapping of naturally occurring, highly correlated features such as body pose and motion, the characteristic sound of an action (Beauchamp et al., 2004; Barraclough et al., 2005) and linguistic stimuli (Belin et al., 2002; Wright et al., 2003; Stevenson and James, 2009).

These findings together suggest that multimodal representations of actions in the brain play an important role for a robust perception of complex action patterns, with the STS representing a multisensory area in the brain network for signaling the social significance of biological motion (Allison et al., 2000; Adolphs, 2003; Beauchamp, 2005; Beauchamp et al., 2008).

Formation of Cortical Maps

It is now known that rudimentary patterns of cortical connectivity for visual pro-cessing are established early in development (see Section 3.1). However, normal visual input is required for the correct development of the visual cortex through input-driven self-organization (Hubel and Wiesel, 1962, 1967, 1970; Hubel et al., 1977). The ability of the cortex to self-organize with respect to the distribution of the inputs becomes a less prominent feature as the system stabilizes through a well-specified set of developmental stages (Nelson, 2000). However, this ability is not absent in the adult system that exhibits mechanisms of transient reorganization at a smaller scale (Stiles, 2000).

The ability of the brain to adapt to dynamic input distributions provides vital insight into how connectivity and function of the cortex are shaped and recovered from injuries. We will discuss the pillars of cortical experience-driven learning mechanisms and computational models of self-organization in Chapter 3.

2.2 Computational Approaches

2.2.1 Trends in Action Recognition

The task of human action recognition has been of strong interest for different fields of research. Artificial systems aimed to tackle complex visual tasks such as the classification of actions from videos have been extensively studied in the literature, with a large variety of models and methodologies tested on different

(27)

2.2. Computational Approaches

action benchmark datasets (Poppe, 2010). In particular, learning-based approaches have been successfully used to generalize a set of training action samples and then predict the labels of unseen samples by computing their similarity with respect to the learned action templates. Deep learning architectures motivated by biological evidence have been shown to recognize actions with high accuracy from video sequences with the use of spatiotemporal hierarchies that functionally resemble the organization of earlier areas of the visual cortex (see Section 2.1.2). Many of these models show high computational costs linked to the extraction of action features such as body posture and motion characteristics from rich streams of information (Guo et al., 2016).

In the last half decade, the emergence of low-cost depth sensing devices such as the Microsoft Kinect and ASUS Xtion Live has led to a large number of vision-based applications using depth information instead of, or in combination with, color information. This sensor technology provides depth measurements used to obtain reliable estimations of 3D human motion in cluttered environments, including a set of body joints in real-world coordinates and their orientations. Depth sensors rep-resent a significant contribution to the field of action recognition since they address a set of limitations related to traditional 2D sensors (e.g. RGB cameras), thereby increasing robustness under varying illumination conditions and reducing compu-tational effort for motion segmentation and body pose estimation (see Han et al. (2013) for a survey). Depth sensors have the additional advantage of avoiding pri-vacy issues regarding the identity of the monitored person since color information is not required at any stage. However, although this approach allows to efficiently compute 3D motion features in real time, robust mechanisms for learning relevant spatiotemporal action features represent still an open question.

Contrary to fixed sensors, mobile robots may be designed to process the sensed information and undertake actions that benefit people with disabilities and se-niors in a residential context (Fig. 2.3). In this context, the reliable recognition of actions and potentially dangerous behaviors such as fall events play a crucial role. There has been an increasing number of ongoing research projects aimed to develop assistive robots in smart environments for self-care and independence at home. Moreover, advanced robotic technologies may encompass socially-aware assistive solutions for interactive robot companions, able to support basic daily tasks of independent living and enhance the user experience through a more flex-ible human-robot interaction (e.g., gesture recognition, dialogues, and vocal com-mands). Recent studies support the idea that the use of socially assistive robots leads to positive effects on the senior’s well-being in domestic environments (see Kachouie et al. 2014 for a review). On the other hand, the use of robotic technolo-gies brings a vast set of challenges and technical concerns.

To cope with the dynamic nature of real-world scenarios, learning artificial systems may also be adaptive to unseen situations. In addition to detecting short-term behavior such as domestic daily actions and abnormal behavior with respect to specific action patterns, it may be of particular interest to learn the user’s behavior over longer periods of time (Vettier and Garbay, 2014). In this setting, it would be desirable to collect sensory data to, e.g., perform medium- and

(28)

long-Chapter 2. Multimodal Action Recognition

Figure 2.3: Person monitoring in a home-like environment. The humanoid robot tracks the person while performing daily activities (Parisi et al., 2016c).

term gait assessment of the person, which can be an important indicator for a variety of health problems, e.g. physical diseases and neurological disorders such as Parkinson’s disease (Aerts et al., 2012). To enhance the user’s experience, assistive robots may be given the capability to adapt over time to better interact with the monitored user. This would include, for instance, a more natural human-robot communication including the recognition of hand gestures and full-body actions, speech recognition, and a set of reactive behaviors based on the user’s habits. In this context, interdisciplinary research aimed to address the vast set of technical and social issues regarding robots for assisted living is fundamental to provide feasible and reliable solutions in the near future.

Computational models for action recognition through multiple sensor modal-ities are a paramount ingredient of autonomous robots to forming robust and meaningful representations of perceived events (Ursino et al., 2014). There are numerous advantages from the multimodal processing of sensory inputs conveyed by rich and uncertain information streams. For instance, the integration of stim-uli from different sources may be used to attenuate noise and remove ambiguities from converging or complementary inputs. Multimodal representations have been shown to improve robustness in the context of action recognition, human-robot interaction, and sensory-driven motor behavior (Kachouie et al., 2014; Noda et al., 2014; Bauer et al., 2015). However, multisensory inputs must be integrated in an appropriate way so that they result in a reliable cognitive experience aimed to trigger adequate behavioral responses. Consequently, the question of how to effectively acquire, process, and bind multimodal knowledge from rich information streams represents a fundamental issue still to be fully investigated.

(29)

2.2.2 Learning to Recognize Actions

Machine learning and neural network techniques processing multi-cue features from natural images have shown motivating results for classifying a set of training ac-tions. Typically, baselines of performance in terms of classification accuracy are provided by evaluating the approach with publicly available action datasets. Ex-amples of common public datasets are the KTH human motion dataset (Schuldt et al., 2004), the Weizmann human action dataset (Gorelick et al., 2005), the UCF sports action dataset (Rodriguez et al., 2008), and the CAD-60 with depth map video sequences (Sung et al., 2012).

Xu et al. (2012) presented a system for action recognition using dynamic poses by coupling local motion information with pose in terms of skeletal joint points. They generated a codebook of dynamic poses from two RGB action benchmarks (KTH and UCF-Sports), and then classified these features with an Intersection Kernel Support Vector Machine. Jiang et al. (2012) explored a prototype-based approach using pose-motion features in combination with tree-based prototype matching via hierarchical clustering and look-up table indexing for classification. They evaluated the algorithm on the Weizmann, KTH, UCF Sports, and CMU action benchmarks. To be noted is that although these two approaches use pose-motion cues to enhance classification accuracy with respect to traditional single-cue approaches, they do not take into account an integration function that learns order-selective prototypes of joint pose-motion representations of action segments from training sequences. Furthermore, these classification algorithms can be susceptible to noise which may occur during live recognition.

Learning systems using depth information from low-cost sensors have been in-creasingly popular in the research community encouraged by the combination of computational efficiency and robustness to light changes in indoor environments. In recent years, a large number of applications using 3D motion information has been proposed for human activity recognition such as classification of full-body actions (Faria et al., 2014; Shan and Akella, 2014), fall detection (Rougier et al., 2011; Parisi and Wermter, 2013), and recognition of hand gestures (Suarez and Murphy, 2012). A vast number of depth-based methods has used a 3D human skeleton model to extract relevant action features for the subsequent use of a clas-sification algorithm. For instance, Sung et al. (2012) combined the skeleton model with Histogram of Oriented Gradient (HOG) features and then used a hierarchical maximum entropy Markov model to classify 12 different actions. The learning model used a Gaussian mixture model to cluster and segment the original training data into activities.

Using the same action benchmark for the evaluation, Shan and Akella (2014) used action templates computed from 3D body poses to train multiple classifiers: Hidden Markov Model, Random Forests, k-Nearest Neighbor, and Support Vector Machine (SVM). Faria et al. (2014) used a dynamic Bayesian Mixture Model de-signed to combine multiple classifier likelihoods and compute probabilistic body motion. Zhu et al. (2014) evaluated a set of spatiotemporal interest point features from raw depth map images to classify actions with an SVM. Experiments were

(30)

conducted also using interest points in combination with skeleton joint positions and color information, obtaining better results. However, the authors also showed that noisy depth data and cluttered background have a significant impact on the detection of points of interest, and that actions without much motion are not well recognized.

Computational models inspired by the hierarchical organization of the visual cortex (see Section 2.1.2) have become increasingly popular for learning complex visual patterns such as action sequences from video (Giese and Poggio, 2003; Lay-her et al., 2013). In particular, neural network approaches with deep learning ar-chitectures have produced state-of-the-art results on a set of benchmark datasets containing daily actions (e.g. Baccouche et al. 2011; Jain et al. 2015; Jung et al. 2015). Typically, visual models using deep learning comprise a set of convolution and pooling layers trained in a hierarchical fashion for obtaining action feature representations with an increasing degree of abstraction (see Guo et al. (2016) for a recent survey). This processing scheme is in agreement with neurophysiological studies supporting the presence of functional hierarchies with increasingly larger spatial and temporal receptive fields along cortical pathways.

The above-described methods are trained by a batch learning scheme, and thus assuming that all the training samples and sample labels are available during the training phase. However, an additional strong assumption is that training samples, typically represented as a sequence of feature vectors extracted from video frames, are well segmented so that ground-truth labels can be univocally assigned. Therefore, it is usually the case that raw visual data collected by sensors must undergo an intensive processing pipeline before training a model. These pre-processing stages are mainly performed manually, thereby hindering the automatic, continuous learning of actions from live video.

From a multimodal perspective, a number of computational models have been proposed aiming to effectively integrate multisensory information, in particular audiovisual input. These approaches typically use unsupervised learning for ob-taining visual representations of the environment and then link these features to auditory cues. For instance, Vavreˇcka and Farkaˇs (2014) presented a connectionist architecture that learns to bind visual properties of objects (spatial location, shape and color) to proper lexical features. These unimodal representations are bound together based on the co-occurrence of audiovisual inputs using a self-organizing neural network (see Section 3.2). Similarly, Morse et al. (2015) investigated how infants may map a name to an object and how body posture may affect these mappings. The computational model is driven by visual input and learns word– to–object mappings through body posture changes and online speech recognition. Unimodal representations are obtained with neural network self-organization and multimodal representations develop through the activation of unimodal modules via associative connections.

The development of associations between co-occurring stimuli for multimodal binding has been strongly supported by neurophysiological evidence (Fiebelkorn et al., 2009). However, these approaches do not naturally scale up to learn more complex spatiotemporal patterns such as action–word mappings. In fact, action

(31)

words do not label actions in the same way that nouns label objects (Gentner, 1982). While nouns typically refer to objects that can be perceived as distinct units, action words refer instead to spatiotemporal relations within events that may be performed in many different ways with high spatial and temporal variance. Thus, further work is required to address the learning of multimodal representation of spatiotemporal inputs for obtaining robust action–word mappings.

2.2.3 Body Motion Assessment

The analysis and assessment of human body motion have recently attracted sig-nificant interest in the healthcare community with many application areas such as physical rehabilitation, diagnosis of pathologies, and assessment of sports per-formance. In this context, the correctness of postural transitions is paramount during the execution of well-defined physical routines, since inaccurate movements may significantly reduce the overall efficiency of the movement and increase the risk of injury (Kachouie et al., 2014). For instance, in the case of weight-lifting training, correct postures improve the mechanical efficiency of the body and allow the athlete to achieve higher effectiveness during training sessions. Similarly, in the healthcare domain, the correct execution of physical rehabilitation routines is crucial for patients to improve their health condition (Velloso et al., 2013a).

Human proprioception may not be sufficient to spot movement mistakes. Thus, expert trainers observing the movement can give the trainee proficient feedback for timely improving the quality of the performance and avoiding persistent inac-curacies. However, it is not the case that a personal trainer is always available to assess the quality of movements during their execution. Therefore, there is a strong motivation to develop automatic systems able to detect mistakes during the performance of well-defined routines for providing feedback in real time.

While the aim of action recognition is to categorize a set of distinct classes by extrapolating inter-class spatiotemporal differences, action assessment requires instead a model to capture intra-class dissimilarities that allow to express a mea-surement on how much an action follows its learned template. In this setting, efficient approaches to learn spatiotemporal templates for computing intra-class dissimilarities have remained an open issue. Common computational bottlenecks are the robust extraction of body features from video streams and the definition of suitable metrics aimed to compare two actions in terms of their spatiotempo-ral structure. The former issue has been partly addressed with the use of depth sensors that allow the efficient tracking of human motion and the estimation of a 3D skeleton model. On the other hand, effective methods for the computation of a similarity measure between two actions still represent a major challenge.

Automatic systems for the visual assessment of body motion have been previ-ously investigated for applications mainly focused on physical rehabilitation and sports training. For instance, Chang et al. (2011) proposed a physical rehabilita-tion system for young patients with motor disabilities using a Kinect sensor. The idea was to assist the users while performing a set of simple movements neces-sary to improve their motor proficiency during the rehabilitation period. Users

(32)

were instructed by a therapist on how to perform the movements. During the au-tonomous execution, visual hints were shown to users to motivate the performance of the routines. Although experimental results have shown improved motivation for users using visual hints, only movements involving the arms at constant speed were considered. Furthermore, the estimation of real-time feedback in order to enable the user to spot and correct mistakes was not considered.

Similarly, Su (2013) proposed the estimation of feedback for Kinect-based reha-bilitation exercises by comparing performed motion with a pre-recorded execution by the same person. The comparison was carried out on sequences using dynamic time warping (DTW) and fuzzy logic with the Euclidean distance as a similarity measure. The evaluation of the exercises was based on the degree of similarity between the current sequence and a correct sequence. The system provided qual-itative feedback on the similarity of body joints and execution speed, but it did not suggest the user how to correct the movement.

Paiement et al. (2014) proposed a method for assessing the quality of gait from sequences of people on stairs. As a measure of quality, Kinect-based body poses were compared to learned normal occurrences of a movement from a statistical model. The likelihood of a model for describing the current movement was com-puted frame-by-frame over a sequence of postures and motion speed. The system triggered an alarm if the current movement differed from the correct movement template. For this purpose, a proper threshold must be empirically chosen to de-cide the degree of tolerance with respect to the template. Although this method represents a useful application for detecting abnormal behavioral patterns, it does not provide any hints on how to correct motion mistakes.

Velloso et al. (2013b) investigated qualitative action recognition with a Kinect sensor for specifying the correct execution of movements, detecting mistakes, and providing feedback to the user. A baseline was created by asking the users to perform a routine ten times, from which individual repetitions were manually seg-mented. Hidden Markov Models were trained with tuples containing the joint angles and the timestamp for individual exercises. Similar to Chang et al. (2011) and Su (2013), the system was tested only on arm movements, in this case for dumbbell lifting. A strong limitation of this approach is that the correct duration and motion intensity of movements were computed by using the timestamp from body joint estimation. Therefore, although the system provides feedback to correct body posture in terms of joint angles, it does not provide any robust feedback on temporal discrepancies.

For the assessment of human motion in sports, Pirsiavash et al. (2014) predicted scores of performed movements from annotated footage. The system compared the gradient for each body joint with a regression model from spatiotemporal pose features to scores obtained from expert judges. Feedback is provided in terms of which joints should be changed to obtain the maximum score. Different from the previously discussed approaches, this method extracts body features from RGB sequences. Thus, the estimation of body joints is not as robust as the 3D skeleton model using a depth sensor. Experimental results showed that the system predicted scores better than non-expert humans but significantly worse than expert judges.

(33)

While the correct execution of well-defined movements plays a crucial role in physical rehabilitation and sports, artificial learning systems for assessing the qual-ity of actions and providing feedback for correcting inaccurate movements have remained an open issue in the literature.

2.2.4 Abnormal Event Detection

Falls represent a major concern in the public health care domain, especially among the elderly population. According to the World Health Organization, fall-related injuries are common among older persons and represent the leading cause of pain, disability, loss of independence and premature death1. Although fall events do not necessarily cause a fatal injury, fallen people may be unable to get up without assistance, thereby resulting in long lie time complications such as hypothermia, dehydration, bronchopneumonia, and pressure sores (Tinetti et al., 1993). More-over, fear of falling has been associated with a decreased quality of life, avoidance of activities, and mood disorders such as depression (Scheffer et al., 2008).

As a response to increasing life expectancy, research has been conducted to provide technological solutions for supporting living at home and smart environ-ments for assisted living. The motivation of assistive fall systems is the ability to promptly report a fall event and by this enhancing the person’s safety perception and avoiding the loss of confidence due to functional disabilities. Recent systems for elderly care aim mostly to detect hazardous events such as falls and allow the monitoring of physiological measurements (e.g. heart rate, breath rate) using wear-able sensors to detect and report emergency situations in real time (Kaluza et al., 2013; Vettier and Garbay, 2014). Vision-based fall detection is currently the pre-dominant approach due to the constant development of computer vision techniques that yield increasingly promising results in both experimental and real-world sce-narios. While the number of advantages introduced by low-cost depth sensors is significant in terms of body motion and posture estimation, these approaches are characterized by a number of issues that may prevent them from operating in real-world environments. For instance, their operation range (distance covered by the sensor) is quite limited (between 0.8 m and 5 m), as well as their limited field of view, thereby requiring a mobile or multi-sensor scenarios to monitor an extensive area of interest.

Lee and Mihailidis (2005) proposed a vision-based method with a ceiling camera for monitoring falls at home. The authors considered falls as lying down in a stretched or tucked position. The system accuracy was evaluated with a pilot study using 21 subjects consisting of 126 simulated falls. Personalized thresholds for fall detection were based on the height of the subjects. The system detected fall events with 77% accuracy and had a false alarm rate of 5%. Miaou et al. (2006) proposed a customized fall detection system using an omni-camera for capturing 360-degree scene images. Falls were detected based on the change of the ratio of

1

World Health Organization: Global report on falls prevention in older age – http://www. who.int/ageing/publications/Falls_prevention7March.pdf

(34)

people’s height and width. Two scenarios were used for the detection: with and without considering user health history, for which the system showed 81% and 70% accuracy respectively.

In a multi-camera scenario, Cucchiara et al. (2007) presented a vision system with multiple cameras for tracking people in different rooms and detecting falls based on a hidden Markov model (HMM). People tracking was based on geometri-cal and color constraints and then sent to the HMM-based posture classifier. Four main postures were considered: walking, sitting, crawling, and lying down. When a fall was detected, the system triggered an alarm via SMS to a clinician’s PDA with a link to live low-bandwidth video streaming. Experiments showed that occlusions had a strong negative impact on the system’s performance. Hazelhoff et al. (2008) detected falls using two fixed perpendicular cameras. The foreground region was extracted from both cameras and the principal components (PCA) for each object were computed to determine the direction of the main axis of the body and the ratio of the variances. Using these features, a Gaussian multi-frame classifier was used to recognize fall. In order to increase robustness and mitigate false positives, the position of the head was taken into account. The system was evaluated also for partially occluded people. Experiments showed real-time performance with an 85% overall detection rate.

Rougier et al. (2011) presented a method for fall detection by analysing hu-man shape deformation in depth map image sequences. Falls were detected from normal activities using a Gaussian mixture model with 98% accuracy. The overall system performance increased when taking into account the lack of significant body motion after the detected fall event. Liu et al. (2010) detected falls considering privacy issues, thereby processing only human silhouettes without featural prop-erties such as the face. A k-nearest neighbor (kNN) algorithm was used to classify the postures using the ratio and difference of human body silhouette bounding box height and width. Recognized postures were divided into three categories: stand-ing, temporary transition, and lying down. Experiments with 15 subjects showed a detection accuracy of 84.44% on fall detection and lying down events. Diraco et al. (2010) addressed the detection of falls and the recognition of several postures with 3D information. The system used a fixed time-of-flight camera that provided robust measurements under different illumination settings. Moving regions with respect to the floor plane were detected applying a Bayesian segmentation to the 3D point cloud. Posture recognition was carried out using the 3D body centroid distance from the floor plane and the estimated body orientation. The system yielded promising results on synthetic data with threshold-based clustering for different centroid’s height thresholds.

Most of the above-described approaches rely on predefined threshold values to detect abnormal behavior. Furthermore, reported experiments were conducted in highly controlled environments with fixed vision sensors. Whether these ap-proaches would account for the robust detection of abnormal behavior if embedded in mobile robot platforms, is questionable.

(35)

2.2.5 Assistive Robotics

Mobile robots have been characterized by a constant development for aging at home scenarios. In contrast to fixed sensors, mobile assistive robots may be designed to process the sensed information and undertake actions that benefit people with disabilities and seniors in a residential context. In fact, the mobility of robots rep-resents a big benefit for non-invasive monitoring of users, thereby better addressing fixed sensors’ limited field of view, blind spots, and occlusions. Despite different functional perspectives concerning elderly care and user needs (e.g. rehabilita-tion,social robotics), there is a strong affinity regarding the intrinsic challenges and issues needed to operate these systems in real-world scenarios. For instance, the use of mobile robots may be generally combined with ambient sensors em-bedded in the environment (e.g. cameras, microphones) to enhance the agent’s perception and increase robustness under in-the-wild conditions. On the other hand, complementary research efforts has been conducted on the deployment of stand-alone mobile robot platforms, able to sense and navigate the environment by relying exclusively on onboard sensors.

In particular when operating in natural environments, the robust and efficient processing of multimodal information plays a key role to perceive human activity. Research efforts have been made towards robots exploiting multi-sensory integra-tion to improve HRI capabilities. For instance, Lacheze et al. (2009) used auditory information to recognize objects that were partially occluded and thus difficult to detect by vision only. Sanchez-Riera et al. (2009) presented a scenario with a robot companion that performs audio-visual fusion for multimodal speaker detec-tion. The system targeted multiple speakers in a domestic environment processing information from two microphones and two cameras mounted on a humanoid robot. Martinson (2014) introduced a robot with a navigational aid for visually impaired people using a mobile robot platform. The system used depth information to de-tect other people in the environment and avoid dynamic obstacles. The system communicated to the person the direction of motion to reach the goal destination via a tactile belt around the waist.

For abnormal behavior detection, promising experimental results have been ob-tained by combining mobile robots and 3D information from depth sensors. This approach overcomes limitations in the operation range of sensors while preserving reduced computational power for real-time characteristics. Mundher and Zhong (2014) proposed a mobile robot with a Kinect sensor for fall detection based on floor-plane estimation. The robot tracks and follows the user in an indoor envi-ronment, and can trigger an alarm in case of a detected fall event. The system recognizes two gestures to start and stop a distance-based user-following proce-dure, and three voice commands to enable fall detection and call for help in case of a fall. Volkhardt et al. (2013) presented a mobile robot to detect fallen per-sons, i.e. a user already lying on the floor. The system segments objects from the ground plane and layers them to address partial occlusions. A classifier trained on positive and negative examples is used to detect object layers as a fallen human. Experiments reveal that the overall accuracy of the system is strongly dependent